Building Mise: A Voice-First Kitchen Agent with Gemini Live

At 6pm every day, millions of people open their fridge and stare into the void.

No plan. No inspiration. Just ingredients and rising hunger.

You had planned to eat healthy and save money, often this moment has you ordering a take-away, blowing your budget and your plan.

We built Mise to solve that moment.

Mise is a voice-first, vision-enabled kitchen companion that sees what ingredients you have, suggests what to cook, and closes the loop to a shopping basket, all using natural conversation.

No typing. No searching. No switching apps.

You talk. Mise listens, looks, and acts.

The Problem: Decision Paralysis in the Kitchen

Recipe apps assume you already know what you want to cook.

Supermarket sites assume you already know what you want to buy.

Neither helps when you are standing in front of your fridge after a long day at work, brain fog in your head and hunger in your belly.

Right now, millions of families are trying to stretch their budget further at every mealtime. The ingredients are there. The intention is there. What’s missing is the moment of inspiration

Mise targets the decision moment before cooking begins.

Why Gemini Live?

At Say It Now we started off building voice first experiences on Alexa and Google Assistant in 2018.

Our vision is to make it possible for every brand to be able to engage in delightful conversations with their customers.

Over the last few years enormous improvements in technology have been made, we have been waiting for Google to something like Gemini Live for us to create experiences that feel alive, not like chatting with a textbox.

Gemini Live made this possible through real-time multimodal streaming.

How We Built It Using Google AI and Google Cloud

Mise runs entirely on Google infrastructure and models.

Core AI Model

We used:

Gemini Live (gemini-live-2.5-flash native audio)

This model handles:

Real-time speech recognition
Natural voice generation
Vision processing from camera input
Multimodal reasoning
Conversational continuity

Audio and video stream continuously to the model, allowing the agent to respond without turn-taking delays.

Backend Architecture

Our backend is a lightweight real-time proxy built on Google Cloud.

Stack:

FastAPI + WebSockets
Google authentication
Vertex AI endpoint
Cloud Run deployment

The browser connects to a Cloud Run service, which streams data to the Gemini Live BidiGenerateContent endpoint.

This architecture enables true live interaction rather than request-response cycles.

The system streams audio and video bidirectionally with no buffering, allowing natural conversation flow .

Token-Driven State Machine

A major challenge with multimodal agents is keeping the UI in sync with the conversation.

We solved this with a token-based state machine.

Mise outputs structured tokens that represent conversation states, for example:

SCAN — detecting ingredients
CONFIRM — verifying results
SUGGEST — proposing recipes
GAP — identifying missing items
IMPACT — summarising cost savings

The backend strips these tokens from spoken output and sends them to the frontend as UI triggers.

This allows the conversation itself to drive the interface.

Frontend Experience

The app is delivered as a Progressive Web App hosted on Firebase.

Key design principles:

Voice replaces every tap

There is no search bar or form.

The camera is the interface

Users simply point at ingredients. Vision AI handles detection.

Outputs adapt to priorities

If a user says “cheap,” price leads the recommendation.
If they say “healthy,” nutrition leads.

The goal was a single fluid conversation rather than a sequence of screens.

Example User Flow

User launches Mise
Agent asks what matters most: cost, health, or both
Camera scans ingredients
Recipes are suggested based on real items detected
Missing ingredients are identified with prices
User can add items to a shopping basket
Agent summarises savings and impact

We believe this is a valuable product for grocery stores and providers we would be delighted to talk to them about commercialisation.

Why This Matters

Voice commerce today is dominated by scripted commands like:

“Alexa, add milk to my basket.”

Mise demonstrates something different: an ambient decision engine that acts before a search begins.

Instead of reacting to a command, it helps form the intention.

This could apply far beyond cooking:

Retail decision support
Healthcare triage
Travel planning
Education
Home services

Anywhere people face complex choices in real time.

What We Learned

Multimodal agents require orchestration

Speech, vision, UI, and backend logic must move together. The model alone is not enough.

Latency determines believability

Even small delays break the illusion of conversation.

State management is critical

Without explicit control, agent interactions drift or feel inconsistent.

Iterative prototyping is essential

We evolved the system through rapid build-test cycles, starting from reference implementations and refining continuously.

What Comes Next

Mise is currently a prototype, but the architecture is designed to scale.

Because the system is configurable, it can be white-labelled for different retailers or contexts without rewriting the core agent.

The long-term vision is a universal “intent layer” between people and services.