Real-time AI cooking assistant that sees your kitchen through your webcam and guides you step-by-step through natural voice conversation.
Built with Gemini Live API for multimodal real-time interaction (vision + audio).
- 🎥 Ingredient Scanner — Point your camera, AI identifies ingredients in real-time
- 🗣️ Voice Interaction — Natural conversation with interruption support
- 📋 Recipe Suggestions — Get recipe ideas based on visible ingredients
- 👨🍳 Live Cooking Guide — Step-by-step voice instructions as you cook
- 👁️ Real-time Feedback — "Those look perfectly golden, time to flip!"
- 🔄 Substitutions — "I don't have cumin" → suggests alternatives
⚠️ Safety Alerts — Proactive food safety warnings
Browser (Webcam + Mic)
↕ WebSocket (binary PCM audio + base64 JPEG video frames)
FastAPI Backend (server.py)
↕ Gemini Live API (send_realtime_input / receive)
Gemini 2.5 Flash Native Audio
→ Audio responses + Text + Function Calls (tool use)
↕
Cloud Firestore (recipes + sessions persistence)
Key data flows:
- Browser → Backend → Gemini: Webcam frames (768×768 JPEG, 1 FPS) + microphone audio (16kHz PCM) streamed over WebSocket
- Gemini → Backend → Browser: Voice responses (24kHz PCM) + text + function calls (timers, recipes, checklists) routed back in real time
- Backend ↔ Firestore: Saved recipes and session tokens persisted (with automatic in-memory fallback)
- Function Calling: Gemini proactively invokes 6 tools → backend executes them → UI updates sent to browser → results returned to Gemini
- Python 3.10+
- A Google AI API key (Get one here)
-
Clone and navigate:
cd ntt -
Install dependencies:
pip install -r requirements.txt
-
Set your API key:
# Copy the template cp .env.example .env # Edit .env and add your key GEMINI_API_KEY=your-actual-api-key
-
Run the server:
python server.py
-
Open in Chrome:
http://localhost:8080 -
Allow camera & microphone when prompted, then click Start Session!
Option A: One-command deploy script (deploy.sh)
# Make executable (first time only)
chmod +x deploy.sh
# Deploy with defaults
./deploy.sh
# Or override project/region
./deploy.sh --project YOUR_PROJECT_ID --region us-central1The script automatically:
- Loads
GEMINI_API_KEYfrom.env - Enables required GCP APIs (Cloud Run, Cloud Build, Artifact Registry)
- Builds the Docker image from source
- Deploys to Cloud Run with optimized settings (3600s timeout for WebSockets, 512Mi memory)
- Prints the live HTTPS URL
Option B: CI/CD with Cloud Build (cloudbuild.yaml)
Automated pipeline that runs on every push to main:
# One-time: create the trigger
gcloud builds triggers create github \
--repo-name=ntt --repo-owner=YOUR_GITHUB_USER \
--branch-pattern="^main$" \
--build-config=cloudbuild.yamlThe pipeline (defined in cloudbuild.yaml):
- Builds the Docker image and tags it with the commit SHA
- Pushes to Artifact Registry
- Deploys to Cloud Run with all production settings
gcloud config set project YOUR_PROJECT_ID
gcloud run deploy cookcam \
--source . \
--port 8080 \
--allow-unauthenticated \
--set-env-vars GEMINI_API_KEY=your-key \
--region us-central1 \
--timeout 3600Recipes and sessions persist automatically on Cloud Run (Firestore is auto-detected).
For local development with Firestore:
- Create a Firestore database in your GCP project (Native mode)
- Set credentials:
export GOOGLE_CLOUD_PROJECT=your-project-id export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
To skip Firestore and use in-memory storage, set:
export FIRESTORE_DISABLED=true| Component | Technology |
|---|---|
| AI Model | Gemini 2.5 Flash Native Audio (Live API) |
| Backend | Python, FastAPI, WebSockets |
| Frontend | Vanilla HTML/CSS/JS, Web Audio API, AudioWorklet |
| Cloud | Google Cloud Run |
| Storage | Cloud Firestore (with in-memory fallback) |
| SDK | Google GenAI Python SDK |
ntt/
├── server.py # FastAPI backend + Gemini Live API proxy
├── requirements.txt # Python dependencies
├── Dockerfile # Cloud Run container
├── cloudbuild.yaml # CI/CD pipeline — automated build & deploy
├── deploy.sh # One-command deploy script
├── .env.example # Environment template
├── README.md
├── api/
│ ├── gemini_client.py # Gemini Live API WebSocket client
│ ├── tools.py # Gemini function-call tool declarations & handlers
│ ├── firestore_client.py # Firestore persistence (recipes, sessions)
│ ├── rate_limiter.py # Token-bucket rate limiter
│ └── session_store.py # Session management (Firestore-backed)
├── core/
│ └── config.py # API keys, model config, system prompt
├── utils/
│ └── logger.py # Structured logging
└── static/
├── index.html # Main web page
├── css/styles.css # Premium dark theme
├── js/app.js # Frontend logic (WebSocket, webcam, audio)
└── audio-processor.js # AudioWorklet for PCM mic capture
- Camera captures video at 768×768 JPEG, sent at 1 FPS to Gemini
- Microphone captures audio, downsampled to 16kHz PCM via AudioWorklet
- FastAPI proxies both streams to Gemini Live API via WebSocket
- Gemini processes vision + audio, returns voice responses at 24kHz
- Browser plays AI audio seamlessly with scheduled buffer playback
- Interruption: speak while AI is talking → it stops and responds to you
- Tool Calling / Function Calling — Wire up Gemini tool use (
TOOLS_LISTinapi/tools.py) for structured actions like setting timers, saving recipes, and unit conversions - Timer System — Voice-activated cooking timers ("Set a timer for 12 minutes") with audio alerts and on-screen countdown
- Recipe Memory & History — Save completed recipes and session transcripts so users can revisit past cooks
- Multi-Step Recipe Tracking — Visual step indicator on the UI showing current step, next step, and overall progress
- Ingredient Checklist — Auto-generate an ingredient list from the chosen recipe with tap-to-check-off support
- Dark/Light Theme Toggle — Add a theme switcher in the top bar
- Text Chat Input — Allow users to type messages in addition to voice (useful in noisy kitchens)
- Recipe Card Display — Render structured recipe cards (ingredients, steps, servings, time) in the activity panel
- Responsive Tablet Layout — Optimize the two-panel layout for tablet-sized screens
- Accessibility (a11y) — ARIA labels, keyboard navigation, screen reader support for all controls
- Loading / Skeleton States — Improve perceived performance with skeleton UI while connecting
- Nutritional Info — Ask "How many calories is this?" and get per-serving nutritional estimates
- Dietary Preference Profiles — Let users set preferences (vegetarian, gluten-free, nut allergy) so suggestions are always relevant
- Skill-Level Adaptation — Detect beginner vs. experienced cooks and adjust instruction verbosity
- Multi-Language Support — Respond in the user's preferred language (leverage Gemini's multilingual capabilities)
- Plating & Presentation Tips — AI suggests plating ideas based on what it sees
- Session Persistence — Store session state so users can reconnect after a dropped connection
- Rate Limiting & Auth — Add API key rotation, per-user rate limits, and optional user authentication
- Error Recovery & Reconnect — Auto-reconnect WebSocket with exponential backoff on connection drops
- Logging Dashboard — Structured logging with request tracing; optional integration with Cloud Logging
- Health Check Endpoint — Add
/healthroute for Cloud Run liveness/readiness probes
- PWA Support — Add a web app manifest and service worker so CookCam can be installed on mobile home screens
- Share Recipe — Generate a shareable link or image card of the recipe cooked
- Grocery List Export — Export missing ingredients to a shopping list (Google Keep, Apple Reminders, etc.)
- Smart Display Mode — A hands-free, large-font "kitchen display" mode optimized for viewing from a distance
- Unit Tests — Add pytest tests for
GeminiLiveClient, config loading, and tool functions - Frontend Tests — Basic integration tests for WebSocket connection, audio pipeline, and UI state
- CI/CD Pipeline — GitHub Actions workflow for linting, testing, and auto-deploying to Cloud Run
- Load Testing — Verify concurrent WebSocket session limits and Gemini API quota handling
Built for the Gemini Live Hackathon 2026 🏆
🔗 GitHub: github.com/shafisma/cookcam
