Skip to content

Luna — always-on AI voice participant #56

@madawei2699

Description

@madawei2699

Goal

Luna joins the room as a real audio participant — appears in the participant grid, listens to the conversation, and speaks to everyone simultaneously. No @luna mention needed.

This is the only voice feature worth building. Partial approaches (client-side TTS, KV audio relay) are not worth implementing — they degrade UX compared to plain text replies.


What's blocking this

No server-side RTK participant API

RTK's Add Participant API (POST .../meetings/{id}/participants) only mints auth tokens for browser SDK clients. There is no server-side media injection path:

  • No participant_type: "server" or bot flag
  • No audio inject endpoint
  • No server SDK (@cloudflare/realtimekit-node does not exist)

@cloudflare/voice SFU cannot bridge into RTK rooms

@cloudflare/voice uses raw Cloudflare Calls SFU (rtc.live.cloudflare.com/v1). RTK is built on top of this same SFU infrastructure, but RTK never exposes the underlying SFU App ID — there is no way to inject a foreign SFU session into an RTK meeting room.

CF private pipeline not public

PR cloudflare/agents#785 implemented a RealtimeKitTransport that joins an RTK room server-side using a private agents.realtime.cloudflare.com pipeline. It works, but was closed May 2026 without merging (missing tests, hardcoded internals, no client example). The pipeline service is gated behind a private beta header and not publicly documented.


Known workaround: Headless Chrome bot

The Recall.ai / Daily Bots pattern — launch a headless Chrome process that joins the RTK room using the standard Web SDK, intercept room audio via Web Audio API, exchange PCM with the BotSession DO via WebSocket, play TTS output back through an AudioContext.

Why we're not doing this now:

WebRTC requires UDP for the media plane (ICE/SRTP). Cloudflare Containers do not support inbound UDP — only HTTP/WebSocket. So the headless Chrome process must run on an external server (Fly.io, Cloud Run, etc.), communicating with the DO over the network. This adds:

  • An external service to maintain
  • Cross-network latency between container and DO
  • Operational complexity that isn't justified while a cleaner CF-native path is on the roadmap

What would unblock a clean implementation

Any one of:

  • CF opens agents.realtime.cloudflare.com — the private pipeline from PR #785 becomes public. This is clearly the direction CF is heading.
  • RTK exposes underlying SFU App ID — lets a DO use the SFU WebSocket adapter (/adapters/websocket/new) to push audio directly into a meeting's SFU session.
  • CF Containers get UDP support — headless Chrome bot becomes fully CF-native, no external server needed.
  • RTK adds WHIP ingest — any server with a real network stack can push audio without implementing RTK's proprietary signaling.

Architecture (when unblocked)

All participants' audio
  → STT (OpenAI gpt-4o-transcribe, multilingual zh+en)
  → LLM (glm-4.7-flash, reuse BotSession history + rate limiting)
  → TTS (Azure zh-CN-XiaoxiaoNeural, ~$0.0012/turn)
  → Luna's audio track broadcast to all participants

Luna uses the existing BotSession Durable Object for conversation history, rate limiting, and LLM calls — no new AI infrastructure needed.

Cost estimate (when built)

Component Cost
STT (streaming, text-triggered only) $0 — no always-on mic monitoring
LLM glm-4.7-flash ~$0.00002/turn
TTS Azure XiaoxiaoNeural ~$0.0012/turn
Total ~$0.0013/turn

Platform-funded at current scale. BYOK for STT if always-on listening is added later.


Status

Parking until CF ships a clean server-side path. Revisit when any of the unblocking conditions above are met.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions