Goal
Luna joins the room as a real audio participant — appears in the participant grid, listens to the conversation, and speaks to everyone simultaneously. No @luna mention needed.
This is the only voice feature worth building. Partial approaches (client-side TTS, KV audio relay) are not worth implementing — they degrade UX compared to plain text replies.
What's blocking this
No server-side RTK participant API
RTK's Add Participant API (POST .../meetings/{id}/participants) only mints auth tokens for browser SDK clients. There is no server-side media injection path:
- No
participant_type: "server" or bot flag
- No audio inject endpoint
- No server SDK (
@cloudflare/realtimekit-node does not exist)
@cloudflare/voice SFU cannot bridge into RTK rooms
@cloudflare/voice uses raw Cloudflare Calls SFU (rtc.live.cloudflare.com/v1). RTK is built on top of this same SFU infrastructure, but RTK never exposes the underlying SFU App ID — there is no way to inject a foreign SFU session into an RTK meeting room.
CF private pipeline not public
PR cloudflare/agents#785 implemented a RealtimeKitTransport that joins an RTK room server-side using a private agents.realtime.cloudflare.com pipeline. It works, but was closed May 2026 without merging (missing tests, hardcoded internals, no client example). The pipeline service is gated behind a private beta header and not publicly documented.
Known workaround: Headless Chrome bot
The Recall.ai / Daily Bots pattern — launch a headless Chrome process that joins the RTK room using the standard Web SDK, intercept room audio via Web Audio API, exchange PCM with the BotSession DO via WebSocket, play TTS output back through an AudioContext.
Why we're not doing this now:
WebRTC requires UDP for the media plane (ICE/SRTP). Cloudflare Containers do not support inbound UDP — only HTTP/WebSocket. So the headless Chrome process must run on an external server (Fly.io, Cloud Run, etc.), communicating with the DO over the network. This adds:
- An external service to maintain
- Cross-network latency between container and DO
- Operational complexity that isn't justified while a cleaner CF-native path is on the roadmap
What would unblock a clean implementation
Any one of:
- CF opens
agents.realtime.cloudflare.com — the private pipeline from PR #785 becomes public. This is clearly the direction CF is heading.
- RTK exposes underlying SFU App ID — lets a DO use the SFU WebSocket adapter (
/adapters/websocket/new) to push audio directly into a meeting's SFU session.
- CF Containers get UDP support — headless Chrome bot becomes fully CF-native, no external server needed.
- RTK adds WHIP ingest — any server with a real network stack can push audio without implementing RTK's proprietary signaling.
Architecture (when unblocked)
All participants' audio
→ STT (OpenAI gpt-4o-transcribe, multilingual zh+en)
→ LLM (glm-4.7-flash, reuse BotSession history + rate limiting)
→ TTS (Azure zh-CN-XiaoxiaoNeural, ~$0.0012/turn)
→ Luna's audio track broadcast to all participants
Luna uses the existing BotSession Durable Object for conversation history, rate limiting, and LLM calls — no new AI infrastructure needed.
Cost estimate (when built)
| Component |
Cost |
| STT (streaming, text-triggered only) |
$0 — no always-on mic monitoring |
| LLM glm-4.7-flash |
~$0.00002/turn |
| TTS Azure XiaoxiaoNeural |
~$0.0012/turn |
| Total |
~$0.0013/turn |
Platform-funded at current scale. BYOK for STT if always-on listening is added later.
Status
Parking until CF ships a clean server-side path. Revisit when any of the unblocking conditions above are met.
Goal
Luna joins the room as a real audio participant — appears in the participant grid, listens to the conversation, and speaks to everyone simultaneously. No
@lunamention needed.This is the only voice feature worth building. Partial approaches (client-side TTS, KV audio relay) are not worth implementing — they degrade UX compared to plain text replies.
What's blocking this
No server-side RTK participant API
RTK's Add Participant API (
POST .../meetings/{id}/participants) only mints auth tokens for browser SDK clients. There is no server-side media injection path:participant_type: "server"or bot flag@cloudflare/realtimekit-nodedoes not exist)@cloudflare/voiceSFU cannot bridge into RTK rooms@cloudflare/voiceuses raw Cloudflare Calls SFU (rtc.live.cloudflare.com/v1). RTK is built on top of this same SFU infrastructure, but RTK never exposes the underlying SFU App ID — there is no way to inject a foreign SFU session into an RTK meeting room.CF private pipeline not public
PR cloudflare/agents#785 implemented a
RealtimeKitTransportthat joins an RTK room server-side using a privateagents.realtime.cloudflare.compipeline. It works, but was closed May 2026 without merging (missing tests, hardcoded internals, no client example). The pipeline service is gated behind a private beta header and not publicly documented.Known workaround: Headless Chrome bot
The Recall.ai / Daily Bots pattern — launch a headless Chrome process that joins the RTK room using the standard Web SDK, intercept room audio via Web Audio API, exchange PCM with the BotSession DO via WebSocket, play TTS output back through an AudioContext.
Why we're not doing this now:
WebRTC requires UDP for the media plane (ICE/SRTP). Cloudflare Containers do not support inbound UDP — only HTTP/WebSocket. So the headless Chrome process must run on an external server (Fly.io, Cloud Run, etc.), communicating with the DO over the network. This adds:
What would unblock a clean implementation
Any one of:
agents.realtime.cloudflare.com— the private pipeline from PR #785 becomes public. This is clearly the direction CF is heading./adapters/websocket/new) to push audio directly into a meeting's SFU session.Architecture (when unblocked)
Luna uses the existing
BotSessionDurable Object for conversation history, rate limiting, and LLM calls — no new AI infrastructure needed.Cost estimate (when built)
Platform-funded at current scale. BYOK for STT if always-on listening is added later.
Status
Parking until CF ships a clean server-side path. Revisit when any of the unblocking conditions above are met.