Skip to content

Feature/voice agent#386

Open
Co-vengers wants to merge 53 commits intoGetBindu:mainfrom
Co-vengers:feature/voice-agent
Open

Feature/voice agent#386
Co-vengers wants to merge 53 commits intoGetBindu:mainfrom
Co-vengers:feature/voice-agent

Conversation

@Co-vengers
Copy link
Copy Markdown
Contributor

Voice Agent Extension — Progress & Documentation PR

Overview

This PR introduces the initial implementation of the Voice Agent Extension for Bindu, enabling real-time voice conversations between users and agents. The extension integrates backend, frontend, and testing components, following the architecture and plan outlined in docs/VOICE_AGENT_PLAN.md.


What’s Included

Backend

  • New voice extension module: bindu/extensions/voice/ with:
    • __init__.py, voice_agent_extension.py, service_factory.py, pipeline_builder.py, session_manager.py, agent_bridge.py, audio_config.py
  • Endpoints: bindu/server/endpoints/voice_endpoints.py (REST + WebSocket)
  • Settings: bindu/settings.py updated with VoiceSettings
  • App integration: bindu/server/applications.py updated for conditional voice route registration and session manager
  • Capabilities: bindu/utils/capabilities.py updated for voice extension helpers
  • Penguin integration: bindu/penguin/bindufy.py updated to accept voice config and add the extension

Frontend

  • Voice UI and client:
    • frontend/src/lib/services/voice-client.ts: WebSocket client, audio capture/playback
    • frontend/src/lib/stores/voice.ts: Svelte stores for voice state and transcripts
    • frontend/src/lib/components/voice/VoiceCallPanel.svelte, VoiceCallButton.svelte, LiveTranscript.svelte: UI components for voice session
  • Integration: Existing chat and agent message handler files updated for voice support

Tests

  • Unit tests for all major backend components:
    • tests/unit/extensions/voice/test_voice_extension.py
    • tests/unit/extensions/voice/test_session_manager.py
    • tests/unit/extensions/voice/test_service_factory.py
    • tests/unit/extensions/voice/test_agent_bridge.py
    • tests/unit/extensions/voice/test_voice_endpoints.py

Examples & Docs

  • Example agent: examples/voice-agent/main.py, .env.example, and README.md
  • Plan: docs/VOICE_AGENT_PLAN.md (implementation plan)

Current Progress

  • All major backend, frontend, and test files are present and staged.
  • Integration into the main app and settings is in progress.
  • Endpoints and frontend integration are actively being refined.
  • Unit tests for the extension and its components are included.
  • Example agent and configuration are provided.
  • Documentation plan is present; full user-facing docs (docs/VOICE.md) are planned.

How to Test

  1. Install dependencies:
    • Ensure pipecat-ai[deepgram,elevenlabs,silero] and websockets are installed (see pyproject.toml voice group).
  2. Set environment variables:
    • VOICE__STT_API_KEY, VOICE__TTS_API_KEY (see .env.example)
  3. Run backend tests:
    • uv run pytest tests/unit/extensions/voice/ -v
  4. Run frontend:
    • Start the Svelte frontend and verify the voice call UI appears for voice-enabled agents.
  5. Manual E2E:
    • Start a voice session from the UI, speak, and verify agent responses and transcripts.
  6. Check task persistence:
    • After a session, verify conversation history via GET /tasks/get.

Next Steps & Improvements

  • Complete and verify all items in the implementation plan checklist (see docs/VOICE_AGENT_PLAN.md)
  • Finalize and publish user documentation (docs/VOICE.md)
  • Polish frontend UI/UX and error handling
  • Expand test coverage (integration, E2E, edge cases)
  • Lint and format: uv run pre-commit run --all-files
  • Optimize session cleanup and resource management
  • Add more example agents and configuration scenarios
  • Prepare for future extensions (telephony, WebRTC, multi-language, etc.)

References


Contributors:


For questions or feedback, please comment on this PR.

@chandan-1427
Copy link
Copy Markdown
Contributor

Hey, thanks for working on adding voice support — really appreciate the effort here.

I went through the implementation and there are a few areas we’ll need to address before merging:

  • Multi-worker compatibility: The current session handling relies on a local store, which won’t work reliably with Uvicorn’s multi-worker setup. We’ll need to move this to a centralized solution (e.g., Redis) to avoid state inconsistencies.

  • Transport & latency: The current flow is based on HTTP requests. For voice interactions, we should aim for a real-time streaming approach (like WebSockets or SSE) to reduce latency and improve responsiveness.

  • Base branch alignment: It looks like this was built on an older version of Bindu. There are conflicts with recent changes, so rebasing onto the latest main would help before proceeding.

Looking forward to the update!

@Co-vengers Co-vengers force-pushed the feature/voice-agent branch from 6b8acb4 to 20e3419 Compare March 24, 2026 14:52
@Co-vengers
Copy link
Copy Markdown
Contributor Author

Hey

Hey, thanks for working on adding voice support — really appreciate the effort here.

I went through the implementation and there are a few areas we’ll need to address before merging:

  • Multi-worker compatibility: The current session handling relies on a local store, which won’t work reliably with Uvicorn’s multi-worker setup. We’ll need to move this to a centralized solution (e.g., Redis) to avoid state inconsistencies.
  • Transport & latency: The current flow is based on HTTP requests. For voice interactions, we should aim for a real-time streaming approach (like WebSockets or SSE) to reduce latency and improve responsiveness.
  • Base branch alignment: It looks like this was built on an older version of Bindu. There are conflicts with recent changes, so rebasing onto the latest main would help before proceeding.

Looking forward to the update!

Thanks for the review. I'll look into the improvements as suggested.

…Bindu#353)

Worker accessed task_operation["_current_span"] but scheduler now sends
primitive trace_id/span_id strings. Add _reconstruct_span() helper to
rebuild a NonRecordingSpan from hex-encoded IDs with graceful fallback.
Replace math.inf buffer size with a constant of 100 to prevent
unbounded memory growth while still allowing task enqueue before the
worker loop is ready.
Add SpanContext, TraceFlags, NonRecordingSpan, and INVALID_SPAN_CONTEXT
mocks. Register opentelemetry.trace.span submodule so worker imports
resolve in the test environment.
…audio config, pipeline builder, service factory, session manager, extension class)
@Co-vengers Co-vengers force-pushed the feature/voice-agent branch from 5592bd1 to 51e7e91 Compare March 28, 2026 08:52
- Add asyncio.Lock to AgentBridgeProcessor to prevent concurrent processing race conditions
- Change pipeline_builder.build_voice_pipeline to synchronous (non-async entry point)
- Add lock-protected state updates in VoiceSessionManager.update_state
- Fix conversation history to pass copy to avoid mutation during retry
- Add proper host detection for WebSocket URL with fallback to client host
- Wrap audio buffer flush in try-catch to prevent transcription errors from crashing cleanup
- Wrap session state updates and WebSocket close in individual try-catch blocks
- Ensure robust cleanup on WebSocket disconnect regardless of errors
- Change from await send() to send_nowait() to fail fast when buffer is full
- Catch anyio.WouldBlock and raise clear RuntimeError with buffer full message
- Prevents API handler from hanging indefinitely on full task queue
- Adds documentation about non-blocking behavior and buffer capacity
- Constrain grpcio to >=1.67.0,<1.78.0 to avoid version conflicts
- Add note about bindu:// internal routing scheme in VoiceSettings
- Update both base dependencies and grpc extra
ChatInput:
- Add error handling for file reading with user alert fallback
- Clear files and input after successful submit

ChatWindow:
- Add submit function with proper error handling and cleanup
- Fix voice error state clearing when error resolves

LiveTranscript:
- Use item.ts as key instead of index for proper list updates

VoiceCallButton:
- Add disabled state styling and correct aria-label

VoiceCallPanel:
- Add proper audio playback error handling and cleanup on unmount

Mime types:
- Expand DOCUMENT_MIME_ALLOWLIST to include more document types

Voice client:
- Return boolean from sendControl to only update state on success
- Fix mute/unmute to only set state when control message sent

Chat store:
- Only add message if text part exists and has non-empty text

Voice store:
- Clean up existing client before creating new one
- Attach handlers only after successful session start
- Add proper error handling and state reset on failure
Quiz agent:
- Validate OPENROUTER_API_KEY is defined before creating LLM
- Add validation for user input with proper error handling

Voice example:
- Mask API key in debug output instead of exposing it

Test stubs:
- Create proper NonRecordingSpan class that returns False for is_recording

Test agent bridge:
- Add assertions for assistant messages in history to verify response handling
- Update dependency version to use ~= for compatibility
- Add security and authentication settings (session auth, rate limiting, etc.)
- Add privacy and compliance settings (transcript storage, retention, consent)
- Add production-grade AgentBridgeProcessor flow description
- Add WebSocket security requirements and frame types
- Update audio capture pipeline with more detailed configuration
- Add integration, E2E, performance, and browser test sections
- Update example voice handler signature to match A2A protocol
- Add security, performance, and browser compatibility checklists
@Co-vengers
Copy link
Copy Markdown
Contributor Author

Hey @chandan-1427
I have implemented redis backed multi worker voice session management plus real time WebSocket streaming for low latency interactions, with config/lifecycle wiring and pre commit cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants