Skip to content

Room→SIP egress mutes continuous low-level audio in a periodic rebuffer cycle (fine to WebRTC subscribers, 55-64% loss on PSTN leg) #722

@gmunhoz0810

Description

@gmunhoz0810

Summary

Continuous low-level background audio (office-ambience bed mixed under an AI agent's voice) is delivered intact to WebRTC subscribers but arrives on the PSTN leg with 55–64% of its frames replaced by digital silence, in a metronomic pattern (~100–150 ms of audio passing every ~600–750 ms). Agent speech is never affected — only the quiet continuous bed between utterances. Callers hear the background "pumping" in and out whenever the agent stops talking.

Setup

  • LiveKit Cloud + SIP (inbound), Twilio Elastic SIP Trunking
  • Python agent (pipecat) publishing 16 kHz mono Opus; TrackPublishOptions with dtx: false defaults; track carries TTS speech + a constant ambience bed (~-38 dBFS effective, never below -45 dBFS in any 50 ms window — measured)
  • Inbound trunk: krisp_enabled: false, no media encryption
  • Dispatch rule: dispatch_rule_individual

Evidence (same-call dual tap)

For a single PSTN call we recorded simultaneously:

  • Tap A — a hidden room participant subscribing to the agent track (what the SFU forwards)
  • Tap B — Twilio dual-channel call recording (what LiveKit SIP egress sends to the trunk)

50 ms RMS windows, "bed" = -60..-30 dBFS, "dropout" = below -60 dBFS:

non-speech windows bed present dropout
Tap A (room subscriber) 688 96% 4%
Tap B (PSTN, same call) 634 28% 61%

Tap A contains unbroken 5+ second stretches of continuous bed. Tap B never has more than ~150 ms of bed at a time outside agent speech; the rest is digital silence (-90 dBFS), alternating with the bed in a regular ~750 ms cycle. Levels confirm Tap A is the real bed (median -41.8 dBFS tracking the asset's dynamics), not decoder comfort noise.

Three separate probe calls reproduce the same pattern at the same magnitude.

Codec-independent

Forcing the SIP leg to PCMU via the dispatch rule's media config (only_listed_codecs: true, codecs: [PCMU/8000]) changed nothing (61% vs 64% dropout), so this is not Opus DTX on the SIP leg.

Where it seems to come from

The behavior matches the mixer input state machine in media-sdk (mixer/mixer.go):

func (i *Input) readSample(bufMin int, out msdk.PCM16Sample) (int, error) {
    if i.buffering {
        if i.buf.Len() < bufMin {
            return 0, nil // keep buffering -> mixer emits silence for this input
        }
        i.buffering = false
    }
    n, err := i.buf.Read(out)
    if n == 0 {
        i.buffering = true // starving; pause the input and start buffering again
    }
    ...
}

A momentary starvation mutes the input until bufMin re-accumulates, then it plays briefly and starves again — i.e. a short producer hiccup is amplified into a repeating mute/burst cycle on the phone leg. During TTS speech the upstream buffers are full (TTS delivers faster than real time), which would explain why speech never chops while the just-in-time-paced bed does.

Questions

  1. Is this a known issue? (Related reports: Audio stream stutter/lag when talking over SIP #348, Audio quality issue with sip outbound call agents#4026 — "fine on WebRTC, choppy only over SIP".)
  2. Is there a Cloud-side flag to relax/disable the rebuffer-mute (or pre-buffer more) on the room→SIP direction?
  3. If useful we can share both WAV captures and the analysis script.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions