Room→SIP egress mutes continuous low-level audio in a periodic rebuffer cycle (fine to WebRTC subscribers, 55-64% loss on PSTN leg)

## Summary

Continuous low-level background audio (office-ambience bed mixed under an AI agent's voice) is delivered intact to WebRTC subscribers but arrives on the PSTN leg with **55–64% of its frames replaced by digital silence**, in a metronomic pattern (~100–150 ms of audio passing every ~600–750 ms). Agent speech is never affected — only the quiet continuous bed between utterances. Callers hear the background "pumping" in and out whenever the agent stops talking.

## Setup

- LiveKit Cloud + SIP (inbound), Twilio Elastic SIP Trunking
- Python agent (pipecat) publishing 16 kHz mono Opus; `TrackPublishOptions` with `dtx: false` defaults; track carries TTS speech + a constant ambience bed (~-38 dBFS effective, never below -45 dBFS in any 50 ms window — measured)
- Inbound trunk: `krisp_enabled: false`, no media encryption
- Dispatch rule: `dispatch_rule_individual`

## Evidence (same-call dual tap)

For a single PSTN call we recorded simultaneously:

- **Tap A** — a hidden room participant subscribing to the agent track (what the SFU forwards)
- **Tap B** — Twilio dual-channel call recording (what LiveKit SIP egress sends to the trunk)

50 ms RMS windows, "bed" = -60..-30 dBFS, "dropout" = below -60 dBFS:

| | non-speech windows | bed present | dropout |
|---|---|---|---|
| Tap A (room subscriber) | 688 | 96% | **4%** |
| Tap B (PSTN, same call) | 634 | 28% | **61%** |

Tap A contains unbroken 5+ second stretches of continuous bed. Tap B never has more than ~150 ms of bed at a time outside agent speech; the rest is digital silence (-90 dBFS), alternating with the bed in a regular ~750 ms cycle. Levels confirm Tap A is the real bed (median -41.8 dBFS tracking the asset's dynamics), not decoder comfort noise.

Three separate probe calls reproduce the same pattern at the same magnitude.

## Codec-independent

Forcing the SIP leg to PCMU via the dispatch rule's `media` config (`only_listed_codecs: true, codecs: [PCMU/8000]`) changed nothing (61% vs 64% dropout), so this is not Opus DTX on the SIP leg.

## Where it seems to come from

The behavior matches the mixer input state machine in media-sdk (`mixer/mixer.go`):

```go
func (i *Input) readSample(bufMin int, out msdk.PCM16Sample) (int, error) {
    if i.buffering {
        if i.buf.Len() < bufMin {
            return 0, nil // keep buffering -> mixer emits silence for this input
        }
        i.buffering = false
    }
    n, err := i.buf.Read(out)
    if n == 0 {
        i.buffering = true // starving; pause the input and start buffering again
    }
    ...
}
```

A momentary starvation mutes the input until `bufMin` re-accumulates, then it plays briefly and starves again — i.e. a short producer hiccup is amplified into a repeating mute/burst cycle on the phone leg. During TTS speech the upstream buffers are full (TTS delivers faster than real time), which would explain why speech never chops while the just-in-time-paced bed does.

## Questions

1. Is this a known issue? (Related reports: livekit/sip#348, livekit/agents#4026 — "fine on WebRTC, choppy only over SIP".)
2. Is there a Cloud-side flag to relax/disable the rebuffer-mute (or pre-buffer more) on the room→SIP direction?
3. If useful we can share both WAV captures and the analysis script.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Room→SIP egress mutes continuous low-level audio in a periodic rebuffer cycle (fine to WebRTC subscribers, 55-64% loss on PSTN leg) #722

Summary

Setup

Evidence (same-call dual tap)

Codec-independent

Where it seems to come from

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	non-speech windows	bed present	dropout
Tap A (room subscriber)	688	96%	4%
Tap B (PSTN, same call)	634	28%	61%

Room→SIP egress mutes continuous low-level audio in a periodic rebuffer cycle (fine to WebRTC subscribers, 55-64% loss on PSTN leg) #722

Description

Summary

Setup

Evidence (same-call dual tap)

Codec-independent

Where it seems to come from

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions