Skip to content

[Bug]: Malformed or wrong-encoding setup request permanently wedges the agent's setup socket #28

@Ninjabippo1205

Description

@Ninjabippo1205

Prerequisites

  • I have searched existing issues and confirmed this is not a duplicate.
  • I have verified the bug against the latest commit on main.
  • I have read CONTRIBUTING.md.

Description

A single malformed or wrong-encoding setup request permanently wedges the agent's setup socket. The setup path is served on a ZMQ REQ/REP socket; when an incoming setupRequest fails to decode (e.g. a JSON-speaking dApp connecting to an ASN.1-configured agent, or any garbage bytes), the handler logs the decode failure and returns without sending a reply. The REP state machine is then stuck in its "must send before next recv" state, so the agent can never receive another setup request: every subsequent dApp setup times out until the whole agent process (in our deployment, the gNB) is restarted. One misconfigured client takes down dApp onboarding for the entire agent.

Steps to reproduce

  1. Build main (Release + tests + examples, all defaults):
    ./build_libe3 -c -d build -j $(nproc) -r -t
  2. Start the bundled example agent with its defaults (ASN.1 encoding, setup socket on tcp://*:9990):
    ./build/simple_agent
  3. From a second shell, send one undecodable setup request (raw garbage; a well-formed JSON setupRequest against the ASN.1-configured agent reproduces it identically) and observe no reply:
python3 - <<'EOF'
import zmq
ctx = zmq.Context()
s = ctx.socket(zmq.REQ); s.setsockopt(zmq.RCVTIMEO, 3000); s.setsockopt(zmq.LINGER, 0)
s.connect("tcp://127.0.0.1:9990")
s.send(b"not-a-valid-e3-setup")
try:
    print("reply:", s.recv())
except zmq.error.Again:
    print("no reply within 3 s")
EOF
  1. Now attempt a valid setup, e.g. run the bundled example dApp:
    ./build/simple_dapp
    → it times out waiting for the setup response. Any further setup attempt from any client does the same.
  2. Restart simple_agent and run ./build/simple_dapp again → setup succeeds immediately, confirming the failure in step 4 was agent-side wedged state, not networking.

Expected behavior

A setup request that fails to decode should affect only that request, never the agent. The agent should send back a best-effort negative/empty reply (the requester sees its setup rejected or times out once), log the decode failure, and keep the setup socket serviceable, so the next valid setupRequest from any dApp succeeds without restarting the agent. A single misbehaving or misconfigured client must not be able to disable dApp onboarding process-wide.

Actual behavior

There is no crash, sanitizer report, or stack trace — the failure mode is silence. On the malformed request the agent logs exactly one line and nothing else:

[E3Interface] ERROR: Failed to decode setup request; ret=20

After that, every setup attempt from any client times out with no agent-side output at all (the probe in the reproduction prints no reply within 3 s; simple_dapp hangs waiting for its setup response). The agent process stays alive and looks healthy.

Root cause is visible in src/core/e3_interface.cpp (main @ 6295811, setup loop around lines 318–335): all three early-exit branches — decode failure ("Failed to decode setup request"), wrong PDU type ("Unexpected PDU type in setup"), and variant extraction failure ("Failed to get SetupRequest from PDU") — do continue; without sending any reply on the REP socket. The ZMQ REP state machine then requires a send before the next receive, so every subsequent recv on the setup socket fails; those failures are swallowed by the loop's ret <= 0 → continue path, which is why nothing further is logged. All three branches reproduce the same wedge.

Deterministic in-tree evidence: the regression test added on the fix branch (test_setup_bad_request) times out on unfixed main and passes with the fix applied.

Build type

Release (-r)

Exact build command

./build_libe3 -c -d build -j $(nproc) -r -t

CMake feature flags (mark non-defaults)

  • LIBE3_ENABLE_ZMQ disabled
  • LIBE3_ENABLE_ASN1 disabled
  • LIBE3_ENABLE_JSON enabled (mutually exclusive with ASN1)
  • LIBE3_BUILD_TESTS disabled
  • LIBE3_BUILD_EXAMPLES disabled
  • LIBE3_ENABLE_ASAN enabled
  • LIBE3_ENABLE_TSAN enabled

How is libe3 being used?

Standalone — examples/simple_agent or similar

libe3 version

0.0.5 (latest main, commit 6295811)

Operating system + architecture

Ubuntu 24.04 aarch64 (container on NVIDIA GH200) — also reproduced on macOS 15 arm64; the wedge is platform-independent ZMQ REQ/REP state-machine behavior.

Compiler and version

gcc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 — primary environment; also reproduced with Apple clang version 21.0.0 (clang-2100.1.1.101) on macOS.

Test output (if relevant)

No test on current main exercises this path — that's part of the gap this report covers. Below is the regression test from the fix branch (test_setup_bad_request) run against unfixed main sources (Release, dual-encoder config, Ubuntu 24.04 aarch64):

1/1 Test #16: test_setup_bad_request ...........***Failed    5.61 sec

[FAIL] SetupChannel_garbageRequest_repliesAndChannelSurvives
  Exception: tests/test_setup_bad_request.cpp:120: Assertion failed: n >= 0 (got -1 vs 0)

========================================
Results: 0 passed, 1 failed out of 1
========================================

0% tests passed, 1 tests failed out of 1

The following tests FAILED:
  16 - test_setup_bad_request (Failed)

The failing assertion is the receive after the garbage request: n = -1 means no reply ever arrives and the channel is dead from that point on. The same test passes with the fix applied (verified 18/18 on the full suite, both Release dual-encoder and JSON-only configurations).

Logs / sanitizer output

Agent-side stderr around the event. Two setup requests were sent (one garbage, then one valid); note there is exactly one log line — the second request produces nothing because the wedged REP socket never receives it:

[2026-06-11 23:58:12.480] [INFO ] [E3Interface] Setup request received: 20 bytes
[2026-06-11 23:58:12.480] [ERROR] [E3Interface] Failed to decode setup request; ret=20

(No further output, indefinitely. The same applies via the sibling branches "Unexpected PDU type in setup" and "Failed to get SetupRequest from PDU" — all three continue without replying.)

Client-side, the only observable is a receive timeout (from the regression test, assertion on the post-garbage receive):

[FAIL] SetupChannel_garbageRequest_repliesAndChannelSurvives
       Exception: tests/test_setup_bad_request.cpp:120: Assertion failed: n >= 0 (got -1 vs 0)

ASan/TSan/UBSan: not applicable and nothing reported — this is a deterministic protocol-state defect (ZMQ REQ/REP state machine left in send-state), not a memory or threading error. GDB backtrace: n/a, the process never crashes; attaching shows the setup thread parked in zmq_recv on the REP socket, which can no longer deliver (every receive fails until a send occurs).

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions