Fix/setup robustness and multi agent#29
Conversation
test_asn1_size drives the APER encoder directly and test_e2e_report_path hardcodes EncodingFormat::ASN1 for both the agent under test and its fake-dApp peer. On a JSON-only build (-DLIBE3_ENABLE_ASN1=OFF) the encoder factory returns nullptr for ASN.1, so both tests fail with 'encoder unavailable' (-12) rather than telling us anything about the build. Exclude the two targets at configure time when ASN.1 support is compiled out, mirroring the existing test_json_encoder skip. The tests themselves are unchanged, so ASN.1-enabled builds keep full coverage.
The RAN side of the setup channel is a ZMQ REP socket, which must send exactly one reply per received request before it can receive again. The setup loop bailed out without replying whenever the incoming message could not be answered (undecodable bytes from a wrong-encoding or misbehaving dApp, an unexpected PDU type, or a SetupResponse encode failure), leaving the socket stuck in the send state: every subsequent zmq_recv fails and all future dApp setups stall until the connector happens to reset or the agent restarts. Complete the REQ/REP exchange on every such path with a best-effort empty reply (new send_empty_setup_reply helper). The peer treats the empty reply as a failed setup and retries; the setup channel stays usable for the next dApp. On the POSIX connector, which has no REP state machine, the empty send is a harmless zero-length frame. Malformed-message logs are demoted to one concise warn line per message. Add test_setup_bad_request: a raw ZMQ REQ peer sends garbage to a running agent and must still get a reply, and a well-formed SetupRequest on the same socket afterwards must still receive a positive SetupResponse. The test fails by timeout without this fix and uses whichever encoding the build provides (JSON preferred), so it runs on JSON-only builds.
ServiceModel::ran_function_data() defaults to an empty vector, but the E3AP schema (messages/asn1/V1/e3ap-1.0.0.asn1) declares E3-RanFunctionDefinition.ranFunctionData as a MANDATORY OCTET STRING (SIZE (1..32768)) — it can be neither omitted nor empty. Encoding a setupResponse that advertises such an SM therefore violates the APER size constraint and aborts the encode of the whole PDU: one data-less SM silently breaks setup for every dApp, including the RAN functions that do provide data. Since the field is mandatory (omitting it is not an option), substitute a 1-byte 0x00 placeholder when the SM provides no data, and say so in a comment next to the schema reference. SMs with real data are encoded exactly as before. Verified against the asn1c runtime generated from this grammar: encoding E3-RanFunctionDefinition with an empty ranFunctionData fails the PER size constraint, while the 1-byte placeholder encodes and decodes. Extend test_asn1_size (the ASN.1-only test binary) with a setupResponse that carries one SM with data and one without: the encode must succeed, the result must decode, the real payload must round-trip byte-for-byte, and the data-less SM must surface the 0x00 placeholder.
SmRegistry was a process-wide singleton, so every E3Interface in a process
shared one SM table. That breaks any multi-agent process in two ways:
- registering an SM with the same RAN function id on a second,
completely independent agent fails with SM_ALREADY_REGISTERED, and
- E3Interface::stop() cleared the SHARED registry, wiping the Service
Models of every other RAN-role agent still running in the process
(the existing role guard only protected against the dApp-role case).
Each E3Interface now owns an SmRegistry member, and all call sites in
e3_interface.cpp (register_sm, get_available_ran_functions, the setup and
sm-data loops, handle_setup_request, handle_control_action,
on_sm_lifecycle_change, and the teardown clear) go through the owning
interface's registry; stop() clears only the agent's own SMs. The
SmRegistry class API is unchanged apart from dropping instance() and
making construction public. Nothing else references the singleton — the
C API and SWIG layer reach SMs through E3Agent, which already routes via
its E3Interface — so single-agent behaviour is identical, including the
clear-on-stop semantics.
Add test_multi_agent_registry: two RAN-role E3Agents in one process (on
distinct IPC endpoints) both register an SM with ran_function_id 1 — both
must succeed — then one agent is stopped and destroyed and the survivor
must still be running, still advertise RAN function 1, and still serve a
full setup handshake to a real dApp-role agent that is offered that RAN
function. The test fails on the singleton design (second registration
returns SM_ALREADY_REGISTERED) and uses the build's available encoding
(JSON preferred), so it runs on JSON-only builds.
⏱️ Full-loop Latency Benchmark (commit
|
| Phase | mean | p50 | p99 | max |
|---|---|---|---|---|
| 1. Collect indication data | 0 | 0 | 0 | 0 |
| 2. Create & encode indication | 0 | 0 | 1 | 11 |
| 3. Deliver indication (RAN -> dApp) | 95 | 93 | 159 | 181 |
| 4. Decode indication | 0 | 0 | 1 | 10 |
| 5. Process data | 0 | 0 | 0 | 10 |
| 6. Create & encode control | 0 | 0 | 0 | 2 |
| 7. Deliver control (dApp -> RAN) | 110 | 106 | 172 | 791 |
| 8. Decode & handle control | 0 | 0 | 0 | 9 |
| Total round-trip | 207 | 196 | 324 | 898 |
Benchmarked on
ubuntu-latest, Release build, ZMQ + IPC, ASN.1 APER.
🔄 E2E dApp Integration Results (commit
|
🔀 E2E Topologies — multi-dApp / multi-RAN (commit
|
|
let's use this branch to work exclusively on the fix |
Thecave3
left a comment
There was a problem hiding this comment.
Thanks for the thorough writeup and the regression tests — the diagnosis in #28 is spot-on and the test harness is genuinely good. I'd like to split this before merge, though: #28 is only the setup-socket wedge, and the PR currently bundles two unrelated changes under it. Concretely:
Fix 1 (setup wedge) — keep, but reply with a real negative SetupResponse, not an empty frame. Our encoding is fixed and known a priori by every peer, so on a malformed/undecodable request we can always encode a proper SetupResponse with ResponseCode::NEGATIVE. When the bytes don't decode we don't have the original request_id — just default it (0 / next-available), since the dApp's lazy-pirate loop doesn't correlate replies by id anyway. That removes the need for the empty-frame helper entirely: the peer gets an explicit rejection instead of a silent timeout. test_setup_bad_request should then assert it receives a decodable negative response, not just n >= 0.
Fix 2 (0x00 padding) — please drop this. ranFunctionData being mandatory SIZE(1..32768) is correct; the issue is that an SM should never advertise empty data in the first place (a real SM always carries at least its name — see simple_agent / e3sm_simple), and ranFunctionList is OPTIONAL, so "nothing to advertise" means omitting the entry, not padding it. The only way an empty value reaches the encoder today is the C API's ran_function_data "may be NULL / len 0" contract, which directly contradicts the schema. Padding hides that mismatch (and makes a decoded {0x00} indistinguishable from real data). Tracked as #30 — let's fix it at the source there instead.
Fix 3 (per-interface SmRegistry) — let's pull this into its own PR. It's a structural change to a public header and ~10 call sites, unrelated to #28, and it carries a behavioral change (SMs destroyed without their destroy() hook if stop() isn't called). The multi-agent-in-one-process motivation is reasonable but deserves its own issue and review on its own merits — opened #31 to discuss whether/why we want to support that configuration. (For what it's worth, no external consumer references SmRegistry::instance(), so the API break is internal-only — splitting it costs nothing downstream.) test_multi_agent_registry moves with it.
The ASN.1-only CI gating (LIBE3_ENABLE_ASN1 test exclusion) is a clean, standalone improvement — happy to see that land either here or as its own tiny PR.
So: this PR becomes Fix 1 (negative-response variant) + test_setup_bad_request, closing #28; Fixes 2 and 3 each get their own issue/PR (#30, #31). Thanks again!
Summary
negative reply instead of leaving the ZMQ REP socket permanently wedged (one bad client
could previously disable dApp onboarding until the agent/gNB was restarted), and the
ASN.1 setupResponse no longer fails outright when a registered SM has empty
ran_function_data(the schema mandatesOCTET STRING (SIZE (1..32768)); a 1-byteplaceholder is substituted).
SmRegistrya per-E3Interfacemember instead of a process-wide singleton: twoE3Agentinstances in one process no longer collide on RAN-function registration, andone agent's teardown no longer destroys the other agent's Service Models.
test_asn1_sizeandtest_e2e_report_pathhardcode ASN.1 and are nowonly built when
LIBE3_ENABLE_ASN1=ON. Two new regression tests added(
test_setup_bad_request,test_multi_agent_registry), both verified to fail onunfixed
main.VERSIONbumped to 0.0.6.Type of change
Linked issue
Closes #28
Mandatory test checklist
These mirror what CI (
.github/workflows/pr-tests.yml) enforces. All boxes must be ticked before review../build_libe3 -c -d build -j $(nproc) -r -tpasses (Release build + tests)./build_libe3 -c -d build -j $(nproc) -g -tpasses (Debug build + tests)cd build && ctest --output-on-failureis clean./build/test_bench_mpmc_queue) shows no regression vsmainVERSIONbumped per SemVer if the public API or ABI changedinclude/were touched,./build_libe3 --docsrenders without new Doxygen warnings./build_libe3 -I(update the script if needed)libe3.pcinterface changed, downstream consumers (dApp-openairinterface5g) still link cleanlyCI checklist
Unit Testsworkflow is green (Debug + Release matrix onubuntu-latest)MPMC Queue Benchmarkjob has posted results to this PR with no regressionTwin-repo coordination
libe3 is paired with
dappsanddApp-openairinterface5g. We do not accept patches that break or reduce compatibility with the twin repositories.Paired PR(s):
Workflow confirmation
CONTRIBUTING.md§ Pull Request Process.)CONTRIBUTING.md.