Fix/setup robustness and multi agent by Ninjabippo1205 · Pull Request #29 · wineslab/libe3

Ninjabippo1205 · 2026-06-12T04:40:27Z

Summary

Harden the setup path: a malformed/wrong-encoding setupRequest now gets a best-effort
negative reply instead of leaving the ZMQ REP socket permanently wedged (one bad client
could previously disable dApp onboarding until the agent/gNB was restarted), and the
ASN.1 setupResponse no longer fails outright when a registered SM has empty
ran_function_data (the schema mandates OCTET STRING (SIZE (1..32768)); a 1-byte
placeholder is substituted).
Make SmRegistry a per-E3Interface member instead of a process-wide singleton: two
E3Agent instances in one process no longer collide on RAN-function registration, and
one agent's teardown no longer destroys the other agent's Service Models.
Fix JSON-only CI: test_asn1_size and test_e2e_report_path hardcode ASN.1 and are now
only built when LIBE3_ENABLE_ASN1=ON. Two new regression tests added
(test_setup_bad_request, test_multi_agent_registry), both verified to fail on
unfixed main. VERSION bumped to 0.0.6.

Type of change

[X ] Bug fix
New feature / enhancement
New Service Model
Refactor (no behavior change)
Documentation
[ X] Test / CI / packaging
Other (explain):

Linked issue

Closes #28

Mandatory test checklist

These mirror what CI (.github/workflows/pr-tests.yml) enforces. All boxes must be ticked before review.

[ X] ./build_libe3 -c -d build -j $(nproc) -r -t passes (Release build + tests)
[ X] ./build_libe3 -c -d build -j $(nproc) -g -t passes (Debug build + tests)
[ X] cd build && ctest --output-on-failure is clean
[ X] MPMC queue benchmark (./build/test_bench_mpmc_queue) shows no regression vs main
[ X] VERSION bumped per SemVer if the public API or ABI changed
[ X] If public headers under include/ were touched, ./build_libe3 --docs renders without new Doxygen warnings
[ X] If new build dependencies were added, they are installed by ./build_libe3 -I (update the script if needed)
[ X] If the libe3.pc interface changed, downstream consumers (dApp-openairinterface5g) still link cleanly

CI checklist

Unit Tests workflow is green (Debug + Release matrix on ubuntu-latest)
MPMC Queue Benchmark job has posted results to this PR with no regression

Twin-repo coordination

libe3 is paired with dapps and dApp-openairinterface5g. We do not accept patches that break or reduce compatibility with the twin repositories.

[ X] This PR does not change the E3 wire protocol or public ABI, OR a paired PR exists in each affected twin repo (link below).

Paired PR(s):

Workflow confirmation

[ X] Commits will be squashed at merge time. Updates to this PR will be applied via rebase only — no merge commits, no duplicated history. (See CONTRIBUTING.md § Pull Request Process.)
[ X] I have read and followed CONTRIBUTING.md.

test_asn1_size drives the APER encoder directly and test_e2e_report_path hardcodes EncodingFormat::ASN1 for both the agent under test and its fake-dApp peer. On a JSON-only build (-DLIBE3_ENABLE_ASN1=OFF) the encoder factory returns nullptr for ASN.1, so both tests fail with 'encoder unavailable' (-12) rather than telling us anything about the build. Exclude the two targets at configure time when ASN.1 support is compiled out, mirroring the existing test_json_encoder skip. The tests themselves are unchanged, so ASN.1-enabled builds keep full coverage.

The RAN side of the setup channel is a ZMQ REP socket, which must send exactly one reply per received request before it can receive again. The setup loop bailed out without replying whenever the incoming message could not be answered (undecodable bytes from a wrong-encoding or misbehaving dApp, an unexpected PDU type, or a SetupResponse encode failure), leaving the socket stuck in the send state: every subsequent zmq_recv fails and all future dApp setups stall until the connector happens to reset or the agent restarts. Complete the REQ/REP exchange on every such path with a best-effort empty reply (new send_empty_setup_reply helper). The peer treats the empty reply as a failed setup and retries; the setup channel stays usable for the next dApp. On the POSIX connector, which has no REP state machine, the empty send is a harmless zero-length frame. Malformed-message logs are demoted to one concise warn line per message. Add test_setup_bad_request: a raw ZMQ REQ peer sends garbage to a running agent and must still get a reply, and a well-formed SetupRequest on the same socket afterwards must still receive a positive SetupResponse. The test fails by timeout without this fix and uses whichever encoding the build provides (JSON preferred), so it runs on JSON-only builds.

ServiceModel::ran_function_data() defaults to an empty vector, but the E3AP schema (messages/asn1/V1/e3ap-1.0.0.asn1) declares E3-RanFunctionDefinition.ranFunctionData as a MANDATORY OCTET STRING (SIZE (1..32768)) — it can be neither omitted nor empty. Encoding a setupResponse that advertises such an SM therefore violates the APER size constraint and aborts the encode of the whole PDU: one data-less SM silently breaks setup for every dApp, including the RAN functions that do provide data. Since the field is mandatory (omitting it is not an option), substitute a 1-byte 0x00 placeholder when the SM provides no data, and say so in a comment next to the schema reference. SMs with real data are encoded exactly as before. Verified against the asn1c runtime generated from this grammar: encoding E3-RanFunctionDefinition with an empty ranFunctionData fails the PER size constraint, while the 1-byte placeholder encodes and decodes. Extend test_asn1_size (the ASN.1-only test binary) with a setupResponse that carries one SM with data and one without: the encode must succeed, the result must decode, the real payload must round-trip byte-for-byte, and the data-less SM must surface the 0x00 placeholder.

SmRegistry was a process-wide singleton, so every E3Interface in a process shared one SM table. That breaks any multi-agent process in two ways: - registering an SM with the same RAN function id on a second, completely independent agent fails with SM_ALREADY_REGISTERED, and - E3Interface::stop() cleared the SHARED registry, wiping the Service Models of every other RAN-role agent still running in the process (the existing role guard only protected against the dApp-role case). Each E3Interface now owns an SmRegistry member, and all call sites in e3_interface.cpp (register_sm, get_available_ran_functions, the setup and sm-data loops, handle_setup_request, handle_control_action, on_sm_lifecycle_change, and the teardown clear) go through the owning interface's registry; stop() clears only the agent's own SMs. The SmRegistry class API is unchanged apart from dropping instance() and making construction public. Nothing else references the singleton — the C API and SWIG layer reach SMs through E3Agent, which already routes via its E3Interface — so single-agent behaviour is identical, including the clear-on-stop semantics. Add test_multi_agent_registry: two RAN-role E3Agents in one process (on distinct IPC endpoints) both register an SM with ran_function_id 1 — both must succeed — then one agent is stopped and destroyed and the survivor must still be running, still advertise RAN function 1, and still serve a full setup handshake to a real dApp-role agent that is offered that RAN function. The test fails on the singleton design (second registration returns SM_ALREADY_REGISTERED) and uses the build's available encoding (JSON preferred), so it runs on JSON-only builds.

github-actions · 2026-06-12T04:41:49Z

⏱️ Full-loop Latency Benchmark (commit `ef2a2a7`)

Full-loop latency benchmark (N=1033 after 50 warmup)

All values in microseconds (us). Transport: ZMQ over IPC, encoding: ASN.1 APER.

Phase	mean	p50	p99	max
1. Collect indication data	0	0	0	0
2. Create & encode indication	0	0	1	11
3. Deliver indication (RAN -> dApp)	95	93	159	181
4. Decode indication	0	0	1	10
5. Process data	0	0	0	10
6. Create & encode control	0	0	0	2
7. Deliver control (dApp -> RAN)	110	106	172	791
8. Decode & handle control	0	0	0	9
Total round-trip	207	196	324	898

Benchmarked on ubuntu-latest, Release build, ZMQ + IPC, ASN.1 APER.

github-actions · 2026-06-12T04:42:46Z

🔄 E2E dApp Integration Results (commit `ef2a2a7`)

posix/ipc

dApp exit: 0
Indications received: 7

posix/tcp

dApp exit: 0
Indications received: 7

zmq/ipc

dApp exit: 0
Indications received: 7

zmq/tcp

dApp exit: 0
Indications received: 7

example_simple_agent + example_simple_dapp on ubuntu-latest, Release build

github-actions · 2026-06-12T04:43:07Z

🔀 E2E Topologies — multi-dApp / multi-RAN (commit `ef2a2a7`)

zmq/ipc

✅ 1 RAN - 1 dApp: indications=5
- dapp peer=t11 ran=ran-solo sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.2 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
✅ 1 RAN - 2 dApps: dApp#1 ind=5 sub=1, dApp#2 ind=6 sub=2, RAN saw 2 dApps
- dapp1 peer=t12 ran=ran-shared sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.4 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
- dapp2 peer=t12 ran=ran-shared sub=2 indications=6 seq=[0..5] dropped=0 (0%) age_ms(avg=0.333333 max=1 @seq=0) hist[<=1:6 2-5:0 6-10:0 >10:0]
✅ 2 RANs - 1 dApp: from ran-a ind=5, from ran-b ind=5
- dapp peer=t2a ran=ran-a sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0 max=0 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
- dapp peer=t2b ran=ran-b sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.4 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]

zmq/tcp

✅ 1 RAN - 1 dApp: indications=5
- dapp peer=default ran=ran-solo sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0 max=0 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
✅ 1 RAN - 2 dApps: dApp#1 ind=5 sub=1, dApp#2 ind=6 sub=2, RAN saw 2 dApps
- dapp1 peer=default ran=ran-shared sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.2 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
- dapp2 peer=default ran=ran-shared sub=2 indications=6 seq=[0..5] dropped=0 (0%) age_ms(avg=0.166667 max=1 @seq=0) hist[<=1:6 2-5:0 6-10:0 >10:0]
✅ 2 RANs - 1 dApp: from ran-a ind=5, from ran-b ind=5
- dapp peer=default ran=ran-a sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.8 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
- dapp peer=off100 ran=ran-b sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.4 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]

example_simple_agent + example_simple_dapp on ubuntu-latest, Release. Indication age is report-only.

Thecave3 · 2026-06-12T14:36:06Z

let's use this branch to work exclusively on the fix

Thecave3

Thanks for the thorough writeup and the regression tests — the diagnosis in #28 is spot-on and the test harness is genuinely good. I'd like to split this before merge, though: #28 is only the setup-socket wedge, and the PR currently bundles two unrelated changes under it. Concretely:

Fix 1 (setup wedge) — keep, but reply with a real negative SetupResponse, not an empty frame. Our encoding is fixed and known a priori by every peer, so on a malformed/undecodable request we can always encode a proper SetupResponse with ResponseCode::NEGATIVE. When the bytes don't decode we don't have the original request_id — just default it (0 / next-available), since the dApp's lazy-pirate loop doesn't correlate replies by id anyway. That removes the need for the empty-frame helper entirely: the peer gets an explicit rejection instead of a silent timeout. test_setup_bad_request should then assert it receives a decodable negative response, not just n >= 0.

Fix 2 (0x00 padding) — please drop this. ranFunctionData being mandatory SIZE(1..32768) is correct; the issue is that an SM should never advertise empty data in the first place (a real SM always carries at least its name — see simple_agent / e3sm_simple), and ranFunctionList is OPTIONAL, so "nothing to advertise" means omitting the entry, not padding it. The only way an empty value reaches the encoder today is the C API's ran_function_data "may be NULL / len 0" contract, which directly contradicts the schema. Padding hides that mismatch (and makes a decoded {0x00} indistinguishable from real data). Tracked as #30 — let's fix it at the source there instead.

Fix 3 (per-interface SmRegistry) — let's pull this into its own PR. It's a structural change to a public header and ~10 call sites, unrelated to #28, and it carries a behavioral change (SMs destroyed without their destroy() hook if stop() isn't called). The multi-agent-in-one-process motivation is reasonable but deserves its own issue and review on its own merits — opened #31 to discuss whether/why we want to support that configuration. (For what it's worth, no external consumer references SmRegistry::instance(), so the API break is internal-only — splitting it costs nothing downstream.) test_multi_agent_registry moves with it.

The ASN.1-only CI gating (LIBE3_ENABLE_ASN1 test exclusion) is a clean, standalone improvement — happy to see that land either here or as its own tiny PR.

So: this PR becomes Fix 1 (negative-response variant) + test_setup_bad_request, closing #28; Fixes 2 and 3 each get their own issue/PR (#30, #31). Thanks again!

Ninjabippo1205 added 4 commits June 11, 2026 23:29

Ninjabippo1205 requested a review from Thecave3 as a code owner June 12, 2026 04:40

This was referenced Jun 12, 2026

C API allows empty ran_function_data but E3AP schema mandates SIZE(1..32768) #30

Open

Discussion: per-interface SmRegistry for multi-agent-in-one-process #31

Open

Thecave3 requested changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/setup robustness and multi agent#29

Fix/setup robustness and multi agent#29
Ninjabippo1205 wants to merge 4 commits into
mainfrom
fix/setup-robustness-and-multi-agent

Ninjabippo1205 commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Thecave3 commented Jun 12, 2026

Uh oh!

Thecave3 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ninjabippo1205 commented Jun 12, 2026

Summary

Type of change

Linked issue

Mandatory test checklist

CI checklist

Twin-repo coordination

Workflow confirmation

Uh oh!

github-actions Bot commented Jun 12, 2026

⏱️ Full-loop Latency Benchmark (commit ef2a2a7)

Full-loop latency benchmark (N=1033 after 50 warmup)

Uh oh!

github-actions Bot commented Jun 12, 2026

🔄 E2E dApp Integration Results (commit ef2a2a7)

posix/ipc

posix/tcp

zmq/ipc

zmq/tcp

Uh oh!

github-actions Bot commented Jun 12, 2026

🔀 E2E Topologies — multi-dApp / multi-RAN (commit ef2a2a7)

zmq/ipc

zmq/tcp

Uh oh!

Thecave3 commented Jun 12, 2026

Uh oh!

Thecave3 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⏱️ Full-loop Latency Benchmark (commit `ef2a2a7`)

🔄 E2E dApp Integration Results (commit `ef2a2a7`)

🔀 E2E Topologies — multi-dApp / multi-RAN (commit `ef2a2a7`)