Skip to content

Fix/setup robustness and multi agent#29

Open
Ninjabippo1205 wants to merge 4 commits into
mainfrom
fix/setup-robustness-and-multi-agent
Open

Fix/setup robustness and multi agent#29
Ninjabippo1205 wants to merge 4 commits into
mainfrom
fix/setup-robustness-and-multi-agent

Conversation

@Ninjabippo1205

Copy link
Copy Markdown
Contributor

Summary

  • Harden the setup path: a malformed/wrong-encoding setupRequest now gets a best-effort
    negative reply instead of leaving the ZMQ REP socket permanently wedged (one bad client
    could previously disable dApp onboarding until the agent/gNB was restarted), and the
    ASN.1 setupResponse no longer fails outright when a registered SM has empty
    ran_function_data (the schema mandates OCTET STRING (SIZE (1..32768)); a 1-byte
    placeholder is substituted).
  • Make SmRegistry a per-E3Interface member instead of a process-wide singleton: two
    E3Agent instances in one process no longer collide on RAN-function registration, and
    one agent's teardown no longer destroys the other agent's Service Models.
  • Fix JSON-only CI: test_asn1_size and test_e2e_report_path hardcode ASN.1 and are now
    only built when LIBE3_ENABLE_ASN1=ON. Two new regression tests added
    (test_setup_bad_request, test_multi_agent_registry), both verified to fail on
    unfixed main. VERSION bumped to 0.0.6.

Type of change

  • [X ] Bug fix
  • New feature / enhancement
  • New Service Model
  • Refactor (no behavior change)
  • Documentation
  • [ X] Test / CI / packaging
  • Other (explain):

Linked issue

Closes #28

Mandatory test checklist

These mirror what CI (.github/workflows/pr-tests.yml) enforces. All boxes must be ticked before review.

  • [ X] ./build_libe3 -c -d build -j $(nproc) -r -t passes (Release build + tests)
  • [ X] ./build_libe3 -c -d build -j $(nproc) -g -t passes (Debug build + tests)
  • [ X] cd build && ctest --output-on-failure is clean
  • [ X] MPMC queue benchmark (./build/test_bench_mpmc_queue) shows no regression vs main
  • [ X] VERSION bumped per SemVer if the public API or ABI changed
  • [ X] If public headers under include/ were touched, ./build_libe3 --docs renders without new Doxygen warnings
  • [ X] If new build dependencies were added, they are installed by ./build_libe3 -I (update the script if needed)
  • [ X] If the libe3.pc interface changed, downstream consumers (dApp-openairinterface5g) still link cleanly

CI checklist

  • Unit Tests workflow is green (Debug + Release matrix on ubuntu-latest)
  • MPMC Queue Benchmark job has posted results to this PR with no regression

Twin-repo coordination

libe3 is paired with dapps and dApp-openairinterface5g. We do not accept patches that break or reduce compatibility with the twin repositories.

  • [ X] This PR does not change the E3 wire protocol or public ABI, OR a paired PR exists in each affected twin repo (link below).

Paired PR(s):

Workflow confirmation

  • [ X] Commits will be squashed at merge time. Updates to this PR will be applied via rebase only — no merge commits, no duplicated history. (See CONTRIBUTING.md § Pull Request Process.)
  • [ X] I have read and followed CONTRIBUTING.md.

test_asn1_size drives the APER encoder directly and test_e2e_report_path
hardcodes EncodingFormat::ASN1 for both the agent under test and its
fake-dApp peer. On a JSON-only build (-DLIBE3_ENABLE_ASN1=OFF) the encoder
factory returns nullptr for ASN.1, so both tests fail with 'encoder
unavailable' (-12) rather than telling us anything about the build.

Exclude the two targets at configure time when ASN.1 support is compiled
out, mirroring the existing test_json_encoder skip. The tests themselves
are unchanged, so ASN.1-enabled builds keep full coverage.
The RAN side of the setup channel is a ZMQ REP socket, which must send
exactly one reply per received request before it can receive again. The
setup loop bailed out without replying whenever the incoming message could
not be answered (undecodable bytes from a wrong-encoding or misbehaving
dApp, an unexpected PDU type, or a SetupResponse encode failure), leaving
the socket stuck in the send state: every subsequent zmq_recv fails and
all future dApp setups stall until the connector happens to reset or the
agent restarts.

Complete the REQ/REP exchange on every such path with a best-effort empty
reply (new send_empty_setup_reply helper). The peer treats the empty reply
as a failed setup and retries; the setup channel stays usable for the next
dApp. On the POSIX connector, which has no REP state machine, the empty
send is a harmless zero-length frame. Malformed-message logs are demoted
to one concise warn line per message.

Add test_setup_bad_request: a raw ZMQ REQ peer sends garbage to a running
agent and must still get a reply, and a well-formed SetupRequest on the
same socket afterwards must still receive a positive SetupResponse. The
test fails by timeout without this fix and uses whichever encoding the
build provides (JSON preferred), so it runs on JSON-only builds.
ServiceModel::ran_function_data() defaults to an empty vector, but the
E3AP schema (messages/asn1/V1/e3ap-1.0.0.asn1) declares
E3-RanFunctionDefinition.ranFunctionData as a MANDATORY
OCTET STRING (SIZE (1..32768)) — it can be neither omitted nor empty.
Encoding a setupResponse that advertises such an SM therefore violates
the APER size constraint and aborts the encode of the whole PDU: one
data-less SM silently breaks setup for every dApp, including the RAN
functions that do provide data.

Since the field is mandatory (omitting it is not an option), substitute a
1-byte 0x00 placeholder when the SM provides no data, and say so in a
comment next to the schema reference. SMs with real data are encoded
exactly as before.

Verified against the asn1c runtime generated from this grammar: encoding
E3-RanFunctionDefinition with an empty ranFunctionData fails the PER size
constraint, while the 1-byte placeholder encodes and decodes.

Extend test_asn1_size (the ASN.1-only test binary) with a setupResponse
that carries one SM with data and one without: the encode must succeed,
the result must decode, the real payload must round-trip byte-for-byte,
and the data-less SM must surface the 0x00 placeholder.
SmRegistry was a process-wide singleton, so every E3Interface in a process
shared one SM table. That breaks any multi-agent process in two ways:

  - registering an SM with the same RAN function id on a second,
    completely independent agent fails with SM_ALREADY_REGISTERED, and

  - E3Interface::stop() cleared the SHARED registry, wiping the Service
    Models of every other RAN-role agent still running in the process
    (the existing role guard only protected against the dApp-role case).

Each E3Interface now owns an SmRegistry member, and all call sites in
e3_interface.cpp (register_sm, get_available_ran_functions, the setup and
sm-data loops, handle_setup_request, handle_control_action,
on_sm_lifecycle_change, and the teardown clear) go through the owning
interface's registry; stop() clears only the agent's own SMs. The
SmRegistry class API is unchanged apart from dropping instance() and
making construction public. Nothing else references the singleton — the
C API and SWIG layer reach SMs through E3Agent, which already routes via
its E3Interface — so single-agent behaviour is identical, including the
clear-on-stop semantics.

Add test_multi_agent_registry: two RAN-role E3Agents in one process (on
distinct IPC endpoints) both register an SM with ran_function_id 1 — both
must succeed — then one agent is stopped and destroyed and the survivor
must still be running, still advertise RAN function 1, and still serve a
full setup handshake to a real dApp-role agent that is offered that RAN
function. The test fails on the singleton design (second registration
returns SM_ALREADY_REGISTERED) and uses the build's available encoding
(JSON preferred), so it runs on JSON-only builds.
@Ninjabippo1205 Ninjabippo1205 requested a review from Thecave3 as a code owner June 12, 2026 04:40
@github-actions

Copy link
Copy Markdown
Contributor

⏱️ Full-loop Latency Benchmark (commit ef2a2a7)

Full-loop latency benchmark (N=1033 after 50 warmup)

All values in microseconds (us). Transport: ZMQ over IPC, encoding: ASN.1 APER.

Phase mean p50 p99 max
1. Collect indication data 0 0 0 0
2. Create & encode indication 0 0 1 11
3. Deliver indication (RAN -> dApp) 95 93 159 181
4. Decode indication 0 0 1 10
5. Process data 0 0 0 10
6. Create & encode control 0 0 0 2
7. Deliver control (dApp -> RAN) 110 106 172 791
8. Decode & handle control 0 0 0 9
Total round-trip 207 196 324 898

Benchmarked on ubuntu-latest, Release build, ZMQ + IPC, ASN.1 APER.

@github-actions

Copy link
Copy Markdown
Contributor

🔄 E2E dApp Integration Results (commit ef2a2a7)

posix/ipc

  • dApp exit: 0
  • Indications received: 7

posix/tcp

  • dApp exit: 0
  • Indications received: 7

zmq/ipc

  • dApp exit: 0
  • Indications received: 7

zmq/tcp

  • dApp exit: 0
  • Indications received: 7

example_simple_agent + example_simple_dapp on ubuntu-latest, Release build

@github-actions

Copy link
Copy Markdown
Contributor

🔀 E2E Topologies — multi-dApp / multi-RAN (commit ef2a2a7)

zmq/ipc

  • 1 RAN - 1 dApp: indications=5
    • dapp peer=t11 ran=ran-solo sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.2 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
  • 1 RAN - 2 dApps: dApp#1 ind=5 sub=1, dApp#2 ind=6 sub=2, RAN saw 2 dApps
    • dapp1 peer=t12 ran=ran-shared sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.4 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
    • dapp2 peer=t12 ran=ran-shared sub=2 indications=6 seq=[0..5] dropped=0 (0%) age_ms(avg=0.333333 max=1 @seq=0) hist[<=1:6 2-5:0 6-10:0 >10:0]
  • 2 RANs - 1 dApp: from ran-a ind=5, from ran-b ind=5
    • dapp peer=t2a ran=ran-a sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0 max=0 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
    • dapp peer=t2b ran=ran-b sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.4 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]

zmq/tcp

  • 1 RAN - 1 dApp: indications=5
    • dapp peer=default ran=ran-solo sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0 max=0 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
  • 1 RAN - 2 dApps: dApp#1 ind=5 sub=1, dApp#2 ind=6 sub=2, RAN saw 2 dApps
    • dapp1 peer=default ran=ran-shared sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.2 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
    • dapp2 peer=default ran=ran-shared sub=2 indications=6 seq=[0..5] dropped=0 (0%) age_ms(avg=0.166667 max=1 @seq=0) hist[<=1:6 2-5:0 6-10:0 >10:0]
  • 2 RANs - 1 dApp: from ran-a ind=5, from ran-b ind=5
    • dapp peer=default ran=ran-a sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.8 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]
    • dapp peer=off100 ran=ran-b sub=1 indications=5 seq=[0..4] dropped=0 (0%) age_ms(avg=0.4 max=1 @seq=0) hist[<=1:5 2-5:0 6-10:0 >10:0]

example_simple_agent + example_simple_dapp on ubuntu-latest, Release. Indication age is report-only.

@Thecave3

Copy link
Copy Markdown
Collaborator

let's use this branch to work exclusively on the fix

@Thecave3 Thecave3 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough writeup and the regression tests — the diagnosis in #28 is spot-on and the test harness is genuinely good. I'd like to split this before merge, though: #28 is only the setup-socket wedge, and the PR currently bundles two unrelated changes under it. Concretely:

Fix 1 (setup wedge) — keep, but reply with a real negative SetupResponse, not an empty frame. Our encoding is fixed and known a priori by every peer, so on a malformed/undecodable request we can always encode a proper SetupResponse with ResponseCode::NEGATIVE. When the bytes don't decode we don't have the original request_id — just default it (0 / next-available), since the dApp's lazy-pirate loop doesn't correlate replies by id anyway. That removes the need for the empty-frame helper entirely: the peer gets an explicit rejection instead of a silent timeout. test_setup_bad_request should then assert it receives a decodable negative response, not just n >= 0.

Fix 2 (0x00 padding) — please drop this. ranFunctionData being mandatory SIZE(1..32768) is correct; the issue is that an SM should never advertise empty data in the first place (a real SM always carries at least its name — see simple_agent / e3sm_simple), and ranFunctionList is OPTIONAL, so "nothing to advertise" means omitting the entry, not padding it. The only way an empty value reaches the encoder today is the C API's ran_function_data "may be NULL / len 0" contract, which directly contradicts the schema. Padding hides that mismatch (and makes a decoded {0x00} indistinguishable from real data). Tracked as #30 — let's fix it at the source there instead.

Fix 3 (per-interface SmRegistry) — let's pull this into its own PR. It's a structural change to a public header and ~10 call sites, unrelated to #28, and it carries a behavioral change (SMs destroyed without their destroy() hook if stop() isn't called). The multi-agent-in-one-process motivation is reasonable but deserves its own issue and review on its own merits — opened #31 to discuss whether/why we want to support that configuration. (For what it's worth, no external consumer references SmRegistry::instance(), so the API break is internal-only — splitting it costs nothing downstream.) test_multi_agent_registry moves with it.

The ASN.1-only CI gating (LIBE3_ENABLE_ASN1 test exclusion) is a clean, standalone improvement — happy to see that land either here or as its own tiny PR.

So: this PR becomes Fix 1 (negative-response variant) + test_setup_bad_request, closing #28; Fixes 2 and 3 each get their own issue/PR (#30, #31). Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Malformed or wrong-encoding setup request permanently wedges the agent's setup socket

2 participants