Gate 7b — EVPN VTEP Linux dataplane reconciler (foundation)#34
Open
Gate 7b — EVPN VTEP Linux dataplane reconciler (foundation)#34
Conversation
Introduces the portable domain surface that crates/evpn-linux will consume across the watch-channel boundary defined by ADR-0054. These types deliberately live in crates/evpn rather than crates/evpn-linux so the daemon can construct intents on platforms that don't compile the netlink crate (macOS dev), and so a future RR-only feature flag can drop crates/evpn-linux entirely without touching the intent surface. New module crates/evpn/src/mac.rs exports MacAddress (re-exported from rustbgpd-wire so consumers don't need a wire dep), RemoteMacEntry, RemoteMacSource, RemoteMacTable + RemoteMacTableBuilder, and LocalMacObservation. The builder rejects duplicate (VNI, MAC) keys with a typed error so the daemon-side projection from RIB best-paths must resolve mobility races deterministically before the dataplane sees inconsistent intent. Iteration is BTreeMap-ordered for deterministic diff output and reproducible property tests. New module crates/evpn/src/dataplane.rs exports DataplaneIntent (generation + Arc<EvpnInstanceTable> + Arc<RemoteMacTable>), DataplaneReport (intent_generation echoes the snapshot's gen for correlation; reconcile_generation is the actor's own monotonic counter), InstanceDataplaneStatus, InstanceState (Ready / NotReady / Unbound), AppliedOp / FailedOp, and DataplaneOpKind (AddRemoteFdb / UpdateRemoteFdb / RemoveRemoteFdb). EvpnInstanceTable gains PartialEq + Eq derives so DataplaneIntent can derive them too — semantically right because the table value is content-equal across reload cycles when nothing changed. 12 new unit tests (5 dataplane + 7 mac) cover builder duplicate rejection, deterministic iteration order, instance-scoped iteration, empty-intent invariants, op-kind discrimination, and the publish struct-update pattern.
New workspace crate crates/evpn-linux carrying the kernel-side reconciliation contract from ADR-0054. Phase 2 ships everything except the real netlink impl (Phase 4) and the reconcile actor (Phase 3): the trait abstraction, the kernel-snapshot types, the pure compute_diff function with foreign-entry preservation, and a fully-functional InMemoryDataplane fake the actor tests will drive in Phase 3. Crate layout: - src/dataplane.rs — Dataplane trait (native async fn in trait, + Send bounds; no async_trait crate). DataplaneOp + KernelEvent. - src/error.rs — DataplaneError with FailureClass classifier (Transient / Permanent / Conflict) so the Phase 3 actor can decide retry vs escalate-to-NotReady vs log-and-skip. - src/snapshot.rs — KernelSnapshot, KernelFdbEntry + KernelFdbFlags (extern_learn / permanent / noarp / master / self_flag — one-to-one with NTF_*/NUD_* kernel bits, hence the struct_excessive_bools allow), KernelLinkInfo + KernelVxlanInfo for the probe pass, InstanceProbes (Ready / NotReady / Unbound), OwnedSet (the (VNI, MAC) keys we have programmed) + OwnedEntry (last applied dst + mobility seq). - src/diff.rs — pure compute_diff(desired, snapshot, last_applied, probes) -> Plan. Foreign-entry preservation is structural: pass 1 iterates desired (creates/updates only where we own it or kernel has nothing), pass 2 iterates last_applied — never the kernel snapshot — for deletes, so kernel-learned local MACs and operator-static FDB entries are invisible to the delete pass. - src/in_memory.rs — InMemoryDataplane fake with cloneable handle for test-side mutation (pre-load FDB, set probes, inject failures, push kernel events, observe apply count). Apply mutates the snapshot with extern_learn flags so subsequent dumps round-trip the fact that we own the entry. Workspace registers the new crate as a member with a path-only, publish=false dep slot. 28 new tests (3 error, 6 snapshot, 5 in_memory, 11 diff incl. all the cases ADR-0054 calls out + a key-grounding cross-check). Workspace test count climbs from 1406 -> 1434, all green; clippy + doc + fmt clean. Phase 4's LinuxDataplane will plug into the same Dataplane trait without the actor (Phase 3) needing to change.
Implements the level-triggered ReconcileActor<D: Dataplane> from ADR-0054 §6/§7/§8. The actor's tokio::select! loop reacts to four inputs: - new DataplaneIntent on the watch::Receiver, - KernelEvent from Dataplane::next_event(), - 60s periodic full-dump timer (configurable), - per-failed-op retry timer fed by RetrySchedule. Plus a fifth: a CancellationToken triggers a bounded 5s drain that deletes only owned remote-MAC FDB entries (foreign entries survive because OwnedSet is the iteration source for both the diff loop's delete pass and drain). New module crates/evpn-linux/src/backoff.rs: - RetrySchedule keyed by (EvpnInstanceId, MacAddress) so a single stuck op doesn't gate unrelated successful ones. - Geometric growth from 100ms -> 5s cap with ±25% jitter via a small LCG (deterministic seed for reproducible tests). Stays in u64 ms end-to-end so no u128/i64 cast lints. - record_failure / record_success / earliest_due / keys_due (the actor only uses earliest_due in production but keys_due aids diagnostics). - 7 unit tests cover initial backoff, geometric doubling, cap, jitter band ±25%, success-clears-tracking, keys_due filtering, earliest_due. New module crates/evpn-linux/src/reconcile.rs: - ReconcileActor<D> + ReconcileActorConfig (production / for_tests). - ActorState carries OwnedSet, RetrySchedule, last_intent_generation, reconcile_generation, an Instant epoch for retry-schedule millisecond timestamps. - coalesce_and_reconcile() applies the configured coalesce_window so fast intent supersession folds into one pass; the level-triggered reconcile pass itself reads borrow_and_update() so subsequent changed() fires only on actual new publishes. - apply_plan() classifies failures via FailureClass and records into the retry schedule; on success it mirrors into OwnedSet so the next diff treats the entry as ours. - emit_report() correlates intent_generation back to the daemon and carries reconcile_generation for telemetry. - drain() runs inside tokio::time::timeout(drain_timeout, ...) for the bounded shutdown-drain ADR-0054 §7 specifies. 7 integration tests in crates/evpn-linux/tests/reconcile_actor.rs exercise the full actor lifecycle against InMemoryDataplane: - initial reconcile emits Apply for desired MAC - fast intent supersession reconciles only the latest (validates watch::Receiver coalescing semantics, not mpsc-style queueing) - failed apply retries on the backoff timer - shutdown drain preserves a pre-loaded foreign static FDB entry - periodic dump fires on the 60s cadence - kernel event triggers immediate reconcile - NotReady instance emits status row with no ops attempted Crate adds tokio-util 0.7 for CancellationToken (no other workspace crate uses it; the dep is local to crates/evpn-linux). crates/evpn re-exports rustbgpd_wire::RouteDistinguisher so consumers of the domain crate (the new actor tests, and Phase 5's projection layer) don't need to take a direct rustbgpd-wire dep just to construct an EvpnInstance. Workspace test count climbs 1434 -> 1448 (14 new). Clippy + doc + fmt clean across the workspace.
Lays down the cfg(target_os = "linux") module structure for the real netlink integration without yet pulling in rtnetlink / netlink-packet-* dependencies. The stub LinuxDataplane: - opens no netlink socket (safe to instantiate without CAP_NET_ADMIN); - reports Unbound for instances with bridge = None; - reports NotReady with a "phase 4 stub: real netlink integration not yet wired" reason for instances that have a bridge name configured; - returns Ok(KernelSnapshot::new()) from dump_snapshot; - returns DataplaneError::Other(stub_reason) from apply; - never produces a KernelEvent (next_event() returns std::future::pending, so the actor's tokio::select! ignores the branch and falls back to its periodic dump cadence + retry timer). This shape exists so Phase 5 (daemon wiring) can compile end-to-end against a real binary on Linux without the netlink work being a blocker on its own merge. The reconcile actor's foreign-entry- preservation, shutdown-drain, and report semantics are identical against the stub as against a real kernel impl, so Phase 5's binary-spawn integration test (Phase 6 extends it) covers the wiring even when the stub is the actual dataplane in production builds. What lands when the real netlink slice arrives (queued as the next commit on this same feature branch): - crates/evpn-linux/src/linux/links.rs — bridge + VXLAN inventory - crates/evpn-linux/src/linux/fdb.rs — RTM_NEWNEIGH/RTM_DELNEIGH with NTF_EXT_LEARNED + NTF_MASTER (bridge/master path per ADR-0054 §5) - crates/evpn-linux/src/linux/notify.rs — RTNLGRP_LINK / NEIGH / NOTIFY subscription with the ADR §6 startup buffering rule - crates/evpn-linux/src/linux/probe.rs — VLAN-aware-bridge rejection + kernel-too-old (NTF_EXT_LEARNED EINVAL fallback) detection - [target.'cfg(target_os = "linux")'.dependencies] = rtnetlink 0.21, netlink-packet-route 0.30, netlink-packet-utils 0.6, netlink-sys 0.8 - tests/netns_dataplane.rs gated on EVPN_LINUX_NETNS=1 (CAP_NET_ADMIN netns required; not a PR-CI gate) 4 new unit tests cover the stub's behavior end-to-end (Unbound vs NotReady probe outcomes, empty dump, apply-always-errors), giving the daemon wiring layer a fixed contract to integrate against. Workspace tests 1448 -> 1452.
ADR-0054 §1 forbids crates/evpn-linux from depending on crates/rib, so the daemon owns the conversion from RIB best-path EVPN routes into the portable RemoteMacTable the dataplane consumes. This commit ships the pure half of that conversion in crates/evpn so the projection is testable without booting the daemon and without taking a wire / RIB dep into evpn-linux. New module crates/evpn/src/projection.rs: - ProjectedEvpnRoute — small portable struct the daemon constructs from EvpnRibRoute at the call site (rd, mac, host_ip, label1, next_hop, mobility_sequence). Carries the wire-shaped RouteDistinguisher because it's already public on EvpnInstance and surfaces useful collision messages. - project_evpn_routes(instances, IntoIterator<ProjectedEvpnRoute>) -> RemoteMacTable. Pure, deterministic. Mobility tie-break (RFC 7432 §15): 1. Higher mobility_sequence wins (Some(N+1) > Some(N) > None). 2. On equal sequence, lower next_hop IP wins (arbitrary but deterministic — same inputs always pick the same winner). Routes whose label1's VNI doesn't match a local EvpnInstance are dropped silently; they belong to other VTEPs' EVIs and the dataplane has no business programming them. VNI 0 from MplsLabel::as_vni is also dropped (RFC 8365 §5 reserved), so a malformed wire-side label doesn't cause an unbuildable EvpnInstanceId. 10 unit tests cover empty inputs, single-route round-trip, unknown VNI / zero-label drop, all three tie-break legs (higher seq wins; equal-seq breaks on lower next_hop; no-seq-at-all also breaks on lower next_hop), distinct MACs in same VNI both land, same MAC in different VNIs both land, and reorder-determinism (running the projection twice with shuffled input produces equal output). Workspace test count climbs 1452 -> 1462 (10 new). Clippy + doc clean. Phase 5b (the daemon-side supervisor that wires this projection into a watch::Sender<Arc<DataplaneIntent>>) lands on the next commit on this branch.
…hase 5b) Adds the daemon-side glue between the RIB's EVPN best-path table and crates/evpn-linux's reconcile actor. ADR-0054 §1 forbids the dataplane crate from depending on rustbgpd-rib; the daemon binary owns this coordination layer. New module src/evpn_dataplane.rs: - spawn(config, &Arc<EvpnInstanceTable>, rib_tx, shutdown) returns Option<EvpnDataplaneHandle>. Empty [[evpn_instances]] -> None (RR-only deployments don't open netlink and don't spawn the actor — ADR-0054 §1 invariant). - spawn_with_dataplane is generic over D: Dataplane + Send + Sync, so integration tests can inject InMemoryDataplane and the production path uses LinuxDataplane on Linux. Non-Linux platforms return None at the spawn level (logged once at startup). - supervisor_loop polls the RIB every config.poll_interval (default 5 s) via the existing RibUpdate::QueryEvpnRoutes channel, projects each EvpnRibRoute (filtering to Type 2 MacIp variants) into a ProjectedEvpnRoute via project_one(), runs project_evpn_routes() to build the RemoteMacTable, wraps it in a DataplaneIntent with a monotonic generation counter, and publishes via watch::Sender. - project_one extracts the MAC mobility sequence from path-attribute ExtendedCommunities (RFC 7432 §15) using the existing ExtendedCommunity::as_mac_mobility helper. Absent extcomm = None, which the projection's tie-break treats as "older than any sequence". - Status logger task drains DataplaneReports and logs failure summaries; Phase 6 will replace this with the gRPC status surface. - EvpnDataplaneHandle::shutdown() cancels the token + bounded-awaits both spawned tasks. Polling cadence note: the reconcile actor's 60 s periodic dump backstop and the per-op 100ms->5s retry already handle kernel drift at finer granularity than the supervisor's 5 s poll. When operator demand pushes for sub-second MAC convergence, the path forward is a tokio::sync::Notify pinged from the RIB's best-path apply path (documented as Gate 7c). src/main.rs wires: - new mod evpn_dataplane declaration alongside the existing modules, - a CancellationToken just before the gRPC server spawn, - evpn_dataplane::spawn(...) right after the Arc<EvpnInstanceTable> is constructed (line 1018) and before it's moved into ServeConfig at line 1036. The handle is bound to _evpn_dataplane_handle so it lives for the daemon lifetime; dropping at exit cancels the reconcile actor's drain. Cargo.toml gains rustbgpd-evpn-linux (workspace) and tokio-util 0.7 (direct dep with the rt feature for CancellationToken — only used inside the daemon binary; the evpn-linux crate already had it). 5 new tests cover project_one's Type 2 vs non-Type-2 filtering, absent-extcomm sequence handling, the RR-only spawn-returns-None path, and an end-to-end supervisor_publishes_intent_built_from_rib_query that wires a stub RIB responder to the supervisor + actor + InMemoryDataplane and asserts the projected MAC lands in the fake kernel snapshot. Workspace test count climbs 1462 -> 1467 (5 new). Clippy + doc clean across the whole workspace. Phase 6 (status surface + binary tests + CHANGELOG) lands next.
CHANGELOG.md gains an [Unreleased] entry summarizing the six-commit Gate 7b foundation: domain types, diff loop with foreign-entry preservation, reconcile actor with backoff + drain, InMemoryDataplane fake, cfg-gated LinuxDataplane stub, RIB-projection, and daemon supervisor wiring. Test count called out (1406 -> 1467 -> 1468 over the branch). Packaging notes cover the new crates/evpn-linux workspace member and the tokio-util binary dep. New tests/evpn_dataplane_rr_only.rs: ADR-0054 §1 invariant test. Boots the real rustbgpd binary with NO [[evpn_instances]] and asserts: - the supervisor's "no EVPN instances configured — dataplane actor not spawned" log line lands in the daemon's structured-log output (tracing_subscriber::fmt() writes to stdout — the test captures both stdout and stderr because the early console banner goes to stderr while the JSON-formatted info log lands on stdout); - the actor's "EVPN dataplane reconcile applied" / "EVPN dataplane apply failures" logs DO NOT appear (they're only emitted from the spawned report-logger task, which short-circuits when the supervisor returns None). This closes the architectural invariant from ADR-0054 §1: route- reflector deployments incur zero dataplane cost. The assertion is on log content rather than a structured "is the actor spawned" probe because gRPC status surfacing for the dataplane is queued under a follow-up commit on the same branch. ROADMAP.md gains a "Next Up — Pre-v1.0 Polish" entry tracking the Gate 7b foundation as in-flight on feat/evpn-linux-dataplane, with explicit framing of what's already there (the contract surface, diff loop, actor, supervisor, RR-only short-circuit) versus what the next commit on the branch ships (the rtnetlink/netlink-packet-route LinuxDataplane and the privileged tests/netns_dataplane.rs gated by EVPN_LINUX_NETNS=1). Workspace test count climbs 1467 -> 1468. Clippy + doc + fmt clean across the whole workspace.
Replaces the cfg(target_os = "linux") stub from `16d17c2` with a
working netlink integration backed by rtnetlink 0.14 +
netlink-packet-route 0.19. The level-triggered ReconcileActor and the
daemon supervisor are unchanged — the trait surface is fixed, so all
of phase 2/3/5's tests continue to pass against this real impl
without modification.
Three new submodules under crates/evpn-linux/src/linux/:
- links.rs — Walks LinkHandle::get once per dump, splits into
bridges (with vlan_filtering state) and VXLAN ports (with vni,
local IP, learning_disabled). A second pass stitches each VXLAN
port onto its master bridge via the Controller (IFLA_MASTER)
attribute. If two VXLAN ports race onto the same bridge the slot
is cleared, surfacing the ambiguity as NotReady through probe.rs
(ADR-0054 §4 requires "exactly one VXLAN port for the instance
VNI"). Builds a LinkCache the dump_snapshot path persists onto
LinuxDataplane so apply() can resolve bridge ifindex by VNI
without a second netlink round-trip.
- fdb.rs — Bridge-family neighbour dump turns the kernel FDB into
KernelFdbEntry rows keyed by (EvpnInstanceId, MacAddress). NTF flags
are mapped onto KernelFdbFlags so the diff loop can distinguish
rustbgpd-owned entries (NTF_EXT_LEARNED) from operator-static
(NUD_PERMANENT/NUD_NOARP) and kernel-learned local entries
(dynamic, no extern_learn). apply_op() turns DataplaneOp::AddRemoteFdb
/ UpdateRemoteFdb into NeighbourAddRequest::add_bridge() with
NTF_EXT_LEARNED + NUD_NOARP and `.replace()` for idempotency;
RemoveRemoteFdb constructs a NeighbourMessage and calls
NeighbourHandle::del.
- probe.rs — Per-instance readiness check covering the ADR-0054 §4
five-point list: bridge exists, exactly one VXLAN port, VNI matches,
local_vtep_ip matches, learning disabled, and the bridge is NOT
VLAN-aware. Failed checks produce operator-facing reason strings;
bridge=None instances produce Unbound. 8 unit tests cover each
rejection leg + the happy path + a multi-instance walk.
LinuxDataplane::connect() opens netlink and spawns the rtnetlink
connection driver task. Returns DataplaneError::Io if the socket
fails (no CAP_NET_ADMIN, AF_NETLINK unavailable). Daemon's spawn
path catches this and logs warn! rather than crashing — running
rustbgpd in an unprivileged container becomes a no-op for EVPN
instead of a fatal error.
next_event() is intentionally still pending(): RTNLGRP_NEIGH /
RTNLGRP_LINK subscription is queued as a follow-up. The level-
triggered design (60 s periodic dump + per-op retry) repairs kernel
drift structurally, so the gap is functional.
Cargo.toml gains the cfg-gated target dep block:
rtnetlink = "0.14"
netlink-packet-route = "0.19"
netlink-packet-core = "0.7"
netlink-packet-utils = "0.5"
netlink-sys = { version = "0.8", features = ["tokio_socket"] }
futures = "0.3"
Pinned to rtnetlink 0.14 (paired with netlink-packet-route 0.19) —
newer 0.21+ releases changed the message-shape ABI and pulled in
async-std incompatibly with the workspace's tokio/tonic stack.
Workspace test count climbs 1468 -> 1473 (5 new probe.rs tests +
the existing connect-doesnt-panic smoke). All previous Phase 2/3/5
tests pass unchanged against the real impl on Linux. PR-CI doesn't
have CAP_NET_ADMIN so connect() may surface DataplaneError::Io;
the daemon warn-and-disable path covers that gracefully.
Adds crates/evpn-linux/tests/netns_dataplane.rs gated on EVPN_LINUX_NETNS=1. The test: 1. Creates a Linux network namespace and a bridge + VXLAN port inside it (vni 2_000_100, local 127.0.0.10, nolearning). 2. Pre-loads a foreign static FDB entry the dataplane must preserve. 3. Re-execs itself inside the netns via `ip netns exec` so the inner test process opens netlink against the namespace's own FDB (rather than the host's). 4. Calls LinuxDataplane::apply with AddRemoteFdb, asserts the entry appears in `bridge fdb show` with extern_learn (or offload). 5. Calls RemoveRemoteFdb, asserts the entry is gone. 6. Back in the outer process, verifies the foreign static entry is still present — end-to-end validation of ADR-0054 §5/§7 foreign- entry preservation. Skips cleanly when EVPN_LINUX_NETNS is unset (PR-CI default), emitting a "skipping: set EVPN_LINUX_NETNS=1" notice. CI runners without CAP_NET_ADMIN never attempt the privileged operations. CHANGELOG and ROADMAP updated to reflect that the real netlink integration has landed: - CHANGELOG: replace the "phase 4 stub" bullet with the real rtnetlink integration description (links/fdb/probe submodules, the connect() warn-and-disable path, NTF_EXT_LEARNED + NUD_NOARP programming with .replace() idempotency); add a netns-test bullet under Tests; update the test count to 50 new / 1474 workspace; expand the Packaging block with the cfg-gated dep versions and the rationale for pinning rtnetlink 0.14 (newer releases pull async-std incompatibly with tokio/tonic). - ROADMAP: rewrite the in-flight entry to remove "stub" framing and call out the remaining follow-up (RTNLGRP_NEIGH/LINK notification subscription — level-triggered design tolerates the gap via the 60 s periodic dump). Workspace test count climbs 1473 -> 1474 (1 new netns gate test). Clippy + doc + fmt all clean.
Six review findings, all addressed in one commit on the branch. == Blocker 1: FDB targeted bridge ifindex with wrong flags == Linux EVPN remote MACs program with the `bridge fdb add MAC dev vxlanX master dst REMOTE` shape — the netlink message's ifindex is the *VXLAN port*, not the bridge, and the entry carries NTF_MASTER. The previous impl targeted the bridge ifindex with only NTF_EXT_LEARNED, which (a) wouldn't reach the right device on the wire and (b) wouldn't activate the bridge/master path switchdev offload requires. Fix: KernelVxlanInfo gains an `ifindex` field; LinkCache replaces `bridge_ifindex_to_name` with `vxlan_ifindex_to_vni` (FDB messages arrive keyed by VXLAN ifindex, not bridge); apply_op() resolves VNI to VXLAN ifindex via the cache, calls add_bridge() with that ifindex, and includes both NeighbourFlag::Controller (NTF_MASTER) and NeighbourFlag::ExtLearned. Delete path constructs a NeighbourMessage on the VXLAN port with the same Controller flag. Parse path now keys on header.ifindex via the new map and reads the Controller flag explicitly into KernelFdbFlags.master rather than inferring it from the absence of NTF_SELF. == Blocker 2: shutdown drain never wired == `_evpn_dataplane_handle = evpn_dataplane::spawn(...)` was held with a leading underscore and never used, leaving `EvpnDataplaneHandle:: shutdown()` as dead code. Dropping a JoinHandle detaches; dropping the CancellationToken doesn't cancel anything that already cloned it. The actor's drain path was unreachable. Fix: drop the leading underscore on `evpn_dataplane_handle`, and add a `2.5` step in the daemon's coordinated shutdown block that calls `handle.shutdown().await` after PeerManager and before BMP. Drain runs the actor's bounded 5s remote-MAC delete pass (foreign entries still survive structurally), with a 10s outer timeout so a stuck task can't wedge the daemon's exit. The dead_code allow on EvpnDataplaneHandle::shutdown is removed; the doc comment is rewritten to clarify that *Drop* doesn't run async drain — the caller must call shutdown().await explicitly. == Should-fix 3: retry/backoff recorded but not enforced == Previous reconcile_once() ran the full plan unconditionally on every wake, so any watch update / kernel event / periodic dump bypassed the per-op backoff and re-attempted permanently-failed ops at maximum frequency. RetrySchedule::record_failure was effectively just telemetry. Fix: apply_plan() filters every op through: 1. `permanent_failures` set — keys that hit FailureClass::Permanent are suppressed entirely until the next intent generation (intent.generation != permanent_anchor_generation clears them so an operator's fix actually retries). 2. `RetrySchedule::next_due_for(vni, mac)` — transient failures defer until the per-op backoff deadline elapses. The actor's outer tokio::select! re-fires on the retry timer when the earliest-due deadline arrives, so a deferred op runs as soon as it's ready instead of waiting for the 60s periodic dump. New backoff::RetrySchedule::next_due_for accessor returns the per- key deadline. New reconcile_actor.rs test `permanent_failure_is_suppressed_until_next_intent_generation` locks the suppression contract: inject KernelTooOld on op N, verify apply_count stops growing across periodic-dump cycles, verify a fresh intent generation re-runs it. == Should-fix 4: classify_apply_error lumped everything Transient == The previous classifier put rtnetlink::Error::NetlinkError into DataplaneError::Other, which classifies as Transient — so an EPERM failure from a missing CAP_NET_ADMIN would retry forever instead of the warn-and-disable behavior the connect() docstring promised. Fix: classify_apply_error now string-matches errno mnemonics in the rendered error and maps EPERM/EACCES + EOPNOTSUPP to KernelTooOld (Permanent class) and EINVAL to InvalidArgument (also Permanent). Anything else stays in Other (Transient). String-matching the Display impl is conservative because rtnetlink 0.14's ErrorMessage::raw_code() isn't part of the public API at this version. == Should-fix 5: netns test bypassed link-cache priming == The previous netns test called `dp.apply()` directly on a fresh LinuxDataplane, which would hit `LinkNotFound` because the link cache is empty until probe() or dump_snapshot() runs. Fix: build an EvpnInstanceTable matching the netns topology, call `dp.probe(&table).await` first (which populates the cache), assert the instance reports `Ready`, *then* run apply(). Now the test exercises the same precondition path the real reconcile actor takes; if either the probe or the FDB program path regresses, the test fails on a privileged runner. == Should-fix 6: VXLAN ambiguity counter + learning fail-open == Two edge cases in links.rs: 1. The "two VXLAN ports" code toggled `Some -> None -> Some` so three attaches reset to the first port's info. Fix: track an explicit `vxlan_attach_count` per bridge; once the count exceeds 1 the slot is cleared and never re-set, and probe() reports NotReady citing the count. 2. `learning_disabled: bool` defaulted to `true` so a kernel that omitted IFLA_VXLAN_LEARNING quietly passed the readiness check. Fix: change to `Option<bool>`; probe() fails closed on `None` with "VXLAN port did not report IFLA_VXLAN_LEARNING". Two new probe.rs tests: - not_ready_when_learning_attribute_missing - not_ready_when_two_vxlan_ports_attached == Open question 7: self-originated Type 2 not filtered == projection.rs filtered routes by VNI but not by next-hop equal to the local VTEP. A locally-originated or controller-injected Type 2 route in the RIB would project as a remote FDB entry pointing back at our own VTEP, creating a black hole. Fix: project_evpn_routes skips routes whose next_hop == the matched EvpnInstance's local_vtep_ip. Two new tests: - self_originated_route_is_dropped - self_filter_does_not_affect_other_vnis_with_same_vtep == Test count + gates == Workspace test count climbs 1474 -> 1479 (5 new). Existing tests all pass unchanged. Clippy + doc + fmt clean. CHANGELOG and the Gate 7b PR #34 review punch list updated.
…ssifier)
Three more findings from the second review round, all addressed.
== Should-fix 1: permanent-failure suppression defeated by every poll ==
The reconcile actor cleared `permanent_failures` whenever the intent
generation changed, but the daemon supervisor previously bumped the
generation on every 5 s poll regardless of whether the projected
RemoteMacTable was actually different. Net effect: an EPERM /
EOPNOTSUPP / EINVAL on op N stayed suppressed for ~5 s, then the
next poll cleared the set and the op got re-applied at full
frequency. The whole permanent-suppression contract was undermined
by the supervisor's polling cadence.
Fix: supervisor_loop now caches the last published RemoteMacTable
and skips the `intent_tx.send` when the new projection equals the
last one. Generation only advances on semantic change. The
EvpnInstanceTable is pinned at startup (ADR-0052) so equality on
RemoteMacTable alone is sufficient for now; comment notes the path
for extending if instances ever become mutable.
New test:
src/evpn_dataplane.rs::supervisor_does_not_bump_generation_on_stable_table
spawns supervisor_loop directly, points it at a stub RIB that
returns the same routes every call, and asserts the watch only
fires `changed()` at most twice (cold-start gen=0 + first publish
gen=1).
== Should-fix 2: classifier string-matched Debug, mislabeled EPERM ==
classify_apply_error did `format!("{err:?}").contains("EPERM")` —
operator-visible message was "kernel too old" for missing
CAP_NET_ADMIN (the wrong root cause), and the implementation
depended on the rtnetlink ErrorMessage Debug rendering staying
stable across versions.
Fix: read `ErrorMessage::raw_code()`, take the `unsigned_abs()` to
get the positive errno, dispatch on `libc::EPERM` / `EACCES` /
`EINVAL` / `EOPNOTSUPP`. EPERM and EACCES now map to a new typed
`DataplaneError::PermissionDenied(detail)` variant — its Display is
"permission denied: <kernel msg> (CAP_NET_ADMIN missing or LSM-
blocked)". `class()` returns `Permanent` for the new variant.
Refactored into a pure `errno_to_dataplane_error(errno, detail) ->
DataplaneError` helper so unit tests can exercise the per-errno
mapping without forging an `ErrorMessage` (`#[non_exhaustive]`, no
public constructor). Five new tests in `linux/fdb.rs::tests` cover
EPERM, EACCES, EINVAL, EOPNOTSUPP, and an unknown-errno-stays-
transient case.
Cargo.toml gains the existing workspace `libc` dep on the
`cfg(target_os = "linux")` target — already pulled in transitively
via tokio/socket2, this just names it explicitly for the errno
constants.
== Nit 3: stale doc comment about master ifindex ==
snapshot.rs's `KernelSnapshot` doc still said "Phase 4 derives the
VNI from the FDB entry's `master` ifindex". The implementation has
since switched to using the VXLAN port ifindex (per ADR-0054 §4 and
the bridge-FDB-on-VXLAN-device kernel convention). Doc updated to
match: VNI now comes from `header.ifindex` via the link cache's
`vxlan_ifindex_to_vni` table.
== Test count + gates ==
Workspace test count climbs 1479 -> 1486 (7 new: 5 errno mapping,
1 supervisor stability, 1 PermissionDenied-class smoke). Existing
tests all pass unchanged. Clippy + doc + fmt clean. CHANGELOG
updated.
…2 nits Three more findings from the review. == Should-fix: per-op-fingerprint permanent-failure suppression == Previously: when an intent generation changed, the actor cleared permanent_failures for ALL keys at once. So a single unrelated RemoteMacTable change (e.g., the operator added MAC X) would clear suppression for an unrelated MAC Y that hit PermissionDenied or EOPNOTSUPP, causing the actor to retry the impossible op against the kernel again. Fix: change permanent_failures from BTreeSet<(VNI, MAC)> to BTreeMap<(VNI, MAC), DataplaneOp>. The value is the exact failed op shape (Add/Update/Remove + dst). On every reconcile pass: - If the current op for (VNI, MAC) equals the recorded op shape, suppress (operator change wouldn't help). - If the op shape differs (mobility move → different dst, or the desired-state transitioned add↔remove), drop the suppression inline and try the new shape. Drop the permanent_anchor_generation field — generation-wide clearing is gone; per-op-fingerprint clears lazily and locally, without touching other keys. Cross-key isolation is now structural. Also dropped the `clear permanent_failures on intent.generation change` block from reconcile_once — that's exactly the generation-wide behavior the reviewer flagged. Two new tests lock the contract: - permanent_failure_suppression_is_per_op_fingerprint: same op shape across generations stays suppressed; different op shape clears suppression and runs. - permanent_failure_does_not_leak_across_keys: a permanent-fail on (VNI 100, MAC 1) does not block (VNI 100, MAC 2) from being applied successfully in the same plan. The previous test permanent_failure_is_suppressed_until_next_intent_generation is replaced — its premise (generation-wide clear is the right model) no longer holds. == Nit: stale CHANGELOG line about EPERM mapping == CHANGELOG.md§55 still said "EPERM/EACCES and EOPNOTSUPP map to KernelTooOld" alongside the new (correct) PermissionDenied text elsewhere. Rewrote the bullet to match: EPERM/EACCES → PermissionDenied, EOPNOTSUPP → KernelTooOld, EINVAL → InvalidArgument. Also added the operator-facing-message rationale. == Nit: stale doc comment about Phase 4 stub == src/evpn_dataplane.rs::spawn doc still said the Linux dataplane "is currently the Phase 4 stub pending the netlink integration". Updated to "rtnetlink-backed FDB program/withdraw against the bridge/master path". == Test count + gates == Workspace 1486 → 1487 (added permanent_failure_does_not_leak_across_keys; permanent_failure_suppression_is_per_op_fingerprint replaces the old generation-clearing test). All green. Clippy + doc + fmt clean.
Real-VTEP smoke against a Linux kernel via containerlab caught a correctness gap that no unit test had: apply_op was programming only the bridge-master FDB row, not the NTF_SELF+dst VXLAN-encap row. Result: control plane looked fine (MAC with extern_learn appeared in bridge fdb show via the master path) but the data plane couldn't actually encap to the remote VTEP because vxlan100 had no dst entry for the MAC. == The wire shape == `strace` on iproute2's `bridge fdb add MAC dev vxlanX master dst REMOTE self extern_learn` shows ONE RTM_NEWNEIGH carrying: ndm_state = NUD_NOARP | NUD_PERMANENT (0x40 | 0x80 = 0xC0) ndm_flags = NTF_SELF | NTF_MASTER | NTF_EXT_LEARNED NDA_LLADDR = MAC NDA_DST = REMOTE The kernel programs both rows from that single message: the NTF_SELF + NDA_DST anchors the VXLAN-encap entry on vxlanX (so the VXLAN driver knows where to tunnel for this MAC), and the NTF_MASTER plumbs the bridge-FDB entry on br100 (so the bridge knows the MAC is reachable via vxlanX). Splitting into two separate calls (which I tried first) returns EINVAL on the master leg — the kernel expects the combined form for a remote-VTEP entry. == The fix == apply_op for AddRemoteFdb / UpdateRemoteFdb now sends one message with all three NTF flags + NDA_DST + ndm_state = NeighbourState::Other(0x40 | 0x80). RemoveRemoteFdb sends one RTM_DELNEIGH with NTF_SELF | NTF_MASTER; the kernel cleans up both rows from that one message. The crate's NeighbourState enum doesn't represent the combined NUD_NOARP | NUD_PERMANENT bitmask — it has separate Noarp and Permanent variants. Used the `Other(u16)` escape hatch with the explicit constant `0x40 | 0x80 = 0xC0`. == M36 smoke == New tests/interop/m36-evpn-vtep-smoke.clab.yml + script: - Topology: rustbgpd VTEP (10.0.0.1, AS 65000) ↔ FRR originator (10.0.0.2, AS 65000) over iBGP L2VPN/EVPN. - start-rustbgpd-vtep.sh pre-creates br100 / vxlan100 (nolearning, local 10.0.0.1) inside the container netns and pre-loads a foreign static FDB entry 02:99:99:99:99:99 → 10.0.0.99. - start-frr-vtep.sh (existing) sets up FRR's bridge+vxlan+dummy topology so MAC injection on dummy100 triggers Type 2 origination. - Test asserts: session Established, MAC programmed with extern_learn AND correct dst=10.0.0.2, withdraw cleans up, foreign static survives both cycles. 6/6 PASS locally (Linux 6.17, kernel VXLAN, no privileged-CI runner needed — runs on any Docker host with containerlab). == Netns test == Updated crates/evpn-linux/tests/netns_dataplane.rs to also assert the dst column appears in the program-cycle FDB dump (previously only checked extern_learn). == Test count + gates == Workspace 1487 tests still pass. Clippy + doc + fmt clean. The new netns assertion will fail-loudly if the wire shape ever regresses on the privileged path.
There was a problem hiding this comment.
Pull request overview
Adds the EVPN Linux dataplane reconciliation foundation: new EVPN domain types, a Linux-specific reconciler crate, daemon-side supervision that projects RIB EVPN routes into dataplane intent, and smoke/integration coverage around the new VTEP workflow.
Changes:
- Introduces
crates/evpn-linuxwith diffing, reconciliation, backoff, snapshot, in-memory fake, and Linux netlink dataplane plumbing. - Extends
crates/evpnand the daemon to publishDataplaneIntentsnapshots from EVPN RIB state and to manage reconciler lifecycle/shutdown. - Adds RR-only, netns, reconcile-actor, and interop/containerlab coverage plus changelog/roadmap/workspace updates.
Reviewed changes
Copilot reviewed 31 out of 32 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/interop/scripts/test-m36-evpn-vtep-smoke.sh |
Adds end-to-end smoke script for FRR ↔ rustbgpd VTEP FDB programming. |
tests/interop/scripts/start-rustbgpd-vtep.sh |
Prepares bridge/VXLAN topology and launches rustbgpd in the VTEP container. |
tests/interop/m36-evpn-vtep-smoke.clab.yml |
Defines the containerlab topology for the VTEP smoke scenario. |
tests/interop/configs/rustbgpd-m36-vtep.toml |
Adds rustbgpd EVPN VTEP config for the smoke topology. |
tests/interop/configs/frr-bgpd-m36-originator.conf |
Adds FRR EVPN originator config for the smoke topology. |
tests/evpn_dataplane_rr_only.rs |
Verifies RR-only deployments do not spawn the dataplane actor. |
src/main.rs |
Wires daemon startup/shutdown to the EVPN dataplane supervisor. |
src/evpn_dataplane.rs |
Implements daemon-side RIB polling, projection, watch publication, and actor spawning. |
ROADMAP.md |
Tracks Gate 7b EVPN dataplane work in roadmap status. |
crates/evpn/src/projection.rs |
Adds pure projection from EVPN RIB routes into remote-MAC intent. |
crates/evpn/src/mac.rs |
Adds remote/local MAC domain types and table builder. |
crates/evpn/src/lib.rs |
Exports new EVPN dataplane-related domain modules and types. |
crates/evpn/src/instance.rs |
Makes EvpnInstanceTable comparable for intent equality/dedup. |
crates/evpn/src/dataplane.rs |
Adds dataplane intent/report/status operation types. |
crates/evpn-linux/tests/reconcile_actor.rs |
Adds end-to-end actor tests using the in-memory dataplane. |
crates/evpn-linux/tests/netns_dataplane.rs |
Adds privileged Linux netns integration coverage for real FDB programming. |
crates/evpn-linux/src/snapshot.rs |
Defines kernel snapshot, probe, and owned-entry state models. |
crates/evpn-linux/src/reconcile.rs |
Implements the level-triggered reconcile actor and shutdown drain. |
crates/evpn-linux/src/linux/probe.rs |
Adds Linux readiness probing for bridge/VXLAN topology. |
crates/evpn-linux/src/linux/mod.rs |
Adds Linux dataplane integration over rtnetlink. |
crates/evpn-linux/src/linux/links.rs |
Builds bridge/VXLAN inventory from netlink link dumps. |
crates/evpn-linux/src/linux/fdb.rs |
Implements netlink FDB dump/apply/remove and errno classification. |
crates/evpn-linux/src/lib.rs |
Exposes the EVPN Linux dataplane crate surface. |
crates/evpn-linux/src/in_memory.rs |
Adds the in-memory dataplane fake used by tests. |
crates/evpn-linux/src/error.rs |
Defines dataplane error taxonomy and retry classes. |
crates/evpn-linux/src/diff.rs |
Implements pure desired-vs-kernel diff planning. |
crates/evpn-linux/src/dataplane.rs |
Defines the abstract dataplane trait and operation/event types. |
crates/evpn-linux/src/backoff.rs |
Adds per-op exponential backoff scheduling. |
crates/evpn-linux/Cargo.toml |
Adds the new workspace crate and Linux-only deps. |
CHANGELOG.md |
Documents the Gate 7b dataplane foundation and tests. |
Cargo.toml |
Registers the new crate and daemon dependencies. |
Cargo.lock |
Locks newly added EVPN Linux/netlink dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+127
to
+134
| let still_in_kernel = snapshot.find_fdb(vni, mac).is_some(); | ||
|
|
||
| // Withdraw if (no longer desired OR instance went NotReady) AND | ||
| // the kernel still has the entry. If the kernel already | ||
| // dropped it (interface flap, manual `bridge fdb del`), we | ||
| // emit no op now — the actor will reconcile its OwnedSet on | ||
| // the next successful pass instead. | ||
| let should_remove = (!still_desired || !instance_ready) && still_in_kernel; |
Comment on lines
+107
to
+113
| for inst in instances.iter() { | ||
| probes.insert( | ||
| inst.id, | ||
| crate::snapshot::InstanceProbe::NotReady { | ||
| reason: format!("kernel link dump failed: {e}"), | ||
| }, | ||
| ); |
Comment on lines
+151
to
+162
| ns.exec( | ||
| "bridge", | ||
| &[ | ||
| "fdb", | ||
| "add", | ||
| &foreign_mac, | ||
| "dev", | ||
| vxlan, | ||
| "dst", | ||
| "127.0.0.99", | ||
| "permanent", | ||
| ], |
| tracing::warn!( | ||
| ?op, | ||
| error = %err, | ||
| "dataplane op failed permanently; suppressed until next intent generation" |
Comment on lines
+56
to
+59
| rb_fdb_has_extern_learn() { | ||
| local mac=${1:?} | ||
| rb_fdb | grep -iF "$mac" | grep -qE 'extern_learn|offload' | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end foundation for ADR-0054's EVPN Linux dataplane boundary. Six commits on this branch land everything except the real rtnetlink/netlink-packet-route integration (which is the natural next slice on the same branch).
The contract surface is closed and unit-tested: the daemon publishes
Arc<DataplaneIntent>snapshots overtokio::sync::watch, a level-triggeredReconcileActorconsumes them through a portableDataplanetrait, and the diff loop's foreign-entry-preservation invariant is structural (delete pass iteratesOwnedSet, never the kernel snapshot).Scope per phase
d497b27crates/evpn:DataplaneIntent,RemoteMacTable+ builder,LocalMacObservation,InstanceState,DataplaneOpKind43e26fccrates/evpn-linuxcrate:Dataplanetrait,KernelSnapshot+OwnedSet, purecompute_diff(11 explicit cases),InMemoryDataplanefakeedb25e4ReconcileActor<D>withtokio::select!over watch + events + 60s periodic + backoff retry, 5s drain on shutdown. Per-failed-op exponential backoff with deterministic ±25% jitter16d17c2LinuxDataplanehonest stub. No netlink socket opened. Real netlink integration queued as next commit on the branch77810c2crates/evpn::projection: pureproject_evpn_routes(instances, routes)with RFC 7432 §15 mobility tie-break (higher seq wins; ties on lower next_hop)50ea6f2src/evpn_dataplane.rs: polling supervisor →RibUpdate::QueryEvpnRoutes→ projection → watch publish. Empty[[evpn_instances]]short-circuits the spawn entirely2884c73Tests
Workspace test count climbs 1406 → 1468 (62 new):
crates/evpn(mac.rs + dataplane.rs)compute_diffcases + key-grounding cross-check incrates/evpn-linux/src/diff.rscrates/evpn-linux/tests/reconcile_actor.rscovering initial reconcile, fast intent supersession (validateswatchsemantics), failed-apply retry on backoff timer, foreign-entry-preservation through shutdown drain, periodic-dump cadence, kernel-event reconcile, NotReady status emissionRibUpdate::QueryEvpnRoutes→ projection → fake-kernel-FDB)tests/evpn_dataplane_rr_only.rs)Architectural invariants verified
compute_diffiterates theOwnedSetof rustbgpd-programmed keys, so kernel-learned local MACs and operator-static FDB entries cannot be deleted by the algorithm.tests/reconcile_actor.rs::shutdown_drain_preserves_foreign_static_entryvalidates end-to-end.[[evpn_instances]]short-circuits before any netlink socket or background task is created. Verified bytests/evpn_dataplane_rr_only.rsagainst the real binary.crates/evpn-linuxdepends only oncrates/evpn+ tokio + tracing + tokio-util; never oncrates/riborcrates/transport. The daemon'ssrc/evpn_dataplane.rsis the only site that touches bothcrates/ribandcrates/evpn, and it does so by passing pure value types.Test plan
cargo fmt --check,cargo clippy --workspace --all-targets -- -D warnings,cargo test --workspace,cargo doc --workspace --no-deps(all green locally)LinuxDataplaneimpl +tests/netns_dataplane.rsprivileged integration test gated byEVPN_LINUX_NETNS=1br100+ VXLAN port on a Linux dev box, runrustbgpdagainst an FRR peer originating Type 2 routes, observe FDB programming viabridge fdb showStatus
Draft. Will lift to ready-for-review after the netlink integration commit lands so the merge is one logical PR for the operator-facing feature, not two.
Reference
docs/adr/0054-evpn-linux-dataplane-boundary.md(locked contract)docs/evpn-enablement.mdGate 7b