Skip to content

Gate 7b — EVPN VTEP Linux dataplane reconciler (foundation)#34

Open
lance0 wants to merge 13 commits intomainfrom
feat/evpn-linux-dataplane
Open

Gate 7b — EVPN VTEP Linux dataplane reconciler (foundation)#34
lance0 wants to merge 13 commits intomainfrom
feat/evpn-linux-dataplane

Conversation

@lance0
Copy link
Copy Markdown
Owner

@lance0 lance0 commented May 4, 2026

Summary

End-to-end foundation for ADR-0054's EVPN Linux dataplane boundary. Six commits on this branch land everything except the real rtnetlink/netlink-packet-route integration (which is the natural next slice on the same branch).

The contract surface is closed and unit-tested: the daemon publishes Arc<DataplaneIntent> snapshots over tokio::sync::watch, a level-triggered ReconcileActor consumes them through a portable Dataplane trait, and the diff loop's foreign-entry-preservation invariant is structural (delete pass iterates OwnedSet, never the kernel snapshot).

Scope per phase

# Commit Scope
1 d497b27 Domain types in crates/evpn: DataplaneIntent, RemoteMacTable + builder, LocalMacObservation, InstanceState, DataplaneOpKind
2 43e26fc New crates/evpn-linux crate: Dataplane trait, KernelSnapshot + OwnedSet, pure compute_diff (11 explicit cases), InMemoryDataplane fake
3 edb25e4 ReconcileActor<D> with tokio::select! over watch + events + 60s periodic + backoff retry, 5s drain on shutdown. Per-failed-op exponential backoff with deterministic ±25% jitter
4 16d17c2 cfg(target_os = "linux")-gated LinuxDataplane honest stub. No netlink socket opened. Real netlink integration queued as next commit on the branch
5a 77810c2 crates/evpn::projection: pure project_evpn_routes(instances, routes) with RFC 7432 §15 mobility tie-break (higher seq wins; ties on lower next_hop)
5b 50ea6f2 Daemon src/evpn_dataplane.rs: polling supervisor → RibUpdate::QueryEvpnRoutes → projection → watch publish. Empty [[evpn_instances]] short-circuits the spawn entirely
6 2884c73 CHANGELOG, RR-only invariant integration test, ROADMAP entry

Tests

Workspace test count climbs 1406 → 1468 (62 new):

  • 12 domain-type tests in crates/evpn (mac.rs + dataplane.rs)
  • 11 compute_diff cases + key-grounding cross-check in crates/evpn-linux/src/diff.rs
  • 14 backoff + InMemoryDataplane + actor-loop tests across reconcile/backoff/in_memory
  • 7 actor-lifecycle integration tests in crates/evpn-linux/tests/reconcile_actor.rs covering initial reconcile, fast intent supersession (validates watch semantics), failed-apply retry on backoff timer, foreign-entry-preservation through shutdown drain, periodic-dump cadence, kernel-event reconcile, NotReady status emission
  • 4 LinuxDataplane stub tests
  • 10 projection tests (mobility tie-break, unknown-VNI drop, reorder-determinism)
  • 5 daemon supervisor tests (RR-only spawn-returns-None, end-to-end RibUpdate::QueryEvpnRoutes → projection → fake-kernel-FDB)
  • 1 binary-spawn integration test asserting RR-only deployments don't spawn the actor (tests/evpn_dataplane_rr_only.rs)

Architectural invariants verified

  1. Foreign-entry preservation is structural, not a runtime check. The delete pass in compute_diff iterates the OwnedSet of rustbgpd-programmed keys, so kernel-learned local MACs and operator-static FDB entries cannot be deleted by the algorithm. tests/reconcile_actor.rs::shutdown_drain_preserves_foreign_static_entry validates end-to-end.
  2. RR-only deployments incur zero dataplane cost. Empty [[evpn_instances]] short-circuits before any netlink socket or background task is created. Verified by tests/evpn_dataplane_rr_only.rs against the real binary.
  3. Dependency direction matches ADR-0054 §1. crates/evpn-linux depends only on crates/evpn + tokio + tracing + tokio-util; never on crates/rib or crates/transport. The daemon's src/evpn_dataplane.rs is the only site that touches both crates/rib and crates/evpn, and it does so by passing pure value types.

Test plan

  • PR-CI: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace, cargo doc --workspace --no-deps (all green locally)
  • Follow-up commit (queued on this branch): real rtnetlink/netlink-packet-route LinuxDataplane impl + tests/netns_dataplane.rs privileged integration test gated by EVPN_LINUX_NETNS=1
  • Hand-smoke once netlink lands: pre-create br100 + VXLAN port on a Linux dev box, run rustbgpd against an FRR peer originating Type 2 routes, observe FDB programming via bridge fdb show

Status

Draft. Will lift to ready-for-review after the netlink integration commit lands so the merge is one logical PR for the operator-facing feature, not two.

Reference

  • docs/adr/0054-evpn-linux-dataplane-boundary.md (locked contract)
  • docs/evpn-enablement.md Gate 7b
  • ADR-0050 (EVPN RR Phase 1), ADR-0052 (Gate 7a foundation)

lance0 added 11 commits May 4, 2026 15:48
Introduces the portable domain surface that crates/evpn-linux will
consume across the watch-channel boundary defined by ADR-0054. These
types deliberately live in crates/evpn rather than crates/evpn-linux
so the daemon can construct intents on platforms that don't compile
the netlink crate (macOS dev), and so a future RR-only feature flag
can drop crates/evpn-linux entirely without touching the intent
surface.

New module crates/evpn/src/mac.rs exports MacAddress (re-exported
from rustbgpd-wire so consumers don't need a wire dep), RemoteMacEntry,
RemoteMacSource, RemoteMacTable + RemoteMacTableBuilder, and
LocalMacObservation. The builder rejects duplicate (VNI, MAC) keys
with a typed error so the daemon-side projection from RIB best-paths
must resolve mobility races deterministically before the dataplane
sees inconsistent intent. Iteration is BTreeMap-ordered for
deterministic diff output and reproducible property tests.

New module crates/evpn/src/dataplane.rs exports DataplaneIntent
(generation + Arc<EvpnInstanceTable> + Arc<RemoteMacTable>),
DataplaneReport (intent_generation echoes the snapshot's gen for
correlation; reconcile_generation is the actor's own monotonic
counter), InstanceDataplaneStatus, InstanceState (Ready / NotReady /
Unbound), AppliedOp / FailedOp, and DataplaneOpKind
(AddRemoteFdb / UpdateRemoteFdb / RemoveRemoteFdb).

EvpnInstanceTable gains PartialEq + Eq derives so DataplaneIntent
can derive them too — semantically right because the table value is
content-equal across reload cycles when nothing changed.

12 new unit tests (5 dataplane + 7 mac) cover builder duplicate
rejection, deterministic iteration order, instance-scoped iteration,
empty-intent invariants, op-kind discrimination, and the publish
struct-update pattern.
New workspace crate crates/evpn-linux carrying the kernel-side
reconciliation contract from ADR-0054. Phase 2 ships everything
except the real netlink impl (Phase 4) and the reconcile actor
(Phase 3): the trait abstraction, the kernel-snapshot types, the
pure compute_diff function with foreign-entry preservation, and a
fully-functional InMemoryDataplane fake the actor tests will drive
in Phase 3.

Crate layout:

- src/dataplane.rs — Dataplane trait (native async fn in trait, +
  Send bounds; no async_trait crate). DataplaneOp + KernelEvent.
- src/error.rs — DataplaneError with FailureClass classifier (Transient
  / Permanent / Conflict) so the Phase 3 actor can decide retry vs
  escalate-to-NotReady vs log-and-skip.
- src/snapshot.rs — KernelSnapshot, KernelFdbEntry + KernelFdbFlags
  (extern_learn / permanent / noarp / master / self_flag —
  one-to-one with NTF_*/NUD_* kernel bits, hence the
  struct_excessive_bools allow), KernelLinkInfo + KernelVxlanInfo for
  the probe pass, InstanceProbes (Ready / NotReady / Unbound),
  OwnedSet (the (VNI, MAC) keys we have programmed) + OwnedEntry
  (last applied dst + mobility seq).
- src/diff.rs — pure compute_diff(desired, snapshot, last_applied,
  probes) -> Plan. Foreign-entry preservation is structural: pass 1
  iterates desired (creates/updates only where we own it or kernel
  has nothing), pass 2 iterates last_applied — never the kernel
  snapshot — for deletes, so kernel-learned local MACs and
  operator-static FDB entries are invisible to the delete pass.
- src/in_memory.rs — InMemoryDataplane fake with cloneable handle for
  test-side mutation (pre-load FDB, set probes, inject failures, push
  kernel events, observe apply count). Apply mutates the snapshot
  with extern_learn flags so subsequent dumps round-trip the fact
  that we own the entry.

Workspace registers the new crate as a member with a path-only,
publish=false dep slot.

28 new tests (3 error, 6 snapshot, 5 in_memory, 11 diff incl. all
the cases ADR-0054 calls out + a key-grounding cross-check).
Workspace test count climbs from 1406 -> 1434, all green; clippy +
doc + fmt clean.

Phase 4's LinuxDataplane will plug into the same Dataplane trait
without the actor (Phase 3) needing to change.
Implements the level-triggered ReconcileActor<D: Dataplane> from
ADR-0054 §6/§7/§8. The actor's tokio::select! loop reacts to four
inputs:

- new DataplaneIntent on the watch::Receiver,
- KernelEvent from Dataplane::next_event(),
- 60s periodic full-dump timer (configurable),
- per-failed-op retry timer fed by RetrySchedule.

Plus a fifth: a CancellationToken triggers a bounded 5s drain that
deletes only owned remote-MAC FDB entries (foreign entries survive
because OwnedSet is the iteration source for both the diff loop's
delete pass and drain).

New module crates/evpn-linux/src/backoff.rs:
- RetrySchedule keyed by (EvpnInstanceId, MacAddress) so a single
  stuck op doesn't gate unrelated successful ones.
- Geometric growth from 100ms -> 5s cap with ±25% jitter via a small
  LCG (deterministic seed for reproducible tests). Stays in u64 ms
  end-to-end so no u128/i64 cast lints.
- record_failure / record_success / earliest_due / keys_due (the
  actor only uses earliest_due in production but keys_due aids
  diagnostics).
- 7 unit tests cover initial backoff, geometric doubling, cap, jitter
  band ±25%, success-clears-tracking, keys_due filtering, earliest_due.

New module crates/evpn-linux/src/reconcile.rs:
- ReconcileActor<D> + ReconcileActorConfig (production / for_tests).
- ActorState carries OwnedSet, RetrySchedule, last_intent_generation,
  reconcile_generation, an Instant epoch for retry-schedule
  millisecond timestamps.
- coalesce_and_reconcile() applies the configured coalesce_window so
  fast intent supersession folds into one pass; the level-triggered
  reconcile pass itself reads borrow_and_update() so subsequent
  changed() fires only on actual new publishes.
- apply_plan() classifies failures via FailureClass and records into
  the retry schedule; on success it mirrors into OwnedSet so the next
  diff treats the entry as ours.
- emit_report() correlates intent_generation back to the daemon and
  carries reconcile_generation for telemetry.
- drain() runs inside tokio::time::timeout(drain_timeout, ...) for
  the bounded shutdown-drain ADR-0054 §7 specifies.

7 integration tests in crates/evpn-linux/tests/reconcile_actor.rs
exercise the full actor lifecycle against InMemoryDataplane:
- initial reconcile emits Apply for desired MAC
- fast intent supersession reconciles only the latest (validates
  watch::Receiver coalescing semantics, not mpsc-style queueing)
- failed apply retries on the backoff timer
- shutdown drain preserves a pre-loaded foreign static FDB entry
- periodic dump fires on the 60s cadence
- kernel event triggers immediate reconcile
- NotReady instance emits status row with no ops attempted

Crate adds tokio-util 0.7 for CancellationToken (no other workspace
crate uses it; the dep is local to crates/evpn-linux).

crates/evpn re-exports rustbgpd_wire::RouteDistinguisher so consumers
of the domain crate (the new actor tests, and Phase 5's projection
layer) don't need to take a direct rustbgpd-wire dep just to construct
an EvpnInstance.

Workspace test count climbs 1434 -> 1448 (14 new). Clippy + doc + fmt
clean across the workspace.
Lays down the cfg(target_os = "linux") module structure for the real
netlink integration without yet pulling in rtnetlink / netlink-packet-*
dependencies. The stub LinuxDataplane:

- opens no netlink socket (safe to instantiate without
  CAP_NET_ADMIN);
- reports Unbound for instances with bridge = None;
- reports NotReady with a "phase 4 stub: real netlink integration not
  yet wired" reason for instances that have a bridge name configured;
- returns Ok(KernelSnapshot::new()) from dump_snapshot;
- returns DataplaneError::Other(stub_reason) from apply;
- never produces a KernelEvent (next_event() returns std::future::pending,
  so the actor's tokio::select! ignores the branch and falls back to
  its periodic dump cadence + retry timer).

This shape exists so Phase 5 (daemon wiring) can compile end-to-end
against a real binary on Linux without the netlink work being a
blocker on its own merge. The reconcile actor's foreign-entry-
preservation, shutdown-drain, and report semantics are identical
against the stub as against a real kernel impl, so Phase 5's
binary-spawn integration test (Phase 6 extends it) covers the wiring
even when the stub is the actual dataplane in production builds.

What lands when the real netlink slice arrives (queued as the next
commit on this same feature branch):

- crates/evpn-linux/src/linux/links.rs — bridge + VXLAN inventory
- crates/evpn-linux/src/linux/fdb.rs — RTM_NEWNEIGH/RTM_DELNEIGH
  with NTF_EXT_LEARNED + NTF_MASTER (bridge/master path per
  ADR-0054 §5)
- crates/evpn-linux/src/linux/notify.rs — RTNLGRP_LINK / NEIGH /
  NOTIFY subscription with the ADR §6 startup buffering rule
- crates/evpn-linux/src/linux/probe.rs — VLAN-aware-bridge rejection
  + kernel-too-old (NTF_EXT_LEARNED EINVAL fallback) detection
- [target.'cfg(target_os = "linux")'.dependencies] = rtnetlink 0.21,
  netlink-packet-route 0.30, netlink-packet-utils 0.6, netlink-sys 0.8
- tests/netns_dataplane.rs gated on EVPN_LINUX_NETNS=1 (CAP_NET_ADMIN
  netns required; not a PR-CI gate)

4 new unit tests cover the stub's behavior end-to-end (Unbound vs
NotReady probe outcomes, empty dump, apply-always-errors), giving
the daemon wiring layer a fixed contract to integrate against.

Workspace tests 1448 -> 1452.
ADR-0054 §1 forbids crates/evpn-linux from depending on crates/rib,
so the daemon owns the conversion from RIB best-path EVPN routes
into the portable RemoteMacTable the dataplane consumes. This commit
ships the pure half of that conversion in crates/evpn so the
projection is testable without booting the daemon and without taking
a wire / RIB dep into evpn-linux.

New module crates/evpn/src/projection.rs:

- ProjectedEvpnRoute — small portable struct the daemon constructs
  from EvpnRibRoute at the call site (rd, mac, host_ip, label1,
  next_hop, mobility_sequence). Carries the wire-shaped
  RouteDistinguisher because it's already public on EvpnInstance and
  surfaces useful collision messages.
- project_evpn_routes(instances, IntoIterator<ProjectedEvpnRoute>)
  -> RemoteMacTable. Pure, deterministic.

Mobility tie-break (RFC 7432 §15):
1. Higher mobility_sequence wins (Some(N+1) > Some(N) > None).
2. On equal sequence, lower next_hop IP wins (arbitrary but
   deterministic — same inputs always pick the same winner).

Routes whose label1's VNI doesn't match a local EvpnInstance are
dropped silently; they belong to other VTEPs' EVIs and the dataplane
has no business programming them.

VNI 0 from MplsLabel::as_vni is also dropped (RFC 8365 §5 reserved),
so a malformed wire-side label doesn't cause an unbuildable
EvpnInstanceId.

10 unit tests cover empty inputs, single-route round-trip, unknown
VNI / zero-label drop, all three tie-break legs (higher seq wins;
equal-seq breaks on lower next_hop; no-seq-at-all also breaks on
lower next_hop), distinct MACs in same VNI both land, same MAC in
different VNIs both land, and reorder-determinism (running the
projection twice with shuffled input produces equal output).

Workspace test count climbs 1452 -> 1462 (10 new). Clippy + doc
clean. Phase 5b (the daemon-side supervisor that wires this
projection into a watch::Sender<Arc<DataplaneIntent>>) lands on the
next commit on this branch.
…hase 5b)

Adds the daemon-side glue between the RIB's EVPN best-path table and
crates/evpn-linux's reconcile actor. ADR-0054 §1 forbids the dataplane
crate from depending on rustbgpd-rib; the daemon binary owns this
coordination layer.

New module src/evpn_dataplane.rs:

- spawn(config, &Arc<EvpnInstanceTable>, rib_tx, shutdown) returns
  Option<EvpnDataplaneHandle>. Empty [[evpn_instances]] -> None
  (RR-only deployments don't open netlink and don't spawn the actor —
  ADR-0054 §1 invariant).
- spawn_with_dataplane is generic over D: Dataplane + Send + Sync,
  so integration tests can inject InMemoryDataplane and the
  production path uses LinuxDataplane on Linux. Non-Linux platforms
  return None at the spawn level (logged once at startup).
- supervisor_loop polls the RIB every config.poll_interval (default
  5 s) via the existing RibUpdate::QueryEvpnRoutes channel, projects
  each EvpnRibRoute (filtering to Type 2 MacIp variants) into a
  ProjectedEvpnRoute via project_one(), runs project_evpn_routes() to
  build the RemoteMacTable, wraps it in a DataplaneIntent with a
  monotonic generation counter, and publishes via watch::Sender.
- project_one extracts the MAC mobility sequence from path-attribute
  ExtendedCommunities (RFC 7432 §15) using the existing
  ExtendedCommunity::as_mac_mobility helper. Absent extcomm = None,
  which the projection's tie-break treats as "older than any
  sequence".
- Status logger task drains DataplaneReports and logs failure
  summaries; Phase 6 will replace this with the gRPC status surface.
- EvpnDataplaneHandle::shutdown() cancels the token + bounded-awaits
  both spawned tasks.

Polling cadence note: the reconcile actor's 60 s periodic dump
backstop and the per-op 100ms->5s retry already handle kernel drift
at finer granularity than the supervisor's 5 s poll. When operator
demand pushes for sub-second MAC convergence, the path forward is a
tokio::sync::Notify pinged from the RIB's best-path apply path
(documented as Gate 7c).

src/main.rs wires:

- new mod evpn_dataplane declaration alongside the existing modules,
- a CancellationToken just before the gRPC server spawn,
- evpn_dataplane::spawn(...) right after the Arc<EvpnInstanceTable>
  is constructed (line 1018) and before it's moved into ServeConfig
  at line 1036. The handle is bound to _evpn_dataplane_handle so it
  lives for the daemon lifetime; dropping at exit cancels the
  reconcile actor's drain.

Cargo.toml gains rustbgpd-evpn-linux (workspace) and tokio-util 0.7
(direct dep with the rt feature for CancellationToken — only used
inside the daemon binary; the evpn-linux crate already had it).

5 new tests cover project_one's Type 2 vs non-Type-2 filtering,
absent-extcomm sequence handling, the RR-only spawn-returns-None
path, and an end-to-end supervisor_publishes_intent_built_from_rib_query
that wires a stub RIB responder to the supervisor + actor +
InMemoryDataplane and asserts the projected MAC lands in the fake
kernel snapshot.

Workspace test count climbs 1462 -> 1467 (5 new). Clippy + doc clean
across the whole workspace. Phase 6 (status surface + binary tests +
CHANGELOG) lands next.
CHANGELOG.md gains an [Unreleased] entry summarizing the six-commit
Gate 7b foundation: domain types, diff loop with foreign-entry
preservation, reconcile actor with backoff + drain, InMemoryDataplane
fake, cfg-gated LinuxDataplane stub, RIB-projection, and daemon
supervisor wiring. Test count called out (1406 -> 1467 -> 1468 over
the branch). Packaging notes cover the new crates/evpn-linux
workspace member and the tokio-util binary dep.

New tests/evpn_dataplane_rr_only.rs: ADR-0054 §1 invariant test.
Boots the real rustbgpd binary with NO [[evpn_instances]] and
asserts:

- the supervisor's "no EVPN instances configured — dataplane actor
  not spawned" log line lands in the daemon's structured-log output
  (tracing_subscriber::fmt() writes to stdout — the test captures
  both stdout and stderr because the early console banner goes to
  stderr while the JSON-formatted info log lands on stdout);
- the actor's "EVPN dataplane reconcile applied" / "EVPN dataplane
  apply failures" logs DO NOT appear (they're only emitted from the
  spawned report-logger task, which short-circuits when the
  supervisor returns None).

This closes the architectural invariant from ADR-0054 §1: route-
reflector deployments incur zero dataplane cost. The assertion is on
log content rather than a structured "is the actor spawned" probe
because gRPC status surfacing for the dataplane is queued under a
follow-up commit on the same branch.

ROADMAP.md gains a "Next Up — Pre-v1.0 Polish" entry tracking the
Gate 7b foundation as in-flight on feat/evpn-linux-dataplane, with
explicit framing of what's already there (the contract surface,
diff loop, actor, supervisor, RR-only short-circuit) versus what
the next commit on the branch ships (the rtnetlink/netlink-packet-route
LinuxDataplane and the privileged tests/netns_dataplane.rs gated by
EVPN_LINUX_NETNS=1).

Workspace test count climbs 1467 -> 1468. Clippy + doc + fmt clean
across the whole workspace.
Replaces the cfg(target_os = "linux") stub from `16d17c2` with a
working netlink integration backed by rtnetlink 0.14 +
netlink-packet-route 0.19. The level-triggered ReconcileActor and the
daemon supervisor are unchanged — the trait surface is fixed, so all
of phase 2/3/5's tests continue to pass against this real impl
without modification.

Three new submodules under crates/evpn-linux/src/linux/:

- links.rs — Walks LinkHandle::get once per dump, splits into
  bridges (with vlan_filtering state) and VXLAN ports (with vni,
  local IP, learning_disabled). A second pass stitches each VXLAN
  port onto its master bridge via the Controller (IFLA_MASTER)
  attribute. If two VXLAN ports race onto the same bridge the slot
  is cleared, surfacing the ambiguity as NotReady through probe.rs
  (ADR-0054 §4 requires "exactly one VXLAN port for the instance
  VNI"). Builds a LinkCache the dump_snapshot path persists onto
  LinuxDataplane so apply() can resolve bridge ifindex by VNI
  without a second netlink round-trip.

- fdb.rs — Bridge-family neighbour dump turns the kernel FDB into
  KernelFdbEntry rows keyed by (EvpnInstanceId, MacAddress). NTF flags
  are mapped onto KernelFdbFlags so the diff loop can distinguish
  rustbgpd-owned entries (NTF_EXT_LEARNED) from operator-static
  (NUD_PERMANENT/NUD_NOARP) and kernel-learned local entries
  (dynamic, no extern_learn). apply_op() turns DataplaneOp::AddRemoteFdb
  / UpdateRemoteFdb into NeighbourAddRequest::add_bridge() with
  NTF_EXT_LEARNED + NUD_NOARP and `.replace()` for idempotency;
  RemoveRemoteFdb constructs a NeighbourMessage and calls
  NeighbourHandle::del.

- probe.rs — Per-instance readiness check covering the ADR-0054 §4
  five-point list: bridge exists, exactly one VXLAN port, VNI matches,
  local_vtep_ip matches, learning disabled, and the bridge is NOT
  VLAN-aware. Failed checks produce operator-facing reason strings;
  bridge=None instances produce Unbound. 8 unit tests cover each
  rejection leg + the happy path + a multi-instance walk.

LinuxDataplane::connect() opens netlink and spawns the rtnetlink
connection driver task. Returns DataplaneError::Io if the socket
fails (no CAP_NET_ADMIN, AF_NETLINK unavailable). Daemon's spawn
path catches this and logs warn! rather than crashing — running
rustbgpd in an unprivileged container becomes a no-op for EVPN
instead of a fatal error.

next_event() is intentionally still pending(): RTNLGRP_NEIGH /
RTNLGRP_LINK subscription is queued as a follow-up. The level-
triggered design (60 s periodic dump + per-op retry) repairs kernel
drift structurally, so the gap is functional.

Cargo.toml gains the cfg-gated target dep block:

    rtnetlink = "0.14"
    netlink-packet-route = "0.19"
    netlink-packet-core = "0.7"
    netlink-packet-utils = "0.5"
    netlink-sys = { version = "0.8", features = ["tokio_socket"] }
    futures = "0.3"

Pinned to rtnetlink 0.14 (paired with netlink-packet-route 0.19) —
newer 0.21+ releases changed the message-shape ABI and pulled in
async-std incompatibly with the workspace's tokio/tonic stack.

Workspace test count climbs 1468 -> 1473 (5 new probe.rs tests +
the existing connect-doesnt-panic smoke). All previous Phase 2/3/5
tests pass unchanged against the real impl on Linux. PR-CI doesn't
have CAP_NET_ADMIN so connect() may surface DataplaneError::Io;
the daemon warn-and-disable path covers that gracefully.
Adds crates/evpn-linux/tests/netns_dataplane.rs gated on
EVPN_LINUX_NETNS=1. The test:

1. Creates a Linux network namespace and a bridge + VXLAN port
   inside it (vni 2_000_100, local 127.0.0.10, nolearning).
2. Pre-loads a foreign static FDB entry the dataplane must preserve.
3. Re-execs itself inside the netns via `ip netns exec` so the
   inner test process opens netlink against the namespace's own
   FDB (rather than the host's).
4. Calls LinuxDataplane::apply with AddRemoteFdb, asserts the entry
   appears in `bridge fdb show` with extern_learn (or offload).
5. Calls RemoveRemoteFdb, asserts the entry is gone.
6. Back in the outer process, verifies the foreign static entry is
   still present — end-to-end validation of ADR-0054 §5/§7 foreign-
   entry preservation.

Skips cleanly when EVPN_LINUX_NETNS is unset (PR-CI default),
emitting a "skipping: set EVPN_LINUX_NETNS=1" notice. CI runners
without CAP_NET_ADMIN never attempt the privileged operations.

CHANGELOG and ROADMAP updated to reflect that the real netlink
integration has landed:

- CHANGELOG: replace the "phase 4 stub" bullet with the real
  rtnetlink integration description (links/fdb/probe submodules,
  the connect() warn-and-disable path, NTF_EXT_LEARNED + NUD_NOARP
  programming with .replace() idempotency); add a netns-test
  bullet under Tests; update the test count to 50 new / 1474
  workspace; expand the Packaging block with the cfg-gated dep
  versions and the rationale for pinning rtnetlink 0.14 (newer
  releases pull async-std incompatibly with tokio/tonic).
- ROADMAP: rewrite the in-flight entry to remove "stub" framing
  and call out the remaining follow-up (RTNLGRP_NEIGH/LINK
  notification subscription — level-triggered design tolerates
  the gap via the 60 s periodic dump).

Workspace test count climbs 1473 -> 1474 (1 new netns gate test).
Clippy + doc + fmt all clean.
Six review findings, all addressed in one commit on the branch.

== Blocker 1: FDB targeted bridge ifindex with wrong flags ==

Linux EVPN remote MACs program with the
`bridge fdb add MAC dev vxlanX master dst REMOTE` shape — the netlink
message's ifindex is the *VXLAN port*, not the bridge, and the entry
carries NTF_MASTER. The previous impl targeted the bridge ifindex
with only NTF_EXT_LEARNED, which (a) wouldn't reach the right device
on the wire and (b) wouldn't activate the bridge/master path
switchdev offload requires.

Fix: KernelVxlanInfo gains an `ifindex` field; LinkCache replaces
`bridge_ifindex_to_name` with `vxlan_ifindex_to_vni` (FDB messages
arrive keyed by VXLAN ifindex, not bridge); apply_op() resolves VNI
to VXLAN ifindex via the cache, calls add_bridge() with that ifindex,
and includes both NeighbourFlag::Controller (NTF_MASTER) and
NeighbourFlag::ExtLearned. Delete path constructs a NeighbourMessage
on the VXLAN port with the same Controller flag. Parse path now
keys on header.ifindex via the new map and reads the Controller
flag explicitly into KernelFdbFlags.master rather than inferring it
from the absence of NTF_SELF.

== Blocker 2: shutdown drain never wired ==

`_evpn_dataplane_handle = evpn_dataplane::spawn(...)` was held with
a leading underscore and never used, leaving `EvpnDataplaneHandle::
shutdown()` as dead code. Dropping a JoinHandle detaches; dropping
the CancellationToken doesn't cancel anything that already cloned
it. The actor's drain path was unreachable.

Fix: drop the leading underscore on `evpn_dataplane_handle`, and
add a `2.5` step in the daemon's coordinated shutdown block that
calls `handle.shutdown().await` after PeerManager and before BMP.
Drain runs the actor's bounded 5s remote-MAC delete pass (foreign
entries still survive structurally), with a 10s outer timeout so a
stuck task can't wedge the daemon's exit. The dead_code allow on
EvpnDataplaneHandle::shutdown is removed; the doc comment is
rewritten to clarify that *Drop* doesn't run async drain — the
caller must call shutdown().await explicitly.

== Should-fix 3: retry/backoff recorded but not enforced ==

Previous reconcile_once() ran the full plan unconditionally on
every wake, so any watch update / kernel event / periodic dump
bypassed the per-op backoff and re-attempted permanently-failed ops
at maximum frequency. RetrySchedule::record_failure was effectively
just telemetry.

Fix: apply_plan() filters every op through:

1. `permanent_failures` set — keys that hit FailureClass::Permanent
   are suppressed entirely until the next intent generation
   (intent.generation != permanent_anchor_generation clears them
   so an operator's fix actually retries).
2. `RetrySchedule::next_due_for(vni, mac)` — transient failures
   defer until the per-op backoff deadline elapses. The actor's
   outer tokio::select! re-fires on the retry timer when the
   earliest-due deadline arrives, so a deferred op runs as soon as
   it's ready instead of waiting for the 60s periodic dump.

New backoff::RetrySchedule::next_due_for accessor returns the per-
key deadline. New reconcile_actor.rs test
`permanent_failure_is_suppressed_until_next_intent_generation`
locks the suppression contract: inject KernelTooOld on op N,
verify apply_count stops growing across periodic-dump cycles,
verify a fresh intent generation re-runs it.

== Should-fix 4: classify_apply_error lumped everything Transient ==

The previous classifier put rtnetlink::Error::NetlinkError into
DataplaneError::Other, which classifies as Transient — so an EPERM
failure from a missing CAP_NET_ADMIN would retry forever instead of
the warn-and-disable behavior the connect() docstring promised.

Fix: classify_apply_error now string-matches errno mnemonics in the
rendered error and maps EPERM/EACCES + EOPNOTSUPP to KernelTooOld
(Permanent class) and EINVAL to InvalidArgument (also Permanent).
Anything else stays in Other (Transient). String-matching the
Display impl is conservative because rtnetlink 0.14's
ErrorMessage::raw_code() isn't part of the public API at this
version.

== Should-fix 5: netns test bypassed link-cache priming ==

The previous netns test called `dp.apply()` directly on a fresh
LinuxDataplane, which would hit `LinkNotFound` because the link
cache is empty until probe() or dump_snapshot() runs.

Fix: build an EvpnInstanceTable matching the netns topology, call
`dp.probe(&table).await` first (which populates the cache), assert
the instance reports `Ready`, *then* run apply(). Now the test
exercises the same precondition path the real reconcile actor
takes; if either the probe or the FDB program path regresses, the
test fails on a privileged runner.

== Should-fix 6: VXLAN ambiguity counter + learning fail-open ==

Two edge cases in links.rs:

1. The "two VXLAN ports" code toggled `Some -> None -> Some` so
   three attaches reset to the first port's info. Fix: track an
   explicit `vxlan_attach_count` per bridge; once the count exceeds
   1 the slot is cleared and never re-set, and probe() reports
   NotReady citing the count.
2. `learning_disabled: bool` defaulted to `true` so a kernel that
   omitted IFLA_VXLAN_LEARNING quietly passed the readiness check.
   Fix: change to `Option<bool>`; probe() fails closed on `None`
   with "VXLAN port did not report IFLA_VXLAN_LEARNING".

Two new probe.rs tests:
- not_ready_when_learning_attribute_missing
- not_ready_when_two_vxlan_ports_attached

== Open question 7: self-originated Type 2 not filtered ==

projection.rs filtered routes by VNI but not by next-hop equal to
the local VTEP. A locally-originated or controller-injected Type 2
route in the RIB would project as a remote FDB entry pointing back
at our own VTEP, creating a black hole.

Fix: project_evpn_routes skips routes whose next_hop == the matched
EvpnInstance's local_vtep_ip. Two new tests:
- self_originated_route_is_dropped
- self_filter_does_not_affect_other_vnis_with_same_vtep

== Test count + gates ==

Workspace test count climbs 1474 -> 1479 (5 new). Existing tests
all pass unchanged. Clippy + doc + fmt clean. CHANGELOG and the
Gate 7b PR #34 review punch list updated.
…ssifier)

Three more findings from the second review round, all addressed.

== Should-fix 1: permanent-failure suppression defeated by every poll ==

The reconcile actor cleared `permanent_failures` whenever the intent
generation changed, but the daemon supervisor previously bumped the
generation on every 5 s poll regardless of whether the projected
RemoteMacTable was actually different. Net effect: an EPERM /
EOPNOTSUPP / EINVAL on op N stayed suppressed for ~5 s, then the
next poll cleared the set and the op got re-applied at full
frequency. The whole permanent-suppression contract was undermined
by the supervisor's polling cadence.

Fix: supervisor_loop now caches the last published RemoteMacTable
and skips the `intent_tx.send` when the new projection equals the
last one. Generation only advances on semantic change. The
EvpnInstanceTable is pinned at startup (ADR-0052) so equality on
RemoteMacTable alone is sufficient for now; comment notes the path
for extending if instances ever become mutable.

New test:
src/evpn_dataplane.rs::supervisor_does_not_bump_generation_on_stable_table
spawns supervisor_loop directly, points it at a stub RIB that
returns the same routes every call, and asserts the watch only
fires `changed()` at most twice (cold-start gen=0 + first publish
gen=1).

== Should-fix 2: classifier string-matched Debug, mislabeled EPERM ==

classify_apply_error did `format!("{err:?}").contains("EPERM")` —
operator-visible message was "kernel too old" for missing
CAP_NET_ADMIN (the wrong root cause), and the implementation
depended on the rtnetlink ErrorMessage Debug rendering staying
stable across versions.

Fix: read `ErrorMessage::raw_code()`, take the `unsigned_abs()` to
get the positive errno, dispatch on `libc::EPERM` / `EACCES` /
`EINVAL` / `EOPNOTSUPP`. EPERM and EACCES now map to a new typed
`DataplaneError::PermissionDenied(detail)` variant — its Display is
"permission denied: <kernel msg> (CAP_NET_ADMIN missing or LSM-
blocked)". `class()` returns `Permanent` for the new variant.

Refactored into a pure `errno_to_dataplane_error(errno, detail) ->
DataplaneError` helper so unit tests can exercise the per-errno
mapping without forging an `ErrorMessage` (`#[non_exhaustive]`, no
public constructor). Five new tests in `linux/fdb.rs::tests` cover
EPERM, EACCES, EINVAL, EOPNOTSUPP, and an unknown-errno-stays-
transient case.

Cargo.toml gains the existing workspace `libc` dep on the
`cfg(target_os = "linux")` target — already pulled in transitively
via tokio/socket2, this just names it explicitly for the errno
constants.

== Nit 3: stale doc comment about master ifindex ==

snapshot.rs's `KernelSnapshot` doc still said "Phase 4 derives the
VNI from the FDB entry's `master` ifindex". The implementation has
since switched to using the VXLAN port ifindex (per ADR-0054 §4 and
the bridge-FDB-on-VXLAN-device kernel convention). Doc updated to
match: VNI now comes from `header.ifindex` via the link cache's
`vxlan_ifindex_to_vni` table.

== Test count + gates ==

Workspace test count climbs 1479 -> 1486 (7 new: 5 errno mapping,
1 supervisor stability, 1 PermissionDenied-class smoke). Existing
tests all pass unchanged. Clippy + doc + fmt clean. CHANGELOG
updated.
@lance0 lance0 marked this pull request as ready for review May 5, 2026 00:18
lance0 added 2 commits May 4, 2026 20:36
…2 nits

Three more findings from the review.

== Should-fix: per-op-fingerprint permanent-failure suppression ==

Previously: when an intent generation changed, the actor cleared
permanent_failures for ALL keys at once. So a single unrelated
RemoteMacTable change (e.g., the operator added MAC X) would clear
suppression for an unrelated MAC Y that hit PermissionDenied or
EOPNOTSUPP, causing the actor to retry the impossible op against
the kernel again.

Fix: change permanent_failures from BTreeSet<(VNI, MAC)> to
BTreeMap<(VNI, MAC), DataplaneOp>. The value is the exact failed
op shape (Add/Update/Remove + dst). On every reconcile pass:

- If the current op for (VNI, MAC) equals the recorded op shape,
  suppress (operator change wouldn't help).
- If the op shape differs (mobility move → different dst, or the
  desired-state transitioned add↔remove), drop the suppression
  inline and try the new shape.

Drop the permanent_anchor_generation field — generation-wide
clearing is gone; per-op-fingerprint clears lazily and locally,
without touching other keys. Cross-key isolation is now structural.

Also dropped the `clear permanent_failures on intent.generation
change` block from reconcile_once — that's exactly the
generation-wide behavior the reviewer flagged.

Two new tests lock the contract:

- permanent_failure_suppression_is_per_op_fingerprint: same op
  shape across generations stays suppressed; different op shape
  clears suppression and runs.
- permanent_failure_does_not_leak_across_keys: a permanent-fail on
  (VNI 100, MAC 1) does not block (VNI 100, MAC 2) from being
  applied successfully in the same plan.

The previous test
permanent_failure_is_suppressed_until_next_intent_generation is
replaced — its premise (generation-wide clear is the right model)
no longer holds.

== Nit: stale CHANGELOG line about EPERM mapping ==

CHANGELOG.md§55 still said "EPERM/EACCES and EOPNOTSUPP map to
KernelTooOld" alongside the new (correct) PermissionDenied text
elsewhere. Rewrote the bullet to match: EPERM/EACCES →
PermissionDenied, EOPNOTSUPP → KernelTooOld, EINVAL →
InvalidArgument. Also added the operator-facing-message rationale.

== Nit: stale doc comment about Phase 4 stub ==

src/evpn_dataplane.rs::spawn doc still said the Linux dataplane
"is currently the Phase 4 stub pending the netlink integration".
Updated to "rtnetlink-backed FDB program/withdraw against the
bridge/master path".

== Test count + gates ==

Workspace 1486 → 1487 (added permanent_failure_does_not_leak_across_keys;
permanent_failure_suppression_is_per_op_fingerprint replaces the
old generation-clearing test). All green. Clippy + doc + fmt clean.
Real-VTEP smoke against a Linux kernel via containerlab caught a
correctness gap that no unit test had: apply_op was programming
only the bridge-master FDB row, not the NTF_SELF+dst VXLAN-encap
row. Result: control plane looked fine (MAC with extern_learn
appeared in bridge fdb show via the master path) but the data
plane couldn't actually encap to the remote VTEP because vxlan100
had no dst entry for the MAC.

== The wire shape ==

`strace` on iproute2's `bridge fdb add MAC dev vxlanX master dst
REMOTE self extern_learn` shows ONE RTM_NEWNEIGH carrying:

  ndm_state = NUD_NOARP | NUD_PERMANENT (0x40 | 0x80 = 0xC0)
  ndm_flags = NTF_SELF | NTF_MASTER | NTF_EXT_LEARNED
  NDA_LLADDR = MAC
  NDA_DST    = REMOTE

The kernel programs both rows from that single message: the
NTF_SELF + NDA_DST anchors the VXLAN-encap entry on vxlanX (so
the VXLAN driver knows where to tunnel for this MAC), and the
NTF_MASTER plumbs the bridge-FDB entry on br100 (so the bridge
knows the MAC is reachable via vxlanX).

Splitting into two separate calls (which I tried first) returns
EINVAL on the master leg — the kernel expects the combined form
for a remote-VTEP entry.

== The fix ==

apply_op for AddRemoteFdb / UpdateRemoteFdb now sends one message
with all three NTF flags + NDA_DST + ndm_state =
NeighbourState::Other(0x40 | 0x80). RemoveRemoteFdb sends one
RTM_DELNEIGH with NTF_SELF | NTF_MASTER; the kernel cleans up
both rows from that one message.

The crate's NeighbourState enum doesn't represent the combined
NUD_NOARP | NUD_PERMANENT bitmask — it has separate Noarp and
Permanent variants. Used the `Other(u16)` escape hatch with the
explicit constant `0x40 | 0x80 = 0xC0`.

== M36 smoke ==

New tests/interop/m36-evpn-vtep-smoke.clab.yml + script:

- Topology: rustbgpd VTEP (10.0.0.1, AS 65000) ↔ FRR originator
  (10.0.0.2, AS 65000) over iBGP L2VPN/EVPN.
- start-rustbgpd-vtep.sh pre-creates br100 / vxlan100 (nolearning,
  local 10.0.0.1) inside the container netns and pre-loads a
  foreign static FDB entry 02:99:99:99:99:99 → 10.0.0.99.
- start-frr-vtep.sh (existing) sets up FRR's bridge+vxlan+dummy
  topology so MAC injection on dummy100 triggers Type 2
  origination.
- Test asserts: session Established, MAC programmed with
  extern_learn AND correct dst=10.0.0.2, withdraw cleans up,
  foreign static survives both cycles.

6/6 PASS locally (Linux 6.17, kernel VXLAN, no privileged-CI
runner needed — runs on any Docker host with containerlab).

== Netns test ==

Updated crates/evpn-linux/tests/netns_dataplane.rs to also assert
the dst column appears in the program-cycle FDB dump (previously
only checked extern_learn).

== Test count + gates ==

Workspace 1487 tests still pass. Clippy + doc + fmt clean.
The new netns assertion will fail-loudly if the wire shape ever
regresses on the privileged path.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the EVPN Linux dataplane reconciliation foundation: new EVPN domain types, a Linux-specific reconciler crate, daemon-side supervision that projects RIB EVPN routes into dataplane intent, and smoke/integration coverage around the new VTEP workflow.

Changes:

  • Introduces crates/evpn-linux with diffing, reconciliation, backoff, snapshot, in-memory fake, and Linux netlink dataplane plumbing.
  • Extends crates/evpn and the daemon to publish DataplaneIntent snapshots from EVPN RIB state and to manage reconciler lifecycle/shutdown.
  • Adds RR-only, netns, reconcile-actor, and interop/containerlab coverage plus changelog/roadmap/workspace updates.

Reviewed changes

Copilot reviewed 31 out of 32 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/interop/scripts/test-m36-evpn-vtep-smoke.sh Adds end-to-end smoke script for FRR ↔ rustbgpd VTEP FDB programming.
tests/interop/scripts/start-rustbgpd-vtep.sh Prepares bridge/VXLAN topology and launches rustbgpd in the VTEP container.
tests/interop/m36-evpn-vtep-smoke.clab.yml Defines the containerlab topology for the VTEP smoke scenario.
tests/interop/configs/rustbgpd-m36-vtep.toml Adds rustbgpd EVPN VTEP config for the smoke topology.
tests/interop/configs/frr-bgpd-m36-originator.conf Adds FRR EVPN originator config for the smoke topology.
tests/evpn_dataplane_rr_only.rs Verifies RR-only deployments do not spawn the dataplane actor.
src/main.rs Wires daemon startup/shutdown to the EVPN dataplane supervisor.
src/evpn_dataplane.rs Implements daemon-side RIB polling, projection, watch publication, and actor spawning.
ROADMAP.md Tracks Gate 7b EVPN dataplane work in roadmap status.
crates/evpn/src/projection.rs Adds pure projection from EVPN RIB routes into remote-MAC intent.
crates/evpn/src/mac.rs Adds remote/local MAC domain types and table builder.
crates/evpn/src/lib.rs Exports new EVPN dataplane-related domain modules and types.
crates/evpn/src/instance.rs Makes EvpnInstanceTable comparable for intent equality/dedup.
crates/evpn/src/dataplane.rs Adds dataplane intent/report/status operation types.
crates/evpn-linux/tests/reconcile_actor.rs Adds end-to-end actor tests using the in-memory dataplane.
crates/evpn-linux/tests/netns_dataplane.rs Adds privileged Linux netns integration coverage for real FDB programming.
crates/evpn-linux/src/snapshot.rs Defines kernel snapshot, probe, and owned-entry state models.
crates/evpn-linux/src/reconcile.rs Implements the level-triggered reconcile actor and shutdown drain.
crates/evpn-linux/src/linux/probe.rs Adds Linux readiness probing for bridge/VXLAN topology.
crates/evpn-linux/src/linux/mod.rs Adds Linux dataplane integration over rtnetlink.
crates/evpn-linux/src/linux/links.rs Builds bridge/VXLAN inventory from netlink link dumps.
crates/evpn-linux/src/linux/fdb.rs Implements netlink FDB dump/apply/remove and errno classification.
crates/evpn-linux/src/lib.rs Exposes the EVPN Linux dataplane crate surface.
crates/evpn-linux/src/in_memory.rs Adds the in-memory dataplane fake used by tests.
crates/evpn-linux/src/error.rs Defines dataplane error taxonomy and retry classes.
crates/evpn-linux/src/diff.rs Implements pure desired-vs-kernel diff planning.
crates/evpn-linux/src/dataplane.rs Defines the abstract dataplane trait and operation/event types.
crates/evpn-linux/src/backoff.rs Adds per-op exponential backoff scheduling.
crates/evpn-linux/Cargo.toml Adds the new workspace crate and Linux-only deps.
CHANGELOG.md Documents the Gate 7b dataplane foundation and tests.
Cargo.toml Registers the new crate and daemon dependencies.
Cargo.lock Locks newly added EVPN Linux/netlink dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +127 to +134
let still_in_kernel = snapshot.find_fdb(vni, mac).is_some();

// Withdraw if (no longer desired OR instance went NotReady) AND
// the kernel still has the entry. If the kernel already
// dropped it (interface flap, manual `bridge fdb del`), we
// emit no op now — the actor will reconcile its OwnedSet on
// the next successful pass instead.
let should_remove = (!still_desired || !instance_ready) && still_in_kernel;
Comment on lines +107 to +113
for inst in instances.iter() {
probes.insert(
inst.id,
crate::snapshot::InstanceProbe::NotReady {
reason: format!("kernel link dump failed: {e}"),
},
);
Comment on lines +151 to +162
ns.exec(
"bridge",
&[
"fdb",
"add",
&foreign_mac,
"dev",
vxlan,
"dst",
"127.0.0.99",
"permanent",
],
tracing::warn!(
?op,
error = %err,
"dataplane op failed permanently; suppressed until next intent generation"
Comment on lines +56 to +59
rb_fdb_has_extern_learn() {
local mac=${1:?}
rb_fdb | grep -iF "$mac" | grep -qE 'extern_learn|offload'
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants