From 5f672cafc4700a62d29c813a3827daa1595ef5bc Mon Sep 17 00:00:00 2001 From: Chloe Crozier Date: Fri, 29 May 2026 15:35:56 -0700 Subject: [PATCH 1/4] #69 - address first round of Slack feedback Signed-off-by: Chloe Crozier --- AGENTS.md | 9 +- docs/api-reference/configuration.md | 27 +- docs/api-reference/cpp.md | 16 +- docs/api-reference/index.md | 11 +- docs/concepts.md | 263 ++++++++++++-------- docs/getting-started.md | 48 ++-- docs/index.html | 65 ++--- docs/stylesheets/extra.css | 19 ++ docs/tutorials/benchmarking_examples.md | 4 +- docs/tutorials/configuration-walkthrough.md | 36 +-- docs/tutorials/system_configuration.md | 16 +- mkdocs.yml | 2 +- 12 files changed, 310 insertions(+), 206 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index fbbb0e5..b2ab39d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -93,14 +93,17 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf **Structure:** - `docs/index.html` — custom HTML landing page (not generated by MkDocs, hand-maintained) - `docs/getting-started.md` — system requirements, build instructions, CMake options -- `docs/concepts.md` — terminology glossary (kernel bypass, GPUDirect, packet/burst/segment, flow/queue, memory region, zero-copy ownership, RX reorder). Meant to be opened in parallel with the rest of the docs. +- `docs/concepts.md` — terminology glossary (stream types and protocols, GPUDirect, packet/burst/segment, flow/queue, memory region, zero-copy ownership, RX reorder). Meant to be opened in parallel with the rest of the docs. - `docs/api-reference/index.md` — API guide (6-step application lifecycle, configuration-first model) - `docs/api-reference/configuration.md`, `docs/api-reference/cpp.md`, `docs/api-reference/python.md` — YAML schema, C++ API, and Python bindings docs -- `docs/tutorials/` — tutorial walkthroughs (system config, benchmarking, config files) +- `docs/tutorials/` — tutorial walkthroughs (system config, config-file walkthrough) +- `docs/tutorials/benchmarking_examples.md` — surfaced as a top-level "Benchmarks" nav entry in `mkdocs.yml` and `docs/index.html`; file kept at its original path for inbound-link stability - `docs/stylesheets/extra.css` — custom theme overrides +**User-facing vocabulary:** docs and the YAML schema use `stream_type` (`raw`, `socket`, future `pcie`) and `protocol` (`udp`, `tcp`, `roce`). The word "backend" is internal-only — accurate for `src/managers//`, the `Manager` ABC, CMake `DAQIRI_MGR`, and API-reference function blurbs, but should not appear in tutorials, the landing page, or concept pages. The mapping: `stream_type: "raw"` is implemented by the `dpdk` manager; `stream_type: "socket"` with `protocol: "udp"` / `"tcp"` is implemented by the `socket` manager; `stream_type: "socket"` with `protocol: "roce"` is implemented by the `rdma` manager. + **Keeping docs in sync with code:** before committing changes, scan for the recurring drift hotspots: -- **Backend list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/concepts.md` (Kernel Bypass section + Backend Maturity admonition), `docs/api-reference/configuration.md` +- **Stream-type list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/concepts.md` (Stream Types section + Maturity admonition), `docs/api-reference/configuration.md` - **CMake options / `DAQIRI_MGR` default** (`src/CMakeLists.txt:137`) — README Quick Start, `docs/getting-started.md`, this file's Build & run section - **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, and the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage) - **Public API include** (`#include `; source files under `include/daqiri/`) — `docs/api-reference/index.md`, `docs/api-reference/cpp.md`, `docs/api-reference/python.md`; if the change adds or renames a user-facing concept, also `docs/concepts.md` diff --git a/docs/api-reference/configuration.md b/docs/api-reference/configuration.md index 30c6b7c..303a2a8 100644 --- a/docs/api-reference/configuration.md +++ b/docs/api-reference/configuration.md @@ -68,9 +68,10 @@ and their `kind` determines the receive mode (CPU-only, header-data split, or ba - values: `local`, `rdma_read`, `rdma_write` - **`num_bufs`**: Number of buffers in this region. Higher values give more processing headroom but consume more memory (GPU BAR1 for `device`). Too low risks dropped packets - on RX or higher latency on TX. Rule of thumb: 3x-5x `batch_size`. For the DPDK - backend, `num_bufs` below 1.5x the NIC ring size deadlocks the worker; `daqiri_init` - auto-bumps such MRs to 3x the ring (24576 with the default 8192) and logs a `WARN`. + on RX or higher latency on TX. Rule of thumb: 3x-5x `batch_size`. For Raw Ethernet + (`stream_type: "raw"`), `num_bufs` below 1.5x the NIC ring size deadlocks the worker; + `daqiri_init` auto-bumps such MRs to 3x the ring (24576 with the default 8192) and + logs a `WARN`. - type: `integer` - **`buf_size`**: Size of each buffer in bytes. Should match the expected packet size, or the segment size when using header-data split. @@ -104,8 +105,9 @@ memory_regions: - **`name`**: Interface name. Used to look up port IDs at runtime via `get_port_id()`. - type: `string` -- **`address`**: PCIe BDF address (from `lspci`) or Linux interface name for DPDK, or IP - address for RDMA. +- **`address`**: PCIe BDF address (from `lspci`) or Linux interface name for Raw Ethernet + (`stream_type: "raw"`), or IP address for RoCE (`stream_type: "socket"`, + `protocol: "roce"`). - type: `string` ### RDMA Configuration @@ -201,7 +203,8 @@ Unmatched packets are dropped. When `false`, unmatched packets go to a default q ### Hardware Timestamps -`rx.hardware_timestamps:` — Enable per-packet hardware RX timestamps for the DPDK backend. +`rx.hardware_timestamps:` — Enable per-packet hardware RX timestamps for Raw Ethernet +(`stream_type: "raw"`). When enabled, DAQIRI requires `RTE_ETH_RX_OFFLOAD_TIMESTAMP` support from the NIC/PMD and initialization fails if DAQIRI cannot provide nanosecond timestamps for the selected PMD. Timestamps are returned by `get_packet_rx_timestamp()` in nanoseconds in the NIC timestamp @@ -210,12 +213,12 @@ clock domain, not wall-clock time. - type: `boolean` - default: `false` -### RX Reorder Configs (DPDK v1) +### RX Reorder Configs -`rx.reorder_configs:` — Optional automatic packet reordering/aggregation plans. In v1 this is -implemented for the DPDK backend only. GPU reorder requires CUDA-addressable packet buffers -(`device` or `host_pinned` memory regions). CPU reorder requires CPU-addressable packet buffers -(`host`, `host_pinned`, or `huge` memory regions). +`rx.reorder_configs:` — Optional automatic packet reordering/aggregation plans. Implemented +for Raw Ethernet (`stream_type: "raw"`) only in v1. GPU reorder requires CUDA-addressable +packet buffers (`device` or `host_pinned` memory regions). CPU reorder requires CPU-addressable +packet buffers (`host`, `host_pinned`, or `huge` memory regions). v1 source-memory requirement: - Reorder queues must use exactly one RX source memory region. @@ -316,7 +319,7 @@ enabled, use `set_packet_tx_time()` to schedule packets. Requires ConnectX-7 or - type: `boolean` - default: `false` -## Complete Example (DPDK, Header-Data Split) +## Complete Example (Raw Ethernet, Header-Data Split) ```yaml %YAML 1.2 diff --git a/docs/api-reference/cpp.md b/docs/api-reference/cpp.md index 0442d85..db26c4b 100644 --- a/docs/api-reference/cpp.md +++ b/docs/api-reference/cpp.md @@ -28,9 +28,9 @@ auto status = daqiri::daqiri_init(config); After `daqiri_init()` returns `Status::SUCCESS`, all memory regions are allocated, NIC queues are configured, and worker threads are running. -If GPU RX `reorder_configs` are configured for the DPDK backend, set one CUDA stream -per GPU reorder plan before pulling reordered bursts. CPU reorder configs do not use a -CUDA stream. See the [Configuration YAML Reference](configuration.md#rx-reorder-configs-dpdk-v1) +If GPU RX `reorder_configs` are configured for Raw Ethernet (`stream_type: "raw"`), set +one CUDA stream per GPU reorder plan before pulling reordered bursts. CPU reorder configs do not use a +CUDA stream. See the [Configuration YAML Reference](configuration.md#rx-reorder-configs) for reorder configuration constraints. ```cpp @@ -81,11 +81,11 @@ for (int i = 0; i < daqiri::get_num_packets(burst); i++) { } ``` -RX hardware timestamps are available only when the DPDK backend is configured with -`rx.hardware_timestamps: true` and the NIC supports `RTE_ETH_RX_OFFLOAD_TIMESTAMP`. -DAQIRI converts the NIC timestamp counter to nanoseconds internally using DPDK's -matching device clock when available, or the PMD's nanosecond timestamp format when -the driver already supplies nanoseconds. DAQIRI does not expose NIC clock reads or +RX hardware timestamps are available only when Raw Ethernet (`stream_type: "raw"`) is +configured with `rx.hardware_timestamps: true` and the NIC supports +`RTE_ETH_RX_OFFLOAD_TIMESTAMP`. DAQIRI converts the NIC timestamp counter to nanoseconds +internally using the matching device clock when available, or the PMD's nanosecond +timestamp format when the driver already supplies nanoseconds. DAQIRI does not expose NIC clock reads or convert timestamps to wall-clock time. For reordered aggregate bursts, `get_packet_rx_timestamp(burst, 0, &ts)` returns the timestamp of the first source packet accepted into the aggregate. diff --git a/docs/api-reference/index.md b/docs/api-reference/index.md index 8da6f22..a123a0b 100644 --- a/docs/api-reference/index.md +++ b/docs/api-reference/index.md @@ -15,16 +15,17 @@ For the terminology and conceptual background it relies on A DAQIRI application starts from a YAML configuration file (or an equivalent `NetworkConfig` struct built in code). The configuration -defines the active backend, NIC interfaces, RX and TX queues, memory -regions, flow steering rules, flow isolation, header-data split, and -optional reorder plans. After initialization, the language API operates -on those configured ports, queues, buffers, and flows. +defines the active stream type and protocol, NIC interfaces, RX and TX +queues, memory regions, flow steering rules, flow isolation, +header-data split, and optional reorder plans. After initialization, +the language API operates on those configured ports, queues, buffers, +and flows. The language APIs do **not** discover queues, memory, or flow steering rules on their own. They are runtime handles over the topology declared in the configuration (YAML file or `NetworkConfig` struct). The configuration is the source of truth for queue IDs, memory placement, -protocol/backend selection, and flow routing. +stream-type / protocol selection, and flow routing. The configuration schema lives in the [Configuration YAML Reference](configuration.md). For an annotated diff --git a/docs/concepts.md b/docs/concepts.md index ca0a546..7896a6a 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -8,71 +8,111 @@ hide: This page is the DAQIRI glossary. It defines the terms used across the [API Guide](api-reference/index.md), [Configuration Reference](api-reference/configuration.md), and -[tutorials](tutorials/system_configuration.md): **kernel bypass**, -**GPUDirect**, **packet / burst / segment**, **flow / queue**, -**memory region**, **zero-copy ownership**, and **RX reorder**. - -## Kernel Bypass - -**Kernel bypass** means bypassing the operating system's kernel to talk -directly to the network interface (NIC). That removes the latency and -overhead of the Linux network stack and lets the application work with NIC -ring buffers in user space. - -DAQIRI is a thin, common interface over multiple kernel-bypass technologies. -All of its backends are Ethernet-based, but they differ in their model, -features, and footprint: - -- **DPDK**: the [Data Plane Development Kit](https://www.dpdk.org/) is a - Linux Foundation project with strong, long-running community support. Its - RTE Flow capability is generally considered the most flexible solution for - splitting ingress and egress data into per-queue streams. -- **RDMA**: Remote Direct Memory Access, using the open-source - [`rdma-core`](https://github.com/linux-rdma/rdma-core) library. RDMA - differs from the other Ethernet-based backends with its server/client - model and **RoCE** (RDMA over Converged Ethernet) protocol. It costs more - to set up on both ends but offers a simpler user interface, orders packets - on arrival, and provides a NIC-level reliable transport mode (RC). -- **Socket**: a socket-oriented interface (UDP and TCP via - the Linux kernel, plus a RoCE path that delegates to the RDMA backend). - Useful as a comparison baseline against DPDK and RDMA, and as a path to - first results when no NVIDIA NIC is available. - -Which backend is best for your use case depends on multiple factors: packet -size, batch size, data type, whether you need ordering or reliability, and -whether both ends of the link are under your control. DAQIRI's goal is to -abstract the interface to these backends so developers can focus on -application logic and experiment with different configurations to find the -best technology for their workload. - -??? example "Backend maturity" +[tutorials](tutorials/system_configuration.md): **stream types and +protocols**, **GPUDirect**, **packet / burst / segment**, +**flow / queue**, **memory region**, **zero-copy ownership**, and +**RX reorder**. + +## Stream Types + +DAQIRI exposes a single C++ API on top of several packet-I/O stacks. The +choice is configured per-application in YAML by two keys: + +- `stream_type` — the I/O stack family. +- `protocol` — required when `stream_type: "socket"`; selects the + socket-level protocol. + +### Raw Ethernet — `stream_type: "raw"` + +Kernel-bypass raw Ethernet. The application talks directly to NIC ring +buffers in user space, skipping the Linux network stack entirely. This +is the highest-performance path and the only one with hardware flow +steering (see [Flows](#flow) below). Currently implemented on top of +[DPDK](https://www.dpdk.org/); the DPDK dependency is an implementation +detail, not a user-facing concept. + +Requires an NVIDIA SmartNIC (ConnectX-6 Dx or later). + +### Socket — `stream_type: "socket"` + +Socket-style interfaces. The specific transport is chosen by `protocol`: + +- **`protocol: "udp"`** / **`protocol: "tcp"`** — Linux kernel UDP and + TCP sockets. No NIC privileges required, no special hardware. Useful + as a comparison baseline against the kernel-bypass paths and as a way + to get first results on a system without an NVIDIA NIC. +- **`protocol: "roce"`** — RDMA over Converged Ethernet, using the + open-source [`rdma-core`](https://github.com/linux-rdma/rdma-core) + library. A server/client connection model, NIC-level reliable + transport (RC), and in-order delivery. Primarily intended for + workloads where **one** endpoint is a third-party device (an FPGA, an + instrument, or another customer-supplied black box) that already + speaks RoCE. When both peers run DAQIRI, prefer an upper-layer + library such as MPI, NCCL, or UCX rather than wiring RoCE directly. + +### PCIe — `stream_type: "pcie"` *(future)* + +Placeholder for an upcoming direct-PCIe stream type. Not implemented +yet. + +### Choosing a stream type + +The right choice depends on packet size, batch size, latency target, +whether you need ordering or hardware reliability, and what the other +end of the link looks like. DAQIRI's job is to make swapping among them +a configuration change rather than a code change. + +For a use-case-driven decision tree (baseline throughput, GPU reorder, +header-data split, multi-queue flow steering, packet recording, RDMA, +sockets), see +[Choosing an example config](tutorials/configuration-walkthrough.md#choosing-an-example-config) +in the configuration walkthrough. + +??? example "Maturity" The DAQIRI library integration testing infrastructure is under active development. As such: - - The **DPDK** backend is supported and distributed with the DAQIRI - library, and is the only backend actively tested at this time. - - The **RDMA / RoCE** backend is supported and distributed with the - DAQIRI library; integration testing is under development. - - The **Socket** backend (UDP/TCP via the Linux kernel, plus the RoCE - path that delegates to RDMA) is supported and distributed; integration + - **Raw Ethernet** (`stream_type: "raw"`) is supported, distributed + with the DAQIRI library, and is the only stream type actively + tested at this time. + - **Socket — UDP / TCP** (`stream_type: "socket"`, `protocol: "udp"` + / `"tcp"`) is supported and distributed; integration testing is + under development. + - **Socket — RoCE** (`stream_type: "socket"`, + `protocol: "roce"`) is supported and distributed; integration testing is under development. ## GPUDirect -**GPUDirect** allows the NIC to read and write data from/to a GPU without -having to first stage it through system memory. That decreases CPU overhead -and significantly reduces latency. An implementation of GPUDirect is -supported by every DAQIRI backend. +**GPUDirect** allows the NIC to read and write data from/to a GPU +without staging it through system memory first. That decreases CPU +overhead and significantly reduces latency. An implementation of +GPUDirect is supported by every DAQIRI stream type. + +The two paths look like this: + +```mermaid +flowchart LR + subgraph withGPUDirect [With GPUDirect] + nicA[NIC] -->|"PCIe peer-to-peer DMA"| gpuA[GPU memory] + end + subgraph withoutGPUDirect [Without GPUDirect] + nicB[NIC] -->|"DMA"| cpuB[CPU staging buffer] -->|"cudaMemcpy"| gpuB[GPU memory] + end +``` + +The GPUDirect path skips the CPU-side staging buffer and the +`cudaMemcpy` that goes with it. !!! warning - GPUDirect is only supported on Workstation/Quadro/RTX GPUs and Data - Center GPUs. It is not supported on GeForce cards. + GPUDirect is only supported on RTX GPUs and Data Center GPUs. It is + not supported on GeForce cards. ??? info "How does that relate to peermem or dma-buf?" - There are two interfaces to enable GPUDirect: + There are two kernel interfaces to enable GPUDirect: - The [`nvidia-peermem`](https://docs.nvidia.com/cuda/gpudirect-rdma/) kernel module, distributed with the NVIDIA DKMS GPU drivers. @@ -95,50 +135,46 @@ For step-by-step system setup, see the ## Packets, Bursts, and Segments -These three terms describe the units of data that flow through DAQIRI. -They appear throughout the API, configuration, and code paths. +DAQIRI is a batch processing library. Packets are received from DAQIRI +and sent to DAQIRI in batches called **bursts**. Larger bursts can +increase throughput at the expense of latency; smaller bursts decrease +latency but cap total throughput because of the per-burst processing +overhead. The terms below appear throughout the API, configuration, and +code paths. ### Packet -A **packet** is a single Ethernet frame including headers and payload as one -logical unit. DAQIRI never delivers packets one at a time; the unit of -delivery is a *burst*. +A **packet** is a single, contiguous block of memory representing +either received data or data to transmit. Packets can be far larger +than an Ethernet MTU in some cases (for example with `protocol: "roce"` +or `protocol: "tcp"`/`"udp"`); the underlying stack fragments and +reassembles them on the wire transparently. ### Burst (`BurstParams`) -A **burst** is a batch of packets grouped together for efficient transfer -between DAQIRI internals and the application. Bursts are the way the -application receives, transmits, and frees packets. - -The C++ type for a burst is `BurstParams`. A burst carries: - -- Pointers to the underlying packet buffers -- Packet count, port ID, queue ID, segment count -- Per-packet byte totals and lengths -- Flow IDs (when flow steering is configured) -- Optional RX hardware timestamps - -`BurstParams` is meant to be opaque. Applications use helper functions -(`get_packet_ptr`, `get_packet_length`, `get_num_packets`, ...) to inspect -or modify it rather than touching its fields directly. +A **burst** is the metadata container DAQIRI uses to describe a batch +of packets being transmitted or received. The C++ type for a burst is +`BurstParams`. It is intentionally opaque — applications use helper +functions (`get_packet_ptr`, `get_packet_length`, `get_num_packets`, +...) to inspect or modify it rather than touching its fields directly. ### Segment -A **segment** is one contiguous memory region inside a packet. A packet can -have one segment or multiple segments. The number of segments a packet has -is set by the receive mode configured in the YAML: +A **segment** is one contiguous memory region inside a packet. A packet +can have one segment or multiple segments: -- **Single segment**: used for CPU-only or batched-GPU paths that do not - split headers from payloads. -- **Two segments (header-data split)**: segment 0 holds headers in CPU - memory, segment 1 holds payload data in GPU memory. +- **Single segment**: the whole packet fills one contiguous region. +- **Multiple segments**: each segment is assigned to a different memory + region. The memory regions can be of any kind (CPU or GPU) in any + order. A common use case is *header-data split* (HDS) below. ### Header-Data Split (HDS) -**Header-data split** is the most common multi-segment configuration: +**Header-data split** is the canonical multi-segment configuration: headers go to CPU memory (segment 0), payload goes to GPU memory -(segment 1). This keeps the GPU payload path zero-copy for downstream GPU -workloads while still letting the CPU parse and steer on the headers. +(segment 1). This keeps the GPU payload path zero-copy for downstream +GPU workloads while still letting the CPU parse and steer on the +headers. Use HDS when the application needs to inspect headers (UDP source/destination ports, application-layer sequence numbers, etc.) but @@ -161,51 +197,64 @@ buffers (CPU hugepages, GPU device memory, or pinned host memory). ### Flow -A **flow** is a rule that maps packets matching a given pattern to a -specific queue. A flow has a match (e.g. UDP destination port 4096, -IPv4 length 1050) and an action (e.g. *queue 0*). Multiple flows can -target the same queue; the matching flow's ID is available at runtime -so the application can distinguish them. Flows are configured under -`rx.flows` in the YAML. +A **flow** is a match pattern paired with an action. The common action +is to steer matching packets into a specific queue. For example, all +UDP-destination-port-4096 packets can be routed into a queue backed by +GPU memory. Matching and the resulting action both run entirely in NIC +hardware. + +Flow rules are only available in Raw Ethernet (`stream_type: "raw"`). + +A flow's match can combine fields such as `udp_src`, `udp_dst`, and +`ipv4_len`; multiple flows can target the same queue, and the matching +flow's ID is available at runtime so the application can distinguish +them. Flows are configured under `rx.flows` in the YAML. ### Flow Steering -**Flow steering** is the NIC-level mechanism that classifies an incoming -packet against the configured flows and writes it into the matching -queue's buffer, entirely in hardware. Multi-queue RX works by routing -each flow to a separate queue for parallel processing. +**Flow steering** is the NIC-level mechanism that classifies an +incoming packet against the configured flows and writes it into the +matching queue's buffer, entirely in hardware. Multi-queue RX works by +routing each flow to a separate queue for parallel processing. -For DPDK, flow steering is implemented on top of RTE Flow. The YAML -options are documented in +For Raw Ethernet, flow steering is implemented on top of RTE Flow. The +YAML options are documented in [Configuration YAML Reference → Flows](api-reference/configuration.md#flows). ## Memory Regions A **memory region** is a named pool of buffers where packet data lives. -Memory regions are declared at the top of the YAML and referenced by name -from each queue. +Memory regions are declared at the top of the YAML and referenced by +name from each queue. -The kind of a memory region determines whether packet data ends up on the -CPU or the GPU: +The kind of a memory region determines whether packet data ends up on +the CPU or the GPU: - `huge`: CPU hugepages (recommended for CPU buffers). - `device`: GPU VRAM (discrete GPUs; requires GPUDirect via peermem or DMA-BUF). - `host_pinned`: pinned CPU pages allocated via `cudaHostAlloc`. - Recommended on integrated GPUs (NVIDIA GB10 / DGX Spark), where the NIC - cannot peer-DMA into device memory. + Recommended on integrated GPUs (NVIDIA GB10 / DGX Spark), where the + NIC cannot peer-DMA into device memory. - `host`: regular CPU memory (not recommended for hot paths). -Combining memory regions on a single queue is how *header-data split* is -expressed in the YAML: queue 0's first memory region is a `huge` CPU pool -(for headers, segment 0); its second region is a `device` GPU pool (for -payload, segment 1). +The size of the memory region (`buf_size`) dictates the largest +contiguous chunk that can be stored in a single *segment*. For example, +with a 60-byte region the first 60 bytes of each packet land in that +segment before the remainder spills into the next region in the +queue's list. Region buffers can be much larger than a single Ethernet +frame for fragmented transports (for example, `protocol: "roce"`). + +Combining memory regions on a single queue is how *header-data split* +is expressed in the YAML: queue 0's first memory region is a `huge` CPU +pool (for headers, segment 0); its second region is a `device` GPU pool +(for payload, segment 1). ## Zero-Copy Ownership DAQIRI is designed around zero-copy packet delivery. When a receive API -returns packet data, the application is reading the buffers the NIC DMA'd -into; the API passes pointers and metadata, not copies. +returns packet data, the application is reading the buffers the NIC +DMA'd into; the API passes pointers and metadata, not copies. That zero-copy model makes **buffer release part of the API contract**. Applications must free RX bursts after processing and free or send TX @@ -236,7 +285,7 @@ GPU-only or CPU-only. Reordering packets whose segments span two memory regions (for example, an HDS pair with CPU-side headers and GPU-side payloads) is not yet supported but is planned. -See [Configuration YAML Reference → RX Reorder Configs](api-reference/configuration.md#rx-reorder-configs-dpdk-v1) +See [Configuration YAML Reference → RX Reorder Configs](api-reference/configuration.md#rx-reorder-configs) for the configuration constraints and [C++ API Usage → Reordered RX bursts](api-reference/cpp.md#reordered-rx-bursts) for how to consume them from C++. @@ -250,4 +299,4 @@ for how to consume them from C++. - [C++ API Usage](api-reference/cpp.md): initialization, RX/TX, file writes, utilities, and the C++ function reference. - [System Configuration tutorial](tutorials/system_configuration.md): - the hardware and OS setup the concepts above depend on. \ No newline at end of file + the hardware and OS setup the concepts above depend on. diff --git a/docs/getting-started.md b/docs/getting-started.md index 0000854..ea843a1 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -7,23 +7,19 @@ hide: ## System Requirements -DAQIRI requires a system with an [**NVIDIA SmartNIC**](https://www.nvidia.com/en-us/networking/ethernet-adapters/) (ConnectX-6 Dx or later) and a [**discrete GPU**](https://www.nvidia.com/en-us/design-visualization/desktop-graphics/). +DAQIRI's baseline requirements depend on which [stream type](concepts.md#stream-types) you plan to use. The Linux Sockets path (`stream_type: "socket"`, `protocol: "udp"`/`"tcp"`) runs on any modern Linux box. The Raw Ethernet kernel-bypass path and GPUDirect impose additional hardware requirements, listed below. | Component | Requirement | |-----------|-------------| | **OS** | Linux (kernel 5.4+), Ubuntu 22.04 recommended | -| **NIC** | NVIDIA ConnectX-6 Dx or later, with MLNX_OFED or inbox drivers | -| **GPU** | Workstation/Quadro/RTX or Data Center GPU (GPUDirect-capable) | -| **CUDA** | CUDA Toolkit 11.7+ | -| **DPDK** | Included in the DAQIRI container; see [Dockerfile](https://github.com/NVIDIA/daqiri/blob/main/Dockerfile) for bare-metal deps | -| **RDMA** | `libibverbs` and `librdmacm` (for the RDMA backend) | +| **CUDA** | CUDA Toolkit 12.2+ (the container ships CUDA 13.1) | +| **NIC** *(Raw Ethernet / GPUDirect / RoCE only)* | NVIDIA ConnectX-6 Dx or later. Default Ubuntu kernel drivers (inbox) are sufficient; we recommend also installing `doca-ofed` for the diagnostic utilities (`ibstat`, `ibv_devinfo`, `ibdev2netdev`, `mlnx_perf`, `mlxconfig`, …). | +| **GPU** *(GPUDirect only)* | RTX or Data Center GPU. GeForce is not supported. | +| **DPDK** | Included in the DAQIRI container (patched for dma-buf, so `nvidia-peermem` is **not required** inside the container); see [bare-metal dependencies](#bare-metal-dependencies) below for the host build. | +| **RoCE** | `libibverbs` and `librdmacm` (for `stream_type: "socket"`, `protocol: "roce"`). | | **GDS** | Optional `cufile.h` and `libcufile` for file writes from CUDA device memory. Runtime device-memory writes require a working cuFile installation; for regular `nvidia-fs` mode, the `nvidia-fs` kernel module must be loaded and the destination storage stack must be supported. | -Supported platforms include [NVIDIA Data Center](https://www.nvidia.com/en-us/data-center/) systems, edge systems like [NVIDIA IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) and [NVIDIA Project DIGITS](https://www.nvidia.com/en-us/project-digits/), and `x86_64` systems with the above components. - -!!! note - - If you use the DPDK bundled in the DAQIRI container, it is patched with dmabuf support and the `nvidia-peermem` kernel module is **not required**. +Supported platforms include [NVIDIA Data Center](https://www.nvidia.com/en-us/data-center/) systems, edge systems like [NVIDIA IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) and [NVIDIA DGX Spark](https://www.nvidia.com/en-us/products/workstations/dgx-spark/), and `x86_64` systems with the above components. For detailed instructions on verifying NIC drivers, configuring link layers, enabling GPUDirect, and tuning your system for maximum performance, see the [System Configuration tutorial](tutorials/system_configuration.md). @@ -73,7 +69,7 @@ Then build the DAQIRI library: === "Container build (recommended)" - The container bundles all user-space libraries for each networking backend, avoiding dependency issues on the host: + The container bundles all user-space libraries for each stream type, avoiding dependency issues on the host: ```bash git clone git@github.com:NVIDIA/daqiri.git @@ -95,6 +91,8 @@ Then build the DAQIRI library: === "CMake build (bare-metal)" + Install the dependencies listed under [Bare-metal dependencies](#bare-metal-dependencies) below first, then: + ```bash git clone git@github.com:NVIDIA/daqiri.git cd daqiri @@ -103,7 +101,27 @@ Then build the DAQIRI library: cmake --install build --prefix /opt/daqiri ``` - Inspect the [Dockerfile](https://github.com/NVIDIA/daqiri/blob/main/Dockerfile) to see the full list of user-space dependencies needed for a bare-metal build. +### Bare-metal dependencies + +The Ubuntu apt packages mirror the Dockerfile. Build DPDK from source with the patches under `dpdk_patches/` if you want GPUDirect without the `nvidia-peermem` kernel module. + +```bash +# Core build deps +sudo apt install -y \ + build-essential cmake git curl ca-certificates gnupg \ + pkgconf ninja-build meson python3-pip python3-dev python3-pyelftools + +# Raw Ethernet (DPDK) build deps +sudo apt install -y libnuma-dev + +# RoCE / RDMA + diagnostic utilities (from the DOCA APT repo, see above) +sudo apt install -y \ + libibverbs-dev librdmacm-dev libmlx5-1 ibverbs-utils infiniband-diags \ + mlnx-ofed-kernel-utils mft + +# Python bindings (only if -DDAQIRI_BUILD_PYTHON=ON) +sudo apt install -y pybind11-dev +``` ### Use an Installed Library @@ -130,7 +148,7 @@ Both methods use the same public C++ include: | Option | Default | Description | |--------|---------|-------------| -| `DAQIRI_MGR` | `"dpdk socket rdma"` | Space-separated list of backends to build. Valid values: `dpdk`, `socket`, `rdma`. | +| `DAQIRI_MGR` | `"dpdk socket rdma"` | Space-separated list of manager implementations to compile in. Valid values: `dpdk` (Raw Ethernet), `socket` (Linux UDP/TCP sockets), `rdma` (RoCE). | | `DAQIRI_BUILD_PYTHON` | `OFF` | Build pybind11 Python bindings. | | `DAQIRI_BUILD_EXAMPLES` | `ON` | Build benchmark executables. | | `DAQIRI_ENABLE_GDS` | `OFF` | Enable cuFile-backed burst file writes from CUDA device memory. Host-memory writes use POSIX APIs without GDS. | @@ -164,7 +182,7 @@ must configure the OpenTelemetry C++ SDK before or during DAQIRI initialization. Once DAQIRI is built, follow the tutorials to configure your system and run your first benchmark: -1. [**Concepts**](concepts.md) — terminology (packet, burst, segment, flow, queue, memory region), kernel-bypass backends, GPUDirect, and zero-copy ownership. Keep this open in a second tab. +1. [**Concepts**](concepts.md) — terminology (stream types and protocols, packet, burst, segment, flow, queue, memory region), GPUDirect, and zero-copy ownership. Keep this open in a second tab. 2. [**API Guide**](api-reference/index.md) — the six-step DAQIRI application lifecycle and configuration-first model 3. [**System Configuration**](tutorials/system_configuration.md) — NIC drivers, link layers, GPUDirect, hugepages, CPU isolation, GPU clocks, and more 4. [**Benchmarking Examples**](tutorials/benchmarking_examples.md) — run `daqiri_bench_raw_gpudirect` with a loopback test diff --git a/docs/index.html b/docs/index.html index 2984720..0b9b579 100644 --- a/docs/index.html +++ b/docs/index.html @@ -40,18 +40,18 @@ /* NAV */ #navbar { position:fixed; top:0; left:0; right:0; z-index:1000; height:var(--nav-h); display:flex; align-items:center; background:rgba(10,10,10,.92); backdrop-filter:blur(16px); border-bottom:1px solid var(--border); transition:box-shadow var(--ease); } #navbar.scrolled { box-shadow:0 4px 40px rgba(0,0,0,.6); } - .nav-inner { width:100%; max-width:1200px; margin:0 auto; padding:0 2rem; display:flex; align-items:center; gap:2rem; } + .nav-inner { width:100%; max-width:1200px; margin:0 auto; padding:0 2rem; display:flex; align-items:center; gap:1.25rem; } .nav-logo { display:flex; align-items:center; gap:.75rem; flex-shrink:0; text-decoration:none; } .nav-logo-icon { width:32px; height:32px; background:var(--nv-green); border-radius:6px; display:flex; align-items:center; justify-content:center; font-weight:900; font-size:.75rem; color:#000; letter-spacing:-.05em; } .nav-logo-text { font-weight:800; font-size:1.1rem; color:var(--text-pri); letter-spacing:.05em; } .nav-logo-badge { font-size:.65rem; font-weight:700; padding:2px 6px; background:rgba(118,185,0,.15); color:var(--nv-green); border:1px solid rgba(118,185,0,.3); border-radius:99px; letter-spacing:.08em; } - .nav-links { display:flex; align-items:center; gap:.25rem; flex:1; } - .nav-links a { color:var(--text-mut); font-size:.875rem; font-weight:500; padding:.4rem .75rem; border-radius:var(--radius); transition:color var(--ease),background var(--ease); } + .nav-links { display:flex; align-items:center; gap:.1rem; flex:1; } + .nav-links a { color:var(--text-mut); font-size:.875rem; font-weight:500; padding:.4rem .65rem; border-radius:var(--radius); transition:color var(--ease),background var(--ease); white-space:nowrap; } .nav-links a:hover { color:var(--text-pri); background:rgba(255,255,255,.05); } .nav-links a.active { color:var(--nv-green); } .nav-links a.nav-ext::after { content:'↗'; font-size:.72em; opacity:.55; margin-left:2px; } - .nav-actions { display:flex; align-items:center; gap:.75rem; margin-left:auto; } - .btn { display:inline-flex; align-items:center; gap:.5rem; font-size:.875rem; font-weight:600; padding:.5rem 1.25rem; border-radius:var(--radius); border:1.5px solid transparent; cursor:pointer; transition:all var(--ease); text-decoration:none; } + .nav-actions { display:flex; align-items:center; gap:.5rem; margin-left:auto; flex-shrink:0; } + .btn { display:inline-flex; align-items:center; gap:.5rem; font-size:.875rem; font-weight:600; padding:.5rem 1.25rem; border-radius:var(--radius); border:1.5px solid transparent; cursor:pointer; transition:all var(--ease); text-decoration:none; white-space:nowrap; } .btn-primary { background:var(--nv-green); color:#000; border-color:var(--nv-green); } .btn-primary:hover { background:var(--nv-green-l); border-color:var(--nv-green-l); color:#000; } .btn-outline { background:transparent; color:var(--text-mut); border-color:var(--border); } @@ -187,9 +187,10 @@ ::-webkit-scrollbar { width:6px; height:6px; } ::-webkit-scrollbar-track { background:var(--bg-dark); } ::-webkit-scrollbar-thumb { background:#333; border-radius:99px; } + @media (max-width:1100px) { .nav-links { display:none; } } @media (max-width:1000px) { .hero-inner { grid-template-columns:1fr; } .hero-logo-wrap { display:none; } } @media (max-width:900px) { .gs-layout { grid-template-columns:1fr; } .gs-code-panel { position:static; } .footer-inner { grid-template-columns:1fr 1fr; } } - @media (max-width:640px) { .nav-links { display:none; } section { padding:4rem 0; } .footer-inner { grid-template-columns:1fr; } .tut-meta { display:none; } } + @media (max-width:640px) { section { padding:4rem 0; } .footer-inner { grid-template-columns:1fr; } .tut-meta { display:none; } .nav-actions .btn-outline { display:none; } } @@ -203,7 +204,8 @@

Closing the Gap Between Sensor and GPU

-

Scientific and industrial instruments generate data that is richest at the source — before it is filtered, decimated, or summarized. DAQIRI places NVIDIA GPU hardware directly in that data path, forging a tight bond between upstream sensors, their data converters, and the NVIDIA compute ecosystem. The result is a new foundation for developers: the ability to work with instrument data in its rawest form, at wire speed, and to build a new class of autonomous experiments where AI can observe phenomena directly at the source, augment human analysis, and steer experiments in real time. Streaming Ethernet data in, GPU tensor out.

+

Scientific and industrial instruments generate data that is richest at the source — before it is filtered, decimated, or summarized. DAQIRI places NVIDIA GPU hardware directly in that data path, forging a tight bond between upstream sensors, their data converters, and the NVIDIA compute ecosystem. The result is a new foundation for developers: the ability to work with instrument data in its rawest form, at wire speed, and to build a new class of autonomous experiments where AI can observe phenomena directly at the source, augment human analysis, and steer experiments in real time. Stream data into and out of GPUs efficiently while leveraging common tensor-compute libraries.

AI Native DAQ Architecture @@ -282,7 +284,7 @@

GPUDirect Zero-Copy

🔀

Hardware Flow Steering

-

Route packets to specific queues by UDP port, IPv4 payload length, or custom flex items — all in NIC silicon, before any software runs.

+

Route packets based on header matching to steer different streams to different GPUs or CPUs — entirely in NIC silicon, before any software runs.

🔗
@@ -292,7 +294,7 @@

RDMA over Converged Ethernet

📄

YAML-Driven Configuration

-

Define memory regions, NIC interfaces, TX/RX queues, and flow rules in a single YAML file — or build the same config in C++ code. Switch backends, memory kinds, and buffer sizes without recompiling.

+

Define memory regions, NIC interfaces, TX/RX queues, and flow rules in a single YAML file — or build the same config in C++ code. Switch stream types, memory kinds, and buffer sizes without recompiling.

📦
@@ -310,7 +312,7 @@

Containerized Deployment

Build & Run in Minutes

-

Requires a ConnectX-6 Dx+ NIC, Linux (kernel 5.4+), and the CUDA Toolkit.

+

Runs on Linux (kernel 5.4+) with the CUDA Toolkit 12.2+. The kernel-bypass and GPUDirect paths additionally require an NVIDIA ConnectX-6 Dx (or newer) NIC.

Full Guide →
@@ -320,14 +322,15 @@

Build & Run in Minutes

1

Install Prerequisites

-

Install MLNX5/InfiniBand drivers with peermem support (inbox on Ubuntu ≥5.4 and <6.8, or OFED from DOCA-Host 2.8+). Install the CUDA Toolkit.

+

Install the CUDA Toolkit (12.2 or newer).

+

For the Raw Ethernet / GPUDirect / RoCE path, you also need an NVIDIA ConnectX-6 Dx (or newer) NIC. The default Ubuntu kernel drivers are sufficient; we recommend additionally installing doca-ofed for the diagnostic utilities (ibstat, ibv_devinfo, mlxconfig, mlnx_perf, …).

2

Build from Source

-

Select backends with DAQIRI_MGR. Valid values: dpdk, rdma.

+

Select implementations with DAQIRI_MGR. Valid values: dpdk, socket, rdma.

# Configure, build, install
 cmake -S . -B build \
   -DBUILD_SHARED_LIBS=ON \
@@ -351,8 +354,8 @@ 

Or Build the Container

4

Tune the System

-

Isolate CPU cores, enable hugepages, configure NUMA affinity. Run the diagnostic script:

-
python3 python/tune_system.py
+

Run the diagnostic script to surface common networking bottlenecks (CPU governor, hugepages, MRRS, NUMA, GPU clocks, MTU, BAR1, PCIe topology):

+
sudo python3 python/tune_system.py --check all
@@ -447,7 +450,7 @@

Examples

payload_ptr,payload_size); } daqiri::send_tx_burst(burst);
- +
@@ -501,21 +504,27 @@

Examples

cpu_core: 9 batch_size: 10240 memory_regions: - - "Data_RX_CPU" - "Data_RX_GPU" + - name: "rx_q_1" + id: 1 + cpu_core: 10 + batch_size: 10240 + memory_regions: + - "Data_RX_GPU_2" flows: - - name: "flow_0" + - name: "udp_4096" id: 0 action: {type: queue, id: 0} - match: - udp_src: 4096 - udp_dst: 4096 - ipv4_len: 1050
- + match: {udp_dst: 4096} + - name: "udp_4097" + id: 1 + action: {type: queue, id: 1} + match: {udp_dst: 4097}
+
-
DPDK Benchmarksbash
+
Raw Ethernet Benchmarksbash
# Build with examples
 cmake -S . -B build \
   -DDAQIRI_BUILD_EXAMPLES=ON \
@@ -531,7 +540,7 @@ 

Examples

./build/examples/daqiri_bench_raw_hds \ examples/daqiri_bench_raw_tx_rx_hds.yaml \ --seconds 10
- +
@@ -569,14 +578,14 @@

Tutorials

Getting Started →
- 01
Requirements & Installation
Hardware (ConnectX-6 Dx+), driver setup (OFED from DOCA-Host 2.8+ or inbox on Ubuntu 5.4–6.7), and CUDA Toolkit installation on Linux 5.4+.
Beginner~15 min
+ 01
Requirements & Installation
Hardware (NVIDIA ConnectX-6 Dx or newer for kernel-bypass and GPUDirect), default Ubuntu kernel drivers plus optional doca-ofed for diagnostics, and CUDA Toolkit 12.2+ on Linux 5.4+.
Beginner~15 min
02
Building from Source with CMake
Configure DAQIRI_MGR, DAQIRI_BUILD_PYTHON, BUILD_SHARED_LIBS, and DAQIRI_BUILD_EXAMPLES. Build for A100/H100 (CUDA arches 80, 90).
Coming Soon
03
Container Build with Patched DPDK
Build the Docker image with build-container.sh. The container ships a dmabuf-patched DPDK, so peermem is not required.
Coming Soon
04
System Tuning for High-Performance Networking
Isolate CPU cores, configure hugepages, set NUMA affinity, and run python/tune_system.py to diagnose common configuration issues.
Intermediate~30 min
05
Benchmarking Examples
Run a TX/RX loopback test to validate your setup, and walk through interpreting throughput results.
Beginner~20 min
06
YAML Configuration Deep Dive
Memory regions (huge, device, host_pinned), RX/TX queue setup, flow steering rules, flex items, and RDMA client/server config schemas.
Intermediate~40 min
07
GPUDirect: Header-Data Split Pipeline
Configure a two-region memory layout, access CPU headers and GPU payloads per-packet with get_segment_packet_ptr(), and reorder scattered GPU buffers with the built-in CUDA kernel.
Coming Soon
-
08
RDMA Client/Server Setup
Configure the RDMA backend with RC transport, assign client and server roles across two hosts, and run daqiri_bench_rdma to validate the connection.
Coming Soon
+
08
RoCE (RDMA) Client/Server Setup
Configure stream_type: socket, protocol: roce with RC transport, assign client and server roles across two hosts, and run daqiri_bench_rdma to validate the connection.
Coming Soon
09
Timed TX with ConnectX-7
Enable accurate_send in the TX config and use set_packet_tx_time() for PTP-synchronized, hardware-scheduled packet transmission on ConnectX-7+.
Coming Soon
@@ -596,7 +605,7 @@

News

GitHub2025
DAQIRI Open-Sourced on GitHub
-
NVIDIA — Initial public release under Apache 2.0, featuring DPDK and RDMA backends with GPUDirect support for ConnectX-6 Dx and later NICs.
+
NVIDIA — Initial public release under Apache 2.0, featuring Raw Ethernet and RoCE stream types with GPUDirect support for ConnectX-6 Dx and later NICs.
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index 084896a..454761d 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -118,6 +118,15 @@ background: #0d0d0d; } +/* ── Content width ───────────────────────────────────────────────────── */ +/* Material defaults to ~61rem of content width; combined with the smaller */ +/* 0.72rem typeset baseline above, tables with monospace cells (CMake-options*/ +/* table in getting-started.md) wrap mid-token. Widen the grid for the slate */ +/* theme so wide tables breathe. */ +[data-md-color-scheme="slate"] .md-grid { + max-width: 76rem; +} + /* ── Tables ──────────────────────────────────────────────────────────── */ [data-md-color-scheme="slate"] .md-typeset table:not([class]) { border: 1px solid var(--nv-border); @@ -135,6 +144,16 @@ [data-md-color-scheme="slate"] .md-typeset table:not([class]) td { border-bottom: 1px solid #181818; color: var(--nv-text-mut); + /* Prefer word-boundary wrapping over mid-word breaks, which were */ + /* producing single-character orphans in narrow columns. */ + word-break: normal; + overflow-wrap: anywhere; + hyphens: none; +} +/* Keep monospace tokens (CMake options, YAML keys) together. */ +[data-md-color-scheme="slate"] .md-typeset table:not([class]) td > code, +[data-md-color-scheme="slate"] .md-typeset table:not([class]) th > code { + white-space: nowrap; } /* ── Sidebar / Navigation ────────────────────────────────────────────── */ diff --git a/docs/tutorials/benchmarking_examples.md b/docs/tutorials/benchmarking_examples.md index f2921e8..79167fb 100644 --- a/docs/tutorials/benchmarking_examples.md +++ b/docs/tutorials/benchmarking_examples.md @@ -22,7 +22,7 @@ For a persistent allocation across reboots, use the grub recipe in [Step 4 of Sy ## Running the DAQIRI container -If you built DAQIRI using the container approach, use the following command to launch the container with DPDK and GPU support. The host system must be fully configured (see [System Configuration](system_configuration.md)) before the container can access the NIC and GPU hardware. +If you built DAQIRI using the container approach, use the following command to launch the container with Raw Ethernet (DPDK) and GPU support. The host system must be fully configured (see [System Configuration](system_configuration.md)) before the container can access the NIC and GPU hardware. ```bash docker run --rm -it --privileged \ @@ -376,7 +376,7 @@ The `*_packets_phy` and `*_bytes_phy` counters are physical-link counters. They [CRITICAL] Cannot start device err=-95, port=0 ``` - The DPDK backend uses Hardware Steering (HWS) via the `dv_flow_en=2` mlx5 device argument. HWS requires compatible versions of both the NIC firmware and the host's MLNX_OFED kernel modules. Per the [DPDK mlx5 documentation](https://doc.dpdk.org/guides/nics/mlx5.html), the minimum requirements are ConnectX-6 Dx or later with firmware `xx.35.1012`+, but the host's OFED/kernel driver must also support the HWS features expected by the DPDK version in use. + Raw Ethernet (DPDK-backed) uses Hardware Steering (HWS) via the `dv_flow_en=2` mlx5 device argument. HWS requires compatible versions of both the NIC firmware and the host's MLNX_OFED kernel modules. Per the [DPDK mlx5 documentation](https://doc.dpdk.org/guides/nics/mlx5.html), the minimum requirements are ConnectX-6 Dx or later with firmware `xx.35.1012`+, but the host's OFED/kernel driver must also support the HWS features expected by the DPDK version in use. Check your OFED and firmware versions: diff --git a/docs/tutorials/configuration-walkthrough.md b/docs/tutorials/configuration-walkthrough.md index af4abcc..68e0421 100644 --- a/docs/tutorials/configuration-walkthrough.md +++ b/docs/tutorials/configuration-walkthrough.md @@ -2,22 +2,24 @@ ## Choosing an example config -### Choosing the appropriate DAQIRI backend for your setup +### Choosing the appropriate DAQIRI stream type for your setup -DAQIRI ships three backends, selected at build time via `DAQIRI_MGR` (the default build enables all three). Which backend's example YAML you start from depends on your hardware and topology: +DAQIRI exposes a single API on top of multiple packet I/O stacks, selected at runtime via two YAML keys — `stream_type` and (when `stream_type: "socket"`) `protocol`. Pick the row that matches your hardware and the role of the other endpoint: -- **DPDK raw** — kernel-bypass raw Ethernet with GPUDirect zero-copy. Highest performance. Requires a [Mellanox/ConnectX-class NVIDIA NIC](https://www.nvidia.com/en-us/networking/ethernet-adapters/); `tx_port` and `rx_port` can share one physical NIC for a single-host closed-loop bench, or be split across two hosts. -- **RDMA / RoCE** — low-latency verbs over an RDMA-capable fabric. The natural choice when you have NVIDIA NICs at both endpoints of a host-to-host link. -- **Kernel TCP/UDP sockets** — no NIC, no privileges, no special CMake flags. Useful as a comparison baseline against DPDK and RDMA, or as a path to first results when no NVIDIA NIC is available. +- **Raw Ethernet** — `stream_type: "raw"`. Kernel-bypass with GPUDirect zero-copy. Highest performance. Requires an [NVIDIA ConnectX-class NIC](https://www.nvidia.com/en-us/networking/ethernet-adapters/); `tx_port` and `rx_port` can share one physical NIC for a single-host closed-loop bench, or be split across two hosts. +- **Socket — UDP / TCP** — `stream_type: "socket"`, `protocol: "udp"` or `"tcp"`. Plain Linux kernel sockets. No NIC, no privileges, no special CMake flags. Useful as a comparison baseline and as a path to first results on a system without an NVIDIA NIC. +- **Socket — RoCE (RDMA)** — `stream_type: "socket"`, `protocol: "roce"`. RDMA verbs over Ethernet, with a server/client connection model and a NIC-level reliable transport. Primarily intended for setups where **one** endpoint is a third-party RoCE implementation (FPGA, instrument, customer black box). When both peers run DAQIRI, prefer an upper-layer library such as MPI / NCCL / UCX instead. -If you don't have any NIC at all, the `*_sw_loopback*` variants of the DPDK configs need no hardware — useful for first-time build verification. +If you don't have any NIC at all, the `*_sw_loopback*` variants of the Raw Ethernet configs need no hardware — useful for first-time build verification. -With a backend in mind, read down the questions below and stop at the first one that matches what you're trying to do. Each section names the YAML, the binary that consumes it, and any platform-specific notes. +(`DAQIRI_MGR` at the CMake layer is the inverse selector: it tells the build which manager implementations to compile in — `dpdk` enables `stream_type: "raw"`, `socket` enables `stream_type: "socket"` with `protocol: "udp"`/`"tcp"`, and `rdma` enables `protocol: "roce"`. The default build enables all three.) + +With a stream type in mind, read down the questions below and stop at the first one that matches what you're trying to do. Each section names the YAML, the binary that consumes it, and any platform-specific notes. ??? question "1. I want to measure baseline throughput" - Pick the backend that matches your stack (see the [backend overview](#choosing-the-appropriate-daqiri-backend-for-your-setup) above), then the hardware or protocol variant. + Pick the stream type that matches your stack (see the [overview](#choosing-the-appropriate-daqiri-stream-type-for-your-setup) above), then the hardware or protocol variant. - **DPDK raw** — runs on `daqiri_bench_raw_gpudirect`. + **Raw Ethernet** (`stream_type: "raw"`) — runs on `daqiri_bench_raw_gpudirect`. - **Generic discrete GPU** (template — replace ``) — [`daqiri_bench_raw_tx_rx.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx.yaml). This is the file annotated line-by-line in the [walkthrough below](#annotated-walkthrough). - **Four queue closed-loop TX+RX** (template — replace ``) — [`daqiri_bench_raw_tx_rx_4q.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_4q.yaml). Uses one application worker per TX/RX queue, with each `bench_tx` entry sending a different UDP flow. @@ -28,12 +30,12 @@ With a backend in mind, read down the questions below and stop at the first one counters, use the Grafana compose stack described in [Watch live OpenTelemetry metrics in Grafana](benchmarking_examples.md#watch-live-opentelemetry-metrics-in-grafana). - **RDMA / RoCE** — runs on `daqiri_bench_rdma` (use `--mode {tx,rx,both}`). Configs use `kind: host_pinned` regardless of platform. + **Socket — RoCE (RDMA)** (`stream_type: "socket"`, `protocol: "roce"`) — runs on `daqiri_bench_rdma` (use `--mode {tx,rx,both}`). Configs use `kind: host_pinned` regardless of platform. - **Generic** (template — replace IPs) — [`daqiri_bench_rdma_tx_rx.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_rdma_tx_rx.yaml). - **DGX Spark** (prefilled) — [`daqiri_bench_rdma_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_rdma_tx_rx_spark.yaml). See the [Spark profile callout](benchmarking_examples.md#update-the-loopback-configuration) for run details. - **Kernel TCP/UDP sockets** — runs on `daqiri_bench_socket`. Both bind to `127.0.0.1`. + **Socket — UDP / TCP** (`stream_type: "socket"`, `protocol: "udp"` or `"tcp"`) — runs on `daqiri_bench_socket`. Both bind to `127.0.0.1`. - **UDP** — [`daqiri_bench_socket_udp_tx_rx.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_socket_udp_tx_rx.yaml). - **TCP** — [`daqiri_bench_socket_tcp_tx_rx.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_socket_tcp_tx_rx.yaml). @@ -71,7 +73,7 @@ With a backend in mind, read down the questions below and stop at the first one | [`daqiri_bench_raw_rx_reorder_seq_batch.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_rx_reorder_seq_batch.yaml) | `seq_batch_number` | GPU | RX-only | | [`daqiri_bench_raw_sw_loopback_reorder_seq_1024.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback_reorder_seq_1024.yaml) | `seq_packets_per_batch` (1024) | CPU | TX+RX, no NIC | - *Requires: DPDK build + Mellanox-class NIC (or the SW-loopback variant for first-time validation).* + *Requires: Raw Ethernet build (`DAQIRI_MGR` includes `dpdk`) + NVIDIA ConnectX-class NIC (or the SW-loopback variant for first-time validation).* A [diff-style walkthrough](#packet-reordering-on-the-gpu) of `daqiri_bench_raw_tx_rx_reorder_seq_1024.yaml` appears below. @@ -80,7 +82,7 @@ With a backend in mind, read down the questions below and stop at the first one Header-data split: segment 0 (CPU) holds the header, segment 1 (GPU) holds the payload via GPUDirect zero-copy. Pick this when the CPU needs to read small per-packet fields without ever touching the payload. - *Requires: DPDK build + Mellanox-class NIC.* + *Requires: Raw Ethernet build (`DAQIRI_MGR` includes `dpdk`) + NVIDIA ConnectX-class NIC.* A [diff-style walkthrough](#header-data-split-hds) of this config appears below. @@ -90,7 +92,7 @@ With a backend in mind, read down the questions below and stop at the first one The four-queue TX+RX config is self-contained and maps each `bench_tx`/`bench_rx` list entry to the matching DAQIRI queue. The RX-only config is for an external traffic source. Both demonstrate flow-rule-based routing across multiple RX queues, each pinned to its own CPU core. - *Requires: DPDK build + Mellanox-class NIC. The RX-only config also requires a separate TX traffic source.* + *Requires: Raw Ethernet build (`DAQIRI_MGR` includes `dpdk`) + NVIDIA ConnectX-class NIC. The RX-only config also requires a separate TX traffic source.* ??? question "5. I need to record packet data to disk" Sub-question: **which output format?** @@ -100,7 +102,7 @@ With a backend in mind, read down the questions below and stop at the first one - **Hardware loopback** — [`daqiri_example_pcap_writer_tx_rx.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_example_pcap_writer_tx_rx.yaml). - **No physical NIC available** — [`daqiri_example_pcap_writer_sw_loopback.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_example_pcap_writer_sw_loopback.yaml). - *Requires: DPDK build. No special CMake flag.* + *Requires: Raw Ethernet build (`DAQIRI_MGR` includes `dpdk`). No special CMake flag.* **5.2 Zero-copy GPU → NVMe writes** (advanced) — runs on `daqiri_example_gds_write`. Pick this *only* if the GPU-to-disk zero-copy path is the specific subject of investigation; otherwise pick PCAP (5.1). @@ -196,13 +198,13 @@ bench_tx: # (25)! ``` 1. The `daqiri` section configures the DAQIRI library, which is responsible for setting up the NIC. It is passed to `daqiri_init(...)` during application startup. Within this section, `name:` fields on interfaces, queues, flows, and memory regions are used only for logging — pick any descriptive string. -2. **`stream_type`** · `string` · *required* — High-level transport family selected for this config. **Supported:** `"raw"` (DPDK raw Ethernet, used here), `"socket"` (kernel sockets and RDMA/RoCE; the specific protocol is then set via a separate `protocol:` field). The actual backend implementation is chosen at build time via `DAQIRI_MGR` — `stream_type` only picks among the backends you built. +2. **`stream_type`** · `string` · *required* — High-level transport family selected for this config. **Supported:** `"raw"` (Raw Ethernet via kernel bypass, used here), `"socket"` (kernel sockets and RoCE; the specific protocol is then set via a separate `protocol:` field). The implementation backing each stream type is chosen at build time via `DAQIRI_MGR` — `stream_type` only picks among the implementations you built. 3. :material-wrench: **`master_core`** · `integer (CPU core ID)` · *required* — Core used for DAQIRI setup. Does not need to be isolated; recommended to differ from the `cpu_core` fields below that poll the NIC. 4. **`loopback`** · `string` · *default: `""`* — Loopback mode. **Supported:** `""` (no loopback; use the physical NIC), `"sw"` (software loopback — no NIC required, used by the `*_sw_loopback*` configs for first-time build verification). 5. The `memory_regions` section lists where the NIC will write/read data from/to when bypassing the OS kernel. Tip: when using GPU buffer regions, keeping the sum of their buffer sizes below 80% of your BAR1 size is generally a good rule of thumb. 6. :material-package-variant: **`kind`** · `string` · *required* — Type of memory backing the region. **Supported:** `device` (GPU VRAM via GPUDirect — preferred on discrete GPUs), `host_pinned` (CPU pinned memory — required on integrated GPUs like NVIDIA GB10/DGX Spark where peer-DMA isn't available), `huge` (hugepages, CPU), `host` (CPU unpinned). See the [memory regions reference](../api-reference/configuration.md#memory-regions). Choose based on whether packets are processed on the GPU or CPU and on the GPU class. 7. :material-wrench: **`affinity`** · `integer (GPU ID / NUMA node)` · *required* — GPU device ID when `kind: device` or `kind: host_pinned`; NUMA node ID for CPU memory regions (`huge`, `host`). -8. :material-package-variant: **`num_bufs`** · `integer` · *required* — Number of buffers in the region. Higher gives more time to process packets but uses more BAR1 space; too low risks NIC drops (RX) or buffering latency (TX). A good starting point is 3×–5× the queue `batch_size`. For the DPDK backend, `num_bufs` below 1.5× the NIC ring size deadlocks the worker; `daqiri_init` auto-bumps such regions to 3× the ring (24576 with the default 8192) and logs a `WARN`. +8. :material-package-variant: **`num_bufs`** · `integer` · *required* — Number of buffers in the region. Higher gives more time to process packets but uses more BAR1 space; too low risks NIC drops (RX) or buffering latency (TX). A good starting point is 3×–5× the queue `batch_size`. For Raw Ethernet (`stream_type: "raw"`), `num_bufs` below 1.5× the NIC ring size deadlocks the worker; `daqiri_init` auto-bumps such regions to 3× the ring (24576 with the default 8192) and logs a `WARN`. 9. :material-package-variant: **`buf_size`** · `integer (bytes)` · *required* — Size of each buffer in the region. Should equal your maximum packet size, or smaller when chaining regions per packet (e.g. header-data split — see the [HDS walkthrough](#header-data-split-hds) below). 10. The `interfaces` section lists the NIC interfaces that will be configured for the application. 11. :material-wrench: **`address`** · `string (PCIe BDF)` · *required* — PCIe bus address of this interface. **Must be changed for your system.** Both `tx_port` and `rx_port` may point to the same physical NIC for single-port closed-loop benches. diff --git a/docs/tutorials/system_configuration.md b/docs/tutorials/system_configuration.md index d4da790..0b25bec 100644 --- a/docs/tutorials/system_configuration.md +++ b/docs/tutorials/system_configuration.md @@ -1411,11 +1411,15 @@ DAQIRI requires an [**NVIDIA SmartNIC**](https://www.nvidia.com/en-us/networking ### Enable GPUDirect - !!! warning "Skip `nvidia_peermem` on GB10" + **No GPUDirect kernel-module setup is required on GB10.** Set `kind: "host_pinned"` in the YAML and you're done — there is no system-side step to perform. Buffers are allocated by DAQIRI via `cudaHostAlloc` (so they are CUDA-addressable) and registered with DPDK via `rte_extmem_register`. End-to-end TX↔RX over the QSFP loop with `kind: "host_pinned"`, `num_bufs: 51200`, `batch_size: 10240` reaches **~94 Gbps** unicast (verified against `main` 9ebd729, which contains [PR #41](https://github.com/nvidia/daqiri/pull/41)). - `sudo modprobe nvidia_peermem` returns `Invalid argument` (EINVAL, exit=1) on GB10. The module file ships in `/lib/modules/$(uname -r)/kernel/nvidia-580-open/nvidia-peermem.ko`, but loading fails by design: peermem maps the NIC into a separate GPU BAR1, and GB10's NVLink-C2C unified memory has no separate BAR1. + `kind: "huge"` works as a fallback at the same rate. `kind: "device"` does **not** work on GB10. + + See the ready-to-run [`examples/daqiri_bench_raw_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark.yaml) for the complete config. + + ??? info "Why peermem and DMA-BUF don't apply on GB10" - !!! note "DMA-BUF is also unreachable as of CUDA 13.1" + `sudo modprobe nvidia_peermem` returns `Invalid argument` (EINVAL, exit=1) on GB10. The module file ships in `/lib/modules/$(uname -r)/kernel/nvidia-580-open/nvidia-peermem.ko`, but loading fails by design: peermem maps the NIC into a separate GPU BAR1, and GB10's NVLink-C2C unified memory has no separate BAR1. The Open kernel module on Grace platforms expects the standard Linux **DMA-BUF** path instead of peermem, but as of CUDA 13.1 / driver 580.142 the device-attribute query reports `flag=0`: @@ -1425,11 +1429,7 @@ DAQIRI requires an [**NVIDIA SmartNIC**](https://www.nvidia.com/en-us/networking cuDeviceGetAttribute(CU_DEVICE_ATTRIBUTE_INTEGRATED, 0) → SUCCESS, flag=1 ``` - DAQIRI's CUDA-DMA-BUF code path is therefore unreachable on Spark; `dpdk_patches/dmabuf.patch` still ships and is mandatory for the build, but the daqiri-side dma-buf branch never fires. - - **The right configuration on Spark is `kind: "host_pinned"` in the YAML** — there is no system-side step. Buffers are allocated by daqiri via `cudaHostAlloc` (so they are CUDA-addressable) and registered with DPDK via `rte_extmem_register`. End-to-end TX↔RX over the QSFP loop with `kind: "host_pinned"`, `num_bufs: 51200`, `batch_size: 10240` reaches **~94 Gbps** unicast (verified against `main` 9ebd729, which contains [PR #41](https://github.com/nvidia/daqiri/pull/41)). `kind: "huge"` works as a fallback at the same rate; `kind: "device"` does **not** work and is not expected to on GB10. - - See the ready-to-run [`examples/daqiri_bench_raw_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark.yaml) for the complete config. + DAQIRI's CUDA-DMA-BUF code path is therefore unreachable on Spark; `dpdk_patches/dmabuf.patch` still ships and is mandatory for the build, but the daqiri-side dma-buf branch never fires. The `host_pinned` path above sidesteps both interfaces entirely. --- diff --git a/mkdocs.yml b/mkdocs.yml index ea93621..eee551b 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -48,6 +48,7 @@ site_dir: site nav: - Getting Started: getting-started.md - Concepts: concepts.md + - Benchmarks: tutorials/benchmarking_examples.md - API Reference: - API Guide: api-reference/index.md - Configuration YAML Reference: api-reference/configuration.md @@ -55,7 +56,6 @@ nav: - Python API Usage: api-reference/python.md - Tutorials: - System Configuration: tutorials/system_configuration.md - - Benchmarking Examples: tutorials/benchmarking_examples.md - Configuration YAML Walkthrough: tutorials/configuration-walkthrough.md markdown_extensions: From f0a4aa2931cf1ad139e4047b60ca5f779cb1301a Mon Sep 17 00:00:00 2001 From: Chloe Crozier Date: Fri, 29 May 2026 16:46:27 -0700 Subject: [PATCH 2/4] Addressing more feedback and fixing details noticed during local deployment Signed-off-by: Chloe Crozier --- AGENTS.md | 2 +- docs/api-reference/configuration.md | 5 + docs/api-reference/cpp.md | 5 + docs/api-reference/index.md | 5 + docs/concepts.md | 15 ++- docs/index.html | 40 ++++-- docs/javascripts/tab-dropdowns.js | 80 ++++++++++++ docs/stylesheets/extra.css | 130 ++++++++++++++++++-- docs/tutorials/benchmarking_examples.md | 5 + docs/tutorials/configuration-walkthrough.md | 5 + docs/tutorials/system_configuration.md | 5 + mkdocs.yml | 1 + 12 files changed, 270 insertions(+), 28 deletions(-) create mode 100644 docs/javascripts/tab-dropdowns.js diff --git a/AGENTS.md b/AGENTS.md index b2ab39d..effe7ff 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -103,7 +103,7 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf **User-facing vocabulary:** docs and the YAML schema use `stream_type` (`raw`, `socket`, future `pcie`) and `protocol` (`udp`, `tcp`, `roce`). The word "backend" is internal-only — accurate for `src/managers//`, the `Manager` ABC, CMake `DAQIRI_MGR`, and API-reference function blurbs, but should not appear in tutorials, the landing page, or concept pages. The mapping: `stream_type: "raw"` is implemented by the `dpdk` manager; `stream_type: "socket"` with `protocol: "udp"` / `"tcp"` is implemented by the `socket` manager; `stream_type: "socket"` with `protocol: "roce"` is implemented by the `rdma` manager. **Keeping docs in sync with code:** before committing changes, scan for the recurring drift hotspots: -- **Stream-type list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/concepts.md` (Stream Types section + Maturity admonition), `docs/api-reference/configuration.md` +- **Stream-type list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/concepts.md` (Stream Types section + Support and testing admonition), `docs/api-reference/configuration.md` - **CMake options / `DAQIRI_MGR` default** (`src/CMakeLists.txt:137`) — README Quick Start, `docs/getting-started.md`, this file's Build & run section - **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, and the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage) - **Public API include** (`#include `; source files under `include/daqiri/`) — `docs/api-reference/index.md`, `docs/api-reference/cpp.md`, `docs/api-reference/python.md`; if the change adds or renames a user-facing concept, also `docs/concepts.md` diff --git a/docs/api-reference/configuration.md b/docs/api-reference/configuration.md index 303a2a8..47d6519 100644 --- a/docs/api-reference/configuration.md +++ b/docs/api-reference/configuration.md @@ -1,3 +1,8 @@ +--- +hide: + - navigation +--- + # Configuration YAML Reference DAQIRI is configured through a YAML file or a `NetworkConfig` struct built in code. diff --git a/docs/api-reference/cpp.md b/docs/api-reference/cpp.md index db26c4b..0af6728 100644 --- a/docs/api-reference/cpp.md +++ b/docs/api-reference/cpp.md @@ -1,3 +1,8 @@ +--- +hide: + - navigation +--- + # C++ API Usage This guide covers C++ initialization, RX/TX workflows, buffer lifecycle calls, file diff --git a/docs/api-reference/index.md b/docs/api-reference/index.md index a123a0b..1afab59 100644 --- a/docs/api-reference/index.md +++ b/docs/api-reference/index.md @@ -1,3 +1,8 @@ +--- +hide: + - navigation +--- + # API Guide DAQIRI is a library that moves bursts of packets between NICs and CPU or diff --git a/docs/concepts.md b/docs/concepts.md index 7896a6a..ba1b46d 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -22,7 +22,9 @@ choice is configured per-application in YAML by two keys: - `protocol` — required when `stream_type: "socket"`; selects the socket-level protocol. -### Raw Ethernet — `stream_type: "raw"` +### Raw Ethernet + +*YAML:* `stream_type: "raw"`. Kernel-bypass raw Ethernet. The application talks directly to NIC ring buffers in user space, skipping the Linux network stack entirely. This @@ -33,9 +35,10 @@ detail, not a user-facing concept. Requires an NVIDIA SmartNIC (ConnectX-6 Dx or later). -### Socket — `stream_type: "socket"` +### Socket -Socket-style interfaces. The specific transport is chosen by `protocol`: +*YAML:* `stream_type: "socket"`. The specific transport is chosen by +`protocol`: - **`protocol: "udp"`** / **`protocol: "tcp"`** — Linux kernel UDP and TCP sockets. No NIC privileges required, no special hardware. Useful @@ -50,7 +53,9 @@ Socket-style interfaces. The specific transport is chosen by `protocol`: speaks RoCE. When both peers run DAQIRI, prefer an upper-layer library such as MPI, NCCL, or UCX rather than wiring RoCE directly. -### PCIe — `stream_type: "pcie"` *(future)* +### PCIe (future) + +*YAML:* `stream_type: "pcie"`. Placeholder for an upcoming direct-PCIe stream type. Not implemented yet. @@ -68,7 +73,7 @@ sockets), see [Choosing an example config](tutorials/configuration-walkthrough.md#choosing-an-example-config) in the configuration walkthrough. -??? example "Maturity" +??? example "Support and testing" The DAQIRI library integration testing infrastructure is under active development. As such: diff --git a/docs/index.html b/docs/index.html index 0b9b579..ffb8e3d 100644 --- a/docs/index.html +++ b/docs/index.html @@ -46,10 +46,19 @@ .nav-logo-text { font-weight:800; font-size:1.1rem; color:var(--text-pri); letter-spacing:.05em; } .nav-logo-badge { font-size:.65rem; font-weight:700; padding:2px 6px; background:rgba(118,185,0,.15); color:var(--nv-green); border:1px solid rgba(118,185,0,.3); border-radius:99px; letter-spacing:.08em; } .nav-links { display:flex; align-items:center; gap:.1rem; flex:1; } - .nav-links a { color:var(--text-mut); font-size:.875rem; font-weight:500; padding:.4rem .65rem; border-radius:var(--radius); transition:color var(--ease),background var(--ease); white-space:nowrap; } - .nav-links a:hover { color:var(--text-pri); background:rgba(255,255,255,.05); } + .nav-links > a, .nav-item > a { color:var(--text-mut); font-size:.875rem; font-weight:500; padding:.4rem .65rem; border-radius:var(--radius); transition:color var(--ease),background var(--ease); white-space:nowrap; display:inline-block; } + .nav-links > a:hover, .nav-item > a:hover { color:var(--text-pri); background:rgba(255,255,255,.05); } .nav-links a.active { color:var(--nv-green); } - .nav-links a.nav-ext::after { content:'↗'; font-size:.72em; opacity:.55; margin-left:2px; } + /* Dropdown for nav items that map to a multi-page section (Tutorials, */ + /* API Reference). Hover/focus reveals a popover with sub-page links. */ + .nav-item { position:relative; } + .nav-item.nav-has-dropdown > a::after { content:'▾'; font-size:.7em; opacity:.55; margin-left:.3em; } + .nav-dropdown { display:none; position:absolute; top:100%; left:0; margin:0; padding:.4rem 0; list-style:none; min-width:14rem; background:var(--bg-card); border:1px solid var(--border); border-radius:var(--radius); box-shadow:0 6px 28px rgba(0,0,0,.55); z-index:1100; } + .nav-dropdown::before { content:''; position:absolute; top:-.5rem; left:0; right:0; height:.5rem; } + .nav-item:hover > .nav-dropdown, .nav-item:focus-within > .nav-dropdown { display:block; } + .nav-dropdown li { list-style:none; margin:0; } + .nav-dropdown a { display:block; padding:.5rem 1rem; font-size:.825rem; font-weight:500; color:var(--text-mut); text-decoration:none; white-space:nowrap; transition:color var(--ease),background var(--ease); } + .nav-dropdown a:hover { color:var(--text-pri); background:rgba(118,185,0,.1); } .nav-actions { display:flex; align-items:center; gap:.5rem; margin-left:auto; flex-shrink:0; } .btn { display:inline-flex; align-items:center; gap:.5rem; font-size:.875rem; font-weight:600; padding:.5rem 1.25rem; border-radius:var(--radius); border:1.5px solid transparent; cursor:pointer; transition:all var(--ease); text-decoration:none; white-space:nowrap; } .btn-primary { background:var(--nv-green); color:#000; border-color:var(--nv-green); } @@ -124,9 +133,9 @@ .ex-title { color:var(--text-pri); font-size:.92rem; font-weight:600; } .ex-lang { font-size:.72rem; color:var(--text-dim); margin-left:auto; font-family:var(--font-mono); } .ex-body pre { border:none; border-radius:0; margin:0; max-height:210px; overflow:hidden; font-size:.77rem; background:#090909; } - .ex-footer { padding:.9rem 1.5rem; border-top:1px solid var(--border); display:flex; align-items:center; justify-content:space-between; } - .ex-desc { font-size:.8rem; color:var(--text-mut); } - .ex-link { font-size:.8rem; color:var(--nv-green); font-weight:600; } + .ex-footer { padding:.9rem 1.5rem; border-top:1px solid var(--border); display:flex; align-items:center; justify-content:space-between; gap:1rem; } + .ex-desc { font-size:.8rem; color:var(--text-mut); min-width:0; } + .ex-link { font-size:.8rem; color:var(--nv-green); font-weight:600; white-space:nowrap; flex-shrink:0; } /* TUTORIALS */ #tutorials { border-top:1px solid var(--border); } @@ -161,7 +170,7 @@ .pub-title { color:var(--text-pri); font-size:1rem; font-weight:600; margin-bottom:.75rem; line-height:1.4; } .pub-authors { font-size:.82rem; color:var(--text-mut); margin-bottom:1.25rem; } .pub-links { display:flex; gap:.75rem; } - .pub-link { font-size:.8rem; font-weight:600; padding:.3rem .75rem; border-radius:6px; border:1px solid var(--border); color:var(--text-mut); transition:all var(--ease); } + .pub-link { font-size:.8rem; font-weight:600; padding:.3rem .75rem; border-radius:6px; border:1px solid var(--border); color:var(--text-mut); transition:all var(--ease); white-space:nowrap; } .pub-link:hover { color:var(--nv-green); border-color:rgba(118,185,0,.4); background:rgba(118,185,0,.05); } /* CTA */ @@ -207,8 +216,21 @@ Concepts Benchmarks Examples - Tutorials - API Reference + + News