LuxonisEval is a modular evaluation framework for benchmarking neural network models across multiple inference backends. It supports inference on Luxonis devices (RVC2 and RVC4) through DepthAI, as well as host-side inference through ONNX Runtime, while reporting both quality metrics and throughput or latency performance.
The framework follows a registry-based architecture: each pluggable component (engines, dataloaders, parsers, metrics, and visualizers) registers itself automatically. This lets you swap, extend, or add parts of the evaluation pipeline without modifying the core evaluation loop. In practice, adding a new component usually means subclassing the appropriate base class and referencing it by name in the configuration.
- Multiple Inference Backends
- DepthAI Engine - Run models exported as NNArchive files on Luxonis devices via DepthAI
- ONNX Engine - Run models on CPU or GPU using
ONNX Runtime
- Dataset Loading
- LuxonisLoader - Load datasets stored in Luxonis Data Format (
LDF) - BaseEvalLoader - Base class for custom dataloaders
- LuxonisLoader - Load datasets stored in Luxonis Data Format (
- Supported Tasks
Classification- Image classificationDetection- Bounding box detectionSemanticSegmentation- Per-pixel class labelingInstanceSegmentation- Per-instance masks with detectionKeypointDetection- Body or object keypoint localization
- Built-In Metrics
TopKAccuracy- Top-1 and Top-5 accuracy for classificationBboxMeanAveragePrecision- COCO-style mAP for bounding box detectionMaskMeanAveragePrecision- COCO-style mAP for instance segmentationKeypointMeanAveragePrecision- OKS-based mAP for keypoint detectionMIoU- Mean Intersection over Union for semantic segmentationDiceCoefficient- Dice score for semantic segmentationThroughputMetric- End-to-end throughput and latency reporting
- Extensible Architecture - Registry-based design powered by
AutoRegisterMeta, making it straightforward to add custom engines, parsers, metrics, loaders, and visualizers
Get started with LuxonisEval in a few steps:
-
Install the project from source
pip install . -
Prepare the example model and dataset (requires the
fiftyonepackage)pip install fiftyone bash examples/quickstart_inst_seg/setup_example.sh
-
Run the evaluation
luxonis_eval eval --config configs/yolov8n_inst_seg_config.yaml
This quickstart runs instance segmentation evaluation with ONNX Runtime on CPU and does not require Luxonis hardware. For a fuller walkthrough, see examples/quickstart_inst_seg/README.md.
- 🌟 Overview
- 🚀 Quick Start
- 🛠️ Installation
- 📝 Usage
- 🏗️ Architecture
- ⚙️ Configuration
- 🧱 Extending the Framework
- 📄 License
LuxonisEval requires Python 3.10 or higher. We recommend using a virtual environment to keep dependencies isolated.
Install from source:
pip install .This installs the luxonis_eval CLI in your environment.
Developer install:
pip install -e ".[dev]"You can use LuxonisEval either from the command line or through the Python API. The CLI is the primary entry point for running evaluations from configuration files.
The CLI currently exposes the eval command:
luxonis_eval eval --helpExample invocations:
# Run evaluation with a config file
luxonis_eval eval --config path/to/config.yaml
# Run with CLI overrides
luxonis_eval eval \
--config path/to/config.yaml \
--dataset-name coco \
--model-path path/to/model.tar.xz \
--backend depthai
# Use the ONNX backend
luxonis_eval eval \
--config path/to/config.yaml \
--dataset-name coco \
--model-path path/to/model.onnx \
--backend onnx
# Specify device IP for RVC4
luxonis_eval eval \
--config path/to/config.yaml \
--device-ip 192.168.1.100For programmatic usage, load an EvalConfig instance and pass it to eval_run:
from luxonis_eval.__main__ import eval_run
from luxonis_eval.utils.config import EvalConfig
eval_cfg = EvalConfig.get_config(cfg="path/to/config.yaml")
eval_run(eval_cfg)The repository is organized around a small set of core component types:
luxonis_eval/
├── engines/ # Inference backends
├── loaders/ # Dataset loaders
├── metrics/ # Evaluation metrics
├── parsers/ # Model output parsers
├── utils/ # Configuration and helper functions
├── visualizers/ # Result visualization
└── metadata/ # Class mapping files| Base Class | Location | Purpose |
|---|---|---|
BaseEngine |
engines/ |
Abstract inference engine |
BaseParser |
parsers/ |
Abstract output parser |
BaseMetric |
metrics/ |
Abstract evaluation metric |
BaseEvalLoader |
loaders/ |
Abstract dataset loader |
BaseVisualizer |
visualizers/ |
Abstract result visualizer |
All base classes use the AutoRegisterMeta metaclass. Any subclass is registered automatically and becomes available by name in configuration files, with no manual wiring required.
The evaluation loop in eval_run is structured around abstract component interfaces rather than concrete implementations. That design keeps the pipeline modular and makes backend or task-specific components easy to replace.
┌────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐
│ DataLoader │────▶│ Engine │────▶│ Parser │────▶│ Metrics │
│ (provides │ │ (runs model │ │ (converts │ │ (scores │
│ samples) │ │ inference) │ │ raw output)│ │ results) │
└────────────┘ └─────────────┘ └─────────────┘ └───────────┘
│
┌────────────┐ │
│ Visualizer │◀──────┘
│ (optional) │
└────────────┘The pipeline works as follows:
- DataLoader provides images together with ground-truth annotations.
- Engine runs inference and returns raw backend outputs.
- Parser converts raw outputs into a structured prediction format.
- Metrics accumulate per-sample results and compute final scores.
- Visualizer optionally renders predictions for inspection.
Because each component is resolved from a registry at runtime, you can mix and match implementations freely. For example, you can:
- swap
depthaiforonnxinenginewithout changing the rest of the config - add another metric under
metrics.metrics - introduce a custom parser and reference it by name
- replace
LuxonisLoaderwith a dataset-specific custom loader
The main constraint is compatibility: the parser must produce predictions in the format the configured metrics expect, and the dataloader must provide the annotation keys those metrics require. BaseMetric.validate_target_keys() catches mismatches early and raises a clear error message.
ThroughputMetric measures end-to-end pipeline timing rather than isolated model-only benchmarks. The reported rows mean:
Warning
Throughput values are end-to-end pipeline measurements and not isolated model-only benchmarks. Lower numbers than modelconverter benchmark results are expected.
- Throughput - Samples processed per second across the full evaluation pipeline
- End-to-end Latency - Average wall-clock time per sample for the whole run
- Inference - Time spent inside the inference engine
- Parsing - Time spent converting raw model outputs into predictions
- Metric Update - Time spent updating metrics for each sample
- Metric Compute - Time spent in the final metric aggregation after the sample loop
- Pipeline Overhead - Remaining time not covered by the rows above; this typically includes dataloader iteration, image decode, preprocessing such as resize or normalization, annotation reconstruction, visualization, progress bar updates, and general loop bookkeeping
Rule of thumb: End-to-end Latency ≈ Inference + Parsing + Metric Update + Metric Compute + Pipeline Overhead
Evaluation runs are driven by a YAML configuration file. EvalConfig parses and validates the configuration at startup, ensuring that referenced components exist and that required fields are present before evaluation begins.
A complete configuration file is typically organized into the sections below.
This section defines which dataloader to use, which dataset it points to, and which preprocessing steps are applied before inference.
loader:
name: LuxonisLoader # Registered dataloader name
params:
dataset_name: coco-2017 # Dataset identifier
view: [val] # Dataset split(s) to use
preprocessing:
normalize:
active: true # Whether to apply normalization
params:
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
color_space: RGB # RGB | BGR | GRAY
keep_aspect_ratio: false # Preserve aspect ratio during resizeNote
When using the depthai backend, normalization is usually handled by the model's own preprocessing pipeline. The engine will warn you if normalization is enabled together with DepthAI. DepthAI also expects BGR color space, so a warning is emitted if RGB is selected.
The parser converts raw model outputs into structured predictions. Different model architectures expose different tensor layouts, so the parser is responsible for translating backend-specific outputs into a format the metrics can consume.
parser:
name: YOLOInstanceSegmentationParser
params:
conf_thres: 0.25
mask_thres: 0.25
iou_thres: 0.45Metrics are instantiated independently, updated for each sample, and computed at the end of the run. Throughput reporting is added automatically.
metrics:
metrics:
- name: BboxMeanAveragePrecision
params:
iou_type: bbox
- name: MaskMeanAveragePrecision
params:
iou_type: segmVisualization is optional and can be enabled when you want to inspect predictions during the evaluation loop.
visualizer:
name: InstanceSegmentationVisualizer
visualize: true
params: {}The engine section selects the backend and points to the model file. Configuration validation ensures that the model format matches the backend (.tar.xz for depthai, .onnx for onnx).
engine:
name: onnx # Registered engine name: onnx | depthai
model_path: ./models/yolov11n/yolov11n.onnx
params: {} # Engine-specific parameters, for example device_ip for RVC4loader:
name: LuxonisLoader
params:
dataset_name: coco-2017
view: [val]
preprocessing:
normalize:
active: false
color_space: BGR
keep_aspect_ratio: false
parser:
name: YOLOInstanceSegmentationParser
params:
conf_thres: 0.25
mask_thres: 0.25
iou_thres: 0.45
metrics:
metrics:
- name: BboxMeanAveragePrecision
params:
iou_type: bbox
- name: MaskMeanAveragePrecision
params:
iou_type: segm
engine:
name: depthai
model_path: ./models/yolov11n-seg.rvc4.tar.xz
params:
device_ip: 192.168.1.100LuxonisEval is designed around a simple rule: implement a new class that inherits from the appropriate base class, and the registry handles the rest. Every component type (BaseEngine, BaseEvalLoader, BaseParser, BaseMetric, BaseVisualizer) uses AutoRegisterMeta, so subclassing is enough to make a component available once its module is imported.
Every custom loader must inherit from BaseEvalLoader and implement four abstract methods:
load_classes()- Returns adict[str, int]mapping class names to integer indices. The result is assigned toself.classesand validated automatically.get_class_mapping()- Returns a tuple of(ldf_class_map, native_class_map, class_index_map):- LDF class map (
dict[int, str]): class ordering used inside Luxonis Data Format - Native class map (
dict[int, str]): original class ordering used during training - Class index map (
dict[int, int]): mapping from LDF indices to native indices
- LDF class map (
__getitem__(idx)- Returns aLoaderOutputtuple for the requested sample__len__()- Returns the number of samples in the dataset
For LuxonisLoader-backed datasets, the LDF and native class maps often differ, so the class index map must encode the remapping explicitly. For custom datasets that inherit directly from BaseEvalLoader, the two class maps are usually identical and the class index map is typically an identity mapping.
Important
__getitem__ must return LoaderOutput from luxonis_ml.typing, which is a tuple of (image, annotations_dict).
image(np.ndarray) is a single image, for example with shape(H, W, 3).annotations_dict(dict[str, np.ndarray]) maps task-group annotation keys to arrays, such as"/boundingbox","/classification", or"/segmentation".
Every subclass implementation of __getitem__ is wrapped by @validate_loader_output, which calls check_loader_output at runtime and raises a descriptive TypeError if the output format is invalid.
Subclass BaseEngine and implement the six abstract methods:
setup()- Initialize backend resources such as runtimes, sessions, or device connectionsget_input_shape()- Return the model input size as a(width, height)tupleget_platform_name()- Return a human-readable platform name such as"RVC2"or"RVC4"infer_once(img)- Run inference on a single preprocessed image and return the raw backend outputvis_frame()- Return a copy of the input image suitable for visualization overlaysteardown()- Release backend resources after evaluation finishes
Subclass BaseParser and implement the single abstract method:
parse(raw_output, **kwargs)- Convert raw backend output into a structured prediction format
The parser bridges the gap between model-specific tensor layouts and the standardized message types that downstream metrics expect. The built-in parsers produce the following output types:
- ClassificationParser -> depthai_nodes.Classifications
- YOLODetectionParser -> dai.ImgDetections
- YOLOInstanceSegmentationParser -> dai.ImgDetections
- YOLOKeypointDetectionParser -> dai.ImgDetections
- SemanticSegmentationParser -> depthai_nodes.SegmentationMask
Important
The parser must produce outputs that the configured metrics can consume. For example, if a metric expects dai.ImgDetections, the parser must return that message type.
Subclass BaseMetric and implement the four abstract methods:
metric_keys()- Declare which annotation keys the metric requires_reset_impl()- Reset internal state such as counters or accumulators_update_impl(predictions, target, **kwargs)- Update the metric state for one sample_compute_impl()- Return the final metric value
Important
Metrics must be compatible with the outputs generated by the configured parser. If the parser returns dai.ImgDetections, the metric must know how to process that object.
All extensions follow the same three-step workflow:
- Subclass the appropriate base class
- Implement the required abstract methods
- Reference the component by name in the YAML config
No manual registration, factory wiring, or extra boilerplate is required. As long as the module is imported, the metaclass makes the class available.
This project is licensed under the Apache License 2.0.