Implement native protocol parsing#273
Conversation
616a1e7 to
555a381
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #273 +/- ##
==========================================
- Coverage 99.52% 99.35% -0.17%
==========================================
Files 18 18
Lines 1253 928 -325
Branches 145 90 -55
==========================================
- Hits 1247 922 -325
Misses 3 3
Partials 3 3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6d167c5 to
2fbf1ef
Compare
a8e6f7f to
ef01e99
Compare
There was a problem hiding this comment.
Pull request overview
Implements native (Rust/PyO3) protocol parsing for Kafka entities and primitive readers, wiring the Python public API (kio.serial.*) to the native extension for large parse-time performance gains (per #214). Also updates packaging/CI/docs/benchmarks to build and validate the extension via maturin.
Changes:
- Adds a new Rust native extension (
kio._kio_native) implementing primitive readers and entity parsing with schema compilation/caching. - Switches Python parsing/reader entrypoints to call into the native extension and updates tests/docs accordingly.
- Migrates build system to maturin with Rust linting in CI and adds benchmark scripts/config.
Reviewed changes
Copilot reviewed 29 out of 31 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/serial/test_readers.py | Updates/extends reader tests for the native-backed reader functions. |
| src/kio/static/primitive.py | Adds _PhantomStr to allow nominal string types to be satisfied by plain str. |
| src/kio/serial/readers.py | Re-exports reader functions from kio._kio_native and defines Reader protocol/types. |
| src/kio/serial/errors.py | Adds new decode error types used by native readers. |
| src/kio/serial/_parse.py | Replaces Python parsing implementation with native extension entrypoints. |
| src/kio/serial/_implicit_defaults.py | Adds _PhantomStr to implicit-default mapping. |
| src/kio/schema/types.py | Moves nominal string schema types to subclass _PhantomStr. |
| src/kio/records/readers.py | Minor slicing fix while reading signed compact strings. |
| src/kio/_kio_native.pyi | Adds typing stubs for the native extension module. |
| rust/src/schema.rs | Implements schema compilation/caching and native entity parsing loop. |
| rust/src/readers.rs | Implements native primitive readers and buffer handling, plus a small Rust-only unit test. |
| rust/src/parse.rs | Implements native get_reader dispatch for schema introspection. |
| rust/src/lib.rs | Exposes native functions/classes through the _kio_native PyO3 module. |
| rust/src/entity.rs | Exposes entity/field/array reader callables to Python and caches them. |
| rust/rust-toolchain.toml | Pins to stable toolchain with rustfmt/clippy components. |
| rust/Cargo.toml | Adds Rust crate configuration for the extension build. |
| rust/Cargo.lock | Locks Rust dependencies. |
| pyproject.toml | Switches build backend to custom maturin wrapper and configures maturin packaging. |
| docs/pages/serial.rst | Replaces autodoc with explicit API docs for readers/parsing functions. |
| docs/conf.py | Adjusts Sphinx config to import installed kio instead of modifying sys.path. |
| codegen/generate_schema.py | Generates string nominal types as _PhantomStr subclasses. |
| build_kio.py | Custom PEP 517 backend wrapper to integrate setuptools-scm versioning with maturin. |
| benchmarks/ruff.toml | Benchmark-specific ruff configuration. |
| benchmarks/roundtrip-serialization.py | Updates benchmark to use bytes+offset reader API. |
| benchmarks/parsing_memory.py | Adds parsing memory benchmark script. |
| benchmarks/parsing.py | Adds parsing micro-benchmark script. |
| Makefile | Adds clean/nuke targets and a maturin develop build target. |
| MANIFEST.in | Updates sdist inclusion/exclusion rules. |
| .pre-commit-config.yaml | Adds Rust formatting hook and updates dependencies. |
| .gitignore | Ignores maturin debug symbol directories. |
| .github/workflows/ci.yaml | Adds Rust fmt/clippy/tests job and ensures Rust toolchain is available in other jobs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cb421bb to
ec51ce0
Compare
ec51ce0 to
0e8f895
Compare
Introduce _PhantomStr as a base class for nominal string types such as TopicName, GroupId, and TransactionalId. This removes the need for a wrapping callable in the native parser since isinstance checks pass transparently for any plain str value.
Low-level functions for reading every Kafka primitive type from a raw byte buffer at a given offset. Each reader returns (value, bytes_consumed) and raises Python exceptions on underflow or invalid data.
Compile a Python dataclass entity type into an EntitySchema once and cache it. parse_entity then walks the schema to read all fields directly in Rust, allocating the dataclass via tp_alloc and writing slot attributes with PyObject_GenericSetAttr, bypassing __init__ and per-field callables.
Wire up the Rust extension as the backend for kio.serial. readers.py re-exports the native reader functions. _parse.py delegates entity_reader and get_field_reader to the extension. Add InvalidUnicode and NegativeByteLength error classes and a _kio_native.pyi stub for type checkers.
Add tests for read_boolean, varint boundary and truncation cases, empty compact strings/bytes, and invalid UTF-8 sequences. The single Rust-level test covers zigzag_decode, an internal function not reachable from Python.
Add pyperf-based CPU benchmark and memray/psutil-based memory benchmark comparing the native extension against the pure-Python baseline. Update the roundtrip benchmark to the new (buffer, offset) reader API. Add a nested ruff.toml to allow print statements in benchmark scripts.
Sphinx cannot introspect Rust-defined pyfunction attributes at runtime, so replace the automodule directives for kio.serial and kio.serial.readers with explicit py:function declarations. Writers and errors remain on autodoc since they are pure Python.
0e8f895 to
9ae616c
Compare
|
@keejon I don't think it's possible to meaningfully split this PR further.
If you have some cycles available for this, please take a look. |
This implements #214: native protocol parsing with readers implemented in Rust.
Performance benchmarks
Benchmarks run on the same machine (macOS/arm64) against
origin/mainas baseline. The workload is parsing aMetadataResponsewith 100 topics × 12 partitions each (170 KB wire payload).Parsing time
origin/mainnative-parsingResult: 6.71× faster
Memory
The memory benchmark retains all parsed objects simultaneously to measure the true live working-set: 1 000 loops × 5 parses per loop = 5 000
MetadataResponseobjects alive at peak measurement.origin/mainnative-parsingPeak working-set: identical (0.09 MB difference, < 0.01%). The Rust extension returns the same Python dataclass instances as the pure-Python implementation — there is no additional memory cost for the speedup. There is a difference in how the data gets allocated though, and the native code will result in higher volume of allocations as it doesn't have the ability of Python to allocate an area and use that for objects. In practice, the performance gain should be well worth that.