Improve msgspec benchmarks#24
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the msgspec-msgpack serializer and refactors the existing msgspec implementation to utilize msgspec.Struct models. By generating these models from canonical dataclasses and performing data conversion outside the timed benchmark loop, the suite now better reflects high-performance usage patterns. Additionally, the benchmark runner was updated with prepare and prepare_data hooks to support this pre-timing setup. The review feedback focuses on improving the robustness of stream-based serialization by correctly handling buffer offsets and lengths when using encode_into and getbuffer().
| self._encoder.encode_into(obj, self._buffer) | ||
| stream.write(self._buffer) |
There was a problem hiding this comment.
While msgspec.json.Encoder.encode_into resizes a bytearray to fit the encoded data when offset=0, relying on this behavior for serialization to a stream is correct but subtle. If msgspec were to change its implementation to only grow the buffer (similar to how many other libraries handle bytearray buffers), this would result in trailing garbage being written to the stream. A more robust approach would be to use the return value of encode_into (which returns the number of bytes written) to slice the buffer or use a memoryview.
| self._encoder.encode_into(obj, self._buffer) | |
| stream.write(self._buffer) | |
| n = self._encoder.encode_into(obj, self._buffer) | |
| stream.write(memoryview(self._buffer)[:n]) |
There was a problem hiding this comment.
It was locally verified that both msgspec.json.Encoder.encode_into and msgspec.msgpack.Encoder.encode_into return None; the docstring says the bytearray is truncated to the end of the serialized message. So using n = encode_into(...) would break.
leo-gan
left a comment
There was a problem hiding this comment.
LGTM
I'll try to use the new base Serializer prepare() and prepare_data() methods on other serializers. Thanks for a good idea!!
Summary
This updates the Python
msgspecbenchmark to use the API patterns recommended for high-performance msgspec applications, and adds a separate native MessagePack benchmark usingmsgspec.msgpack.The existing
msgspecbenchmark encoded the shared stdlib dataclass fixtures directly. While msgspec supports dataclasses, its documented and most efficient modeling API ismsgspec.Struct. This PR changes the benchmark to convert the canonical dataclass fixtures to generatedStructmodels before timed repetitions, then measures serialization/deserialization of thoseStructinstances.Changes
Structtypes dynamically from the canonical dataclasses inbenchmark.data.models.array_like=Truefor the generated Structs.msgspec.json.Encoder/ typedmsgspec.json.Decoderinstances per benchmark data type.encode_intowith a reusablebytearrayfor stream serialization.msgspec-msgpackserializer usingmsgspec.msgpack.Encoder/ typedmsgspec.msgpack.Decoder.msgspec-msgpack.Why
The goal is to benchmark idiomatic, high-performance msgspec usage rather than compatibility-path dataclass handling.
Moving fixture conversion outside the timed loop models an application that already uses msgspec
Structtypes in its data model. The timed region still measures the important runtime costs: encoding, decoding, payload size, allocation, and semantic roundtrip correctness.Generating Structs from the existing dataclasses avoids maintaining parallel hand-written msgspec models, which would be easy to let drift from the canonical benchmark fixtures.
About
array_like=Truearray_like=Trueencodes Struct instances as positional arrays instead of maps keyed by field name. This reduces payload size and avoids repeatedly writing field names for every object.This is not intended to artificially boost msgspec at the expense of idiomatic usage. It is a documented msgspec option for schema-oriented workloads where both sides share the model definition. That matches other schema-based serializers already in the suite:
For a benchmark suite that includes schema-aware formats, array-like Structs are a fair representation of msgspec's compact schema-oriented mode.
Limitations
ObjectGraphremains unsupported for msgspec andmsgspec-msgpack. The fixture contains circular references/object identity cycles, which JSON and MessagePack formats do not represent.Benchmark Results
I compared this branch against the previous implementation using:
The first repetition was excluded as warmup, matching the benchmark runner's reporting logic.
Overall:
msgspecJSON:1.11xgeomean faster for serialize+deserialize, average payload size0.74xof the previous implementation.msgspec-msgpack:1.24xgeomean faster than the previous JSON baseline, average payload size0.53xof the previous implementation.Validation
Ran:
Disclaimer
I coaxed Codex 5.5 xhigh into achieving what I wanted throughout the day when I had some downtime.