Skip to content

Improve msgspec benchmarks#24

Merged
leo-gan merged 2 commits intoleo-gan:masterfrom
ofek:msgspec-improvements
Apr 30, 2026
Merged

Improve msgspec benchmarks#24
leo-gan merged 2 commits intoleo-gan:masterfrom
ofek:msgspec-improvements

Conversation

@ofek
Copy link
Copy Markdown
Contributor

@ofek ofek commented Apr 29, 2026

Summary

This updates the Python msgspec benchmark to use the API patterns recommended for high-performance msgspec applications, and adds a separate native MessagePack benchmark using msgspec.msgpack.

The existing msgspec benchmark encoded the shared stdlib dataclass fixtures directly. While msgspec supports dataclasses, its documented and most efficient modeling API is msgspec.Struct. This PR changes the benchmark to convert the canonical dataclass fixtures to generated Struct models before timed repetitions, then measures serialization/deserialization of those Struct instances.

Changes

  • Generate msgspec Struct types dynamically from the canonical dataclasses in benchmark.data.models.
  • Use array_like=True for the generated Structs.
  • Pre-build and reuse msgspec.json.Encoder / typed msgspec.json.Decoder instances per benchmark data type.
  • Use encode_into with a reusable bytearray for stream serialization.
  • Add a new msgspec-msgpack serializer using msgspec.msgpack.Encoder / typed msgspec.msgpack.Decoder.
  • Add serializer lifecycle hooks so serializers can prepare schemas/codecs and serializer-native fixture objects outside the timed loop.
  • Keep correctness comparison against the original canonical dataclass fixture, rather than the serializer-native prepared object.
  • Update Python serializer docs and summary counts to include msgspec-msgpack.

Why

The goal is to benchmark idiomatic, high-performance msgspec usage rather than compatibility-path dataclass handling.

Moving fixture conversion outside the timed loop models an application that already uses msgspec Struct types in its data model. The timed region still measures the important runtime costs: encoding, decoding, payload size, allocation, and semantic roundtrip correctness.

Generating Structs from the existing dataclasses avoids maintaining parallel hand-written msgspec models, which would be easy to let drift from the canonical benchmark fixtures.

About array_like=True

array_like=True encodes Struct instances as positional arrays instead of maps keyed by field name. This reduces payload size and avoids repeatedly writing field names for every object.

This is not intended to artificially boost msgspec at the expense of idiomatic usage. It is a documented msgspec option for schema-oriented workloads where both sides share the model definition. That matches other schema-based serializers already in the suite:

  • protobuf encodes numbered field tags rather than full field names.
  • Avro uses a shared schema for schemaless binary encoding.

For a benchmark suite that includes schema-aware formats, array-like Structs are a fair representation of msgspec's compact schema-oriented mode.

Limitations

ObjectGraph remains unsupported for msgspec and msgspec-msgpack. The fixture contains circular references/object identity cycles, which JSON and MessagePack formats do not represent.

Benchmark Results

I compared this branch against the previous implementation using:

python -m benchmark.runner 100 msgspec

The first repetition was excluded as warmup, matching the benchmark runner's reporting logic.

Overall:

  • Current msgspec JSON: 1.11x geomean faster for serialize+deserialize, average payload size 0.74x of the previous implementation.
  • New msgspec-msgpack: 1.24x geomean faster than the previous JSON baseline, average payload size 0.53x of the previous implementation.
Data / mode Old msgspec ops/s Current JSON ops/s JSON speedup JSON size msgspec-msgpack ops/s msgpack speedup msgpack size
EDI_835 bytes 23,102 29,109 1.26x 591 vs 1,730 34,055 1.47x 644
EDI_835 stream 20,893 25,701 1.23x 591 vs 1,730 31,008 1.48x 644
Integer bytes 385,899 410,281 1.06x 10 vs 10 399,707 1.04x 5
Integer stream 220,591 206,503 0.94x 10 vs 10 203,534 0.92x 5
Person bytes 37,319 37,909 1.02x 466 vs 917 42,494 1.14x 363
Person stream 32,325 32,735 1.01x 466 vs 917 34,379 1.06x 363
SimpleObject bytes 156,629 212,551 1.36x 67 vs 102 234,831 1.50x 56
SimpleObject stream 129,110 147,000 1.14x 67 vs 102 124,659 0.97x 56
StringArray bytes 21,090 21,205 1.01x 1,893 vs 1,901 22,454 1.06x 1,694
StringArray stream 19,459 19,041 0.98x 1,893 vs 1,901 19,929 1.02x 1,694
Telemetry bytes 33,807 40,416 1.20x 2,070 vs 2,185 59,339 1.76x 1,024
Telemetry stream 33,698 40,845 1.21x 2,070 vs 2,185 61,395 1.82x 1,024

Validation

Ran:

uv run --no-sync python -m compileall src
uv run --no-sync python -m benchmark.runner 5 msgspec
uv run --no-sync python -m benchmark.runner 1
uv run --no-sync python -m benchmark.runner 100 msgspec
git diff --check

Disclaimer

I coaxed Codex 5.5 xhigh into achieving what I wanted throughout the day when I had some downtime.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the msgspec-msgpack serializer and refactors the existing msgspec implementation to utilize msgspec.Struct models. By generating these models from canonical dataclasses and performing data conversion outside the timed benchmark loop, the suite now better reflects high-performance usage patterns. Additionally, the benchmark runner was updated with prepare and prepare_data hooks to support this pre-timing setup. The review feedback focuses on improving the robustness of stream-based serialization by correctly handling buffer offsets and lengths when using encode_into and getbuffer().

Comment on lines +157 to +158
self._encoder.encode_into(obj, self._buffer)
stream.write(self._buffer)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While msgspec.json.Encoder.encode_into resizes a bytearray to fit the encoded data when offset=0, relying on this behavior for serialization to a stream is correct but subtle. If msgspec were to change its implementation to only grow the buffer (similar to how many other libraries handle bytearray buffers), this would result in trailing garbage being written to the stream. A more robust approach would be to use the return value of encode_into (which returns the number of bytes written) to slice the buffer or use a memoryview.

Suggested change
self._encoder.encode_into(obj, self._buffer)
stream.write(self._buffer)
n = self._encoder.encode_into(obj, self._buffer)
stream.write(memoryview(self._buffer)[:n])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was locally verified that both msgspec.json.Encoder.encode_into and msgspec.msgpack.Encoder.encode_into return None; the docstring says the bytearray is truncated to the end of the serialized message. So using n = encode_into(...) would break.

Comment thread python/src/benchmark/serializers/json_msgspec.py
Copy link
Copy Markdown
Owner

@leo-gan leo-gan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I'll try to use the new base Serializer prepare() and prepare_data() methods on other serializers. Thanks for a good idea!!

@leo-gan leo-gan merged commit 2a1a7f2 into leo-gan:master Apr 30, 2026
4 checks passed
@ofek ofek deleted the msgspec-improvements branch April 30, 2026 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants