Skip to content

Arrow IPC on the wire: C++ consumer PoC#40

Draft
jghoman wants to merge 1 commit intomainfrom
experiment/arrow-message-payload
Draft

Arrow IPC on the wire: C++ consumer PoC#40
jghoman wants to merge 1 commit intomainfrom
experiment/arrow-message-payload

Conversation

@jghoman
Copy link
Copy Markdown
Collaborator

@jghoman jghoman commented Apr 17, 2026

Summary

  • Validates the zero-copy claim from WICKED_COOL_NEXT_STEPS.md: a C++ consumer wraps librdkafka's receive buffer in place via arrow::Buffer(ptr, len) — no memcpy between Kafka and DuckDB's arrow_scan
  • Python producer batches PostHog-shaped events into Arrow IPC record batches on the test-events-arrow topic
  • C++ consumer (~250 LOC): librdkafka poll → Arrow IPC reader → Arrow C Data Interface → DuckDB arrow_scan → INSERT into DuckLake
  • Standalone docker-compose stack with shifted ports for side-by-side comparison with the parent stack
  • Smoke test: 660K records ingested over ~30s, zero crashes, zero data loss

Smoke-test results (Apple Silicon, 2026-04-12)

Metric Value
Producer rate (rate-limited) 10K rec/sec
Per-batch wire size ~1.1 MB
Consumer lag 0 across 8 partitions
Arrow IPC → parquet compression ~3.2×

Test plan

  • cd experiments/arrow-payload && docker compose up --build — all services healthy, records flowing
  • Consumer logs show flushed batch with record counts
  • MinIO console at :9101 shows parquet files under ducklake/data/main/events_arrow/

Validates the claim from WICKED_COOL_NEXT_STEPS.md that producing Arrow IPC
on the Kafka wire enables a near-zero-copy consumer path. The librdkafka
receive buffer is wrapped in place by Arrow C++ — no memcpy between
librdkafka and DuckDB's arrow_scan.

Includes:
- Python producer (pyarrow + confluent-kafka) generating PostHog-shaped
  events as Arrow IPC record batches
- C++ consumer (~250 LOC): librdkafka poll → Arrow IPC reader → DuckDB
  arrow_scan → INSERT into DuckLake
- Standalone docker-compose stack (shifted ports to avoid parent collisions)
- Smoke test results: 660K records ingested, zero crashes, zero data loss

See experiments/arrow-payload/FINDINGS.md for detailed results.
@jghoman
Copy link
Copy Markdown
Collaborator Author

jghoman commented Apr 17, 2026

Not merging as this is just a PoC for now.

@jghoman jghoman marked this pull request as draft April 17, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant