perf(arrow): avoid per-element allocation in BFloat16Array::from#7500
perf(arrow): avoid per-element allocation in BFloat16Array::from#7500LuciferYang wants to merge 1 commit into
Conversation
…bf16>) The From<Vec<bf16>> impl built the byte buffer via flat_map + to_vec, heap-allocating a 2-byte Vec for every element (~2M allocations for a 4096x512 bf16 batch on the vector path). Write each value's little-endian bytes straight into the MutableBuffer with extend_from_slice. Behavior is unchanged; test_basics already pins from() == from_iter_values().
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
cc @Xuanwo |
| // Write each value's little-endian bytes straight into the buffer. Going | ||
| // through an intermediate `Vec` per element would allocate once per value. | ||
| for val in &data { | ||
| buffer.extend_from_slice(&val.to_bits().to_le_bytes()); | ||
| } |
There was a problem hiding this comment.
suggestion: I wonder if we can do one better and just pass ownership of the buffer directly via Buffer::from_vec. This would require the memory-layout to be the same, but I would think it is.
There was a problem hiding this comment.
seems can't go direct in this pr, bf16 doesn't implement ArrowNativeType, so Buffer::from_vec rejects it (arrow-rs registers the trait for f16/f32/f64 among the floats, but not bf16). The root cause sits one layer deeper: Apache Arrow's logical type system has no bf16 primitive at all, which is why Lance falls back to FixedSizeBinary(2) in the first place.
Maybe two ways to bridge it for a true zero-copy path, both larger than this pr's scope:
-
unsafetransmuteVec<bf16>→Vec<u16>viaVec::from_raw_parts, thenBuffer::from_vec. Sound becausebf16is#[repr(transparent)]overu16— same size and alignment, so the allocator Layout is preserved — but adds a newunsafeblock, which feels out of scope for a perf-only change. -
Enable the
bytemuckfeature onhalfin the workspace and usebytemuck::cast_vec::<bf16, u16>. Safe at the call site; the layouts match statically viarepr(transparent), so the runtime size/align check can't fire. Cost is a workspace dependency feature bump.
This pr already removes the per-element allocation, which was the hot-path cost on 4096×512 batches. What's left is one linear pass over ~4 MB of bf16 data — likely a smaller win than the allocation fix this
PR captures.
Personally, I prefer to keep the current PR as-is and pursue one of the options as a follow-up, such as Option 2. WDYT?
What
impl From<Vec<bf16>> for BFloat16Arraybuilt its byte buffer with:to_vec()heap-allocates a 2-byteVecfor every element. On the vector-ingestion path (coerce_float_vector→BFloat16Array::from) that is ~2M tiny allocations for a 4096×512 bf16 batch.Fix
Write each value's little-endian bytes straight into the pre-sized
MutableBuffer:No per-element allocation; the bytes produced are identical.
Test
Behavior-preserving — the existing
test_basicsalready assertsBFloat16Array::from(values) == BFloat16Array::from_iter_values(values)and checks the decoded values, andtest_nulls/test_coerce_float_vector_bfloat16cover the surrounding paths. All pass.cargo clippy -p lance-arrow --tests -- -D warningsandcargo fmt -- --checkare green.