Skip to content

perf(arrow): avoid per-element allocation in BFloat16Array::from#7500

Open
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:perf/bf16-from-vec-no-per-elem-alloc
Open

perf(arrow): avoid per-element allocation in BFloat16Array::from#7500
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:perf/bf16-from-vec-no-per-elem-alloc

Conversation

@LuciferYang

Copy link
Copy Markdown
Contributor

What

impl From<Vec<bf16>> for BFloat16Array built its byte buffer with:

let bytes = data.iter().flat_map(|val| {
    let bytes = val.to_bits().to_le_bytes();  // [u8; 2] on the stack
    bytes.to_vec()                            // heap allocation per element
});
buffer.extend(bytes);

to_vec() heap-allocates a 2-byte Vec for every element. On the vector-ingestion path (coerce_float_vectorBFloat16Array::from) that is ~2M tiny allocations for a 4096×512 bf16 batch.

Fix

Write each value's little-endian bytes straight into the pre-sized MutableBuffer:

for val in &data {
    buffer.extend_from_slice(&val.to_bits().to_le_bytes());
}

No per-element allocation; the bytes produced are identical.

Test

Behavior-preserving — the existing test_basics already asserts BFloat16Array::from(values) == BFloat16Array::from_iter_values(values) and checks the decoded values, and test_nulls / test_coerce_float_vector_bfloat16 cover the surrounding paths. All pass. cargo clippy -p lance-arrow --tests -- -D warnings and cargo fmt -- --check are green.

…bf16>)

The From<Vec<bf16>> impl built the byte buffer via flat_map + to_vec,
heap-allocating a 2-byte Vec for every element (~2M allocations for a
4096x512 bf16 batch on the vector path). Write each value's little-endian
bytes straight into the MutableBuffer with extend_from_slice. Behavior is
unchanged; test_basics already pins from() == from_iter_values().
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@LuciferYang

Copy link
Copy Markdown
Contributor Author

cc @Xuanwo

Comment on lines +164 to +168
// Write each value's little-endian bytes straight into the buffer. Going
// through an intermediate `Vec` per element would allocate once per value.
for val in &data {
buffer.extend_from_slice(&val.to_bits().to_le_bytes());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I wonder if we can do one better and just pass ownership of the buffer directly via Buffer::from_vec. This would require the memory-layout to be the same, but I would think it is.

@LuciferYang LuciferYang Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems can't go direct in this pr, bf16 doesn't implement ArrowNativeType, so Buffer::from_vec rejects it (arrow-rs registers the trait for f16/f32/f64 among the floats, but not bf16). The root cause sits one layer deeper: Apache Arrow's logical type system has no bf16 primitive at all, which is why Lance falls back to FixedSizeBinary(2) in the first place.

Maybe two ways to bridge it for a true zero-copy path, both larger than this pr's scope:

  1. unsafe transmute Vec<bf16>Vec<u16> via Vec::from_raw_parts, then Buffer::from_vec. Sound because bf16 is #[repr(transparent)] over u16 — same size and alignment, so the allocator Layout is preserved — but adds a new unsafe block, which feels out of scope for a perf-only change.

  2. Enable the bytemuck feature on half in the workspace and use bytemuck::cast_vec::<bf16, u16>. Safe at the call site; the layouts match statically via repr(transparent), so the runtime size/align check can't fire. Cost is a workspace dependency feature bump.

This pr already removes the per-element allocation, which was the hot-path cost on 4096×512 batches. What's left is one linear pass over ~4 MB of bf16 data — likely a smaller win than the allocation fix this
PR captures.

Personally, I prefer to keep the current PR as-is and pursue one of the options as a follow-up, such as Option 2. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants