feat: blob v2 write support #560
Open
geruh wants to merge 2 commits into
Open
Conversation
Contributor
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack PR that depends on #548, which adds Blob v2 descriptor reads.
This PR, allows users to create a table with
file_format_version >= 2.2and<column>.lance.encoding = blob, writeBINARYvalues through SQL or DataFrames, and have Lance store them using Blob v2. Reads still expose the descriptor struct from #548.The hard part here is that the read schema and write schema are logically different and spark doesnt like this (as far as i can tell. Reads expose Blob v2 as a descriptor struct, but writes still need to accept
BINARY.I tried an analyzer rule first, but it runs after Spark resolves the table schema, so the write has already failed by the time the rule gets a chance to do anything.
So this uses
ACCEPT_ANY_SCHEMA, but only for tables with Blob v2 columns. It is basically a scoped bypass for this one schema mismatch, not a free pass for all writes. Lance still validates the incoming schema innewWriteBuilder: Blob v2 columns must be binary, column order/names must match, Spark SQLVALUEScolumns are handled, and nested structs still follow the normal checks.At encode time, Spark still gives us
BINARY, and the connector maps that into the Lance Blob v2 write struct. This PR only supports inline binary writes. URI-based blob writes are not included here and can be added separately.Testing
./mvnw test -pl lance-spark-base_2.12 -Dtest=BlobV2StructWriterTest,LanceWriteSchemaValidatorTest,SchemaConverterTestmake docker-build-test-base && make docker-build-test && make docker-test TEST_BACKENDS=local