Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions arrow-json/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ ryu = "1.0"
itoa = "1.0"

[dev-dependencies]
arrow-select = { workspace = true }
flate2 = { version = "1", default-features = false, features = ["rust_backend"] }
serde = { version = "1.0", default-features = false, features = ["derive"] }
futures = "0.3"
Expand Down
2 changes: 1 addition & 1 deletion arrow-json/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@
pub mod reader;
pub mod writer;

pub use self::reader::{Reader, ReaderBuilder};
pub use self::reader::{ArrayDecoder, DecoderFactory, Reader, ReaderBuilder, Tape, TapeElement};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the key part that might be deemed controversial. Is Tape really a good thing to expose publicly? It's been a while since I wrote it, but I remember it not being especially friendly as an API, and something that stands a good chance of being changed in future - e.g. to avoid copying strings.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and I don't remember seeing any discussion on the original PR I'm building on here:

Is there any way to allow users to customize parsing without exposing something? Other options might include:

  1. Create a new trait or wrapper that exposes the tape's information in a simplified/safe/stable way, to decouple users from the low-level details.
    • Maybe could work? Worth exploring?
  2. Convert the tape to variant, and shift the factory/decoder stuff over to variant-compute instead of json crate
    • We'd still need something to allow parsing JSON to variant, which I believe is a canonical extension types that should be supported directly.
    • Variant is insanely complex once shredding comes into the picture, so such an interface would not be easier or safer to use IMO.
    • The extra layer of conversion would impose significant overhead for somebody who just wants to parse a few misbehaving columns in a special way.
  3. Something else I'm not thinking of?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree if we want to allow users to override decoding behavior we are going to have to given them direct access to Tape / Tape Element - I don't really see any way around it

something that stands a good chance of being changed in future - e.g. to avoid copying strings.

@tustvold -- what strings are you referring to? I don't see any strings copied here:

https://github.com/apache/arrow-rs/blob/7e5076f1f775a6fd08a4d63389e26e2920fe3f6a/arrow-json/src/reader/tape.rs#L34-L33

pub struct Tape<'a> {
elements: &'a [TapeElement],
strings: &'a str,
string_offsets: &'a [usize],
num_rows: usize,
}

Copy link
Copy Markdown
Contributor

@tustvold tustvold Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some discussions a while back about the way that it copies all strings from the source data being unfortunate. I'm not sure how this is avoidable with the push-based decoder interface, but it has been discussed.

TBC I am not against making Tape public, but it likely needs some TLC prior to that to ensure it is usable and vaguely future-proof. Even basic things like adding non_exhaustive, hiding methods like this that are a bit odd to expose, etc...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone could come up with a list of changes that would make us comfortable making Tape public, I can try to put up a PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich is this something you can help drive? I think it is important but I really just don't have the bandwidth to give it the attention it deserves

From my perspective, the Arrow crates now permit making breaking API changes every 3 months (major releases) so if we expose a structure and then decide to make breaking changes to it, that isn't impossible

Thus in my mind, I think we should get something out and working

I don't suspect we are not likely to see changes to this interface unless we expose it and there are new use cases put forward.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't have time in the next couple of weeks, sadly. But yes I do want to see this through.

I do tend to agree we should get something out once it's reasonable, but we already know of several use cases and it's not obvious to me that even those known use cases are well-served by the current approach (speaking as somebody who wants to actually use whatever we come up with).

So: If we have something that works well for the use cases we've thought of, it's probably Good Enough and can evolve a quarter or two later as we discover more use cases or warts.

Is that a reasonable go/no-go criteria?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So: If we have something that works well for the use cases we've thought of, it's probably Good Enough and can evolve a quarter or two later as we discover more use cases or warts.

This is my personal preferred approach -- let's get it out rather than waiting on something that is perfect

pub use self::writer::{
ArrayWriter, Encoder, EncoderFactory, EncoderOptions, LineDelimitedWriter, Writer,
WriterBuilder,
Expand Down
2 changes: 1 addition & 1 deletion arrow-json/src/reader/list_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ impl<O: OffsetSizeTrait> ListArrayDecoder<O> {
DataType::LargeList(f) if O::IS_LARGE => f,
_ => unreachable!(),
};
let decoder = ctx.make_decoder(field.data_type(), field.is_nullable())?;
let decoder = ctx.make_decoder(field.data_type(), field.is_nullable(), field.metadata())?;

Ok(Self {
data_type: data_type.clone(),
Expand Down
12 changes: 10 additions & 2 deletions arrow-json/src/reader/map_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,16 @@ impl MapArrayDecoder {
_ => unreachable!(),
};

let keys = ctx.make_decoder(fields[0].data_type(), fields[0].is_nullable())?;
let values = ctx.make_decoder(fields[1].data_type(), fields[1].is_nullable())?;
let keys = ctx.make_decoder(
fields[0].data_type(),
fields[0].is_nullable(),
fields[0].metadata(),
)?;
let values = ctx.make_decoder(
fields[1].data_type(),
fields[1].is_nullable(),
fields[1].metadata(),
)?;

Ok(Self {
data_type: data_type.clone(),
Expand Down
Loading
Loading