Skip to content

Avro missing types#9291

Open
jecsand838 wants to merge 2 commits intoapache:mainfrom
jecsand838:avro-missing-types
Open

Avro missing types#9291
jecsand838 wants to merge 2 commits intoapache:mainfrom
jecsand838:avro-missing-types

Conversation

@jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Jan 28, 2026

Which issue does this PR close?

Rationale for this change

NOTE TO REVIEWERS: Over 1500 lines of this diff are tests.

arrow-avro currently cannot encode/decode a number of Arrow DataTypes, and some types have schema/encoding mismatches that can lead to incorrect data (even when encoding succeeds).

The goal is:

  • No more ArrowError::NotYetImplemented (or similar) when writing/reading an Arrow RecordBatch containing supported Arrow types, excluding Sparse Unions (will be handled separately).
  • When compiled with feature = "avro_custom_types": Arrow to Avro to Arrow should round-trip the Arrow DataType (including width/signedness/time units and relevant metadata using Arrow-specific custom logical types following the established arrow.* pattern.
  • When compiled without avro_custom_types: Arrow types should be encoded to the closest standard Avro primitive / logical type, with any necessary lossy conversions documented and consistently applied.

What changes are included in this PR?

Implementation of all existing missing arrow-avro types except for Sparse Unions

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, additional type support is being added which is user-facing.

# What changes are included in this PR?

- Introduced support for Avro custom logical types under the `avro_custom_types` feature. Added mappings for:
  - Int8, Int16, UInt8, UInt16, UInt32, UInt64.
  - Float16.
  - Interval (YearMonth, DayTime).
  - Custom logical types for Time32, Time64, Timestamps, and Date64.

- Updated schema handling to generate appropriate Avro JSON based on feature flag.

- Added specialized encoders/decoders to handle custom types, ensuring compatibility with Avro's logical types.

- Adjusted `Codec` enum and related encoding paths for precise storage (e.g., UInt64 stored as fixed(8), Float16 as fixed(2)).

# Are these changes tested?

Yes, new unit tests verify:
- Schema and type mappings.
- Avro serialization and deserialization for custom logical types.
- Default value handling and boundary cases for custom types.

# Are there any user-facing changes?

Yes:
- New feature flag (`avro_custom_types`) enabling advanced logical types.
- Enhanced custom type support for integration with extended Avro schemas.
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jan 28, 2026
…custom_types` feature flag. Updates schema handling, encoders, and readers to leverage Arrow-native fixed(16) representation for custom logical type, preserving full range and signed values. Adds unit tests for round-trip serialization/deserialization.
@jecsand838 jecsand838 marked this pull request as ready for review January 29, 2026 22:33
Copy link
Contributor Author

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review.

Comment on lines +1249 to +1251
let months = u32::from_le_bytes([b[0], b[1], b[2], b[3]]);
let days = u32::from_le_bytes([b[4], b[5], b[6], b[7]]);
let millis = u32::from_le_bytes([b[8], b[9], b[10], b[11]]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this update to align with the newer code.

DataType::Null => Value::String("null".into()),
DataType::Boolean => Value::String("boolean".into()),
DataType::Int8 | DataType::Int16 | DataType::UInt8 | DataType::UInt16 | DataType::Int32 => {
#[cfg(not(feature = "avro_custom_types"))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added because these are not native Avro types and now when #[cfg(feature = "avro_custom_types")] we are annotating a custom logicalType to the metadata. This enables easier round-tripping and optimal compatibility with Arrow DataType's.

assert_eq!(expected_str, actual_str);
Ok(())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing e2e tests are preserved to ensure backwards compatibility is maintained.

@jecsand838
Copy link
Contributor Author

@alamb @nathaniel-d-ef @mzabaluev @EmilyMatt @getChan

I came across some challenges with non-implemented Arrow DataType's while preparing to work on apache/datafusion#7679 and realized that this is the time to get this addressed given the upcoming Arrow v58 release.

Most of this PR involves ensuring all Arrow DataType's (except for sparse Unions) are implemented and--when the avro_custom_types feature flag is set-- support round tripping.

~ Half of this PR is tests, but I know it's large. Any help with reviews would be huge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-avro] Add missing Arrow DataType support with avro_custom_types round-trip + non-custom fallbacks

1 participant