feat(arrow-avro): Configurable Arrow timezone ID for Avro timestamps#9280
feat(arrow-avro): Configurable Arrow timezone ID for Avro timestamps#9280mzabaluev wants to merge 2 commits intoapache:mainfrom
Conversation
In the ReaderBuilder API, add a new method `with_tz` that allows users to specify the timezone ID for Avro logical types that represent UTC timestamps. The choices are between "+00:00" and "UTC" and can be selected by the new Tz enumeration.
|
This makes more sense in my opinion than using "+00:00" exclusively |
3d58167 to
c61f7e2
Compare
jecsand838
left a comment
There was a problem hiding this comment.
@mzabaluev LMGTM. Really like this improvement to arrow-avro!
For me the big things are:
- An e2e test in
reader/mod.rsusing the newwith_tz. - Better documentation on the public
with_tzmethod.
| /// Sets the timezone representation for Avro timestamp fields. | ||
| pub fn with_tz(mut self, tz: Tz) -> Self { | ||
| self.use_tz = tz; | ||
| self | ||
| } | ||
|
|
There was a problem hiding this comment.
Would you mind adding an e2e test using this?
Also may help to explain in the doc comment what the default value is.
| Codec::TimestampMillis(Some(Tz::OffsetZero)) => Self::TimestampMillisUtc, | ||
| Codec::TimestampMillis(Some(Tz::Utc)) => Self::TimestampMillisUtc, | ||
| Codec::TimestampMillis(None) => Self::TimestampMillisLocal, | ||
| Codec::TimestampMicros(Some(Tz::OffsetZero)) => Self::TimestampMicrosUtc, | ||
| Codec::TimestampMicros(Some(Tz::Utc)) => Self::TimestampMicrosUtc, | ||
| Codec::TimestampMicros(None) => Self::TimestampMicrosLocal, | ||
| Codec::TimestampNanos(Some(Tz::OffsetZero)) => Self::TimestampNanosUtc, | ||
| Codec::TimestampNanos(Some(Tz::Utc)) => Self::TimestampNanosUtc, | ||
| Codec::TimestampNanos(None) => Self::TimestampNanosLocal, |
There was a problem hiding this comment.
This maybe a bit cleaner.
| Codec::TimestampMillis(Some(Tz::OffsetZero)) => Self::TimestampMillisUtc, | |
| Codec::TimestampMillis(Some(Tz::Utc)) => Self::TimestampMillisUtc, | |
| Codec::TimestampMillis(None) => Self::TimestampMillisLocal, | |
| Codec::TimestampMicros(Some(Tz::OffsetZero)) => Self::TimestampMicrosUtc, | |
| Codec::TimestampMicros(Some(Tz::Utc)) => Self::TimestampMicrosUtc, | |
| Codec::TimestampMicros(None) => Self::TimestampMicrosLocal, | |
| Codec::TimestampNanos(Some(Tz::OffsetZero)) => Self::TimestampNanosUtc, | |
| Codec::TimestampNanos(Some(Tz::Utc)) => Self::TimestampNanosUtc, | |
| Codec::TimestampNanos(None) => Self::TimestampNanosLocal, | |
| Codec::TimestampMillis(Some(_)) => Self::TimestampMillisUtc, | |
| Codec::TimestampMillis(None) => Self::TimestampMillisLocal, | |
| Codec::TimestampMicros(Some(_)) => Self::TimestampMicrosUtc, | |
| Codec::TimestampMicros(None) => Self::TimestampMicrosLocal, | |
| Codec::TimestampNanos(Some(_)) => Self::TimestampNanosUtc, | |
| Codec::TimestampNanos(None) => Self::TimestampNanosLocal, |
| impl Display for Tz { | ||
| fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | ||
| match self { | ||
| Tz::OffsetZero => write!(f, "+00:00"), | ||
| Tz::Utc => write!(f, "UTC"), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
What about doing something like this?
| impl Display for Tz { | |
| fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | |
| match self { | |
| Tz::OffsetZero => write!(f, "+00:00"), | |
| Tz::Utc => write!(f, "UTC"), | |
| } | |
| } | |
| } | |
| impl Tz { | |
| pub fn as_str(&self) -> &'static str { | |
| match self { | |
| Self::OffsetZero => "+00:00", | |
| Self::Utc => "UTC", | |
| } | |
| } | |
| } | |
| impl Display for Tz { | |
| fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | |
| f.write_str(self.as_str()) | |
| } | |
| } |
as_str() can be used to convert directly to Arc<str> without intermediate allocation. Then we can use it both in fmt and again below in Codec::data_type.
Meanwhile we can use f.write_str to avoid the overhead of the write! macro when parsing a format string for simple string literals.
| Self::TimestampMillis(tz) => DataType::Timestamp( | ||
| TimeUnit::Millisecond, | ||
| tz.as_ref().map(|tz| tz.to_string().into()), | ||
| ), | ||
| Self::TimestampMicros(tz) => DataType::Timestamp( | ||
| TimeUnit::Microsecond, | ||
| tz.as_ref().map(|tz| tz.to_string().into()), | ||
| ), | ||
| Self::TimestampNanos(tz) => DataType::Timestamp( | ||
| TimeUnit::Nanosecond, | ||
| tz.as_ref().map(|tz| tz.to_string().into()), | ||
| ), |
There was a problem hiding this comment.
Then down here you could do:
| Self::TimestampMillis(tz) => DataType::Timestamp( | |
| TimeUnit::Millisecond, | |
| tz.as_ref().map(|tz| tz.to_string().into()), | |
| ), | |
| Self::TimestampMicros(tz) => DataType::Timestamp( | |
| TimeUnit::Microsecond, | |
| tz.as_ref().map(|tz| tz.to_string().into()), | |
| ), | |
| Self::TimestampNanos(tz) => DataType::Timestamp( | |
| TimeUnit::Nanosecond, | |
| tz.as_ref().map(|tz| tz.to_string().into()), | |
| ), | |
| Self::TimestampMillis(tz) => DataType::Timestamp( | |
| TimeUnit::Millisecond, | |
| tz.as_ref().map(|tz| tz.as_str().into()), | |
| ), | |
| Self::TimestampMicros(tz) => DataType::Timestamp( | |
| TimeUnit::Microsecond, | |
| tz.as_ref().map(|tz| tz.as_str().into()), | |
| ), | |
| Self::TimestampNanos(tz) => DataType::Timestamp( | |
| TimeUnit::Nanosecond, | |
| tz.as_ref().map(|tz| tz.as_str().into()), | |
| ), |
Which issue does this PR close?
Rationale for this change
Enable an alternative representation of UTC timestamp data types with the "UTC" timezone ID, which is useful for interoperability with applications preferring that form.
What changes are included in this PR?
In the
ReaderBuilderAPI, add a new methodwith_tzthat allows users to specify the timezone ID for Avro logical types that represent UTC timestamps. The choices are between "+00:00" and "UTC" and can be selected by the newTzenumeration.Are these changes tested?
Added unit tests to verify the representation with different
Tzparameter values.Are there any user-facing changes?
A new
with_tzmethod is added toarrow_avro::reader::Builder.