-
Notifications
You must be signed in to change notification settings - Fork 1.2k
WIP: Arrow reader/writer for geometry and geography #8801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,210 @@ | ||
| use std::sync::Arc; | ||
|
|
||
| use arrow::array::{Array, ArrayRef, BinaryArray, BinaryViewArray, LargeBinaryArray, make_array}; | ||
| use arrow::buffer::NullBuffer; | ||
| use arrow::error::Result; | ||
| use arrow_schema::Field; | ||
| use arrow_schema::{ArrowError, DataType, extension::ExtensionType}; | ||
| use serde::{Deserialize, Serialize}; | ||
|
|
||
| #[derive(Copy, Clone, Debug, Serialize, Deserialize)] | ||
| pub enum Hint { | ||
| Geometry, | ||
| Geography, | ||
| } | ||
|
|
||
| #[derive(Clone, Debug, Default, Serialize, Deserialize)] | ||
| pub struct Metadata { | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub crs: Option<String>, // TODO: explore when this is valid JSON to avoid double escaping | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub algorithm: Option<String>, | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| type_hint: Option<Hint>, | ||
| } | ||
|
|
||
| impl Metadata { | ||
| pub fn new(crs: Option<String>, algorithm: Option<String>) -> Self { | ||
| Self { | ||
| crs, | ||
| algorithm, | ||
| type_hint: None, | ||
| } | ||
| } | ||
|
|
||
| pub fn with_type_hint(mut self, type_hint: Hint) -> Self { | ||
| self.type_hint = Some(type_hint); | ||
| self | ||
| } | ||
|
|
||
| pub fn set_type_hint(&mut self, type_hint: Hint) { | ||
| self.type_hint = Some(type_hint) | ||
| } | ||
|
|
||
| pub fn type_hint(&self) -> Option<Hint> { | ||
| if self.type_hint.is_some() { | ||
| return self.type_hint; | ||
| } | ||
|
|
||
| match &self.algorithm { | ||
| Some(s) if s.to_lowercase() == "planar" => Some(Hint::Geometry), | ||
| Some(_) => Some(Hint::Geography), | ||
| None => None, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| #[derive(Debug, Default)] | ||
| pub struct WkbType(Metadata); | ||
|
|
||
| impl WkbType { | ||
| pub fn new_geometry(metadata: Option<Metadata>) -> Self { | ||
| Self(metadata.unwrap_or_default().with_type_hint(Hint::Geometry)) | ||
| } | ||
|
|
||
| pub fn new_geography(metadata: Option<Metadata>) -> Self { | ||
| Self(metadata.unwrap_or_default().with_type_hint(Hint::Geography)) | ||
| } | ||
| } | ||
|
|
||
| impl ExtensionType for WkbType { | ||
| const NAME: &'static str = "geoarrow.wkb"; | ||
|
|
||
| type Metadata = Metadata; | ||
|
|
||
| fn metadata(&self) -> &Self::Metadata { | ||
| &self.0 | ||
| } | ||
|
|
||
| fn serialize_metadata(&self) -> Option<String> { | ||
| serde_json::to_string(&self.0).ok() | ||
| } | ||
|
|
||
| fn deserialize_metadata(metadata: Option<&str>) -> Result<Self::Metadata> { | ||
| let Some(metadata) = metadata else { | ||
| return Ok(Self::Metadata::default()); | ||
| }; | ||
|
|
||
| serde_json::from_str(metadata).map_err(|e| ArrowError::JsonError(e.to_string())) | ||
| } | ||
|
|
||
| fn supports_data_type(&self, data_type: &arrow_schema::DataType) -> Result<()> { | ||
| match data_type { | ||
| DataType::Binary | DataType::LargeBinary | DataType::BinaryView => Ok(()), | ||
| dt => Err(ArrowError::InvalidArgumentError(format!( | ||
| "Geometry data type mismatch, expected one of Binary, LargeBinary, BinaryView. Found {dt}" | ||
| ))), | ||
| } | ||
| } | ||
|
|
||
| fn try_new(data_type: &arrow_schema::DataType, metadata: Self::Metadata) -> Result<Self> { | ||
| let wkb = Self(metadata); | ||
| wkb.supports_data_type(data_type)?; | ||
| Ok(wkb) | ||
| } | ||
| } | ||
|
|
||
| #[derive(Debug)] | ||
| pub struct WkbArray { | ||
| inner: ArrayRef, | ||
| metadata: Metadata, | ||
| } | ||
|
|
||
| impl WkbArray { | ||
|
Comment on lines
+108
to
+113
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This @alamb do you have some time to quickly look over this new type (as it's still WIP) and provide any feedback regarding the general API? I've tried to mimic the work done for the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (It seems reasonable from my end although I don't really know how the extension array/extension type/metadata is supposed to work from the arrow-rs end of things and Andrew is definitely the right person to answer that) |
||
| pub fn try_new_geometry(inner: &dyn Array, metadata: Metadata) -> Result<Self> { | ||
| let inner = make_array(inner.to_data()); | ||
| let metadata = metadata.with_type_hint(Hint::Geometry); | ||
|
|
||
| Ok(Self { inner, metadata }) | ||
| } | ||
|
|
||
| pub fn try_new_geography(inner: &dyn Array, metadata: Metadata) -> Result<Self> { | ||
| let inner = make_array(inner.to_data()); | ||
| let metadata = metadata.with_type_hint(Hint::Geography); | ||
|
|
||
| Ok(Self { inner, metadata }) | ||
| } | ||
|
|
||
| /// Returns a reference to the underlying [`ArrayRef`] | ||
| pub fn inner(&self) -> &ArrayRef { | ||
| &self.inner | ||
| } | ||
|
|
||
| /// Returns the inner [`ArrayRef`], consuming self | ||
| pub fn into_inner(self) -> ArrayRef { | ||
| self.inner | ||
| } | ||
|
|
||
| /// Returns a reference to the [`Metadata`] associated with this [`WkbArray`] | ||
| pub fn metadata(&self) -> &Metadata { | ||
| &self.metadata | ||
| } | ||
|
|
||
| /// Return the WKB (Well-Known-Binary) data stored at the given row | ||
| pub fn value(&self, index: usize) -> &[u8] { | ||
| if let Some(bv) = self.inner.as_any().downcast_ref::<BinaryViewArray>() { | ||
| bv.value(index) | ||
| } else if let Some(b) = self.inner.as_any().downcast_ref::<BinaryArray>() { | ||
| b.value(index) | ||
| } else if let Some(lb) = self.inner.as_any().downcast_ref::<LargeBinaryArray>() { | ||
| lb.value(index) | ||
| } else { | ||
| // Panic safety: Our try_new method ensures our inner array is one of the expected | ||
| // binary array types | ||
| unreachable!() | ||
| } | ||
| } | ||
|
|
||
| /// Return a [`Field`] to represent this [`WkbArray`] in an [`arrow_schema::Schema`] with a particular name | ||
| pub fn field(&self, name: impl Into<String>) -> Field { | ||
| Field::new( | ||
| name.into(), | ||
| self.data_type().clone(), | ||
| self.inner.is_nullable(), | ||
| ) | ||
| .with_extension_type(WkbType(self.metadata().clone())) | ||
| } | ||
|
|
||
| /// Returns a new [`DataType`] representing this [`WkbArray`]'s inner type | ||
| pub fn data_type(&self) -> &DataType { | ||
| self.inner.data_type() | ||
| } | ||
|
|
||
| pub fn slice(&self, offset: usize, length: usize) -> Self { | ||
| let inner = self.inner.slice(offset, length); | ||
| Self { | ||
| inner, | ||
| metadata: self.metadata().clone(), | ||
| } | ||
| } | ||
|
|
||
| pub fn len(&self) -> usize { | ||
| self.inner.len() | ||
| } | ||
|
|
||
| pub fn is_empty(&self) -> bool { | ||
| self.inner.is_empty() | ||
| } | ||
|
|
||
| pub fn nulls(&self) -> Option<&NullBuffer> { | ||
| self.inner.nulls() | ||
| } | ||
|
|
||
| /// Is the element at index null? | ||
| pub fn is_null(&self, index: usize) -> bool { | ||
| self.nulls().is_some_and(|n| n.is_null(index)) | ||
| } | ||
|
|
||
| /// Is the element at index valid (not null)? | ||
| pub fn is_valid(&self, index: usize) -> bool { | ||
| !self.is_null(index) | ||
| } | ||
|
|
||
| // TODO: implement iter() | ||
| } | ||
|
|
||
| impl From<WkbArray> for ArrayRef { | ||
| fn from(wkb_array: WkbArray) -> Self { | ||
| Arc::new(wkb_array.into_inner()) | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -55,6 +55,25 @@ pub(crate) fn try_add_extension_type( | |
| LogicalType::Json => { | ||
| arrow_field.try_with_extension_type(arrow_schema::extension::Json::default())?; | ||
| } | ||
| #[cfg(feature = "geospatial")] | ||
| LogicalType::Geometry { crs } => { | ||
| use parquet_geospatial::WkbTypeHint; | ||
|
|
||
| let md = parquet_geospatial::WkbMetadata::new(crs, None) | ||
| .with_type_hint(WkbTypeHint::Geometry); | ||
| arrow_field | ||
| .try_with_extension_type(parquet_geospatial::WkbType::new_geometry(Some(md)))?; | ||
| } | ||
| #[cfg(feature = "geospatial")] | ||
| LogicalType::Geography { crs, algorithm } => { | ||
| use parquet_geospatial::WkbTypeHint; | ||
|
|
||
| let algorithm = algorithm.map(|a| a.to_string()); | ||
| let md = parquet_geospatial::WkbMetadata::new(crs, algorithm) | ||
| .with_type_hint(WkbTypeHint::Geography); | ||
| arrow_field | ||
| .try_with_extension_type(parquet_geospatial::WkbType::new_geography(Some(md)))?; | ||
| } | ||
| _ => {} | ||
| }; | ||
| Ok(arrow_field) | ||
|
|
@@ -75,6 +94,10 @@ pub(crate) fn has_extension_type(parquet_type: &Type) -> bool { | |
| LogicalType::Uuid => true, | ||
| #[cfg(feature = "arrow_canonical_extension_types")] | ||
| LogicalType::Json => true, | ||
| #[cfg(feature = "geospatial")] | ||
| LogicalType::Geometry { .. } => true, | ||
| #[cfg(feature = "geospatial")] | ||
| LogicalType::Geography { .. } => true, | ||
| _ => false, | ||
| } | ||
| } | ||
|
|
@@ -133,3 +156,46 @@ pub(crate) fn logical_type_for_string(field: &Field) -> Option<LogicalType> { | |
| pub(crate) fn logical_type_for_string(_field: &Field) -> Option<LogicalType> { | ||
| Some(LogicalType::String) | ||
| } | ||
|
|
||
| #[cfg(feature = "geospatial")] | ||
| pub(crate) fn logical_type_for_binary(field: &Field) -> Option<LogicalType> { | ||
| use parquet_geospatial::WkbType; | ||
| use parquet_geospatial::WkbTypeHint; | ||
|
|
||
| match field.extension_type_name() { | ||
| Some(n) if n == WkbType::NAME => match field.try_extension_type::<WkbType>() { | ||
| Ok(wkb_type) => match wkb_type.metadata().type_hint() { | ||
| Some(WkbTypeHint::Geometry) => Some(LogicalType::Geometry { | ||
| crs: wkb_type.metadata().crs.clone(), | ||
| }), | ||
| Some(WkbTypeHint::Geography) => Some(LogicalType::Geography { | ||
| crs: wkb_type.metadata().crs.clone(), | ||
| algorithm: wkb_type | ||
| .metadata() | ||
| .algorithm | ||
| .clone() | ||
| .and_then(|a| a.parse().ok()), | ||
| }), | ||
| _ => None, | ||
| }, | ||
| Err(_e) => None, | ||
| }, | ||
| _ => None, | ||
| } | ||
| } | ||
|
|
||
| #[cfg(not(feature = "geospatial"))] | ||
| pub(crate) fn logical_type_for_binary(field: &Field) -> Option<LogicalType> { | ||
| None | ||
| } | ||
|
|
||
| #[cfg(feature = "geospatial")] | ||
| pub(crate) fn logical_type_for_binary_view(field: &Field) -> Option<LogicalType> { | ||
| // TODO: make this better-er | ||
| logical_type_for_binary(field) | ||
| } | ||
|
|
||
| #[cfg(not(feature = "geospatial"))] | ||
| pub(crate) fn logical_type_for_binary_view(field: &Field) -> Option<LogicalType> { | ||
| None | ||
| } | ||
|
Comment on lines
+198
to
+201
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure anybody is using |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to highlight this change because I think it merits discussion. I've introduced a
type_hintinto the metadata for this extension because I believe it's necessary to solve a round-trip for a specific ambiguous case.If my reading of the spec is correct, having a parquet
GEOGRAPHYlogical type with no metadata is a valid case (the default values ofcrs: OGC:CRS84andalgorithm: SPHERICALwould apply for readers of the column). If we want to support writing this case while preserving the (lack of) input metadata we need some additional information to inform us of the underlying type of the array data.This field is obviously not a part of the parquet metadata spec, nor the GeoArrow metadata spec. However, the
WkbArraythat has been introduced here doesn't claim to be either, so perhaps this is ok. If we want to ensure the metadata here strictly follows one spec or the other we will likely have to drop back to determining the parquet type of the array heuristically based on thealgorithmfield which could result in incorrect logical types for the ambiguous case. While I suspect this case is relatively rare, I think it's important to note it could happen.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely correct to spot this case! I am not personally worried about this one...I think you're safe to canonically export spherical edges as omitted if you'd like (this is what Arrow C++ will do today). Also if you're passionate about preserving this particular round trip, go for it!
I may have missed a case, although I think that all of the metadata you're proposing here is correct for both the Parquet and GeoArrow specs (there are just multiple ways to represent some things and it's often useful to roundtrip things where possible?).
Er, I suppose putting a
"type_hint"key in GeoArrow extension metadata is a bit fishy although I don't think it will cause any existing parsers to error.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the
type_hint(regardless of whether or not this would be tolerated by existing parsers) more or less makes this metadata neither Parquet nor GeoArrow. I think the more philosophical question of "should this metadata be Parquet or GeoArrow" is important here. I'll preface this by saying I don't really feel too strongly about this and my primary goal is to make sure this is usable for the community at large, however, I'll at least detail my thought process behind it.I think even without the
type_hintthe metadata we present here is already not really GeoArrow metadata, in spite of it generally looking compatible. Since cases like{"crs": "projjson:my_crs_metadata_key"}don't actually communicate the CRS without user (or other library, likelygeoarrow-rs) intervention there's already inherently some additional inspection of this metdata necessary to convert it to GeoArrow. If users of the metadata are going to need to inspect and potentially convert it anyway, we may as well allow the metadata to support all the cases the underlying parquet spec supports.Now, all that being said, I think we're in agreement that the case this supports isn't a huge concern. If it makes most sense from an ecosystem compatibility perspective to discard the
type_hintand accept that we technically have an unsupported case I think that's likely the strongest justification for how to move forward here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it's important to output valid GeoArrow metadata, at least in the sense that it conforms to the spec and tries its best to pass on any information it has (arbitrary strings in GeoArrow are permitted for this purpose!). (This Metadata struct can definitely represent anything we'd like it to...I'm talking strictly about the JSON output we write to the output Field).
I see this more of an inability to roundtrip one of two possible representations of spherical edges from Parquet, through GeoArrow, then back to Parquet. I'm comfortable with that but let me know if there's something beyond this I didn't see!