Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions parquet-geospatial/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,11 @@ edition = { workspace = true }
rust-version = { workspace = true }

[dependencies]
arrow = { workspace = true }
arrow-schema = { workspace = true }
geo-traits = { version = "0.3" }
serde = { version = "1.0", default-features = false, features = ["derive"]}
serde_json = { version = "1.0", default-features = false, features = ["std"]}
wkb = { version = "0.9.1" }

[dev-dependencies]
Expand Down
7 changes: 7 additions & 0 deletions parquet-geospatial/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,10 @@
pub mod bounding;
pub mod interval;
pub mod testing;

mod types;

pub use types::Hint as WkbTypeHint;
pub use types::Metadata as WkbMetadata;
pub use types::WkbArray;
pub use types::WkbType;
210 changes: 210 additions & 0 deletions parquet-geospatial/src/types.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
use std::sync::Arc;

use arrow::array::{Array, ArrayRef, BinaryArray, BinaryViewArray, LargeBinaryArray, make_array};
use arrow::buffer::NullBuffer;
use arrow::error::Result;
use arrow_schema::Field;
use arrow_schema::{ArrowError, DataType, extension::ExtensionType};
use serde::{Deserialize, Serialize};

#[derive(Copy, Clone, Debug, Serialize, Deserialize)]
pub enum Hint {
Geometry,
Geography,
}

#[derive(Clone, Debug, Default, Serialize, Deserialize)]
pub struct Metadata {
#[serde(skip_serializing_if = "Option::is_none")]
pub crs: Option<String>, // TODO: explore when this is valid JSON to avoid double escaping
#[serde(skip_serializing_if = "Option::is_none")]
pub algorithm: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
type_hint: Option<Hint>,
}
Comment on lines +22 to +24
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to highlight this change because I think it merits discussion. I've introduced a type_hint into the metadata for this extension because I believe it's necessary to solve a round-trip for a specific ambiguous case.

If my reading of the spec is correct, having a parquet GEOGRAPHY logical type with no metadata is a valid case (the default values of crs: OGC:CRS84 and algorithm: SPHERICAL would apply for readers of the column). If we want to support writing this case while preserving the (lack of) input metadata we need some additional information to inform us of the underlying type of the array data.

This field is obviously not a part of the parquet metadata spec, nor the GeoArrow metadata spec. However, the WkbArray that has been introduced here doesn't claim to be either, so perhaps this is ok. If we want to ensure the metadata here strictly follows one spec or the other we will likely have to drop back to determining the parquet type of the array heuristically based on the algorithm field which could result in incorrect logical types for the ambiguous case. While I suspect this case is relatively rare, I think it's important to note it could happen.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support writing this case while preserving the (lack of) input metadata

You're absolutely correct to spot this case! I am not personally worried about this one...I think you're safe to canonically export spherical edges as omitted if you'd like (this is what Arrow C++ will do today). Also if you're passionate about preserving this particular round trip, go for it!

I may have missed a case, although I think that all of the metadata you're proposing here is correct for both the Parquet and GeoArrow specs (there are just multiple ways to represent some things and it's often useful to roundtrip things where possible?).

Er, I suppose putting a "type_hint" key in GeoArrow extension metadata is a bit fishy although I don't think it will cause any existing parsers to error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the type_hint (regardless of whether or not this would be tolerated by existing parsers) more or less makes this metadata neither Parquet nor GeoArrow. I think the more philosophical question of "should this metadata be Parquet or GeoArrow" is important here. I'll preface this by saying I don't really feel too strongly about this and my primary goal is to make sure this is usable for the community at large, however, I'll at least detail my thought process behind it.

I think even without the type_hint the metadata we present here is already not really GeoArrow metadata, in spite of it generally looking compatible. Since cases like {"crs": "projjson:my_crs_metadata_key"} don't actually communicate the CRS without user (or other library, likely geoarrow-rs) intervention there's already inherently some additional inspection of this metdata necessary to convert it to GeoArrow. If users of the metadata are going to need to inspect and potentially convert it anyway, we may as well allow the metadata to support all the cases the underlying parquet spec supports.

Now, all that being said, I think we're in agreement that the case this supports isn't a huge concern. If it makes most sense from an ecosystem compatibility perspective to discard the type_hint and accept that we technically have an unsupported case I think that's likely the strongest justification for how to move forward here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata we present here is already not really GeoArrow metadata

I do think it's important to output valid GeoArrow metadata, at least in the sense that it conforms to the spec and tries its best to pass on any information it has (arbitrary strings in GeoArrow are permitted for this purpose!). (This Metadata struct can definitely represent anything we'd like it to...I'm talking strictly about the JSON output we write to the output Field).

accept that we technically have an unsupported case

I see this more of an inability to roundtrip one of two possible representations of spherical edges from Parquet, through GeoArrow, then back to Parquet. I'm comfortable with that but let me know if there's something beyond this I didn't see!


impl Metadata {
pub fn new(crs: Option<String>, algorithm: Option<String>) -> Self {
Self {
crs,
algorithm,
type_hint: None,
}
}

pub fn with_type_hint(mut self, type_hint: Hint) -> Self {
self.type_hint = Some(type_hint);
self
}

pub fn set_type_hint(&mut self, type_hint: Hint) {
self.type_hint = Some(type_hint)
}

pub fn type_hint(&self) -> Option<Hint> {
if self.type_hint.is_some() {
return self.type_hint;
}

match &self.algorithm {
Some(s) if s.to_lowercase() == "planar" => Some(Hint::Geometry),
Some(_) => Some(Hint::Geography),
None => None,
}
}
}

#[derive(Debug, Default)]
pub struct WkbType(Metadata);

impl WkbType {
pub fn new_geometry(metadata: Option<Metadata>) -> Self {
Self(metadata.unwrap_or_default().with_type_hint(Hint::Geometry))
}

pub fn new_geography(metadata: Option<Metadata>) -> Self {
Self(metadata.unwrap_or_default().with_type_hint(Hint::Geography))
}
}

impl ExtensionType for WkbType {
const NAME: &'static str = "geoarrow.wkb";

type Metadata = Metadata;

fn metadata(&self) -> &Self::Metadata {
&self.0
}

fn serialize_metadata(&self) -> Option<String> {
serde_json::to_string(&self.0).ok()
}

fn deserialize_metadata(metadata: Option<&str>) -> Result<Self::Metadata> {
let Some(metadata) = metadata else {
return Ok(Self::Metadata::default());
};

serde_json::from_str(metadata).map_err(|e| ArrowError::JsonError(e.to_string()))
}

fn supports_data_type(&self, data_type: &arrow_schema::DataType) -> Result<()> {
match data_type {
DataType::Binary | DataType::LargeBinary | DataType::BinaryView => Ok(()),
dt => Err(ArrowError::InvalidArgumentError(format!(
"Geometry data type mismatch, expected one of Binary, LargeBinary, BinaryView. Found {dt}"
))),
}
}

fn try_new(data_type: &arrow_schema::DataType, metadata: Self::Metadata) -> Result<Self> {
let wkb = Self(metadata);
wkb.supports_data_type(data_type)?;
Ok(wkb)
}
}

#[derive(Debug)]
pub struct WkbArray {
inner: ArrayRef,
metadata: Metadata,
}

impl WkbArray {
Comment on lines +108 to +113
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This WkbArray is a new addition in this most recent commit.

@alamb do you have some time to quickly look over this new type (as it's still WIP) and provide any feedback regarding the general API? I've tried to mimic the work done for the VariantArray as closely as possible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It seems reasonable from my end although I don't really know how the extension array/extension type/metadata is supposed to work from the arrow-rs end of things and Andrew is definitely the right person to answer that)

pub fn try_new_geometry(inner: &dyn Array, metadata: Metadata) -> Result<Self> {
let inner = make_array(inner.to_data());
let metadata = metadata.with_type_hint(Hint::Geometry);

Ok(Self { inner, metadata })
}

pub fn try_new_geography(inner: &dyn Array, metadata: Metadata) -> Result<Self> {
let inner = make_array(inner.to_data());
let metadata = metadata.with_type_hint(Hint::Geography);

Ok(Self { inner, metadata })
}

/// Returns a reference to the underlying [`ArrayRef`]
pub fn inner(&self) -> &ArrayRef {
&self.inner
}

/// Returns the inner [`ArrayRef`], consuming self
pub fn into_inner(self) -> ArrayRef {
self.inner
}

/// Returns a reference to the [`Metadata`] associated with this [`WkbArray`]
pub fn metadata(&self) -> &Metadata {
&self.metadata
}

/// Return the WKB (Well-Known-Binary) data stored at the given row
pub fn value(&self, index: usize) -> &[u8] {
if let Some(bv) = self.inner.as_any().downcast_ref::<BinaryViewArray>() {
bv.value(index)
} else if let Some(b) = self.inner.as_any().downcast_ref::<BinaryArray>() {
b.value(index)
} else if let Some(lb) = self.inner.as_any().downcast_ref::<LargeBinaryArray>() {
lb.value(index)
} else {
// Panic safety: Our try_new method ensures our inner array is one of the expected
// binary array types
unreachable!()
}
}

/// Return a [`Field`] to represent this [`WkbArray`] in an [`arrow_schema::Schema`] with a particular name
pub fn field(&self, name: impl Into<String>) -> Field {
Field::new(
name.into(),
self.data_type().clone(),
self.inner.is_nullable(),
)
.with_extension_type(WkbType(self.metadata().clone()))
}

/// Returns a new [`DataType`] representing this [`WkbArray`]'s inner type
pub fn data_type(&self) -> &DataType {
self.inner.data_type()
}

pub fn slice(&self, offset: usize, length: usize) -> Self {
let inner = self.inner.slice(offset, length);
Self {
inner,
metadata: self.metadata().clone(),
}
}

pub fn len(&self) -> usize {
self.inner.len()
}

pub fn is_empty(&self) -> bool {
self.inner.is_empty()
}

pub fn nulls(&self) -> Option<&NullBuffer> {
self.inner.nulls()
}

/// Is the element at index null?
pub fn is_null(&self, index: usize) -> bool {
self.nulls().is_some_and(|n| n.is_null(index))
}

/// Is the element at index valid (not null)?
pub fn is_valid(&self, index: usize) -> bool {
!self.is_null(index)
}

// TODO: implement iter()
}

impl From<WkbArray> for ArrayRef {
fn from(wkb_array: WkbArray) -> Self {
Arc::new(wkb_array.into_inner())
}
}
66 changes: 66 additions & 0 deletions parquet/src/arrow/schema/extension.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,25 @@ pub(crate) fn try_add_extension_type(
LogicalType::Json => {
arrow_field.try_with_extension_type(arrow_schema::extension::Json::default())?;
}
#[cfg(feature = "geospatial")]
LogicalType::Geometry { crs } => {
use parquet_geospatial::WkbTypeHint;

let md = parquet_geospatial::WkbMetadata::new(crs, None)
.with_type_hint(WkbTypeHint::Geometry);
arrow_field
.try_with_extension_type(parquet_geospatial::WkbType::new_geometry(Some(md)))?;
}
#[cfg(feature = "geospatial")]
LogicalType::Geography { crs, algorithm } => {
use parquet_geospatial::WkbTypeHint;

let algorithm = algorithm.map(|a| a.to_string());
let md = parquet_geospatial::WkbMetadata::new(crs, algorithm)
.with_type_hint(WkbTypeHint::Geography);
arrow_field
.try_with_extension_type(parquet_geospatial::WkbType::new_geography(Some(md)))?;
}
_ => {}
};
Ok(arrow_field)
Expand All @@ -75,6 +94,10 @@ pub(crate) fn has_extension_type(parquet_type: &Type) -> bool {
LogicalType::Uuid => true,
#[cfg(feature = "arrow_canonical_extension_types")]
LogicalType::Json => true,
#[cfg(feature = "geospatial")]
LogicalType::Geometry { .. } => true,
#[cfg(feature = "geospatial")]
LogicalType::Geography { .. } => true,
_ => false,
}
}
Expand Down Expand Up @@ -133,3 +156,46 @@ pub(crate) fn logical_type_for_string(field: &Field) -> Option<LogicalType> {
pub(crate) fn logical_type_for_string(_field: &Field) -> Option<LogicalType> {
Some(LogicalType::String)
}

#[cfg(feature = "geospatial")]
pub(crate) fn logical_type_for_binary(field: &Field) -> Option<LogicalType> {
use parquet_geospatial::WkbType;
use parquet_geospatial::WkbTypeHint;

match field.extension_type_name() {
Some(n) if n == WkbType::NAME => match field.try_extension_type::<WkbType>() {
Ok(wkb_type) => match wkb_type.metadata().type_hint() {
Some(WkbTypeHint::Geometry) => Some(LogicalType::Geometry {
crs: wkb_type.metadata().crs.clone(),
}),
Some(WkbTypeHint::Geography) => Some(LogicalType::Geography {
crs: wkb_type.metadata().crs.clone(),
algorithm: wkb_type
.metadata()
.algorithm
.clone()
.and_then(|a| a.parse().ok()),
}),
_ => None,
},
Err(_e) => None,
},
_ => None,
}
}

#[cfg(not(feature = "geospatial"))]
pub(crate) fn logical_type_for_binary(field: &Field) -> Option<LogicalType> {
None
}

#[cfg(feature = "geospatial")]
pub(crate) fn logical_type_for_binary_view(field: &Field) -> Option<LogicalType> {
// TODO: make this better-er
logical_type_for_binary(field)
}

#[cfg(not(feature = "geospatial"))]
pub(crate) fn logical_type_for_binary_view(field: &Field) -> Option<LogicalType> {
None
}
Comment on lines +198 to +201
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure anybody is using large_binary on purpose, but if it's easy to do it may be worth supporting that here.

7 changes: 5 additions & 2 deletions parquet/src/arrow/schema/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,9 @@ mod primitive;
use super::PARQUET_FIELD_ID_META_KEY;
use crate::arrow::ProjectionMask;
use crate::arrow::schema::extension::{
has_extension_type, logical_type_for_fixed_size_binary, logical_type_for_string,
logical_type_for_struct, try_add_extension_type,
has_extension_type, logical_type_for_binary, logical_type_for_binary_view,
logical_type_for_fixed_size_binary, logical_type_for_string, logical_type_for_struct,
try_add_extension_type,
};
pub(crate) use complex::{ParquetField, ParquetFieldType};

Expand Down Expand Up @@ -604,6 +605,7 @@ fn arrow_to_parquet_type(field: &Field, coerce_types: bool) -> Result<Type> {
Type::primitive_type_builder(name, PhysicalType::BYTE_ARRAY)
.with_repetition(repetition)
.with_id(id)
.with_logical_type(logical_type_for_binary(field))
.build()
}
DataType::FixedSizeBinary(length) => {
Expand All @@ -617,6 +619,7 @@ fn arrow_to_parquet_type(field: &Field, coerce_types: bool) -> Result<Type> {
DataType::BinaryView => Type::primitive_type_builder(name, PhysicalType::BYTE_ARRAY)
.with_repetition(repetition)
.with_id(id)
.with_logical_type(logical_type_for_binary_view(field))
.build(),
DataType::Decimal32(precision, scale)
| DataType::Decimal64(precision, scale)
Expand Down
18 changes: 18 additions & 0 deletions parquet/src/basic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1017,6 +1017,24 @@ impl fmt::Display for EdgeInterpolationAlgorithm {
}
}

impl FromStr for EdgeInterpolationAlgorithm {
type Err = ParquetError;

fn from_str(s: &str) -> Result<Self> {
match s.to_ascii_uppercase().as_str() {
"SPHERICAL" => Ok(EdgeInterpolationAlgorithm::SPHERICAL),
"VINCENTY" => Ok(EdgeInterpolationAlgorithm::VINCENTY),
"THOMAS" => Ok(EdgeInterpolationAlgorithm::THOMAS),
"ANDOYER" => Ok(EdgeInterpolationAlgorithm::ANDOYER),
"KARNEY" => Ok(EdgeInterpolationAlgorithm::KARNEY),
unknown => Err(general_err!(
"Unknown edge interpolation algorithm: {}",
unknown
)),
}
}
}

impl<'a, R: ThriftCompactInputProtocol<'a>> ReadThrift<'a, R> for EdgeInterpolationAlgorithm {
fn read_thrift(prot: &mut R) -> Result<Self> {
let val = prot.read_i32()?;
Expand Down
Loading
Loading