Skip to content

feat: propagate GEOMETRY CRS through PostGIS in both directions#418

Open
jatorre wants to merge 2 commits intoduckdb:mainfrom
jatorre:fix/geometry-srid-ewkb
Open

feat: propagate GEOMETRY CRS through PostGIS in both directions#418
jatorre wants to merge 2 commits intoduckdb:mainfrom
jatorre:fix/geometry-srid-ewkb

Conversation

@jatorre
Copy link
Copy Markdown

@jatorre jatorre commented Mar 20, 2026

Summary

Make DuckDB 1.5's GEOMETRY CRS metadata (GEOMETRY('EPSG:N')) survive a full round trip through PostgreSQL/PostGIS via postgres_scanner.

Write side (DuckDB → PostGIS)

PostgresBinaryWriter::WriteGeometry detects CRS on the LogicalType and writes EWKB instead of plain WKB so PostGIS receives the SRID:

WKB:  [byte_order:1] [type:4]                    [payload...]
EWKB: [byte_order:1] [type|0x20000000:4] [srid:4] [payload...]

Falls back to plain WKB when no CRS is set, preserving previous behaviour. Uses GeoType::HasCRS / GeoType::GetCRS (the earlier draft mistakenly referenced LogicalType::GetCRS, which does not exist on the public DuckDB 1.5 API).

Read side (PostGIS → DuckDB)

PostgresUtils::TypeToLogicalType decodes the SRID from PostGIS's column-level typmod for geometry(TYPE, SRID) columns and returns LogicalType::GEOMETRY("EPSG:N") instead of plain GEOMETRY(). The typmod layout (PostGIS source):

bits meaning
0 has-M
1 has-Z
2–7 geometry type code (POINT=1, LINESTRING=2, …)
8–29 SRID

Untyped geometry columns carry typmod -1 and fall back to plain GEOMETRY (no CRS), matching the previous behaviour.

Before / after

Before After
Write GEOMETRY('EPSG:4326') PostGIS geometry(SRID=0) PostGIS geometry(SRID=4326)
Read geometry(POINT, 4326) DuckDB GEOMETRY (no CRS) DuckDB GEOMETRY('EPSG:4326')

Test plan

  • Unit tests build (make release).
  • clang-format clean (was a CI failure on the previous draft).
  • End-to-end SRID survival via a DuckDB ATTACH against live PostGIS:
    • npm run test:adbc-srid -- postgresql --srid=3857 passes — remote_srid=3857, duckdb_crs=EPSG:3857, duckdb_type=GEOMETRY('EPSG:3857').
    • Multi-shape (point, line, polygon, NULL) byte-for-byte WKB equality round trip passes.

@jatorre jatorre marked this pull request as draft March 20, 2026 04:43
@jatorre jatorre force-pushed the fix/geometry-srid-ewkb branch 2 times, most recently from e3d8ea5 to 774ea01 Compare March 20, 2026 04:52
@jatorre
Copy link
Copy Markdown
Author

jatorre commented Mar 20, 2026

Hey @Maxxen — this PR adds EWKB writing to postgres_scanner so that GEOMETRY('EPSG:4326') columns arrive in PostGIS with the correct SRID instead of 0.

A couple of things I'd appreciate your input on:

  1. SRID extraction approach: Currently we parse the CRS identifier (e.g., "EPSG:4326"4326) per-row via string::find + stoi. It's cheap but not elegant. Would you prefer this to happen once per-column during init, maybe storing the resolved SRID in the copy state? Or is per-row fine given it's a short string?

  2. AuxInfo() guard: We check type.AuxInfo() before calling GetCRS() since it throws on bare GEOMETRY without CRS. Is there a more idiomatic way to check if a GEOMETRY type has CRS attached? Something like GeoType::HasCRS() that I might have missed?

  3. Internal format assumption: We treat the GEOMETRY binary representation as WKB and insert the SRID header directly. Verified byte-for-byte identical to ST_AsWKB() output for all 7 geometry types. Is this a safe assumption going forward, or should we go through an explicit conversion?

Also — maybe you're already looking at completing this work somewhere else, or have a different approach in mind. Let me know if for topics like this you'd prefer an issue first rather than a PR. I thought this was small enough that a PR would be more useful, but happy to adjust to whatever workflow you prefer.

@Maxxen
Copy link
Copy Markdown
Member

Maxxen commented Mar 20, 2026

Unfortunately this isnt safe to do, the postgres srid is not guaranteed to be the same as the numeric part of the auth code. Ig also doesnt work for non-integer auth codes, like OGC:CRS84.

We need to do a lookup in the postgres databases SPATIAL_REF_SYS table and find the best match (which I think we need PROJ for, which only spatial bundles) to get the SRID. This is not really easy to do in the type conversion code and we dont want to add proj as a dependency here right now.

@Maxxen
Copy link
Copy Markdown
Member

Maxxen commented Mar 20, 2026

I have some thought on how to eventually solve this in the future, that would involve breaking PROJ out from spatial into its own separate extension, however we dont want to do that yet until we have a stable extension c-api, which we are working on.

@jatorre
Copy link
Copy Markdown
Author

jatorre commented Mar 20, 2026

Thanks for the context @Maxxen — that makes total sense about the SRID ≠ EPSG code edge case and OGC:CRS84.

That said, I think there's a pragmatic middle ground here: only handle the EPSG:NNNN case (where SRID = integer part), and fall back to plain WKB (SRID=0) for everything else. The code already does this — it looks for a colon, tries stoi, and falls back to WriteRawBlob if parsing fails.

One of the key values of DuckDB in the geospatial world is being the intermediary between data warehouses — BigQuery, Snowflake, Databricks, Redshift — and PostGIS. Those cloud warehouses don't have a spatial_ref_sys table or the CRS flexibility that PostGIS has — they work with well-known EPSG codes directly. So in practice, the vast majority of data flowing through DuckDB into PostGIS will be GEOMETRY('EPSG:NNNN') where the PostGIS SRID matches the integer value.

Covering that 99% case with a few lines of code would already be extremely valuable — right now every postgres_scanner write loses the SRID, and users have to UPDATE ... SET geom = ST_SetSRID(geom, 4326) after every import.

For the remaining cases (OGC:CRS84, custom SRIDs, non-integer auth codes), falling back to SRID=0 preserves current behavior and ensures correctness — no worse than today, and the user can handle it with a post-import ST_SetSRID just as they do now.

Would you be open to that scoped approach? The contract would be:

  • GEOMETRY('EPSG:4326') → EWKB with SRID=4326 ✓
  • GEOMETRY('OGC:CRS84') → plain WKB, SRID=0 (same as today)
  • GEOMETRY (no CRS) → plain WKB, SRID=0 (same as today)

Happy to adjust the code or close this if you'd rather wait for the PROJ-based solution — just wanted to make the case that the simple version already covers almost all real-world usage.

@Maxxen
Copy link
Copy Markdown
Member

Maxxen commented Mar 20, 2026

Alright, maybe we can do it for EPSG only, I just worry that the SRIDs in PostGIS spatial_ref_sys table don't actually match with the EPSG codes, even for EPSG defined projections, not sure how easy that would be to verify... Either way we still lose the CRS on import - but I guess we can deal with that separately.

Let me get back to this later :)

@jatorre
Copy link
Copy Markdown
Author

jatorre commented Mar 20, 2026

Good news — I checked a default PostGIS spatial_ref_sys table and all 6,184 EPSG entries have srid = auth_srid with zero mismatches:

SELECT COUNT(*) as total,
       SUM(CASE WHEN srid = auth_srid::INT THEN 1 ELSE 0 END) as matching,
       SUM(CASE WHEN srid != auth_srid::INT THEN 1 ELSE 0 END) as mismatched
FROM spatial_ref_sys WHERE auth_name = 'EPSG';

-- total: 6184, matching: 6184, mismatched: 0

So for the default table, EPSG code = PostGIS SRID is a safe assumption. A user could theoretically insert custom rows that break this, but that's an edge case they'd need to handle themselves anyway.

Happy to wait until you have time to look at it properly — no rush.

@jatorre
Copy link
Copy Markdown
Author

jatorre commented Apr 22, 2026

Follow-up: the read path has the symmetric issue. postgres_scanner reads PostGIS geometry columns as plain WKB (type 0x01), not EWKB (0x01 | 0x20000000), so the SRID is stripped on the way in too. A column typed geometry(Point, 3857) in PostGIS arrives in DuckDB as GEOMETRY with empty ST_CRS().

Two pieces needed for a full fix:

  1. Bind time — read the PostGIS typmod from pg_attribute (or geometry_columns) and bind the column as GEOMETRY('EPSG:NNNN') instead of plain GEOMETRY.
  2. Per-value — in postgres_binary_reader.cpp (LogicalTypeId::GEOMETRY branch), strip the EWKB SRID header before passing bytes to Geometry::FromBinary.

Happy to follow up with a separate PR for the read side. Keeping this PR focused on the write path since the two changes touch different code and have different edge cases (untyped geometry columns, mixed-SRID columns, postgres_query() path — see #415).

DuckDB 1.5 introduced CRS metadata on the GEOMETRY type (e.g.,
GEOMETRY('EPSG:4326')). This change makes that CRS survive a full
round trip through PostgreSQL/PostGIS via postgres_scanner.

Write side (DuckDB → PostGIS):
- WriteGeometry detects CRS on the LogicalType and writes EWKB
  instead of plain WKB so PostGIS receives the SRID:
    WKB:  [byte_order:1] [type:4]                    [payload...]
    EWKB: [byte_order:1] [type|0x20000000:4] [srid:4] [payload...]
  Falls back to plain WKB when no CRS is set, preserving the
  previous behavior. (Uses GeoType::HasCRS / GeoType::GetCRS — the
  earlier draft mistakenly referenced LogicalType::GetCRS, which
  does not exist on DuckDB 1.5.)

Read side (PostGIS → DuckDB):
- TypeToLogicalType decodes the SRID from PostGIS's column-level
  typmod for `geometry(TYPE, SRID)` columns and returns
  LogicalType::GEOMETRY("EPSG:N") instead of plain GEOMETRY().
  The typmod layout is:
      bits  0    : has-M
      bits  1    : has-Z
      bits  2-7  : geometry type code (POINT=1, LINESTRING=2, ...)
      bits  8-29 : SRID
  Untyped `geometry` columns carry typmod -1 and fall back to plain
  GEOMETRY (no CRS), matching the previous behavior.

Before: DuckDB GEOMETRY('EPSG:4326') → PostGIS geometry(SRID=0)
        PostGIS geometry(POINT,4326)  → DuckDB GEOMETRY (no CRS)
After:  DuckDB GEOMETRY('EPSG:4326') → PostGIS geometry(SRID=4326)
        PostGIS geometry(POINT,4326)  → DuckDB GEOMETRY('EPSG:4326')

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jatorre jatorre force-pushed the fix/geometry-srid-ewkb branch from 774ea01 to 99cd2f8 Compare April 25, 2026 07:00
@jatorre jatorre changed the title feat: propagate GEOMETRY CRS as PostGIS SRID via EWKB feat: propagate GEOMETRY CRS through PostGIS in both directions Apr 25, 2026
@jatorre jatorre marked this pull request as ready for review April 25, 2026 07:00
The previous commit added the EWKB write path and typmod read decode,
but on CREATE TABLE AS the CRS was being stripped before reaching
either side, so the resulting PostGIS column landed as plain
`geometry` (typmod -1) and the read-back lost the SRID.

Two strip points in this file:

  1. ToPostgresType (called by AddCastToPostgresTypes during CTAS
     planning) returned LogicalType::GEOMETRY() unconditionally,
     dropping CRS before the projection that feeds the new physical
     plan.  Now returns the input unchanged for GEOMETRY so CRS
     survives the cast/projection.

  2. TypeToString returned plain "GEOMETRY" for any GEOMETRY type and
     would have been short-circuited anyway by the HasAlias() early
     return (GEOMETRY('EPSG:N') carries an alias of "GEOMETRY").  The
     new branch runs *before* HasAlias, detects CRS via
     GeoType::HasCRS, and emits `geometry(Geometry, N)` so PostGIS
     records the SRID in pg_attribute.atttypmod and the existing
     read-side typmod decode picks it up.

Verified end-to-end through the streamability harness in
duckdb-warehouse-transfer: postgres -> oracle and postgres -> snowflake
now round-trip GEOMETRY('EPSG:3857') with byte-perfect WKB and CRS
preserved at the destination, with no plan-level materialization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jatorre
Copy link
Copy Markdown
Author

jatorre commented Apr 25, 2026

Pushed 640b9fa to cover a second hole I hit while testing this end-to-end through DuckDB-mediated cross-warehouse transfers.

The previous commit (99cd2f8) handles the EWKB write path and typmod read decode for tables created with an explicit geometry(geometry, SRID) column. But for CREATE TABLE AS — which is how DuckDB-driven CREATE TABLE pg.public.t AS SELECT * FROM <source> lands geometry — the SRID was still being stripped before reaching either side, so the column landed as plain geometry (typmod -1) and the round trip lost the CRS.

There are two strip points in src/postgres_utils.cpp:

  1. ToPostgresType is called from AddCastToPostgresTypes during CTAS physical planning. It used to return LogicalType::GEOMETRY() unconditionally for any GEOMETRY input, dropping the CRS before the projection that feeds the insert. Changed to return the input unchanged so CRS survives the cast/projection.

  2. TypeToString used to return plain "GEOMETRY" for any GEOMETRY type, and would have been short-circuited anyway by the HasAlias() early return (GEOMETRY('EPSG:N') carries an alias of "GEOMETRY"). The new branch runs before HasAlias, detects CRS via GeoType::HasCRS, and emits geometry(Geometry, N) so PostGIS records the SRID in pg_attribute.atttypmod and the existing read-side typmod decode (already on this PR) picks it up correctly.

Verified end-to-end with a small streamability harness that chains adbc_insert(dest, ..., (SELECT * FROM <postgres>...)) through a single DuckDB connection:

Source → Dest Plan WKB CRS at dest
postgres → oracle streams (no materialization op) byte-match EPSG:3857
postgres → snowflake streams byte-match EPSG:3857

Both round trips were failing before this commit (CRS dropped to EPSG:4326 at the destination because the postgres source lost it). Same plan shape, no change in elapsed time, RSS stays flat.

This is the only change beyond the previous commit; the EWKB write and typmod read decode are unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants