From ddfe2aea72e6c8e7b74460303389b24387f2ee64 Mon Sep 17 00:00:00 2001 From: Aimee Barciauskas Date: Fri, 19 Jun 2026 17:00:30 -0700 Subject: [PATCH 1/8] docs: publish DSE architecture vision slide deck Add the DSE Architecture Vision reveal.js deck under docs/ so it is published (accessible, not linked in nav). Includes a new References slide and converts the zarr-python markdown link to a real anchor tag. Co-Authored-By: Claude Opus 4.8 --- .../diagrams_svg/07_reference.svg | 1 + docs/dse-architecture-vision/index.html | 379 ++++++++++++++++++ 2 files changed, 380 insertions(+) create mode 100644 docs/dse-architecture-vision/diagrams_svg/07_reference.svg create mode 100644 docs/dse-architecture-vision/index.html diff --git a/docs/dse-architecture-vision/diagrams_svg/07_reference.svg b/docs/dse-architecture-vision/diagrams_svg/07_reference.svg new file mode 100644 index 0000000..8ee69af --- /dev/null +++ b/docs/dse-architecture-vision/diagrams_svg/07_reference.svg @@ -0,0 +1 @@ +Reference Architecture — Tools & FlowsConcrete mapping of the vision onto NASA tools. Bold = implementation / tool · small = logical component.NASA Data System Evolution (DSE)Architecture Visionexistsin developmentData Product GenerationEarthdata Cloud Data LakeDiscoverabilityServicesMAAPdata processing systemSTACproduct metadataStaging bucketobject storage (staging)Cumulusingest / archiveEarthdata HarmonyGIBS image-pyramid generationObject storageGIBS tile pyramids — NOT the data lakeuse only via the GIBS servicemetadatadata productsmoves fromtriggers GIBS generationwrites pyramidsProducer roles & processing levelsAll NASA-funded data production — products span Level 0 to Level 4L0L1L2L3L4Mission-funded · standard product suiteofficial suite — often L0–L3, sometimes L4Project-funded · value-added productsderived — L2–L4Boundary set by each mission's Data Management Plan — going forward,DMPs should also require:CF metadata conventionsObject-store-optimized chunk structureIcechunk / Iceberg storage deliveryCMR (RDS)collection & granule metadata catalogFile-based data lake · object store on S3Archival filesHDF5 / netCDF (cloud-optimized)Apache Icebergtabular table storeIcechunk / Zarrn-D array store (virtual or native)TBD data pipelineevent-driven processingpublishes tomoves toS3 notificationappends rows / arraysregisters inCMR APIsearch for collectionsApache DataFusionquery engine:query & filter collection viewsZarrarray interfaceservespublishes viewstable providersaccessesarray access viaCustom Product GenerationSlideRuleusers select from existing registered algorithmspython client | HTTP API | web appMAAPauthorized users use MAAP data processing systemcloud-hosted notebooks | DPS APIAccess / Subsetting / ReformattingEarthdata Harmonylarge async requests (servers)python client | HTTP APIZarrsmall synchronous requests (serverless)python clientSQLsmall synchronous requests (serverless)VisualizationGIBSfast, pre-generated imagery (servers)TiTilerdynamic (user-driven) imagery (servers)deck.gldata directly in-browser(serverless, authentication barrier)search for collectionsquery + filteraccess viaSolid = exists today · dashed purple = in development. Harmony/GIBS tiling store is separate from the data lake. Tool choices are illustrative and open for discussion. \ No newline at end of file diff --git a/docs/dse-architecture-vision/index.html b/docs/dse-architecture-vision/index.html new file mode 100644 index 0000000..b9433c3 --- /dev/null +++ b/docs/dse-architecture-vision/index.html @@ -0,0 +1,379 @@ + + + + + +NASA DSE — Architecture Vision (editable) + + + + + +
+ + +
+

NASA EARTHDATA CLOUD DATA SYSTEM EVOLUTION (DSE)

+

Architecture Vision

+
+

From "lift and shift" granules to a unified, cloud-native data lake.

+

Working draft · June 2026  ·  JP Swinski, Sean Harkins & Aimee Barciauskas

+
+ + +
+

End-to-End Architecture Vision

+

From "lift and shift" granules to a unified, cloud-native data lake with integrated services.

+
+
1 · Mission Products
    +
  • Problem: Mission Products are produced in mission silos today, with little cross-mission consistency
  • +
  • Solution: Additional requirements for cloud-friendly chunking + storage and standards (CF)-compliant metadata
  • +
  • Assumption: Archival format for L0-L2; higher levels may be produced in cloud-native formats (w/ requirement a repeatable pipeline)
  • +
+
2 · Data Lake
    +
  • Archival data files will be archived in cloud object storage
  • +
  • Event-driven pipelines (S3 manifests and new object notifications → queue → processing) will build analysis-ready data products (ARD)
  • +
  • Tabular data stored in Apache Iceberg; multi-dimensional arrays stored in Icechunk (Zarr)
  • +
+
3 · Discoverability
    +
  • Collections will be discoverable via CMR or via query engine provider (see next box)
  • +
  • CMR collection records will store location references to "views" into the collection (entry points for query and access to entire collection)
  • +
  • Find & compare datasets, for example search for all precipitation datasets. Consistent metadata standards will help users pick one.
  • +
+
4 · Queryability
    +
  • A DataFusion query-engine provider over every collection or collection views
  • +
  • Users can query across collections and within a collection
  • +
  • Collection "views" open many files as one logical collection
  • +
+
5 · Data Services
    +
  • Subsetting & reformatting (Harmony, SlideRule)
  • +
  • Custom product generation via algorithm services (SlideRule, MAAP)
  • +
  • Access + Analytics (notebooks + sync/async HTTP)
  • +
  • Visualization
  • +
+
+
Access modes: cloud-hosted notebooks (co-located in AWS us-west-2) · HTTP endpoints · web browsers
+
Cross-cutting principle: uniform data models (Zarr/Icechunk for n-D, Iceberg for tabular) enable services to be used across many collections.
+
+ + +
+ Reference architecture: concrete tools and flows +
This slide stays an editable SVG: diagrams_svg/07_reference.svg
+
+ + +
+

Why Now — From Fragmentation to a Unified Vision

+

A 3-5 year transition: cloud-native production for new data, plus virtualization of existing collections.

+
+
TODAYmission-centric & fragmented
    +
  • Each mission picks its own formats & chunking — little cross-mission consistency
  • +
  • Suboptimal HDF5 granules; even cloud-optimized files fall short
  • +
  • Scattered metadata makes comparing similar products hard — data fusion is painful
  • +
  • No structural incentive for missions to invest in downstream usability
  • +
  • Discovery to use is manual: find a collection, then download & preprocess every file
  • +
  • SlideRule overloaded with subsetting; CMR strained by analytics & AI-agent traffic
  • +
+
VISIONunified, cloud-native data lake
    +
  • Mission data product requirements: consistent cloud-friendly chunking + storage / metadata / formatting standards
  • +
  • One data lake: Icechunk (n-D / Zarr) + Iceberg (tabular), leveraging virtual chunk-manifest indices
  • +
  • Federated query engine (DataFusion): discovery → cross-collection → within-collection
  • +
  • Uniform data models → build a service once, reuse it for every dataset
  • +
  • SlideRule and Harmony leverage chunk-manifest indices for product generation and subsetting
  • +
  • Distributed CMR data lake scales for analytics, consumer & AI-agent traffic
  • +
+
+
3-5 year transition: virtualize existing collections (via DAACs / external teams) while shifting new missions to cloud-native production.
+
+ + + + + + + + +
+

Roles & Responsibilities

+

Multiple teams, one shared interface.

+
+
Data Producer Teamsall NASA-funded data production (mission, science investigator & DAAC)
    +
  • All NASA-funded teams produce across Levels 0–4: mission-funded standard + project-funded value-added products
  • +
  • Could share one platform (e.g. MAAP) — today they use distinct tools
  • +
  • Scope & standards documented in each mission's Data Management Plan (DMP)
  • +
  • DMPs should require: CF conventions, object-store-optimized chunking, Icechunk / Iceberg delivery
  • +
  • Publish chunk indices (virtual Icechunk manifests) to the shared stores
  • +
  • Register collections + collection-level metadata so products are discoverable & queryable
  • +
+
publish →
+
Shared Data Lakethe contract / interface between the teams
    +
  • CF conventions — the common metadata standard
  • +
  • Apache Iceberg — tabular data + indices
  • +
  • Icechunk / Zarr — n-D array stores
  • +
  • Virtual chunk-manifest indices — published by producers → queryable
  • +
  • Collection & metadata registry — discovery across & within collections
  • +
+
→ consume
+
Cross-Product Services Teamdata lake consumers
    +
  • Operate the shared Iceberg + Icechunk stores and the query engine
  • +
  • Discoverability & queryability (CMR + DataFusion federation)
  • +
  • Subsetting & reformatting (Harmony); on-demand products (SlideRule)
  • +
  • Analytics & visualization services
  • +
  • Build once on uniform data models — reuse across every dataset
  • +
+
+
+ + +
+

Additions to reference architecture

+
AI tooling for mission product teams
+
Guidance on creating optimized archival formats and Icechunk / Iceberg stores.
+
IO layer choice
+
What matters for efficient (fast, cost-minimizing) access is the object-store IO-layer library: 2 high performance options are object_store used in zarr-python and h5coro used by SlideRule. Both offer advantages (next slide).
+
Caching
+
Within reason, caching should be a feature of every data service, but the implementation is out of scope for this document.
+
+ + +
+

Object-store IO layer — what drives efficient, low-cost access

+

The IO library sets the number and size of S3 GET requests and compute time — i.e. the cost of access. The two are complementary.

+
+

Rust object_store

cloud-native · general purpose
    +
  • Unified async API over S3 / GCS / Azure (Apache Arrow project)
  • +
  • Powers the Rust data ecosystem: DataFusion, Iceberg, Icechunk / Zarr
  • +
  • Concurrent range reads + connection pooling for chunk-level access
  • +
  • Best for cloud-native formats — Zarr, Parquet, Iceberg tables
  • +
+

h5coro

cloud-optimized HDF5 · file-level
    +
  • Reads HDF5 directly from S3 without the HDF5 library
  • +
  • Minimizes requests by smart caching of metadata / B-trees
  • +
  • Efficient access to existing archival granules — no reformatting
  • +
  • File-level and HDF5-specific by design
  • +
+
+

Bottom line: object_store for cloud-native stores; h5coro for legacy HDF5 in place — a cost-vs-migration tradeoff.

+
+ + +
+

Discussion questions

+
1 — Consolidate the interfaces?
+
Should investment be consolidated into accessing data via the Zarr / DataFusion interfaces, ultimately divesting from HDF-specific tooling?
+
2 — The future of HDF
+
Would the natural outcome be moving away from HDF entirely? Or does NASA see archival file formats as having a role indefinitely?
+
+ + +
+

For discussion & next steps

+
Align internally
+
Brief leadership on the discussion and confirm next steps.
+
Document the strategy (roadmap)
+
Outline the transition plan and long-term objectives behind this vision.
+
Socialize the vision
+
Share the architecture vision with leadership to align on goals.
+
Open questions
+
ARD storage format (archival + virtual Icechunk vs. native Zarr Icechunk) · which services NASA maintains for which datasets · funding & incentives for cloud-native mission product requirements.
+
+ + +
+

References

+ +
+ +
+ + + + + From 6c54037e465dadd4df1271bd74b25db8ed77ae07 Mon Sep 17 00:00:00 2001 From: Aimee Barciauskas Date: Sun, 28 Jun 2026 16:06:12 -0700 Subject: [PATCH 2/8] In-progress deck --- .../diagrams_svg/00_overview_10k.svg | 84 ++++++ .../diagrams_svg/07_reference.svg | 2 +- .../diagrams_svg/cloud-data-lake-simple.svg | 4 + .../diagrams_svg/cloud-native-data-lake.svg | 4 + .../diagrams_svg/nasa-esdis-evolution.svg | 4 + docs/dse-architecture-vision/index.html | 265 +++++++++++++----- docs/fy26-roadmap.md | 129 --------- docs/roadmap-phases.svg | 60 ++++ docs/roadmap.md | 189 +++++++++++++ 9 files changed, 539 insertions(+), 202 deletions(-) create mode 100644 docs/dse-architecture-vision/diagrams_svg/00_overview_10k.svg create mode 100644 docs/dse-architecture-vision/diagrams_svg/cloud-data-lake-simple.svg create mode 100644 docs/dse-architecture-vision/diagrams_svg/cloud-native-data-lake.svg create mode 100644 docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg delete mode 100644 docs/fy26-roadmap.md create mode 100644 docs/roadmap-phases.svg create mode 100644 docs/roadmap.md diff --git a/docs/dse-architecture-vision/diagrams_svg/00_overview_10k.svg b/docs/dse-architecture-vision/diagrams_svg/00_overview_10k.svg new file mode 100644 index 0000000..4f05a12 --- /dev/null +++ b/docs/dse-architecture-vision/diagrams_svg/00_overview_10k.svg @@ -0,0 +1,84 @@ + + + + + Architecture at 10,000 ft + From the data lake out to user interfaces — plus direct access via a cloud-hosted dev environment. + NASA Data System Evolution (DSE) + Architecture Vision + + + + new data lake + + existing / neutral + + + + + + Archival Files + HDF5 / netCDF + object storage (S3) + + + + + Icechunk + Zarr arrays (n-D) + virtual or native + + + + + Query Engine + DataFusion + discover & filter collections + + + + + APIs + subset · reformat · tiles + timeseries · custom products + + + + + User Interfaces + web apps · notebooks + python / HTTP clients + + + + + + + + + + + + + + Cloud-hosted development environment + notebooks co-located with the data in AWS us-west-2 + power users go straight to the data, bypassing the API / query layer + + + + + direct Icechunk access + + + + direct archival access + + + + + + users log into + + Teal = new data lake (Icechunk / Zarr · query engine · direct access). Grey = existing / neutral. Solid = primary read path; dashed = direct-to-data shortcuts. + diff --git a/docs/dse-architecture-vision/diagrams_svg/07_reference.svg b/docs/dse-architecture-vision/diagrams_svg/07_reference.svg index 8ee69af..bca481b 100644 --- a/docs/dse-architecture-vision/diagrams_svg/07_reference.svg +++ b/docs/dse-architecture-vision/diagrams_svg/07_reference.svg @@ -1 +1 @@ -Reference Architecture — Tools & FlowsConcrete mapping of the vision onto NASA tools. Bold = implementation / tool · small = logical component.NASA Data System Evolution (DSE)Architecture Visionexistsin developmentData Product GenerationEarthdata Cloud Data LakeDiscoverabilityServicesMAAPdata processing systemSTACproduct metadataStaging bucketobject storage (staging)Cumulusingest / archiveEarthdata HarmonyGIBS image-pyramid generationObject storageGIBS tile pyramids — NOT the data lakeuse only via the GIBS servicemetadatadata productsmoves fromtriggers GIBS generationwrites pyramidsProducer roles & processing levelsAll NASA-funded data production — products span Level 0 to Level 4L0L1L2L3L4Mission-funded · standard product suiteofficial suite — often L0–L3, sometimes L4Project-funded · value-added productsderived — L2–L4Boundary set by each mission's Data Management Plan — going forward,DMPs should also require:CF metadata conventionsObject-store-optimized chunk structureIcechunk / Iceberg storage deliveryCMR (RDS)collection & granule metadata catalogFile-based data lake · object store on S3Archival filesHDF5 / netCDF (cloud-optimized)Apache Icebergtabular table storeIcechunk / Zarrn-D array store (virtual or native)TBD data pipelineevent-driven processingpublishes tomoves toS3 notificationappends rows / arraysregisters inCMR APIsearch for collectionsApache DataFusionquery engine:query & filter collection viewsZarrarray interfaceservespublishes viewstable providersaccessesarray access viaCustom Product GenerationSlideRuleusers select from existing registered algorithmspython client | HTTP API | web appMAAPauthorized users use MAAP data processing systemcloud-hosted notebooks | DPS APIAccess / Subsetting / ReformattingEarthdata Harmonylarge async requests (servers)python client | HTTP APIZarrsmall synchronous requests (serverless)python clientSQLsmall synchronous requests (serverless)VisualizationGIBSfast, pre-generated imagery (servers)TiTilerdynamic (user-driven) imagery (servers)deck.gldata directly in-browser(serverless, authentication barrier)search for collectionsquery + filteraccess viaSolid = exists today · dashed purple = in development. Harmony/GIBS tiling store is separate from the data lake. Tool choices are illustrative and open for discussion. \ No newline at end of file +Reference Architecture — Tools & FlowsConcrete mapping of the vision onto NASA tools. Bold = implementation / tool · small = logical component.NASA Data System Evolution (DSE)Architecture Visionexistsin developmentnew data lakeData Product GenerationEarthdata Cloud Data LakeDiscoverabilityServicesComplementary system — not the Icechunk data lake (ODD's purview).HySDSdata production systemMAAP = one exampleplatform on HySDSSTACproduct metadata (at publication)Staging bucketobject storage (staging)Cumulusingest / archiveEarthdata HarmonyGIBS image-pyramid generationObject storageGIBS tile pyramids — NOT the data lakeuse only via the GIBS servicemetadatadata productsmoves fromtriggers GIBS generationwrites pyramidsProducer roles & processing levelsAll NASA-funded data production — products span Level 0 to Level 4L0L1L2L3L4Mission-funded · standard product suiteofficial suite — often L0–L3, sometimes L4Project-funded · value-added productsderived — L2–L4Boundary set by each mission's Data Management Plan — going forward,DMPs should also require:CF metadata conventionsObject-store-optimized chunk structureIcechunk / Iceberg storage deliveryCMR (RDS)collection & granule metadata catalogFile-based data lake · object store on S3Archival filesHDF5 / netCDF (cloud-optimized)Apache Icebergtabular table storeIcechunk / Zarrn-D array store (virtual or native)TBD data pipelineevent-driven processingpublishes tomoves toS3 notificationappends rows / arraysregisters inCMR APIsearch for collectionsSTAC APIrepresent data to users (downstream view)— not how metadata is stored at restApache DataFusionquery engine:query & filter collection viewsZarrarray interfaceservespublishes viewstable providersaccessesarray access viaCustom Product GenerationSlideRuleusers select from existing registered algorithmspython client | HTTP API | web appMAAP (on HySDS)authorized users use MAAP data processing systemcloud-hosted notebooks | DPS APIAccess / Subsetting / ReformattingEarthdata Harmonylarge async requests (servers)python client | HTTP APIZarrsmall synchronous requests (serverless)python clientSQLsmall synchronous requests (serverless)VisualizationGIBSfast, pre-generated imagery (servers)TiTilerdynamic (user-driven) imagery (servers)deck.gldata directly in-browser(serverless, authentication barrier)search for collectionsquery + filteraccess viaDirect AccessDirect S3 / Zarr accessread Icechunk / Zarr arrays directly over S3 — no service layerZarr protocol (object_store · zarr-python)cloud-hosted notebooks (us-west-2) or any S3 clientarchival files also readable in place via h5corodirect accessTeal = new data lake components (Iceberg · Icechunk/Zarr · query engine · direct access). Grey = existing / neutral; dashed = in development. Mission data processing (HySDS / MAAP) is complementary, not the data lake. STAC API is a user-facing view, not storage-at-rest. Tool choices are illustrative. \ No newline at end of file diff --git a/docs/dse-architecture-vision/diagrams_svg/cloud-data-lake-simple.svg b/docs/dse-architecture-vision/diagrams_svg/cloud-data-lake-simple.svg new file mode 100644 index 0000000..126f78e --- /dev/null +++ b/docs/dse-architecture-vision/diagrams_svg/cloud-data-lake-simple.svg @@ -0,0 +1,4 @@ + + +Earthdata CloudData LakeNetCDF4NetCDF4NetCDF4Query EngineUser Interfaceswebsites, notebooksAPIsstandard APIs: tiles, STAC, EDR, WMScustom outputs: reproject, reformat, analysisdiscover and filterraw data chunks,chunk manifests and metadatacloud-hosted systems support direct access to the query enginecloud-hosted systems support direct access to data \ No newline at end of file diff --git a/docs/dse-architecture-vision/diagrams_svg/cloud-native-data-lake.svg b/docs/dse-architecture-vision/diagrams_svg/cloud-native-data-lake.svg new file mode 100644 index 0000000..0ff13cb --- /dev/null +++ b/docs/dse-architecture-vision/diagrams_svg/cloud-native-data-lake.svg @@ -0,0 +1,4 @@ + + +CMR APIDiscoverabilityfcCommon MetadataRepository(CMR) RDStable providersServicesEarthdata CloudData Lakepython client | HTTP API | web appNetCDF4TEMPOHDF5ICESat-2COGHLSHDF5GPM IMERGHDF5NISARpython client | HTTP APIMission Product TeamsData ProductGenerationCustom Product GenerationAccess / Subsetting / ReformattingVisualizationlarge async requests(servers)small synchronous requests(serverless)outputs topublishes toicechunk + iceberg datapipelinetriggers S3 notificationappends to icechunk(virtual or native)appends iceberg rowspython clientfast, pre-generated imagery(servers)dynamic (user-driven) imagery(servers)data directly in-browser(serverless, authentication barrier)pngGIBSusers select from existing registered algorithmsauthorized users use MAAPdata processing systemcloud-hosted notebooks + DPS APIsmall synchronous requests(serverless)Usersaccess viaaccessesquery + filtera collectiongenerates products on in developmentexistsNASA Data Systems Evolutionaccess viastagingpublishes todiscovers frommoves to \ No newline at end of file diff --git a/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg b/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg new file mode 100644 index 0000000..21f9780 --- /dev/null +++ b/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg @@ -0,0 +1,4 @@ + + +CMR APIDiscovery and AccessfcCommon MetadataRepository(CMR) RDStable providersServicesEarthdata CloudData Lakepython client | HTTP API | web appNetCDF4TEMPOHDF5ICESat-2COGHLSHDF5GPM IMERGHDF5NISARpython client | HTTP APIMission Product TeamsData ProductGenerationCustom Product GenerationAccess / Subsetting / ReformattingVisualizationlarge async requests(servers)small synchronous requests(serverless)outputs topublishes toicechunk + iceberg datapipelinetriggers S3 notificationappends to icechunk(virtual or native)appends iceberg rowspython clientfast, pre-generated imagery(servers)dynamic (user-driven) imagery(servers)data directly in-browser(serverless, authentication barrier)pngGIBSusers select from existing registered algorithmsauthorized users use MAAPdata processing systemcloud-hosted notebooks + DPS APIsmall synchronous requests(serverless)Usersaccess viaaccessesquery + filtera collectiongenerates products on in developmentexistsaccess viastagingpublishes todiscovers frommoves to \ No newline at end of file diff --git a/docs/dse-architecture-vision/index.html b/docs/dse-architecture-vision/index.html index b9433c3..4cd9faf 100644 --- a/docs/dse-architecture-vision/index.html +++ b/docs/dse-architecture-vision/index.html @@ -37,9 +37,9 @@ /* panels with check/cross */ .panel{ border:2.5px solid var(--c); border-radius:14px; background:#fff; overflow:hidden; } .panel .ph{ background:var(--c); color:#fff; padding:7px 12px; } -.panel .ph b{ font-size:15px; } .panel .ph span{ font-size:11.5px; font-style:italic; opacity:.92; display:block; } +.panel .ph b{ font-size:13.5px; } .panel .ph span{ font-size:10.5px; font-style:italic; opacity:.92; display:block; } .panel ul{ margin:8px 12px; padding:0; list-style:none; } -.panel li{ font-size:12px; margin:7px 0; padding-left:22px; position:relative; line-height:1.3; } +.panel li{ font-size:10.5px; margin:5px 0; padding-left:20px; position:relative; line-height:1.28; } .panel.good li:before{ content:"✓"; position:absolute; left:0; color:var(--c); font-weight:700; } .panel.bad li:before{ content:"✕"; position:absolute; left:0; color:var(--c); font-weight:700; } @@ -62,8 +62,8 @@ .reveal section.dark .subt{ color:var(--ice); font-style:italic; font-size:.9em; } .reveal section.dark .meta{ color:#6f97b8; font-size:.5em; margin-top:1.3em; } .reveal section.dark .rule{ height:4px; width:160px; background:var(--teal); margin:.5em 0 .7em; border-radius:2px; } -.reveal section.dark .step{ color:var(--teal); font-weight:700; font-size:.74em; margin:.5em 0 .04em;} -.reveal section.dark .stepbody{ color:#cfe0ee; font-size:.6em; margin:0 0 .15em;} +.reveal section.dark .step{ color:var(--teal); font-weight:700; font-size:.62em; margin:.45em 0 .04em;} +.reveal section.dark .stepbody{ color:#cfe0ee; font-size:.5em; margin:0 0 .12em;} .reveal .cols{ display:flex; gap:26px; margin-top:16px; } .reveal .col{ flex:1; background:var(--navy2); border-radius:14px; padding:18px 22px; text-align:left; } .reveal .col.a{ border:1.5px solid #4f9de0; } .reveal .col.b{ border:1.5px solid #e08a3a; } @@ -75,12 +75,45 @@ .reveal .slide-number{ background:transparent; color:var(--mute); } .editnote{ position:absolute; bottom:6px; left:10px; font-size:9px; color:#cbd5e1; } +/* 10k architecture — HTML diagram (editable text) */ +.arch{ position:relative; width:1180px; height:470px; margin:14px auto 0; } +.arch .abox{ position:absolute; top:0; width:200px; height:88px; box-sizing:border-box; border:2px solid #64748b; border-radius:8px; background:#fff; padding:9px 12px; } +.arch .abox b{ display:block; font-size:15px; color:#0f172a; font-weight:700; line-height:1.1; } +.arch .abox small{ display:block; font-size:11px; color:#475569; line-height:1.4; margin-top:5px; } +.arch .abox.hl{ border:3px solid #0e7490; background:#0e7490; box-shadow:0 3px 14px rgba(14,116,144,.45); } +.arch .abox.hl b{ color:#fff; } +.arch .abox.hl small{ color:#d8eef3; } +.arch .aarrow{ position:absolute; top:24px; width:45px; text-align:center; color:#64748b; font-size:36px; font-weight:700; line-height:1; } +.arch .adev{ position:absolute; left:0; top:300px; width:980px; height:96px; box-sizing:border-box; border:2px solid #64748b; border-radius:12px; background:#f1f5f9; text-align:center; padding-top:14px; } +.arch .adev b{ font-size:16px; color:#0f172a; } +.arch .adev small{ display:block; font-size:12px; color:#475569; margin:5px 16px 0; line-height:1.45; } +.arch .vline{ position:absolute; border-left:2px solid #aab4c2; } +.arch .hline{ position:absolute; border-top:2px solid #aab4c2; } +.arch .dline{ position:absolute; border-left:2px dashed #64748b; } +.arch .dline.teal{ border-left:3px dashed #0e7490; } +.arch .tri{ position:absolute; width:0; height:0; } +.arch .tri.up{ border-left:6px solid transparent; border-right:6px solid transparent; border-bottom:9px solid #64748b; } +.arch .tri.up.teal{ border-bottom-color:#0e7490; } +.arch .tri.left{ border-top:6px solid transparent; border-bottom:6px solid transparent; border-right:9px solid #aab4c2; } +.arch .lbl{ position:absolute; font-size:12px; font-weight:700; white-space:nowrap; } +.arch .lbl.teal{ color:#0e7490; } +.arch .lbl.grey{ color:#64748b; } + /* references slide */ .reveal section.dark .reflist{ list-style:none; margin:.2em 0 0; padding:0; } .reveal section.dark .reflist > li{ margin:.55em 0; font-size:.7em; line-height:1.35; } .reveal section.dark .reflist .rname{ color:var(--ice); font-weight:700; } .reveal section.dark a{ color:#7ec8f2; text-decoration:none; } .reveal section.dark a:hover{ text-decoration:underline; } + +/* comparison table (extra slide) */ +.reveal .content table.cmp{ width:100%; border-collapse:collapse; font-size:15px; margin-top:8px; } +.reveal .content table.cmp th,.reveal .content table.cmp td{ border:1px solid #cbd5e1; padding:8px 12px; text-align:left; vertical-align:top; } +.reveal .content table.cmp thead th{ background:#0b2545; color:#fff; font-weight:700; } +.reveal .content table.cmp tbody th{ background:#f1f5f9; color:#0f172a; font-weight:700; width:24%; } +.reveal .content table.cmp tbody tr:nth-child(even) td{ background:#f8fafc; } +.reveal .content table.cmp td.tlake{ color:#0e7490; font-weight:700; } +.reveal .content table.cmp code{ font-family:"SFMono-Regular",Menlo,Consolas,monospace; font-size:13.5px; } @@ -89,13 +122,75 @@

NASA EARTHDATA CLOUD DATA SYSTEM EVOLUTION (DSE)

-

Architecture Vision

+

Cloud-Native Data Lake

From "lift and shift" granules to a unified, cloud-native data lake.

Working draft · June 2026  ·  JP Swinski, Sean Harkins & Aimee Barciauskas

- + +
+

Why Now?

+

NASA has successfully migrated many of its collections to cloud storage. However, it still lacks a unifying vision for a cloud-native architecture.

+
+
TODAYEarthdata is mission-centric & fragmented
    +
  • Each mission picks its own formats & chunking; there is little cross-mission consistency in data structure.
  • +
  • Even cloud-optimized HDF5 granules fall short of delivering the optimal performance from cloud object storage.
  • +
  • Differentiated metadata makes comparing similar products hard; data fusion is painful.
  • +
  • There is no structural incentive for missions to invest in downstream usability.
  • +
  • Metadata and data management are still distinct; risking incnsistency
  • +
  • Download + process is still a primary access pattern (e.g. Download is the only access method available in the Earthdata search user interface)
  • +
  • CMR is strained by analytics & AI-agent traffic
  • +
+
3-5 YEARS IN THE FUTUREEarthdata cloud is a unified, cloud-native data lake
    +
  • Organizing data around a single data model (i.e. Zarr), means building a tool or service once and it can be used for many datasets.
  • +
  • Mission data product requirements are adopted: consistent cloud-friendly structure (sharding, chunking), consistent and standards-compliant metadata.
  • +
  • Chunk manifests create a bridge between archival file formats and a single data model.
  • +
  • Compliant data products are all available through a centralized data lake, utilizing a storage engine which supports ACID transactions for metadata and data.
  • +
  • SlideRule and Harmony leverage chunk-manifest indices for product generation and subsetting.
  • +
  • An integrated query engine enables frictionless discovery to acccess.
  • +
  • A horizontally scalable query engine supports analytics and operational-scale consumers.
  • +
  • AI agents would be able to utilize the query engine and suite of tools to more reliably discover, access and interpret data for consumers.
  • +
+
+
+ + + +
+

High-level Data Lake Architecture

+ High-level architecture +
SVG: diagrams_svg/cloud-data-lake-simple.svg
+
+ + +
+

Benefits

+
+ + +
+

The Data Lake supports direct, in-browser access

+

Put the effort into good data and metadata and the data does the work, not servers and not users.

+
Our job: instruct providers
+
Instruct NASA data providers on good metadata and cloud-friendly data delivery into the data lake, so users and services can get data out in an optimal way.
+
Optimal = direct in-browser
+
The optimal path is direct, in-browser access: users explore the data in the browser: no files to download, no libraries to install.
+
Why it's optimal
+
With well-structured data, consistent metadata, and caching, in-browser access is fast and cost-effective: versus server-side processing (NASA must build and maintain services that do all the work) or data egress (users do all the work and pay to move data).
+
+ + +
+

The Data Lake compliments existing systems

+
    +
  • The data lake will compliment existing systems, not replace them.
  • +
  • Traditional access methods currently in use will keep working.
  • +
  • As datasets get integrated with Icechunk / Iceberg, users start seeing the benefits of more efficient and more powerful access.
  • +
+
+ +
- Reference architecture: concrete tools and flows -
This slide stays an editable SVG: diagrams_svg/07_reference.svg
-
- - -
-

Why Now — From Fragmentation to a Unified Vision

-

A 3-5 year transition: cloud-native production for new data, plus virtualization of existing collections.

-
-
TODAYmission-centric & fragmented
    -
  • Each mission picks its own formats & chunking — little cross-mission consistency
  • -
  • Suboptimal HDF5 granules; even cloud-optimized files fall short
  • -
  • Scattered metadata makes comparing similar products hard — data fusion is painful
  • -
  • No structural incentive for missions to invest in downstream usability
  • -
  • Discovery to use is manual: find a collection, then download & preprocess every file
  • -
  • SlideRule overloaded with subsetting; CMR strained by analytics & AI-agent traffic
  • -
-
VISIONunified, cloud-native data lake
    -
  • Mission data product requirements: consistent cloud-friendly chunking + storage / metadata / formatting standards
  • -
  • One data lake: Icechunk (n-D / Zarr) + Iceberg (tabular), leveraging virtual chunk-manifest indices
  • -
  • Federated query engine (DataFusion): discovery → cross-collection → within-collection
  • -
  • Uniform data models → build a service once, reuse it for every dataset
  • -
  • SlideRule and Harmony leverage chunk-manifest indices for product generation and subsetting
  • -
  • Distributed CMR data lake scales for analytics, consumer & AI-agent traffic
  • -
-
-
3-5 year transition: virtualize existing collections (via DAACs / external teams) while shifting new missions to cloud-native production.
+

NASA ESDIS Architecture Cloud-Native Data Evolution

+ Reference architecture: tools and flows +
SVG: diagrams_svg/nasa-esdis-evolution.svg
-

Additions to reference architecture

-
AI tooling for mission product teams
-
Guidance on creating optimized archival formats and Icechunk / Iceberg stores.
-
IO layer choice
+

Additional recommendations

+
Mission product teams should be provided with AI tooling
+
Guidance on creating optimized archival formats and deliveryo to the Icechunk or Iceberg stores.
+
Enterprise tools accessing the data should pay special attention to the IO layer choice and configuration
What matters for efficient (fast, cost-minimizing) access is the object-store IO-layer library: 2 high performance options are object_store used in zarr-python and h5coro used by SlideRule. Both offer advantages (next slide).
-
Caching
-
Within reason, caching should be a feature of every data service, but the implementation is out of scope for this document.
+
Implement a shared data cache
+
Services with high demand and low-latency requirements should consider using a data cache. Such a data cache could be shared by multiple services.
+
+ + + +
+

Discussion questions

+
1 — Consolidate the interfaces?
+
Should investment be consolidated into accessing data via the Zarr and Query Engine interfaces, ultimately divesting from CMR and HDF-specific tooling?
+
2 — The future of HDF
+
Would the natural outcome be moving away from HDF entirely? Or does NASA see archival file formats as having a role indefinitely?
+
+ + +
+

References

+ +
+ + +
+

Extra

-

Object-store IO layer — what drives efficient, low-cost access

+

Object-store IO layer — what drives efficient, low-cost access

The IO library sets the number and size of S3 GET requests and compute time — i.e. the cost of access. The two are complementary.

Rust object_store

cloud-native · general purpose
    @@ -329,43 +434,59 @@

    Object-store IO layer — what drives efficient, lo

    Bottom line: object_store for cloud-native stores; h5coro for legacy HDF5 in place — a cost-vs-migration tradeoff.

- +
-

Discussion questions

-
1 — Consolidate the interfaces?
-
Should investment be consolidated into accessing data via the Zarr / DataFusion interfaces, ultimately divesting from HDF-specific tooling?
-
2 — The future of HDF
-
Would the natural outcome be moving away from HDF entirely? Or does NASA see archival file formats as having a role indefinitely?
+

Caching — a data cache, not a tiling cache

+

Cache the data, not the rendered pixels — so every service benefits, not just visualization. (FY27 focus: demonstrate caching performance.)

+
Multiscales live in the Icechunk store
+
The Icechunk store includes multiscales (overview / pyramid resolutions of the arrays), so coarse-resolution reads don't require pulling full-resolution chunks.
+
Cache the multiscales as data
+
These multiscales are cached as a data cache — cached Zarr arrays / chunks — not a tiling cache. The same cached data serves tiles, timeseries, analytics, and AI workloads alike.
+
Why a data cache wins
+
A tiling cache accelerates only one visualization service. A data cache accelerates every consumer of the array, is reused across services, and avoids re-rendering the same pixels repeatedly.
- +
-

For discussion & next steps

-
Align internally
-
Brief leadership on the discussion and confirm next steps.
-
Document the strategy (roadmap)
-
Outline the transition plan and long-term objectives behind this vision.
-
Socialize the vision
-
Share the architecture vision with leadership to align on goals.
-
Open questions
-
ARD storage format (archival + virtual Icechunk vs. native Zarr Icechunk) · which services NASA maintains for which datasets · funding & incentives for cloud-native mission product requirements.
+

AI as a primary objective

+

Access is shifting from web / Python / in-house systems toward AI agents. The architecture must let LLMs discover, reason about, and ingest NASA Earth data.

+
Discover
+
Rich, consistent metadata makes datasets findable by agents — and discovery will broaden beyond STAC. In two years, semantic search over ATBDs may matter as much as a STAC endpoint; the metadata layer is designed for both structured (STAC / query) and semantic / LLM retrieval.
+
Reason
+
Consistent CF metadata plus a uniform query interface (DataFusion / SQL) let agents compare datasets and compose queries without per-dataset glue code.
+
Ingest
+
Cloud-friendly Zarr / Icechunk chunking and cached multiscales give agents efficient, low-cost array access at the resolution they need — direct over S3.
+
On the roadmap
+
Work with the AI/ML teams to demonstrate use of the data lake (e.g. Water Insight or EIE) in the first two quarters of FY27.
- +
-

References

- +

Long term — a simpler vision

+
Deprecate HDF tooling in favor of Zarr
+
Once the data lake is established, support for HDF-specific tooling is deprecated in favor of supporting only Zarr tooling — a single, cloud-native access stack to build and maintain.
+
Access only via the query engine + Zarr interfaces
+
A future architecture focuses on data lake access through the query engine (DataFusion) and Zarr interfaces only — not CMR and not archival files. Direct CMR and archival-file access are slow, costly, and error-prone by comparison.
+
Why this is feasible
+
The DataFusion query layer is stateless and horizontally scalable over partitioned metadata in object storage — replacing CMR's single-database (RDS) bottleneck — so discovery and query scale with object storage and absorb analytics- and AI-agent-scale traffic.
+
+ + +
+

Extra — the array stack vs. the tabular stack

+

Two parallel stacks, one query engine. Choose the store by data shape: dense arrays → Zarr / Icechunk; records (points, swaths, features) → Parquet / Iceberg.

+ + + + + + + + + + +
LayerArray / n-D worldTabular world
On-disk formatZarrParquet  (GeoParquet for vector)
Transactional storeIcechunkIceberg  (+ pyiceberg for snapshots)
Format reader (decode → memory)zarr-pythonpyarrow
In-memory analysisxarraypandas · polars · GeoPandas
Query engineDataFusion  (via zarr-datafusion)DataFusion  (polars overlaps)
Example NASA productsgridded L3/L4 — GPM IMERG · NLDAS · MUR SST · TEMPO L3 · HLS · model reanalysispoints / records — ICESat-2 photons · GEDI footprints · swath L1B/L2 · in-situ & vector · STAC catalog
+
One query engine over both: DataFusion spans the two stacks. polars is a dataframe library that also acts as a mini query engine, so it overlaps DataFusion rather than matching zarr-python's decode-only role.
diff --git a/docs/fy26-roadmap.md b/docs/fy26-roadmap.md deleted file mode 100644 index 6c94442..0000000 --- a/docs/fy26-roadmap.md +++ /dev/null @@ -1,129 +0,0 @@ -# ODD Fiscal Year (FY) 2026 Roadmap - -If you are interested in a better understanding of the ODD service roadmap, and what datasets will be supported when, this document is for you. - -This document provides a roadmap for the VEDA Optimized Data Delivery Team (ODD), broken into 4 categories: -1. Services for granules in CMR -2. Services for datacubes -3. Services non-datacube -4. Foundational Work - -It is important to note that this roadmap is a reflection of the team's current plans, written as of November 2025. These are likely to evolve over time. We intend to update the roadmap quarterly. - -For a higher-level vision, see also: [Optimized Data Delivery Roadmap for NASA - July 2025](https://docs.google.com/presentation/d/1Ouo_9qJJuDBdrzDHpt2P-o1wGBPS1nvTjLRFAFGsYkU/edit?usp=sharing). - ---- - -## Legend - -- **✅ Complete** - Already delivered -- **🚧 In Progress** - Active development -- **🔄 Ongoing** - Ongoing work -- **📅 Planned** - Scheduled for specific quarter -- **🔮 Future** - Planned for future timeline - ---- - -![Services for CMR Granules](./category1-granules.svg) - -## Roadmap for Service Category 1: Services for CMR Granules - -### Access -*N/A* - -### Visualization -- **✅ Complete** titiler-cmr /tiles API + VEDA UI integration - -### Timeseries -- **✅ Complete** titiler-cmr /timeseries/statistics API + VEDA UI integration - -### Additional Features -- **🚧 26.1** Release /compatibility endpoint -- **📅 26.2+** Develop support for more datasets, informed by compatibility testing in 26.1. - -### Dataset Support -- **✅ Complete** Demonstrated with GPM IMERG, TROPESS O3 and MiCASA -- **🚧 26.1** Compile a list of compatible datasets -- **🚧 26.1** Develop support for EDL-based credential access, as an aternative to requester-pays and role-based access. To support NISAR (ASF) and GEDI L4B (ORNL DAAC) specifically. -- **📅 26.2+** Test integration of new datasets as requester-pays is enabled for more buckets. - -### Performance + Operations -- **🚧 26.1** Deploy monitoring + performance evaluation via service tracing (OpenTelemetry) -- **📅 26.1** MCP Production deployment -- **📅 26.2** Consolidated benchmarking utilities for advising users on zoom levels, AOIs and temporal parameters on a per-dataset basis - -### Ecosystem Development -- **📅 26.2** Share compatible dataset list with NASA product teams for potential integration (i.e. Worldview) -- **📅 26.2+** Continued documentation to support self-service use of titiler-cmr. - ---- - -![Services for Datacubes](./category2-datacubes.svg) - -## Roadmap for Service Category 2: Services for Datacubes - -### Access -- **✅ Complete** Lazy loading/intelligent subsetting/intelligent access for varied data formats (GRIB, COG, NetCDF-4, HDF5 via VirtualiZarr) -- **📅 26.1** Support adoption of Virtual Zarr through library maintenance, improved documentation, and user support -- **📅 26.2** Support for arbitrary [chunk-grids (variable chunking)](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#chunk-grids) -- **📅 26.2** Explore virtualization methods for alternate grid structures (i.e., healpix, cubegrid) - -### Visualization -- **📅 26.1** Virtual container (Icechunk) integration in titiler-multidim to support /tiles endpoints -- **📅 26.1** Identify additional I/O parameters to allow for per-dataset optimizations -- **📅 26.1** Test VEDA UI integration of /tiles for a virtual dataset (e.g. NLDAS) -- **📅 26.2** Additional performance improvements (e.g. obstore integration) - -### Timeseries -- **📅 26.1** Design the timeseries/statistics endpoint to support datacubes (i.e. could be an asynchronous API outside the titiler ecosystem) -- **📅 26.2** Develop the timeseries/statistics endpoint -- **📅 26.2** Integrate the timeseries/statistics endpoint into VEDA UI - -### Datasets -- **✅ Complete** Prototyped virtual (Icechunk) stores for NLDAS, RASI, HRRR, MUR SST -- **📅 26.1** Demonstrate publication and tiling of NLDAS virtual store (💧 Water Insight) -- **📅 26.1** Architecture + documentation for generalizing STAC publication and VEDA UI /tiles integration -- **📅 26.2** HydroGlobe 5km and 10km virtual stores (💧 Water Insight) -- **📅 26.2** CarbonTracker-CH₄, EPA Gridded CH₄ Emissions Inventory virtual stores (🏭 GHGCenter) -- **📅 26.3** Documentation for STAC publication and VEDA UI /timeseries/statistics integration -- **📅 26.3** CarbonTracker-CH₄, EPA Gridded CH₄ Emissions Inventory tiles and timeseries integrations (🏭 GHGCenter) -- **📅 26.3** TROPESS NOx, TROPESS O3, JPL MOMO Chem, GEOS CF virtual stores, tiles and timeseries integrations (💨 Air Quality) - -### Operations -- **📅 26.2** Monitoring + Performance evaluation via service tracing (OpenTelemetry) -- **📅 26.3** MCP deployment -- **📅 26.2** Consolidated benchmarking utilities for advising users on zoom levels, AOIs and temporal parameters on a per-dataset basis - -### Ecosystem Development -- **📅 26.1** Create template data ingestion pipeline for virtualizing datasets -- **📅 26.3+** Moving towards self-service integration - ---- - -![Services for Non-Datacubes](./category3-nondatacubes.svg) - -## Roadmap for Service Category 3: Services for Non-Datacubes - -### Access -- **🚧 26.1-26.3** Prototyping creating a query engine using a Zarr provider for data fusion - -### Visualization -- **🔮 26.4 or FY 27** Tiling endpoints in near-term, direct client approaches in long-term - -### Timeseries -- **🔮 26.4 or FY 27** Timeseries API - -### Datasets -- **📅 26.1** Prototype HLS store -- **📅 26.3+** Prototype NISAR and/or Opera stores - -### Operations -- **🔮 26.4 or FY 27** Operational deployment + documentation -- **🔮 26.4 or FY 27** Consolidated benchmarking utilities for advising users on zoom levels, AOIs and temporal parameters on a per-dataset basis - -### Ecosystem Development -- **🔮 26.4 or FY 27** Develop ecosystem, moving towards self-service adoption within VEDA and broader community - -## Roadmap for Service Category 4: Foundational Work (including Technical Debt) - -- **🔄 26.1+** Establish areas for consolidation in the TiTiler ecosystem. Similar features across applications should rely on shared upstream libraries. The ODD team continuously identifying similar features and proactively DRY up codebases. diff --git a/docs/roadmap-phases.svg b/docs/roadmap-phases.svg new file mode 100644 index 0000000..61479dc --- /dev/null +++ b/docs/roadmap-phases.svg @@ -0,0 +1,60 @@ + + + + + ODD phases — notional timeline + Timelines are notional, not concrete + + + + + + FY26.4 · Now + Demonstrate the data lake + • Icechunk data lake across varied data + types: HLS, NISAR, GPM IMERG, + NLDAS, TEMPO, … + • VEDA instances demonstrate the + data lake with scientists + • Migrate data services (TiTiler-CMR) + to the Data Services team + • Socialization + capacity building + (coworking group, integration guide) + + + + + + FY27 · Next + AI integration + caching + • Demonstrate AI integration with the + AI/ML teams (Water Insight / EIE) + • LLMs discover, reason about, ingest + data from the lake + • Demonstrate caching performance: + multiscale data cache in Icechunk + (a data cache, not a tiling cache) + + + + + + Long term + Simplify + • Deprecate HDF tooling; support + only Zarr tooling + • Access only via the query engine + + Zarr interfaces + • Not CMR, not archival files + (slow, costly, error-prone) + + + + + + + + + Foundational & ongoing — throughout all phases + Zarr · Icechunk · obstore (IO) · warp / resampling / projection · reading Zarr + COG directly in the browser · GeoZarr & data standards + diff --git a/docs/roadmap.md b/docs/roadmap.md new file mode 100644 index 0000000..9d7a53a --- /dev/null +++ b/docs/roadmap.md @@ -0,0 +1,189 @@ +# ODD roadmap + +This page exists to explain the motivations behind ODD's daily work. It connects what +we're building to why we're building it, and explains how work enters, moves through, +and may eventually leave our portfolio. The primary audience is the ODD team. +The secondary audience is peer ODSI teams who want to understand how our work fits the broader picture. + +## Vision: who we serve + +Our vision is expressed as the experiences users will have when we've succeeded: + +1. **Ask in plain language and reproduce response.** As an Earth enthusiast, I want to ask questions like "how did the Gifford fire evolve?" and get an animated visual response — with links to the source code that produced the analysis, so I can verify and reproduce it. +2. **Explore in the browser.** As an Earth enthusiast, I want to visually explore forest disturbance through NISAR data directly in my browser, with no specialized software or cloud account. +3. **Research at scale.** As a fire event researcher, I want to evaluate relationships between variables from different data products across many thousands of fires, with minimal data pre-processing for fusion and modeling. +4. **Operate in near-real time.** As an operational **application**, I need products like HLS for disaster response or sea surface temperature for maritime operations available in near-real time. + +## The gap + +NASA already serves these users — but current services have limits that grow more acute as data volumes grow: + +| User story | Today's services | Where they fall short | +| ------------------------- | -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| Ask in plain language | Earth Information Explorer | Limited dataset access; datasets must be curated into the system | +| Explore in the browser | Worldview / GIBS | Not configurable by users; pre-rendered layers don't scale to new datasets or rendering needs | +| Research at scale | Earthdata Cloud, Harmony, cloud-hosted JupyterHubs | Harmony offloads processing to servers — heavy compute cost rather than a structural fix; users struggle to find the best datasets for their needs | +| Operate in near-real time | LANCE + HLS | Hard to keep metadata and data in sync; no reliable notification system for new data landing in Earthdata Cloud buckets | +| All of the above | CMR | Under increasing pressure from rapid archive growth and analytics-scale query traffic | + +Across all of these: discovery is hard, and current systems are becoming unsustainable as data volumes grow. + +## Our pillars + +We address these gaps through four pillars: + +1. **Open standards & FAIR data.** NASA data and services are findable, accessible, interoperable, and reusable, built on community standards rather than bespoke systems. +2. **Performance, cost & scale.** Optimize performance while minimizing cost, with solutions that scale sustainably to new and growing data volumes. +3. **Empowered users.** Users — both data providers and data consumers — can use and apply the solutions we build without us. +4. **Trusted & reliable data.** The data products NASA generates are verifiable, consistent, and kept in sync with their metadata. + +**Cross-cutting foundation: community developed + adopted.** Every item on this roadmap is built in the open, with and for the community. Open source is the license; community development and adoption is the practice — it's how solutions outlive our involvement, and it underpins all four pillars. + +## Roadmap + +| Pillar | Now · mature | Next · developing | Later · future | +| ------------------------------ | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | +| **Open standards & FAIR data** | ◆ Array format (Zarr) stewardship · ◆ Geospatial conventions (GeoZarr) | Ecosystem sustainability · Codec re-architecture · variable chunking | Convention + CRS utilities | +| **Performance, cost & scale** | Data virtualization · Object-store access · Dynamic tiling · In-browser rendering | Virtual stores + lazy array analytics · Analytics-scale metadata · Storage model evaluation | Resampling/warp tooling · Query at scale · Storage cost optimization | +| **Empowered users** | Cloud-native guidance · Science support · Format evaluation | In-browser rendering · Cloud-optimized decision framework · Improved access & auth libraries · Dataset + tooling coverage metrics · AI/ML data-lake demonstrations | AI-assisted optimization (skills + tooling) · ESRI / ArcGIS integration | +| **Trusted & reliable data** | ◆ Transactional Zarr (Icechunk) | Remote store access · Live virtual stores · Synchronized metadata + data | Event-driven (object store notifications) for near-real time (NRT) updates | + +**◆ Foundational** — a category of work that is ongoing. + +**Handed off:** nothing yet — see [How we work](#how-we-work). Building a working handoff path is a goal itself. + +Every objective of this team should trace to at least one vision story and one pillar. Each item name links to deeper context below. + +## Phases + +While the grid above tracks *what* moves through our portfolio, the phases below sketch *when* — a notional sequence (timelines are notional, not concrete). Foundational work continues throughout. + +![ODD phases — notional timeline](./roadmap-phases.svg) + +**FY26.4 · Now — Demonstrate the data lake.** Demonstrate the utility and performance of Icechunk stores as a data lake platform across varied data types (HLS, NISAR, GPM IMERG, NLDAS, TEMPO, …). VEDA instances demonstrate the data lake in action with scientists. Migrate the data services component (starting with TiTiler-CMR) to the Data Services team so ODD can prototype other services. Continue socialization (virtual stores coworking group, feasibility study) and capacity building (a guide on integrating with the data lake). + +**FY27 · Next — AI integration + caching performance.** Work with the AI/ML teams to demonstrate use of the data lake by AI (e.g. Water Insight or EIE): LLMs discover, reason about, and ingest data from the lake. Demonstrate caching performance using multiscales held in the Icechunk store and cached as a *data cache* (cached Zarr arrays), not a per-service tiling cache. + +**Foundational & ongoing — throughout.** Continuation of foundational work in Zarr, Icechunk, and the underlying libraries for geospatial data processing: IO libraries (obstore), warp / resampling / projection, reading and handling Zarr + COG directly in the browser, and geospatial data standards (GeoZarr). + +**Long term — Simplify.** A future architecture that deprecates HDF-specific tooling in favor of supporting only Zarr tooling, focusing data lake access on the query engine (DataFusion) and Zarr interfaces — not CMR and not archival files, which are comparatively slow, costly, and error-prone. + +## How we work + +> "ODD should not be responsible for virtualizing everything! We (and our partners) are responsible for making it easy for NASA to virtualize things though." — Henry + +ODD is a research and development team, not an operations or continued-maintenance team. Success for any item on this roadmap is *graduating off of it* — not staying on it indefinitely. + +### Lifecycle + +Work moves through four stages: **Later** (future, aspirational) → **Next** (developing) → **Now** (mature) → **Handed off** (owned by someone else). + +An item is ready to hand off when it passes three tests: + +1. **Someone else can do it.** Documentation, tooling, and skills exist so that a data provider or partner can reproduce the work without us. +2. **Someone else owns it.** A named owner — a DAAC, a mission team, community maintainers — has accepted responsibility. +3. **We've stopped learning.** Our remaining contribution is maintenance, not discovery. + +Virtual data stores are an example: today we generate stores ourselves (learning). Next, +we will ship developer docs and optimization skills (enabling). Then store generation +graduates to data providers. Only the underlying tooling remains ours. Several +roadmap items — virtual store authoring docs, decision tooling, the optimization +skill/CLI, ecosystem sustainability (maintainer onboarding) — are not just projects but +handoff mechanisms. + +We don't yet have a reliable handoff process. Naming that honestly is the first step; building it is on the roadmap. + +### Prioritization + +At each planning cycle (PI), we ask two questions of the grid: + +- **What promotes?** Which Next items are ready to become Now? Which Later items are ready to become Next? +- **What graduates?** Which Now items pass the three handoff tests? + +Objectives we take on must also balance "utopian" goals — like a unified Zarr model — +with the necessity of supporting legacy patterns and other formats. + +When evaluating new candidate work, we apply these criteria: + +- **Traceability.** Does it serve at least one vision story and one pillar? +- **Adoption readiness.** How quickly can the ecosystem absorb it? Building on familiar interfaces lowers the barrier (VirtualiZarr adopting xarray's data model made it immediately accessible); very new technology carries adoption lag as a risk (zarr-datafusion-search is powerful but the ecosystem may take years to take it on). +- **Cost.** What does adoption cost — in compute, energy, money, and user capability? Solutions that require cloud compute in a specific region, for example, exclude most users. +- **Handoff path.** Can we articulate who would eventually own this, even roughly? + +## Deeper context + +What each roadmap item unlocks, and what success looks like. + +### Open standards & FAIR data + +**◆ Array format stewardship.** The foundational format for cloud-native array data — Zarr. Ongoing maintenance and stewardship, including convening the community — e.g. Zarr Summit '26/27 — to unblock progress on technical features and convention adoption. + +**◆ Geospatial conventions.** Zarr conventions for geospatial metadata (GeoZarr), essential for native and virtual Zarr collections to interoperate across GIS, visualization, and analysis libraries. Closing in on submission of the GeoZarr standard to the OGC architecture board. Success: trust and interoperability for Zarr data from all Earth data providers (NASA, NOAA, ESA), and a consistent, non-ambiguous platform to build client applications on. + +**Ecosystem sustainability.** A sustainable maintainer ecosystem for Zarr to support growing, complex use cases — the zarr-python roadmap plus maintainer onboarding. Success: adoption of the roadmap by maintainers and stakeholders, plus one or two new onboarded maintainers making significant contributions — reducing stagnation and broadening design perspectives. + +**Codec re-architecture.** The Zarr v2→v3 transition exposed design issues in the codec model. Re-architecting it supports new codec development (vital for virtualization, where archival formats use less-standardized codecs) and alternative client implementations in Rust and TypeScript. Follow-ons: *CF codecs* — capturing CF-convention decoding logic as codecs rather than attribute dictionaries, so clients interacting directly with the Zarr API don't need to duplicate xarray's specialized decoding logic; and *concatenated arrays* — supporting variable compression to unlock virtualization of quirky datasets like MUR SST (pre-design). + +**Convention + CRS utilities.** Utilities and guidance for keeping virtual store metadata aligned with CF and GeoZarr conventions. Unblocks tools that rely on those conventions from using compliant virtual stores. + +### Performance, cost & scale + +**Data virtualization.** Access archival data through the Zarr API without duplicating it — VirtualiZarr. Includes parser improvements (virtual-tiff, obspec-utils, async-hdf5, GRIB) — or transitioning parser maintenance to partners, which is itself a handoff opportunity. This is also our current lever on storage cost (see *Storage cost optimization*). + +**Object-store access.** High-performance object storage access for the Python geospatial stack — obstore. + +**Dynamic tiling.** Tiling driven by CMR — TiTiler-CMR. Current work: regenerated compatibility report (with group support), OPERA integration into the disasters portal, a distributed cache for S3 credentials (~1s saved per cold-start request), and WMTS GetCapabilities so EGIS can surface HLS vegetation indices in ArcGIS. + +**Lazy array analytics.** Instantly materialize massive lazy 4-D arrays (time, band, x, y) from metadata stores — lazycogs, a scalable replacement for stackstac/odc-stac. Success: any collection stored as COGs can be analyzed through a collection-level xarray API. + +**Variable chunking.** Variable chunk support in VirtualiZarr + xarray; unlocks virtualizing more datasets. Near-term delivery. + +**Analytics-scale metadata.** EOSDIS has identified pressure on CMR as a significant risk. Prototype collection-level stores using GeoParquet/Iceberg and zarr-datafusion to understand performance, cost, and scaling — and contribute to the relevant open-source libraries. Includes STAC in Iceberg: an object-storage-only STAC catalog giving providers API-less metadata access. + +**Storage model evaluation.** Understand emerging storage models and their trade-offs — currently the S3 Files synchronization model: compare performance to native S3 for common operations and understand its pricing. Potential to serve both durable shared storage and the low-latency block access that ML and massively parallel array workloads need. + +**Resampling/warp tooling.** A composable, Rust-based resampling/warp library reducing dependence on GDAL's monolithic toolchain. Usable from server-side tiling, distributed array frameworks (Dask, Cubed), and WASM in-browser rendering. Pre-design; builds on a full ecosystem assessment. + +**Query at scale.** Query and access data at scale through a single interface — zarr-datafusion-search. Paves the way for Zarr as a storage target for Level 0/1 and swath data, and moves EOSDIS toward an Arrow-native ecosystem. High potential, but very new — adoption lag is the known risk. + +**Storage cost optimization.** Addressing the growing cost of data volumes in Earthdata Cloud. We are not actively working on this beyond *data virtualization* (accessing archival data through the Zarr API without duplicating it). Avoiding duplication is the lever we pull today; broader storage cost strategies remain future work. + +### Empowered users + +**Cloud-native guidance.** The CNG guide: unblock people confused about which formats exist, why, and when to use each. Success: people use the guide to build cloud-native datasets, or to explain to stakeholders why a dataset was built a given way. + +**Science support.** Direct support for science users, including cloud-optimized data usage guidance (e.g., xarray arguments) in the guide and datacube guide. + +**Format evaluation.** Evaluate mission data formats and recommend improvements that enable optimized access patterns — currently NISAR: assess the NISAR HDF5 format and advise the Algorithm Development Team before the official release in summer 2026. Includes a virtualization + data fusion prototype showing a more user-friendly virtual representation. + +**In-browser rendering.** In-browser GPU rendering of COGs and Zarr via direct data access (deck.gl-raster + Lonboard) — users customize rendering without re-fetching data. Current work: demonstrations in documentation (band combinations, direct access), initial GeoZarr support in both libraries, and a TypeScript WKB→GeoArrow parser enabling DuckDB-Wasm integration. Current limitation: requires open data access. + +**Virtual store authoring.** How to build virtual stores, with or without agents — developer docs. Unblocks DAACs and science teams as virtual store developers — a primary handoff mechanism. + +**Cloud-optimized decision framework.** The cloud-optimized data decision tree: a diagram plus explanatory text with examples per branch, guiding format and chunking decisions. Foundation for AI-assisted optimization. + +**Improved access & auth libraries.** Libraries that get data and credentials into users' hands — earthaccess v1, notably a modular approach with refreshable credentials in a lightweight earth-auth package; finish opening Icechunk stores via earthaccess. + +**AI-assisted optimization (skills + tooling).** A CLI and agentic skill for data structure optimization, plus an agent that walks data providers through chunking and format decisions (CO data AI guidance) — usable across ESDS. Builds on the cloud-optimized decision framework, reducing engineering time to a balanced or optimized data structure. + +**Dataset + tooling coverage metrics.** Assess how many NASA datasets work with our tools (VirtualiZarr, datafusion, lazycogs) so we have metrics for improvement and impact. + +**AI/ML data-lake demonstrations.** Data access is shifting from web / Python / in-house systems toward AI agents, making AI a primary class of user. Work with the AI/ML teams to demonstrate use of the data lake by AI — e.g. in Water Insight or EIE — in the first two quarters of FY27 (27.1–27.2). Show that LLMs can discover, reason about, and ingest data from the lake: semantic discovery beyond STAC, query via DataFusion, and direct Zarr ingest. + +**ESRI / ArcGIS integration.** A large share of NASA data users work in ArcGIS, so our tools and data need to integrate with ESRI systems rather than require users to leave them. Ensure our cloud-native outputs are consumable there through the open standards ESRI already supports (COG, WMTS, OGC APIs, GeoZarr) — the EGIS/ArcGIS WMTS work in *Dynamic tiling* is the first concrete instance. Meeting users where they are, not requiring new software. + +### Trusted & reliable data + +**◆ Transactional Zarr.** Checksum verification and ACID transactions for Zarr stores (Icechunk) — the reliability layer. + +**Remote store access.** Bearer-token HTTP support unblocks NASA data users without cloud compute in us-west-2 from using virtual stores — PO.DAAC has identified this as the single blocker to rolling out their Icechunk stores. Also: parsing manifests back out of Icechunk (inspection and modification of virtual stores, plus risk mitigation) and prefix-changing utilities. + +**Live virtual stores.** Stores kept current as data lands — e.g. MUR SST as native Zarr, rechunked for time series, updated in near-real time as an AWS Public Dataset. Serves anyone doing historical or NRT sea surface temperature analysis, and demonstrates Icechunk's capabilities end to end. + +**Synchronized metadata + data.** Keep metadata in sync with data (via zarr-datafusion-search) — addressing the gap where metadata and data drift apart. + +**Event-driven NRT updates.** Icechunk makes all store updates trackable by listening to changes in object storage keys, enabling simple event-driven pipelines: dynamically updated pyramids (e.g., for Worldview), summary statistics, pre-computed time series. The path to keeping virtual stores current with incoming data streams — and to the near-real-time vision story. + +--- + +*Open questions for the team: verify the Earth Information Explorer claim in the gap table; align timelines with data services (when do they stop coggifying?) and front-end teams (will tile servers eventually go away?); define our first formal handoff.* \ No newline at end of file From dcb690a3760f92d7a2ec559c51d6016bac85e946 Mon Sep 17 00:00:00 2001 From: Aimee Barciauskas Date: Sun, 28 Jun 2026 16:06:46 -0700 Subject: [PATCH 3/8] Remove unused diagrams --- .../diagrams_svg/00_overview_10k.svg | 84 ------------------- .../diagrams_svg/cloud-native-data-lake.svg | 4 - 2 files changed, 88 deletions(-) delete mode 100644 docs/dse-architecture-vision/diagrams_svg/00_overview_10k.svg delete mode 100644 docs/dse-architecture-vision/diagrams_svg/cloud-native-data-lake.svg diff --git a/docs/dse-architecture-vision/diagrams_svg/00_overview_10k.svg b/docs/dse-architecture-vision/diagrams_svg/00_overview_10k.svg deleted file mode 100644 index 4f05a12..0000000 --- a/docs/dse-architecture-vision/diagrams_svg/00_overview_10k.svg +++ /dev/null @@ -1,84 +0,0 @@ - - - - - Architecture at 10,000 ft - From the data lake out to user interfaces — plus direct access via a cloud-hosted dev environment. - NASA Data System Evolution (DSE) - Architecture Vision - - - - new data lake - - existing / neutral - - - - - - Archival Files - HDF5 / netCDF - object storage (S3) - - - - - Icechunk - Zarr arrays (n-D) - virtual or native - - - - - Query Engine - DataFusion - discover & filter collections - - - - - APIs - subset · reformat · tiles - timeseries · custom products - - - - - User Interfaces - web apps · notebooks - python / HTTP clients - - - - - - - - - - - - - - Cloud-hosted development environment - notebooks co-located with the data in AWS us-west-2 - power users go straight to the data, bypassing the API / query layer - - - - - direct Icechunk access - - - - direct archival access - - - - - - users log into - - Teal = new data lake (Icechunk / Zarr · query engine · direct access). Grey = existing / neutral. Solid = primary read path; dashed = direct-to-data shortcuts. - diff --git a/docs/dse-architecture-vision/diagrams_svg/cloud-native-data-lake.svg b/docs/dse-architecture-vision/diagrams_svg/cloud-native-data-lake.svg deleted file mode 100644 index 0ff13cb..0000000 --- a/docs/dse-architecture-vision/diagrams_svg/cloud-native-data-lake.svg +++ /dev/null @@ -1,4 +0,0 @@ - - -CMR APIDiscoverabilityfcCommon MetadataRepository(CMR) RDStable providersServicesEarthdata CloudData Lakepython client | HTTP API | web appNetCDF4TEMPOHDF5ICESat-2COGHLSHDF5GPM IMERGHDF5NISARpython client | HTTP APIMission Product TeamsData ProductGenerationCustom Product GenerationAccess / Subsetting / ReformattingVisualizationlarge async requests(servers)small synchronous requests(serverless)outputs topublishes toicechunk + iceberg datapipelinetriggers S3 notificationappends to icechunk(virtual or native)appends iceberg rowspython clientfast, pre-generated imagery(servers)dynamic (user-driven) imagery(servers)data directly in-browser(serverless, authentication barrier)pngGIBSusers select from existing registered algorithmsauthorized users use MAAPdata processing systemcloud-hosted notebooks + DPS APIsmall synchronous requests(serverless)Usersaccess viaaccessesquery + filtera collectiongenerates products on in developmentexistsNASA Data Systems Evolutionaccess viastagingpublishes todiscovers frommoves to \ No newline at end of file From c7460cf92322ac587bc1741b0ce4b6372a8284c6 Mon Sep 17 00:00:00 2001 From: Aimee Barciauskas Date: Sun, 28 Jun 2026 16:55:59 -0700 Subject: [PATCH 4/8] Update roadmap --- .../diagrams_svg/nasa-esdis-evolution.svg | 4 +- docs/dse-architecture-vision/index.html | 239 ++++-------------- docs/dse-architecture-vision/todos.md | 3 + docs/roadmap-phases.svg | 60 ----- docs/roadmap.md | 50 +++- 5 files changed, 90 insertions(+), 266 deletions(-) create mode 100644 docs/dse-architecture-vision/todos.md delete mode 100644 docs/roadmap-phases.svg diff --git a/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg b/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg index 21f9780..1f7be87 100644 --- a/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg +++ b/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg @@ -1,4 +1,4 @@ -CMR APIDiscovery and AccessfcCommon MetadataRepository(CMR) RDStable providersServicesEarthdata CloudData Lakepython client | HTTP API | web appNetCDF4TEMPOHDF5ICESat-2COGHLSHDF5GPM IMERGHDF5NISARpython client | HTTP APIMission Product TeamsData ProductGenerationCustom Product GenerationAccess / Subsetting / ReformattingVisualizationlarge async requests(servers)small synchronous requests(serverless)outputs topublishes toicechunk + iceberg datapipelinetriggers S3 notificationappends to icechunk(virtual or native)appends iceberg rowspython clientfast, pre-generated imagery(servers)dynamic (user-driven) imagery(servers)data directly in-browser(serverless, authentication barrier)pngGIBSusers select from existing registered algorithmsauthorized users use MAAPdata processing systemcloud-hosted notebooks + DPS APIsmall synchronous requests(serverless)Usersaccess viaaccessesquery + filtera collectiongenerates products on in developmentexistsaccess viastagingpublishes todiscovers frommoves to \ No newline at end of file +CMR APIDiscovery and AccessfcCommon MetadataRepository(CMR) RDStable providersServicesEarthdata CloudData Lakepython client | HTTP API | web appNetCDF4TEMPOHDF5ICESat-2COGHLSHDF5GPM IMERGHDF5NISARpython client | HTTP APIMission Product TeamsData ProductGenerationOn-Demand + Custom Product GenerationAccess / Subsetting / ReformattingVisualizationlarge async requests(servers)small synchronous requests(serverless)outputs topublishes toicechunk + iceberg datapipelinetriggers S3 notificationappends to icechunk(virtual or native)appends iceberg rowspython clientfast, pre-generated imagery(servers)dynamic (user-driven) imagery(servers)data directly in-browser(serverless, authentication barrier)pngGIBSusers select from existing registered algorithmsauthorized users use MAAPdata processing systemcloud-hosted notebooks + DPS APIsmall synchronous requests(serverless)Usersaccess viaaccessesquery + filtera collectiongenerates products on in developmentexistsaccess viastagingpublishes todiscovers frommoves topython client \ No newline at end of file diff --git a/docs/dse-architecture-vision/index.html b/docs/dse-architecture-vision/index.html index 4cd9faf..2de71d6 100644 --- a/docs/dse-architecture-vision/index.html +++ b/docs/dse-architecture-vision/index.html @@ -133,16 +133,16 @@

Cloud-Native Data Lake

Why Now?

NASA has successfully migrated many of its collections to cloud storage. However, it still lacks a unifying vision for a cloud-native architecture.

-
TODAYEarthdata is mission-centric & fragmented
    -
  • Each mission picks its own formats & chunking; there is little cross-mission consistency in data structure.
  • +
    TODAYEarthdata Cloud is mission-centric & fragmented
      +
    • Each mission picks its own formats and chunking. There is little cross-mission consistency in data structure.
    • Even cloud-optimized HDF5 granules fall short of delivering the optimal performance from cloud object storage.
    • Differentiated metadata makes comparing similar products hard; data fusion is painful.
    • There is no structural incentive for missions to invest in downstream usability.
    • -
    • Metadata and data management are still distinct; risking incnsistency
    • +
    • Metadata and data management are still distinct, risking inconsistency
    • Download + process is still a primary access pattern (e.g. Download is the only access method available in the Earthdata search user interface)
    • CMR is strained by analytics & AI-agent traffic
    -
    3-5 YEARS IN THE FUTUREEarthdata cloud is a unified, cloud-native data lake
      +
      3-5 YEARS IN THE FUTUREEarthdata Cloud is anchored by a unified, cloud-native data lake
      • Organizing data around a single data model (i.e. Zarr), means building a tool or service once and it can be used for many datasets.
      • Mission data product requirements are adopted: consistent cloud-friendly structure (sharding, chunking), consistent and standards-compliant metadata.
      • Chunk manifests create a bridge between archival file formats and a single data model.
      • @@ -190,44 +190,6 @@

        The Data Lake compliments existing systems

      - -

      NASA ESDIS Architecture Cloud-Native Data Evolution

      @@ -235,137 +197,29 @@

      NASA ESDIS Architecture Cloud-Native Data Evolution

      SVG: diagrams_svg/nasa-esdis-evolution.svg
      - - - - - -

      Roles & Responsibilities

      -

      Multiple teams, one shared interface.

      +

      Clear roles, responsibilities and well-defined interfaces will be required to transition from data silos and disparate systems to a shared cloud-native data lake.

      Data Producer Teamsall NASA-funded data production (mission, science investigator & DAAC)
        -
      • All NASA-funded teams produce across Levels 0–4: mission-funded standard + project-funded value-added products
      • -
      • Mission data processing (e.g. HySDS, with MAAP as one platform built on it) is a complementary system — not part of the data lake; missions choose whether to adopt it
      • -
      • Scope & standards documented in each mission's Data Management Plan (DMP)
      • -
      • DMPs should require: CF conventions, object-store-optimized chunking, Icechunk / Iceberg delivery
      • -
      • Publish chunk indices (virtual Icechunk manifests) to the shared stores
      • -
      • Register collections + collection-level metadata so products are discoverable & queryable
      • +
      • Who? All NASA-funded teams who produce all levels of products; mission science teams to project-funded value-added products
      • +
      • Mission data processing is a complementary system which will feed into the data lake
      • +
      • Each mission's Data Management Plan (DMP) should require a detailed plan for adhering to data lake conventions, such as CF conventions, GeoZarr and object-store-optimized chunking.
      • +
      • DMPs should require a detalied plan for Icechunk or Iceberg delivery
      publish →
      -
      Shared Data Lakethe contract / interface between the teams
        -
      • CF conventions — the common metadata standard
      • -
      • Apache Iceberg — tabular data + indices
      • -
      • Icechunk / Zarr — n-D array stores
      • -
      • Virtual chunk-manifest indices — published by producers → queryable
      • -
      • Collection & metadata registry — discovery across & within collections
      • +
        Data System TeamMaintains the data lake contract and infrastructure.
          +
        • Develop and maintain the standards for the data lake.
        • +
        • Develop and maintain interfaces for data producers to submit products to the data lake.
        • +
        • Maintain, monitor and secure the data storarge and query engine infrastructure. Ensure durability and reliability. Validate incoming data.
        • +
        • Maintain supporting libraries for data integration.
        → consume
        -
        Cross-Product Services Teamdata lake consumers
          -
        • Operate the shared Iceberg + Icechunk stores and the query engine
        • -
        • Discoverability & queryability (CMR + DataFusion federation)
        • -
        • Subsetting & reformatting (Harmony); on-demand products (SlideRule)
        • -
        • Analytics & visualization services
        • -
        • Build once on uniform data models — reuse across every dataset
        • +
          Data Services Teamsdata lake consumers
            +
          • E.g. Subsetting & reformatting (Harmony); on-demand products (SlideRule)
          • +
          • Analytics & visualization services (TiTiler-CMR, Worldview, VEDA)
          • +
          • Build once on shared data models to support extensibility to many datasets.
      @@ -395,7 +249,7 @@

      Discussion questions

      References

        -
      • Original reference architecture — +
      • Architecture Diagramsapp.excalidraw.com/s/66b8kXd4wid/3M1qblQjtJk
      • Harmonyharmony.earthdata.nasa.gov
      • @@ -410,28 +264,45 @@

        References

        -

        Extra

        +

        Extras

        +
        + + +
        +

        Extra — the array stack vs. the tabular stack

        +

        Two parallel stacks, one query engine. Choose the store by data shape: dense arrays → Zarr / Icechunk; records (points, swaths, features) → Parquet / Iceberg.

        + + + + + + + + + + +
        LayerArray / n-D worldTabular world
        On-disk formatZarrParquet  (GeoParquet for vector)
        Transactional storeIcechunkIceberg  (+ pyiceberg for snapshots)
        Format reader (decode → memory)zarr-pythonpyarrow
        In-memory analysisxarraypandas · polars · GeoPandas
        Query engineDataFusion  (via zarr-datafusion)DataFusion  (polars overlaps)
        Example NASA productsgridded L3/L4 — GPM IMERG · NLDAS · MUR SST · TEMPO L3 · HLS · model reanalysispoints / records — ICESat-2 photons · GEDI footprints · swath L1B/L2 · in-situ & vector · STAC catalog
        +
        One query engine over both: DataFusion spans the two stacks. polars is a dataframe library that also acts as a mini query engine, so it overlaps DataFusion rather than matching zarr-python's decode-only role.
        -

        Object-store IO layer — what drives efficient, low-cost access

        -

        The IO library sets the number and size of S3 GET requests and compute time — i.e. the cost of access. The two are complementary.

        +

        Object-store IO layer — what drives efficient, low-cost access

        +

        The IO library manages manages S3 GET requests (parallelism, synchronicity, request block size).

        -

        Rust object_store

        cloud-native · general purpose
          +

          object_store

          cloud-native · general purpose
          • Unified async API over S3 / GCS / Azure (Apache Arrow project)
          • Powers the Rust data ecosystem: DataFusion, Iceberg, Icechunk / Zarr
          • Concurrent range reads + connection pooling for chunk-level access
          • -
          • Best for cloud-native formats — Zarr, Parquet, Iceberg tables

          h5coro

          cloud-optimized HDF5 · file-level
          • Reads HDF5 directly from S3 without the HDF5 library
          • -
          • Minimizes requests by smart caching of metadata / B-trees
          • -
          • Efficient access to existing archival granules — no reformatting
          • +
          • Minimizes requests by smart caching of metadata and B-trees
          • +
          • Efficient access to existing archival granules, no reformatting
          • File-level and HDF5-specific by design
        -

        Bottom line: object_store for cloud-native stores; h5coro for legacy HDF5 in place — a cost-vs-migration tradeoff.

        +

        Bottom line: object_store for cloud-native stores; h5coro for legacy HDF5 in place.

        @@ -453,11 +324,11 @@

        AI as a primary objective

        Discover
        Rich, consistent metadata makes datasets findable by agents — and discovery will broaden beyond STAC. In two years, semantic search over ATBDs may matter as much as a STAC endpoint; the metadata layer is designed for both structured (STAC / query) and semantic / LLM retrieval.
        Reason
        -
        Consistent CF metadata plus a uniform query interface (DataFusion / SQL) let agents compare datasets and compose queries without per-dataset glue code.
        +
        Consistent metadata (CF, GeoZarr) plus a uniform query interface (DataFusion / SQL) let agents compare datasets and compose queries without per-dataset glue code.
        Ingest
        Cloud-friendly Zarr / Icechunk chunking and cached multiscales give agents efficient, low-cost array access at the resolution they need — direct over S3.
        On the roadmap
        -
        Work with the AI/ML teams to demonstrate use of the data lake (e.g. Water Insight or EIE) in the first two quarters of FY27.
        +
        Work with the AI/ML teams to demonstrate use of the data lake.
      @@ -471,24 +342,6 @@

      Long term — a simpler vision

      The DataFusion query layer is stateless and horizontally scalable over partitioned metadata in object storage — replacing CMR's single-database (RDS) bottleneck — so discovery and query scale with object storage and absorb analytics- and AI-agent-scale traffic.
      - -
      -

      Extra — the array stack vs. the tabular stack

      -

      Two parallel stacks, one query engine. Choose the store by data shape: dense arrays → Zarr / Icechunk; records (points, swaths, features) → Parquet / Iceberg.

      - - - - - - - - - - -
      LayerArray / n-D worldTabular world
      On-disk formatZarrParquet  (GeoParquet for vector)
      Transactional storeIcechunkIceberg  (+ pyiceberg for snapshots)
      Format reader (decode → memory)zarr-pythonpyarrow
      In-memory analysisxarraypandas · polars · GeoPandas
      Query engineDataFusion  (via zarr-datafusion)DataFusion  (polars overlaps)
      Example NASA productsgridded L3/L4 — GPM IMERG · NLDAS · MUR SST · TEMPO L3 · HLS · model reanalysispoints / records — ICESat-2 photons · GEDI footprints · swath L1B/L2 · in-situ & vector · STAC catalog
      -
      One query engine over both: DataFusion spans the two stacks. polars is a dataframe library that also acts as a mini query engine, so it overlaps DataFusion rather than matching zarr-python's decode-only role.
      -
      -
    diff --git a/docs/dse-architecture-vision/todos.md b/docs/dse-architecture-vision/todos.md new file mode 100644 index 0000000..a4224e7 --- /dev/null +++ b/docs/dse-architecture-vision/todos.md @@ -0,0 +1,3 @@ +- [ ] add stac, stac-geoparquet to services +- [x] revise roles and responsibilities +- [x] review roadmap \ No newline at end of file diff --git a/docs/roadmap-phases.svg b/docs/roadmap-phases.svg deleted file mode 100644 index 61479dc..0000000 --- a/docs/roadmap-phases.svg +++ /dev/null @@ -1,60 +0,0 @@ - - - - - ODD phases — notional timeline - Timelines are notional, not concrete - - - - - - FY26.4 · Now - Demonstrate the data lake - • Icechunk data lake across varied data - types: HLS, NISAR, GPM IMERG, - NLDAS, TEMPO, … - • VEDA instances demonstrate the - data lake with scientists - • Migrate data services (TiTiler-CMR) - to the Data Services team - • Socialization + capacity building - (coworking group, integration guide) - - - - - - FY27 · Next - AI integration + caching - • Demonstrate AI integration with the - AI/ML teams (Water Insight / EIE) - • LLMs discover, reason about, ingest - data from the lake - • Demonstrate caching performance: - multiscale data cache in Icechunk - (a data cache, not a tiling cache) - - - - - - Long term - Simplify - • Deprecate HDF tooling; support - only Zarr tooling - • Access only via the query engine - + Zarr interfaces - • Not CMR, not archival files - (slow, costly, error-prone) - - - - - - - - - Foundational & ongoing — throughout all phases - Zarr · Icechunk · obstore (IO) · warp / resampling / projection · reading Zarr + COG directly in the browser · GeoZarr & data standards - diff --git a/docs/roadmap.md b/docs/roadmap.md index 9d7a53a..ee08209 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -56,17 +56,45 @@ Every objective of this team should trace to at least one vision story and one p ## Phases -While the grid above tracks *what* moves through our portfolio, the phases below sketch *when* — a notional sequence (timelines are notional, not concrete). Foundational work continues throughout. - -![ODD phases — notional timeline](./roadmap-phases.svg) - -**FY26.4 · Now — Demonstrate the data lake.** Demonstrate the utility and performance of Icechunk stores as a data lake platform across varied data types (HLS, NISAR, GPM IMERG, NLDAS, TEMPO, …). VEDA instances demonstrate the data lake in action with scientists. Migrate the data services component (starting with TiTiler-CMR) to the Data Services team so ODD can prototype other services. Continue socialization (virtual stores coworking group, feasibility study) and capacity building (a guide on integrating with the data lake). - -**FY27 · Next — AI integration + caching performance.** Work with the AI/ML teams to demonstrate use of the data lake by AI (e.g. Water Insight or EIE): LLMs discover, reason about, and ingest data from the lake. Demonstrate caching performance using multiscales held in the Icechunk store and cached as a *data cache* (cached Zarr arrays), not a per-service tiling cache. - -**Foundational & ongoing — throughout.** Continuation of foundational work in Zarr, Icechunk, and the underlying libraries for geospatial data processing: IO libraries (obstore), warp / resampling / projection, reading and handling Zarr + COG directly in the browser, and geospatial data standards (GeoZarr). - -**Long term — Simplify.** A future architecture that deprecates HDF-specific tooling in favor of supporting only Zarr tooling, focusing data lake access on the query engine (DataFusion) and Zarr interfaces — not CMR and not archival files, which are comparatively slow, costly, and error-prone. +While the grid above tracks *what* moves through our portfolio, the phases below sketch *when* — a notional sequence (timelines are notional, not concrete). + + + ODD phases — notional timeline + timelines are notional, not concrete + + FY26.4 + FY27.1 + FY27.2 + FY27.3 + FY27.4 + + + + + + + + + + + Demonstrate the data lake — varied datasets + + Demonstrate the query engine + service integration + + Demonstrate caching + AI use + + Throughout: socialization of the plan · external-team integration · iterating on the plan as we incorporate varied datasets + + Foundational libraries: Zarr · Icechunk · obstore (IO) · warp / resampling / projection performance · in-browser Zarr + COG · GeoZarr & standards + + +**FY26.4–27.1 — Demonstrate the data lake.** Demonstrate the utility and performance of Icechunk stores as a data lake platform across varied data types (HLS, NISAR, GPM IMERG, NLDAS, TEMPO, …). VEDA instances demonstrate the data lake in action with scientists, and we migrate the data services component (starting with TiTiler-CMR) to the Data Services team so ODD can prototype other services. + +**FY27.1–27.2 — Demonstrate the query engine + service integration.** Show discovery and query across the lake via the query engine (DataFusion), and integrate it with the data services so a single interface serves discovery, query, and access. + +**FY27.3–27.4 — Demonstrate caching + AI use.** Demonstrate caching performance using multiscales held in the Icechunk store and cached as a *data cache* (cached Zarr arrays), not a per-service tiling cache. Work with the AI/ML teams to demonstrate use of the data lake by AI (e.g. Water Insight or EIE): LLMs discover, reason about, and ingest data from the lake. + +**Throughout — alongside every phase.** Socialization of the plan, integration of external teams, and iterating on the plan as we work to incorporate varied datasets. Plus continuation of foundational work in Zarr, Icechunk, and the underlying geospatial libraries: IO (obstore), warp / resampling / projection **performance**, reading and handling Zarr + COG directly in the browser, and geospatial data standards (GeoZarr). ## How we work From 24e26477b287c240f536f25a3a1953905529e1a5 Mon Sep 17 00:00:00 2001 From: Aimee Barciauskas Date: Sun, 28 Jun 2026 17:00:59 -0700 Subject: [PATCH 5/8] Updated roadmap --- docs/roadmap.md | 199 +++++++++++++++++++++++++++++++----------------- 1 file changed, 129 insertions(+), 70 deletions(-) diff --git a/docs/roadmap.md b/docs/roadmap.md index a6c8988..ee08209 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -1,15 +1,18 @@ # ODD roadmap -This page explains the motivations behind ODD's daily work. It connects what we're building to why we're building it. The primary audience is the ODD team. The secondary audience is peer ODSI teams who want to understand how our work fits the broader picture. +This page exists to explain the motivations behind ODD's daily work. It connects what +we're building to why we're building it, and explains how work enters, moves through, +and may eventually leave our portfolio. The primary audience is the ODD team. +The secondary audience is peer ODSI teams who want to understand how our work fits the broader picture. -## Vision +## Vision: who we serve -If we are successful, we imagine users will be able to: +Our vision is expressed as the experiences users will have when we've succeeded: -1. **Ask questions in plain language and reproduce the response:** As an Earth enthusiast, I want to ask questions like "how did the Gifford fire evolve?" and get an animated visual. I want to be able to reproduce responses with links to the source code that produced the analysis, so I can verify and reproduce it. -2. **Explore in the browser:** As an Earth enthusiast, I want to visually explore forest disturbance through NISAR data directly in my browser, with no specialized software or cloud account. -3. **Research at scale:** As a fire event researcher, I want to evaluate relationships between variables from different data products across many thousands of fires, with minimal data pre-processing for fusion and modeling. -4. **Operate in near-real time:** As an operational application, I need products like HLS for disaster response, or sea surface temperature for maritime operations, available in near-real time. +1. **Ask in plain language and reproduce response.** As an Earth enthusiast, I want to ask questions like "how did the Gifford fire evolve?" and get an animated visual response — with links to the source code that produced the analysis, so I can verify and reproduce it. +2. **Explore in the browser.** As an Earth enthusiast, I want to visually explore forest disturbance through NISAR data directly in my browser, with no specialized software or cloud account. +3. **Research at scale.** As a fire event researcher, I want to evaluate relationships between variables from different data products across many thousands of fires, with minimal data pre-processing for fusion and modeling. +4. **Operate in near-real time.** As an operational **application**, I need products like HLS for disaster response or sea surface temperature for maritime operations available in near-real time. ## The gap @@ -19,47 +22,89 @@ NASA already serves these users — but current services have limits that grow m | ------------------------- | -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | | Ask in plain language | Earth Information Explorer | Limited dataset access; datasets must be curated into the system | | Explore in the browser | Worldview / GIBS | Not configurable by users; pre-rendered layers don't scale to new datasets or rendering needs | -| Research at scale | Earthdata Cloud, Harmony, cloud-hosted JupyterHubs | Harmony offloads processing to servers, requiring heavy compute cost rather than a structural fix; users struggle to find the best datasets for their needs | +| Research at scale | Earthdata Cloud, Harmony, cloud-hosted JupyterHubs | Harmony offloads processing to servers — heavy compute cost rather than a structural fix; users struggle to find the best datasets for their needs | | Operate in near-real time | LANCE + HLS | Hard to keep metadata and data in sync; no reliable notification system for new data landing in Earthdata Cloud buckets | -| Data discovery | CMR | Under increasing pressure from rapid archive growth and analytics-scale query traffic | +| All of the above | CMR | Under increasing pressure from rapid archive growth and analytics-scale query traffic | + +Across all of these: discovery is hard, and current systems are becoming unsustainable as data volumes grow. ## Our pillars We address these gaps through four pillars: -1. **Open standards & FAIR data:** NASA data and services are findable, accessible, interoperable, and reusable, built on community standards rather than bespoke systems. -2. **Performance, cost & scale:** Optimize performance while minimizing cost, with solutions that scale sustainably to new and growing data volumes. -3. **Empowered users:** Users — both data providers and data consumers — can use and apply the solutions we build without us. -4. **Trusted & reliable data:** The data products NASA generates are verifiable, consistent, and kept in sync with their metadata. +1. **Open standards & FAIR data.** NASA data and services are findable, accessible, interoperable, and reusable, built on community standards rather than bespoke systems. +2. **Performance, cost & scale.** Optimize performance while minimizing cost, with solutions that scale sustainably to new and growing data volumes. +3. **Empowered users.** Users — both data providers and data consumers — can use and apply the solutions we build without us. +4. **Trusted & reliable data.** The data products NASA generates are verifiable, consistent, and kept in sync with their metadata. -Further, we maintain high standards for the software we develop or reuse, while never intending to duplicate effort. All software we develop or use should be of high quality, under an open source license, and developed and adopted by a broad community. +**Cross-cutting foundation: community developed + adopted.** Every item on this roadmap is built in the open, with and for the community. Open source is the license; community development and adoption is the practice — it's how solutions outlive our involvement, and it underpins all four pillars. ## Roadmap -Listed in the table below are technologies and technical components this team plans or is contributing to. We believe these components will make progress towards the vision and pillars described above. - -Below, the **[Roadmap Items in Detail](#roadmap-items-in-detail)** section provides a brief description of each roadmap item. - -* **Now · mature** means this is a mature technology. We are currently working on it but it is ready for adoption. -* **Next · developing** means this is a developing technology. We are currently working on it so it will be ready for adoption. Timeline for maturity and adoption readiness varies. -* **Later · future** means this is a technology we are not actively developing. We would like to work on it but other technologies in active development take precedence. - -The **◆** designation represents a category of ongoing work. - -| Pillar | Now · mature | Next · developing| Later · future | +| Pillar | Now · mature | Next · developing | Later · future | | ------------------------------ | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | -| **Open standards & FAIR data** | ◆ Array format (Zarr) stewardship · ◆ Geospatial conventions (GeoZarr) | Zarr Ecosystem sustainability · Codec re-architecture · variable chunking | Conventions + CRS utilities | -| **Performance, cost & scale** | Data virtualization · Object-store access · Dynamic tiling · In-browser rendering | Virtual stores + lazy array analytics · Analytics-scale metadata · Storage model evaluation | Resampling/warp tooling · Query at scale · Storage cost optimization · Caching | -| **Empowered users** | Cloud-native guidance · Science support · Format evaluation | In-browser rendering · Cloud-optimized decision framework · Improved access & auth libraries · Dataset + tooling coverage metrics | AI-assisted optimization (skills + tooling) · ESRI / ArcGIS integration | -| **Trusted & reliable data** | ◆ Transactional Zarr (Icechunk) | Virtual stores for ongoing datasets · Synchronized metadata + data | Event-driven (object store notifications) for near-real time (NRT) updates | +| **Open standards & FAIR data** | ◆ Array format (Zarr) stewardship · ◆ Geospatial conventions (GeoZarr) | Ecosystem sustainability · Codec re-architecture · variable chunking | Convention + CRS utilities | +| **Performance, cost & scale** | Data virtualization · Object-store access · Dynamic tiling · In-browser rendering | Virtual stores + lazy array analytics · Analytics-scale metadata · Storage model evaluation | Resampling/warp tooling · Query at scale · Storage cost optimization | +| **Empowered users** | Cloud-native guidance · Science support · Format evaluation | In-browser rendering · Cloud-optimized decision framework · Improved access & auth libraries · Dataset + tooling coverage metrics · AI/ML data-lake demonstrations | AI-assisted optimization (skills + tooling) · ESRI / ArcGIS integration | +| **Trusted & reliable data** | ◆ Transactional Zarr (Icechunk) | Remote store access · Live virtual stores · Synchronized metadata + data | Event-driven (object store notifications) for near-real time (NRT) updates | + +**◆ Foundational** — a category of work that is ongoing. + +**Handed off:** nothing yet — see [How we work](#how-we-work). Building a working handoff path is a goal itself. + +Every objective of this team should trace to at least one vision story and one pillar. Each item name links to deeper context below. + +## Phases + +While the grid above tracks *what* moves through our portfolio, the phases below sketch *when* — a notional sequence (timelines are notional, not concrete). + + + ODD phases — notional timeline + timelines are notional, not concrete + + FY26.4 + FY27.1 + FY27.2 + FY27.3 + FY27.4 + + + + + + + + + + + Demonstrate the data lake — varied datasets + + Demonstrate the query engine + service integration + + Demonstrate caching + AI use + + Throughout: socialization of the plan · external-team integration · iterating on the plan as we incorporate varied datasets + + Foundational libraries: Zarr · Icechunk · obstore (IO) · warp / resampling / projection performance · in-browser Zarr + COG · GeoZarr & standards + + +**FY26.4–27.1 — Demonstrate the data lake.** Demonstrate the utility and performance of Icechunk stores as a data lake platform across varied data types (HLS, NISAR, GPM IMERG, NLDAS, TEMPO, …). VEDA instances demonstrate the data lake in action with scientists, and we migrate the data services component (starting with TiTiler-CMR) to the Data Services team so ODD can prototype other services. + +**FY27.1–27.2 — Demonstrate the query engine + service integration.** Show discovery and query across the lake via the query engine (DataFusion), and integrate it with the data services so a single interface serves discovery, query, and access. + +**FY27.3–27.4 — Demonstrate caching + AI use.** Demonstrate caching performance using multiscales held in the Icechunk store and cached as a *data cache* (cached Zarr arrays), not a per-service tiling cache. Work with the AI/ML teams to demonstrate use of the data lake by AI (e.g. Water Insight or EIE): LLMs discover, reason about, and ingest data from the lake. + +**Throughout — alongside every phase.** Socialization of the plan, integration of external teams, and iterating on the plan as we work to incorporate varied datasets. Plus continuation of foundational work in Zarr, Icechunk, and the underlying geospatial libraries: IO (obstore), warp / resampling / projection **performance**, reading and handling Zarr + COG directly in the browser, and geospatial data standards (GeoZarr). ## How we work +> "ODD should not be responsible for virtualizing everything! We (and our partners) are responsible for making it easy for NASA to virtualize things though." — Henry + ODD is a research and development team, not an operations or continued-maintenance team. Success for any item on this roadmap is *graduating off of it* — not staying on it indefinitely. ### Lifecycle -We anticipate work to move through four stages: **Later** (future, aspirational) → **Next** (developing) → **Now** (mature) → **Handed off** (owned by someone else). +Work moves through four stages: **Later** (future, aspirational) → **Next** (developing) → **Now** (mature) → **Handed off** (owned by someone else). An item is ready to hand off when it passes three tests: @@ -67,92 +112,106 @@ An item is ready to hand off when it passes three tests: 2. **Someone else owns it.** A named owner — a DAAC, a mission team, community maintainers — has accepted responsibility. 3. **We've stopped learning.** Our remaining contribution is maintenance, not discovery. -Using virtual data stores as an example: today we generate stores ourselves (learning). Next, +Virtual data stores are an example: today we generate stores ourselves (learning). Next, we will ship developer docs and optimization skills (enabling). Then store generation -graduates to data providers. While we will continue to work on underlying tooling, several -roadmap items — documentation, decision tooling, and ecosystem sustainability — are not just projects but -handoff methods. +graduates to data providers. Only the underlying tooling remains ours. Several +roadmap items — virtual store authoring docs, decision tooling, the optimization +skill/CLI, ecosystem sustainability (maintainer onboarding) — are not just projects but +handoff mechanisms. -The above steps and example are notional and not established through practice. +We don't yet have a reliable handoff process. Naming that honestly is the first step; building it is on the roadmap. ### Prioritization +At each planning cycle (PI), we ask two questions of the grid: + +- **What promotes?** Which Next items are ready to become Now? Which Later items are ready to become Next? +- **What graduates?** Which Now items pass the three handoff tests? + Objectives we take on must also balance "utopian" goals — like a unified Zarr model — with the necessity of supporting legacy patterns and other formats. When evaluating new candidate work, we apply these criteria: -- **Vision alignment:** Does it serve at least one vision story and satisfy all appropriate pillars? -- **Adoption readiness:** How quickly can the ecosystem absorb it? Building on familiar interfaces lowers the barrier (VirtualiZarr adopting xarray's data model made it immediately accessible); very new technology carries adoption lag as a risk. -- **Cost:** What does adoption cost — in compute, energy and onboarding (users and systems)? -- **Handoff path:** Can we state who would eventually own this? +- **Traceability.** Does it serve at least one vision story and one pillar? +- **Adoption readiness.** How quickly can the ecosystem absorb it? Building on familiar interfaces lowers the barrier (VirtualiZarr adopting xarray's data model made it immediately accessible); very new technology carries adoption lag as a risk (zarr-datafusion-search is powerful but the ecosystem may take years to take it on). +- **Cost.** What does adoption cost — in compute, energy, money, and user capability? Solutions that require cloud compute in a specific region, for example, exclude most users. +- **Handoff path.** Can we articulate who would eventually own this, even roughly? -## Roadmap Items in Detail +## Deeper context -Below, each technical component is briefly explained. +What each roadmap item unlocks, and what success looks like. ### Open standards & FAIR data -**◆ Array format stewardship:** The foundational format for cloud-native array data is Zarr. This component comprises ongoing maintenance and stewardship, including convening the community — e.g. Zarr Summit '26/27 — to unblock progress on technical features and convention adoption. +**◆ Array format stewardship.** The foundational format for cloud-native array data — Zarr. Ongoing maintenance and stewardship, including convening the community — e.g. Zarr Summit '26/27 — to unblock progress on technical features and convention adoption. -**◆ Geospatial conventions:** Zarr conventions for geospatial metadata (GeoZarr) are essential for native and virtual Zarr collections to interoperate across GIS, visualization, and analysis libraries. Success is trust and interoperability for Zarr data from all Earth data providers (NASA, NOAA, ESA), and a consistent platform to build client applications on. +**◆ Geospatial conventions.** Zarr conventions for geospatial metadata (GeoZarr), essential for native and virtual Zarr collections to interoperate across GIS, visualization, and analysis libraries. Closing in on submission of the GeoZarr standard to the OGC architecture board. Success: trust and interoperability for Zarr data from all Earth data providers (NASA, NOAA, ESA), and a consistent, non-ambiguous platform to build client applications on. -**Ecosystem sustainability:** Zarr will support growing, complex use cases through a sustainable maintainer ecosystem. That ecosystem includes the work detailed in the zarr-python roadmap plus maintainer onboarding. +**Ecosystem sustainability.** A sustainable maintainer ecosystem for Zarr to support growing, complex use cases — the zarr-python roadmap plus maintainer onboarding. Success: adoption of the roadmap by maintainers and stakeholders, plus one or two new onboarded maintainers making significant contributions — reducing stagnation and broadening design perspectives. -**Codec re-architecture:** The Zarr v2 -> v3 transition exposed design issues in the codec model. Re-architecting it supports new codec development (vital for virtualization, where archival formats use less-standardized codecs), alternative client implementations in Rust and TypeScript and fixing quirky data (CF codecs and concatenating arrays with varied codecs). +**Codec re-architecture.** The Zarr v2→v3 transition exposed design issues in the codec model. Re-architecting it supports new codec development (vital for virtualization, where archival formats use less-standardized codecs) and alternative client implementations in Rust and TypeScript. Follow-ons: *CF codecs* — capturing CF-convention decoding logic as codecs rather than attribute dictionaries, so clients interacting directly with the Zarr API don't need to duplicate xarray's specialized decoding logic; and *concatenated arrays* — supporting variable compression to unlock virtualization of quirky datasets like MUR SST (pre-design). -**Conventions + CRS utilities:** Utilities and guidance on CF and GeoZarr conventions will keep virtual store metadata aligned with tooling. This work will unblock tools that rely on those conventions from using compliant virtual stores. +**Convention + CRS utilities.** Utilities and guidance for keeping virtual store metadata aligned with CF and GeoZarr conventions. Unblocks tools that rely on those conventions from using compliant virtual stores. ### Performance, cost & scale -**Data virtualization:** Data virtualization enables access to archival data through the Zarr API without duplicating it. Work includes VirtualiZarr parser improvements (virtual-tiff, obspec-utils, async-hdf5, GRIB) and transitioning maintenance to partners. +**Data virtualization.** Access archival data through the Zarr API without duplicating it — VirtualiZarr. Includes parser improvements (virtual-tiff, obspec-utils, async-hdf5, GRIB) — or transitioning parser maintenance to partners, which is itself a handoff opportunity. This is also our current lever on storage cost (see *Storage cost optimization*). -**Object-store access:** Libraries such as obstore provide high-performance object storage access for the Python geospatial stack. +**Object-store access.** High-performance object storage access for the Python geospatial stack — obstore. -**Dynamic tiling:** Dynamic tiling enables visualization without maintaining static image pyramids. Future work includes supporting additional datasets and integrations, for example WMTS GetCapabilities so EGIS can surface HLS vegetation indices in ArcGIS. +**Dynamic tiling.** Tiling driven by CMR — TiTiler-CMR. Current work: regenerated compatibility report (with group support), OPERA integration into the disasters portal, a distributed cache for S3 credentials (~1s saved per cold-start request), and WMTS GetCapabilities so EGIS can surface HLS vegetation indices in ArcGIS. -**Lazy array analytics:** Instantly materialize massive lazy multi-dimensional arrays (time, band, x, y) from metadata stores (e.g. lazycogs and lazymerge). These libraries provide a scalable replacement for stackstac/odc-stac. +**Lazy array analytics.** Instantly materialize massive lazy 4-D arrays (time, band, x, y) from metadata stores — lazycogs, a scalable replacement for stackstac/odc-stac. Success: any collection stored as COGs can be analyzed through a collection-level xarray API. -**Variable chunking:** Variable chunk support in VirtualiZarr + xarray will unlock virtualizing more datasets. +**Variable chunking.** Variable chunk support in VirtualiZarr + xarray; unlocks virtualizing more datasets. Near-term delivery. -**Analytics-scale metadata:** EOSDIS has identified pressure on CMR as a significant risk. We are prototyping collection-level stores using GeoParquet/Iceberg and DataFusion to understand performance, cost, and scaling — and contribute to the relevant open-source libraries. +**Analytics-scale metadata.** EOSDIS has identified pressure on CMR as a significant risk. Prototype collection-level stores using GeoParquet/Iceberg and zarr-datafusion to understand performance, cost, and scaling — and contribute to the relevant open-source libraries. Includes STAC in Iceberg: an object-storage-only STAC catalog giving providers API-less metadata access. -**Storage model evaluation:** We will evaluate emerging storage models and their trade-offs, such as the [S3 Files synchronization system](https://aws.amazon.com/s3/features/files/). +**Storage model evaluation.** Understand emerging storage models and their trade-offs — currently the S3 Files synchronization model: compare performance to native S3 for common operations and understand its pricing. Potential to serve both durable shared storage and the low-latency block access that ML and massively parallel array workloads need. -**Resampling/warp tooling:** A composable, Rust-based resampling/warp library will reduce dependence on GDAL's monolithic toolchain. Such a library would be useful for server-side tiling, distributed array frameworks (Dask, Cubed), and WASM in-browser rendering. This idea is stil in the design and ecosystem assessment phase. +**Resampling/warp tooling.** A composable, Rust-based resampling/warp library reducing dependence on GDAL's monolithic toolchain. Usable from server-side tiling, distributed array frameworks (Dask, Cubed), and WASM in-browser rendering. Pre-design; builds on a full ecosystem assessment. -**Query at scale:** We are demonstrating query and access at scale through a single interface (zarr-datafusion-search). This library demonstrates a Zarr interface for Level 0/1 and swath data, and moves EOSDIS toward an Arrow-native ecosystem. +**Query at scale.** Query and access data at scale through a single interface — zarr-datafusion-search. Paves the way for Zarr as a storage target for Level 0/1 and swath data, and moves EOSDIS toward an Arrow-native ecosystem. High potential, but very new — adoption lag is the known risk. -**Storage cost optimization:** Data virtualization addresses the growing cost of data volumes in Earthdata Cloud by accessing archival data through the Zarr API without duplicating it. Future work includes applying other storage cost strategies as evaluated in the work item listed above. +**Storage cost optimization.** Addressing the growing cost of data volumes in Earthdata Cloud. We are not actively working on this beyond *data virtualization* (accessing archival data through the Zarr API without duplicating it). Avoiding duplication is the lever we pull today; broader storage cost strategies remain future work. ### Empowered users -**Cloud-native guidance:** The CNG guide unblocks people confused about which formats exist, why, and when to use each. +**Cloud-native guidance.** The CNG guide: unblock people confused about which formats exist, why, and when to use each. Success: people use the guide to build cloud-native datasets, or to explain to stakeholders why a dataset was built a given way. + +**Science support.** Direct support for science users, including cloud-optimized data usage guidance (e.g., xarray arguments) in the guide and datacube guide. -**Science support:** We continue to work with the dedicated science support team to provide cloud-optimized data guidance. +**Format evaluation.** Evaluate mission data formats and recommend improvements that enable optimized access patterns — currently NISAR: assess the NISAR HDF5 format and advise the Algorithm Development Team before the official release in summer 2026. Includes a virtualization + data fusion prototype showing a more user-friendly virtual representation. -**Format evaluation:** We continue to evaluate mission data formats and recommend improvements that enable optimized access patterns. +**In-browser rendering.** In-browser GPU rendering of COGs and Zarr via direct data access (deck.gl-raster + Lonboard) — users customize rendering without re-fetching data. Current work: demonstrations in documentation (band combinations, direct access), initial GeoZarr support in both libraries, and a TypeScript WKB→GeoArrow parser enabling DuckDB-Wasm integration. Current limitation: requires open data access. -**In-browser rendering:** We are developing in-browser GPU rendering of COGs and Zarr via direct data access (e.g. deck.gl-raster + Lonboard). Users customize rendering without re-fetching data. +**Virtual store authoring.** How to build virtual stores, with or without agents — developer docs. Unblocks DAACs and science teams as virtual store developers — a primary handoff mechanism. -**Virtual store documentation:** Virtual store documentation (how to build virtual stores, with or without agents) will unblock DAACs and science teams as virtual store developers. +**Cloud-optimized decision framework.** The cloud-optimized data decision tree: a diagram plus explanatory text with examples per branch, guiding format and chunking decisions. Foundation for AI-assisted optimization. -**Cloud-optimized decision framework:** The cloud-optimized data decision tree will guide format and chunking decisions. This will also serve as the foundation for AI-assisted optimization. +**Improved access & auth libraries.** Libraries that get data and credentials into users' hands — earthaccess v1, notably a modular approach with refreshable credentials in a lightweight earth-auth package; finish opening Icechunk stores via earthaccess. -**Improved access & auth libraries:** We provide development support to libraries that get data and credentials into users' hands (e.g. earthaccess). +**AI-assisted optimization (skills + tooling).** A CLI and agentic skill for data structure optimization, plus an agent that walks data providers through chunking and format decisions (CO data AI guidance) — usable across ESDS. Builds on the cloud-optimized decision framework, reducing engineering time to a balanced or optimized data structure. -**AI-assisted optimization (skills + tooling):** A CLI and agentic skill for data structure optimization will build on the cloud-optimized decision framework, reducing engineering time to a balanced or optimized data structure. +**Dataset + tooling coverage metrics.** Assess how many NASA datasets work with our tools (VirtualiZarr, datafusion, lazycogs) so we have metrics for improvement and impact. -**Dataset + tooling coverage metrics:** An assessment of how many NASA datasets work with our tools (VirtualiZarr, datafusion, lazycogs) will provide metrics for improvement and impact. +**AI/ML data-lake demonstrations.** Data access is shifting from web / Python / in-house systems toward AI agents, making AI a primary class of user. Work with the AI/ML teams to demonstrate use of the data lake by AI — e.g. in Water Insight or EIE — in the first two quarters of FY27 (27.1–27.2). Show that LLMs can discover, reason about, and ingest data from the lake: semantic discovery beyond STAC, query via DataFusion, and direct Zarr ingest. -**ESRI / ArcGIS integration:** A large share of NASA data users work in ArcGIS, so our tools and data need to integrate with ESRI systems. We need to ensure our cloud-native outputs are consumable through the open standards ESRI already supports (COG, WMTS, OGC APIs, GeoZarr). +**ESRI / ArcGIS integration.** A large share of NASA data users work in ArcGIS, so our tools and data need to integrate with ESRI systems rather than require users to leave them. Ensure our cloud-native outputs are consumable there through the open standards ESRI already supports (COG, WMTS, OGC APIs, GeoZarr) — the EGIS/ArcGIS WMTS work in *Dynamic tiling* is the first concrete instance. Meeting users where they are, not requiring new software. ### Trusted & reliable data -**◆ Transactional Zarr:** Checksum verification and ACID transactions for Zarr stores, via Icechunk, provides reliability. +**◆ Transactional Zarr.** Checksum verification and ACID transactions for Zarr stores (Icechunk) — the reliability layer. + +**Remote store access.** Bearer-token HTTP support unblocks NASA data users without cloud compute in us-west-2 from using virtual stores — PO.DAAC has identified this as the single blocker to rolling out their Icechunk stores. Also: parsing manifests back out of Icechunk (inspection and modification of virtual stores, plus risk mitigation) and prefix-changing utilities. + +**Live virtual stores.** Stores kept current as data lands — e.g. MUR SST as native Zarr, rechunked for time series, updated in near-real time as an AWS Public Dataset. Serves anyone doing historical or NRT sea surface temperature analysis, and demonstrates Icechunk's capabilities end to end. + +**Synchronized metadata + data.** Keep metadata in sync with data (via zarr-datafusion-search) — addressing the gap where metadata and data drift apart. -**Near-real time virtual stores:** We will keep stores current as data arrives. This work will serve anyone doing historical or NRT sea surface temperature analysis. +**Event-driven NRT updates.** Icechunk makes all store updates trackable by listening to changes in object storage keys, enabling simple event-driven pipelines: dynamically updated pyramids (e.g., for Worldview), summary statistics, pre-computed time series. The path to keeping virtual stores current with incoming data streams — and to the near-real-time vision story. -**Synchronized metadata + data:** Keep metadata in sync with data to ensure analyses are valid. +--- -**Event-driven NRT updates:** Stores such as Icechunk make all store updates trackable by listening to changes in object storage keys. Simple event-driven pipelines will enable dynamically updated pyramids (e.g., for Worldview), summary statistics, and pre-computed time series. This is the path to keeping virtual stores current with incoming data streams. +*Open questions for the team: verify the Earth Information Explorer claim in the gap table; align timelines with data services (when do they stop coggifying?) and front-end teams (will tile servers eventually go away?); define our first formal handoff.* \ No newline at end of file From 7617c6b318981f7344c099f8a19982e3d6fdfee8 Mon Sep 17 00:00:00 2001 From: Aimee Barciauskas Date: Sun, 28 Jun 2026 17:04:38 -0700 Subject: [PATCH 6/8] Add phases --- docs/roadmap.md | 154 ++++++++++++++++++++++-------------------------- 1 file changed, 69 insertions(+), 85 deletions(-) diff --git a/docs/roadmap.md b/docs/roadmap.md index ee08209..6148d0f 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -1,18 +1,15 @@ # ODD roadmap -This page exists to explain the motivations behind ODD's daily work. It connects what -we're building to why we're building it, and explains how work enters, moves through, -and may eventually leave our portfolio. The primary audience is the ODD team. -The secondary audience is peer ODSI teams who want to understand how our work fits the broader picture. +This page explains the motivations behind ODD's daily work. It connects what we're building to why we're building it. The primary audience is the ODD team. The secondary audience is peer ODSI teams who want to understand how our work fits the broader picture. -## Vision: who we serve +## Vision -Our vision is expressed as the experiences users will have when we've succeeded: +If we are successful, we imagine users will be able to: -1. **Ask in plain language and reproduce response.** As an Earth enthusiast, I want to ask questions like "how did the Gifford fire evolve?" and get an animated visual response — with links to the source code that produced the analysis, so I can verify and reproduce it. -2. **Explore in the browser.** As an Earth enthusiast, I want to visually explore forest disturbance through NISAR data directly in my browser, with no specialized software or cloud account. -3. **Research at scale.** As a fire event researcher, I want to evaluate relationships between variables from different data products across many thousands of fires, with minimal data pre-processing for fusion and modeling. -4. **Operate in near-real time.** As an operational **application**, I need products like HLS for disaster response or sea surface temperature for maritime operations available in near-real time. +1. **Ask questions in plain language and reproduce the response:** As an Earth enthusiast, I want to ask questions like "how did the Gifford fire evolve?" and get an animated visual. I want to be able to reproduce responses with links to the source code that produced the analysis, so I can verify and reproduce it. +2. **Explore in the browser:** As an Earth enthusiast, I want to visually explore forest disturbance through NISAR data directly in my browser, with no specialized software or cloud account. +3. **Research at scale:** As a fire event researcher, I want to evaluate relationships between variables from different data products across many thousands of fires, with minimal data pre-processing for fusion and modeling. +4. **Operate in near-real time:** As an operational application, I need products like HLS for disaster response, or sea surface temperature for maritime operations, available in near-real time. ## The gap @@ -22,37 +19,39 @@ NASA already serves these users — but current services have limits that grow m | ------------------------- | -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | | Ask in plain language | Earth Information Explorer | Limited dataset access; datasets must be curated into the system | | Explore in the browser | Worldview / GIBS | Not configurable by users; pre-rendered layers don't scale to new datasets or rendering needs | -| Research at scale | Earthdata Cloud, Harmony, cloud-hosted JupyterHubs | Harmony offloads processing to servers — heavy compute cost rather than a structural fix; users struggle to find the best datasets for their needs | +| Research at scale | Earthdata Cloud, Harmony, cloud-hosted JupyterHubs | Harmony offloads processing to servers, requiring heavy compute cost rather than a structural fix; users struggle to find the best datasets for their needs | | Operate in near-real time | LANCE + HLS | Hard to keep metadata and data in sync; no reliable notification system for new data landing in Earthdata Cloud buckets | -| All of the above | CMR | Under increasing pressure from rapid archive growth and analytics-scale query traffic | - -Across all of these: discovery is hard, and current systems are becoming unsustainable as data volumes grow. +| Data discovery | CMR | Under increasing pressure from rapid archive growth and analytics-scale query traffic | ## Our pillars We address these gaps through four pillars: -1. **Open standards & FAIR data.** NASA data and services are findable, accessible, interoperable, and reusable, built on community standards rather than bespoke systems. -2. **Performance, cost & scale.** Optimize performance while minimizing cost, with solutions that scale sustainably to new and growing data volumes. -3. **Empowered users.** Users — both data providers and data consumers — can use and apply the solutions we build without us. -4. **Trusted & reliable data.** The data products NASA generates are verifiable, consistent, and kept in sync with their metadata. +1. **Open standards & FAIR data:** NASA data and services are findable, accessible, interoperable, and reusable, built on community standards rather than bespoke systems. +2. **Performance, cost & scale:** Optimize performance while minimizing cost, with solutions that scale sustainably to new and growing data volumes. +3. **Empowered users:** Users — both data providers and data consumers — can use and apply the solutions we build without us. +4. **Trusted & reliable data:** The data products NASA generates are verifiable, consistent, and kept in sync with their metadata. -**Cross-cutting foundation: community developed + adopted.** Every item on this roadmap is built in the open, with and for the community. Open source is the license; community development and adoption is the practice — it's how solutions outlive our involvement, and it underpins all four pillars. +Further, we maintain high standards for the software we develop or reuse, while never intending to duplicate effort. All software we develop or use should be of high quality, under an open source license, and developed and adopted by a broad community. ## Roadmap -| Pillar | Now · mature | Next · developing | Later · future | -| ------------------------------ | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | -| **Open standards & FAIR data** | ◆ Array format (Zarr) stewardship · ◆ Geospatial conventions (GeoZarr) | Ecosystem sustainability · Codec re-architecture · variable chunking | Convention + CRS utilities | -| **Performance, cost & scale** | Data virtualization · Object-store access · Dynamic tiling · In-browser rendering | Virtual stores + lazy array analytics · Analytics-scale metadata · Storage model evaluation | Resampling/warp tooling · Query at scale · Storage cost optimization | -| **Empowered users** | Cloud-native guidance · Science support · Format evaluation | In-browser rendering · Cloud-optimized decision framework · Improved access & auth libraries · Dataset + tooling coverage metrics · AI/ML data-lake demonstrations | AI-assisted optimization (skills + tooling) · ESRI / ArcGIS integration | -| **Trusted & reliable data** | ◆ Transactional Zarr (Icechunk) | Remote store access · Live virtual stores · Synchronized metadata + data | Event-driven (object store notifications) for near-real time (NRT) updates | +Listed in the table below are technologies and technical components this team plans or is contributing to. We believe these components will make progress towards the vision and pillars described above. + +Below, the **[Roadmap Items in Detail](#roadmap-items-in-detail)** section provides a brief description of each roadmap item. -**◆ Foundational** — a category of work that is ongoing. +* **Now · mature** means this is a mature technology. We are currently working on it but it is ready for adoption. +* **Next · developing** means this is a developing technology. We are currently working on it so it will be ready for adoption. Timeline for maturity and adoption readiness varies. +* **Later · future** means this is a technology we are not actively developing. We would like to work on it but other technologies in active development take precedence. -**Handed off:** nothing yet — see [How we work](#how-we-work). Building a working handoff path is a goal itself. +The **◆** designation represents a category of ongoing work. -Every objective of this team should trace to at least one vision story and one pillar. Each item name links to deeper context below. +| Pillar | Now · mature | Next · developing| Later · future | +| ------------------------------ | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | +| **Open standards & FAIR data** | ◆ Array format (Zarr) stewardship · ◆ Geospatial conventions (GeoZarr) | Zarr Ecosystem sustainability · Codec re-architecture · variable chunking | Conventions + CRS utilities | +| **Performance, cost & scale** | Data virtualization · Object-store access · Dynamic tiling · In-browser rendering | Virtual stores + lazy array analytics · Analytics-scale metadata · Storage model evaluation | Resampling/warp tooling · Query at scale · Storage cost optimization · Caching | +| **Empowered users** | Cloud-native guidance · Science support · Format evaluation | In-browser rendering · Cloud-optimized decision framework · Improved access & auth libraries · Dataset + tooling coverage metrics | AI-assisted optimization (skills + tooling) · ESRI / ArcGIS integration | +| **Trusted & reliable data** | ◆ Transactional Zarr (Icechunk) | Virtual stores for ongoing datasets · Synchronized metadata + data | Event-driven (object store notifications) for near-real time (NRT) updates | ## Phases @@ -96,15 +95,14 @@ While the grid above tracks *what* moves through our portfolio, the phases below **Throughout — alongside every phase.** Socialization of the plan, integration of external teams, and iterating on the plan as we work to incorporate varied datasets. Plus continuation of foundational work in Zarr, Icechunk, and the underlying geospatial libraries: IO (obstore), warp / resampling / projection **performance**, reading and handling Zarr + COG directly in the browser, and geospatial data standards (GeoZarr). -## How we work -> "ODD should not be responsible for virtualizing everything! We (and our partners) are responsible for making it easy for NASA to virtualize things though." — Henry +## How we work ODD is a research and development team, not an operations or continued-maintenance team. Success for any item on this roadmap is *graduating off of it* — not staying on it indefinitely. ### Lifecycle -Work moves through four stages: **Later** (future, aspirational) → **Next** (developing) → **Now** (mature) → **Handed off** (owned by someone else). +We anticipate work to move through four stages: **Later** (future, aspirational) → **Next** (developing) → **Now** (mature) → **Handed off** (owned by someone else). An item is ready to hand off when it passes three tests: @@ -112,106 +110,92 @@ An item is ready to hand off when it passes three tests: 2. **Someone else owns it.** A named owner — a DAAC, a mission team, community maintainers — has accepted responsibility. 3. **We've stopped learning.** Our remaining contribution is maintenance, not discovery. -Virtual data stores are an example: today we generate stores ourselves (learning). Next, +Using virtual data stores as an example: today we generate stores ourselves (learning). Next, we will ship developer docs and optimization skills (enabling). Then store generation -graduates to data providers. Only the underlying tooling remains ours. Several -roadmap items — virtual store authoring docs, decision tooling, the optimization -skill/CLI, ecosystem sustainability (maintainer onboarding) — are not just projects but -handoff mechanisms. +graduates to data providers. While we will continue to work on underlying tooling, several +roadmap items — documentation, decision tooling, and ecosystem sustainability — are not just projects but +handoff methods. -We don't yet have a reliable handoff process. Naming that honestly is the first step; building it is on the roadmap. +The above steps and example are notional and not established through practice. ### Prioritization -At each planning cycle (PI), we ask two questions of the grid: - -- **What promotes?** Which Next items are ready to become Now? Which Later items are ready to become Next? -- **What graduates?** Which Now items pass the three handoff tests? - Objectives we take on must also balance "utopian" goals — like a unified Zarr model — with the necessity of supporting legacy patterns and other formats. When evaluating new candidate work, we apply these criteria: -- **Traceability.** Does it serve at least one vision story and one pillar? -- **Adoption readiness.** How quickly can the ecosystem absorb it? Building on familiar interfaces lowers the barrier (VirtualiZarr adopting xarray's data model made it immediately accessible); very new technology carries adoption lag as a risk (zarr-datafusion-search is powerful but the ecosystem may take years to take it on). -- **Cost.** What does adoption cost — in compute, energy, money, and user capability? Solutions that require cloud compute in a specific region, for example, exclude most users. -- **Handoff path.** Can we articulate who would eventually own this, even roughly? +- **Vision alignment:** Does it serve at least one vision story and satisfy all appropriate pillars? +- **Adoption readiness:** How quickly can the ecosystem absorb it? Building on familiar interfaces lowers the barrier (VirtualiZarr adopting xarray's data model made it immediately accessible); very new technology carries adoption lag as a risk. +- **Cost:** What does adoption cost — in compute, energy and onboarding (users and systems)? +- **Handoff path:** Can we state who would eventually own this? -## Deeper context +## Roadmap Items in Detail -What each roadmap item unlocks, and what success looks like. +Below, each technical component is briefly explained. ### Open standards & FAIR data -**◆ Array format stewardship.** The foundational format for cloud-native array data — Zarr. Ongoing maintenance and stewardship, including convening the community — e.g. Zarr Summit '26/27 — to unblock progress on technical features and convention adoption. +**◆ Array format stewardship:** The foundational format for cloud-native array data is Zarr. This component comprises ongoing maintenance and stewardship, including convening the community — e.g. Zarr Summit '26/27 — to unblock progress on technical features and convention adoption. -**◆ Geospatial conventions.** Zarr conventions for geospatial metadata (GeoZarr), essential for native and virtual Zarr collections to interoperate across GIS, visualization, and analysis libraries. Closing in on submission of the GeoZarr standard to the OGC architecture board. Success: trust and interoperability for Zarr data from all Earth data providers (NASA, NOAA, ESA), and a consistent, non-ambiguous platform to build client applications on. +**◆ Geospatial conventions:** Zarr conventions for geospatial metadata (GeoZarr) are essential for native and virtual Zarr collections to interoperate across GIS, visualization, and analysis libraries. Success is trust and interoperability for Zarr data from all Earth data providers (NASA, NOAA, ESA), and a consistent platform to build client applications on. -**Ecosystem sustainability.** A sustainable maintainer ecosystem for Zarr to support growing, complex use cases — the zarr-python roadmap plus maintainer onboarding. Success: adoption of the roadmap by maintainers and stakeholders, plus one or two new onboarded maintainers making significant contributions — reducing stagnation and broadening design perspectives. +**Ecosystem sustainability:** Zarr will support growing, complex use cases through a sustainable maintainer ecosystem. That ecosystem includes the work detailed in the zarr-python roadmap plus maintainer onboarding. -**Codec re-architecture.** The Zarr v2→v3 transition exposed design issues in the codec model. Re-architecting it supports new codec development (vital for virtualization, where archival formats use less-standardized codecs) and alternative client implementations in Rust and TypeScript. Follow-ons: *CF codecs* — capturing CF-convention decoding logic as codecs rather than attribute dictionaries, so clients interacting directly with the Zarr API don't need to duplicate xarray's specialized decoding logic; and *concatenated arrays* — supporting variable compression to unlock virtualization of quirky datasets like MUR SST (pre-design). +**Codec re-architecture:** The Zarr v2 -> v3 transition exposed design issues in the codec model. Re-architecting it supports new codec development (vital for virtualization, where archival formats use less-standardized codecs), alternative client implementations in Rust and TypeScript and fixing quirky data (CF codecs and concatenating arrays with varied codecs). -**Convention + CRS utilities.** Utilities and guidance for keeping virtual store metadata aligned with CF and GeoZarr conventions. Unblocks tools that rely on those conventions from using compliant virtual stores. +**Conventions + CRS utilities:** Utilities and guidance on CF and GeoZarr conventions will keep virtual store metadata aligned with tooling. This work will unblock tools that rely on those conventions from using compliant virtual stores. ### Performance, cost & scale -**Data virtualization.** Access archival data through the Zarr API without duplicating it — VirtualiZarr. Includes parser improvements (virtual-tiff, obspec-utils, async-hdf5, GRIB) — or transitioning parser maintenance to partners, which is itself a handoff opportunity. This is also our current lever on storage cost (see *Storage cost optimization*). +**Data virtualization:** Data virtualization enables access to archival data through the Zarr API without duplicating it. Work includes VirtualiZarr parser improvements (virtual-tiff, obspec-utils, async-hdf5, GRIB) and transitioning maintenance to partners. -**Object-store access.** High-performance object storage access for the Python geospatial stack — obstore. +**Object-store access:** Libraries such as obstore provide high-performance object storage access for the Python geospatial stack. -**Dynamic tiling.** Tiling driven by CMR — TiTiler-CMR. Current work: regenerated compatibility report (with group support), OPERA integration into the disasters portal, a distributed cache for S3 credentials (~1s saved per cold-start request), and WMTS GetCapabilities so EGIS can surface HLS vegetation indices in ArcGIS. +**Dynamic tiling:** Dynamic tiling enables visualization without maintaining static image pyramids. Future work includes supporting additional datasets and integrations, for example WMTS GetCapabilities so EGIS can surface HLS vegetation indices in ArcGIS. -**Lazy array analytics.** Instantly materialize massive lazy 4-D arrays (time, band, x, y) from metadata stores — lazycogs, a scalable replacement for stackstac/odc-stac. Success: any collection stored as COGs can be analyzed through a collection-level xarray API. +**Lazy array analytics:** Instantly materialize massive lazy multi-dimensional arrays (time, band, x, y) from metadata stores (e.g. lazycogs and lazymerge). These libraries provide a scalable replacement for stackstac/odc-stac. -**Variable chunking.** Variable chunk support in VirtualiZarr + xarray; unlocks virtualizing more datasets. Near-term delivery. +**Variable chunking:** Variable chunk support in VirtualiZarr + xarray will unlock virtualizing more datasets. -**Analytics-scale metadata.** EOSDIS has identified pressure on CMR as a significant risk. Prototype collection-level stores using GeoParquet/Iceberg and zarr-datafusion to understand performance, cost, and scaling — and contribute to the relevant open-source libraries. Includes STAC in Iceberg: an object-storage-only STAC catalog giving providers API-less metadata access. +**Analytics-scale metadata:** EOSDIS has identified pressure on CMR as a significant risk. We are prototyping collection-level stores using GeoParquet/Iceberg and DataFusion to understand performance, cost, and scaling — and contribute to the relevant open-source libraries. -**Storage model evaluation.** Understand emerging storage models and their trade-offs — currently the S3 Files synchronization model: compare performance to native S3 for common operations and understand its pricing. Potential to serve both durable shared storage and the low-latency block access that ML and massively parallel array workloads need. +**Storage model evaluation:** We will evaluate emerging storage models and their trade-offs, such as the [S3 Files synchronization system](https://aws.amazon.com/s3/features/files/). -**Resampling/warp tooling.** A composable, Rust-based resampling/warp library reducing dependence on GDAL's monolithic toolchain. Usable from server-side tiling, distributed array frameworks (Dask, Cubed), and WASM in-browser rendering. Pre-design; builds on a full ecosystem assessment. +**Resampling/warp tooling:** A composable, Rust-based resampling/warp library will reduce dependence on GDAL's monolithic toolchain. Such a library would be useful for server-side tiling, distributed array frameworks (Dask, Cubed), and WASM in-browser rendering. This idea is stil in the design and ecosystem assessment phase. -**Query at scale.** Query and access data at scale through a single interface — zarr-datafusion-search. Paves the way for Zarr as a storage target for Level 0/1 and swath data, and moves EOSDIS toward an Arrow-native ecosystem. High potential, but very new — adoption lag is the known risk. +**Query at scale:** We are demonstrating query and access at scale through a single interface (zarr-datafusion-search). This library demonstrates a Zarr interface for Level 0/1 and swath data, and moves EOSDIS toward an Arrow-native ecosystem. -**Storage cost optimization.** Addressing the growing cost of data volumes in Earthdata Cloud. We are not actively working on this beyond *data virtualization* (accessing archival data through the Zarr API without duplicating it). Avoiding duplication is the lever we pull today; broader storage cost strategies remain future work. +**Storage cost optimization:** Data virtualization addresses the growing cost of data volumes in Earthdata Cloud by accessing archival data through the Zarr API without duplicating it. Future work includes applying other storage cost strategies as evaluated in the work item listed above. ### Empowered users -**Cloud-native guidance.** The CNG guide: unblock people confused about which formats exist, why, and when to use each. Success: people use the guide to build cloud-native datasets, or to explain to stakeholders why a dataset was built a given way. - -**Science support.** Direct support for science users, including cloud-optimized data usage guidance (e.g., xarray arguments) in the guide and datacube guide. +**Cloud-native guidance:** The CNG guide unblocks people confused about which formats exist, why, and when to use each. -**Format evaluation.** Evaluate mission data formats and recommend improvements that enable optimized access patterns — currently NISAR: assess the NISAR HDF5 format and advise the Algorithm Development Team before the official release in summer 2026. Includes a virtualization + data fusion prototype showing a more user-friendly virtual representation. +**Science support:** We continue to work with the dedicated science support team to provide cloud-optimized data guidance. -**In-browser rendering.** In-browser GPU rendering of COGs and Zarr via direct data access (deck.gl-raster + Lonboard) — users customize rendering without re-fetching data. Current work: demonstrations in documentation (band combinations, direct access), initial GeoZarr support in both libraries, and a TypeScript WKB→GeoArrow parser enabling DuckDB-Wasm integration. Current limitation: requires open data access. +**Format evaluation:** We continue to evaluate mission data formats and recommend improvements that enable optimized access patterns. -**Virtual store authoring.** How to build virtual stores, with or without agents — developer docs. Unblocks DAACs and science teams as virtual store developers — a primary handoff mechanism. +**In-browser rendering:** We are developing in-browser GPU rendering of COGs and Zarr via direct data access (e.g. deck.gl-raster + Lonboard). Users customize rendering without re-fetching data. -**Cloud-optimized decision framework.** The cloud-optimized data decision tree: a diagram plus explanatory text with examples per branch, guiding format and chunking decisions. Foundation for AI-assisted optimization. +**Virtual store documentation:** Virtual store documentation (how to build virtual stores, with or without agents) will unblock DAACs and science teams as virtual store developers. -**Improved access & auth libraries.** Libraries that get data and credentials into users' hands — earthaccess v1, notably a modular approach with refreshable credentials in a lightweight earth-auth package; finish opening Icechunk stores via earthaccess. +**Cloud-optimized decision framework:** The cloud-optimized data decision tree will guide format and chunking decisions. This will also serve as the foundation for AI-assisted optimization. -**AI-assisted optimization (skills + tooling).** A CLI and agentic skill for data structure optimization, plus an agent that walks data providers through chunking and format decisions (CO data AI guidance) — usable across ESDS. Builds on the cloud-optimized decision framework, reducing engineering time to a balanced or optimized data structure. +**Improved access & auth libraries:** We provide development support to libraries that get data and credentials into users' hands (e.g. earthaccess). -**Dataset + tooling coverage metrics.** Assess how many NASA datasets work with our tools (VirtualiZarr, datafusion, lazycogs) so we have metrics for improvement and impact. +**AI-assisted optimization (skills + tooling):** A CLI and agentic skill for data structure optimization will build on the cloud-optimized decision framework, reducing engineering time to a balanced or optimized data structure. -**AI/ML data-lake demonstrations.** Data access is shifting from web / Python / in-house systems toward AI agents, making AI a primary class of user. Work with the AI/ML teams to demonstrate use of the data lake by AI — e.g. in Water Insight or EIE — in the first two quarters of FY27 (27.1–27.2). Show that LLMs can discover, reason about, and ingest data from the lake: semantic discovery beyond STAC, query via DataFusion, and direct Zarr ingest. +**Dataset + tooling coverage metrics:** An assessment of how many NASA datasets work with our tools (VirtualiZarr, datafusion, lazycogs) will provide metrics for improvement and impact. -**ESRI / ArcGIS integration.** A large share of NASA data users work in ArcGIS, so our tools and data need to integrate with ESRI systems rather than require users to leave them. Ensure our cloud-native outputs are consumable there through the open standards ESRI already supports (COG, WMTS, OGC APIs, GeoZarr) — the EGIS/ArcGIS WMTS work in *Dynamic tiling* is the first concrete instance. Meeting users where they are, not requiring new software. +**ESRI / ArcGIS integration:** A large share of NASA data users work in ArcGIS, so our tools and data need to integrate with ESRI systems. We need to ensure our cloud-native outputs are consumable through the open standards ESRI already supports (COG, WMTS, OGC APIs, GeoZarr). ### Trusted & reliable data -**◆ Transactional Zarr.** Checksum verification and ACID transactions for Zarr stores (Icechunk) — the reliability layer. - -**Remote store access.** Bearer-token HTTP support unblocks NASA data users without cloud compute in us-west-2 from using virtual stores — PO.DAAC has identified this as the single blocker to rolling out their Icechunk stores. Also: parsing manifests back out of Icechunk (inspection and modification of virtual stores, plus risk mitigation) and prefix-changing utilities. - -**Live virtual stores.** Stores kept current as data lands — e.g. MUR SST as native Zarr, rechunked for time series, updated in near-real time as an AWS Public Dataset. Serves anyone doing historical or NRT sea surface temperature analysis, and demonstrates Icechunk's capabilities end to end. - -**Synchronized metadata + data.** Keep metadata in sync with data (via zarr-datafusion-search) — addressing the gap where metadata and data drift apart. +**◆ Transactional Zarr:** Checksum verification and ACID transactions for Zarr stores, via Icechunk, provides reliability. -**Event-driven NRT updates.** Icechunk makes all store updates trackable by listening to changes in object storage keys, enabling simple event-driven pipelines: dynamically updated pyramids (e.g., for Worldview), summary statistics, pre-computed time series. The path to keeping virtual stores current with incoming data streams — and to the near-real-time vision story. +**Near-real time virtual stores:** We will keep stores current as data arrives. This work will serve anyone doing historical or NRT sea surface temperature analysis. ---- +**Synchronized metadata + data:** Keep metadata in sync with data to ensure analyses are valid. -*Open questions for the team: verify the Earth Information Explorer claim in the gap table; align timelines with data services (when do they stop coggifying?) and front-end teams (will tile servers eventually go away?); define our first formal handoff.* \ No newline at end of file +**Event-driven NRT updates:** Stores such as Icechunk make all store updates trackable by listening to changes in object storage keys. Simple event-driven pipelines will enable dynamically updated pyramids (e.g., for Worldview), summary statistics, and pre-computed time series. This is the path to keeping virtual stores current with incoming data streams. From b75d5bb45d4da156a2e43a00c5ecad88a9bf35ec Mon Sep 17 00:00:00 2001 From: Aimee Barciauskas Date: Sun, 28 Jun 2026 17:06:31 -0700 Subject: [PATCH 7/8] Build pages --- .github/workflows/deploy.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml index 689fb9f..0ecb8e2 100644 --- a/.github/workflows/deploy.yml +++ b/.github/workflows/deploy.yml @@ -3,7 +3,7 @@ name: Publish docs via GitHub Pages on: push: branches: - - main + - add-dse-architecture-vision-deck pull_request: branches: - main From f60eb550a21129c518d604d79b805ddd700f43f3 Mon Sep 17 00:00:00 2001 From: Aimee Barciauskas Date: Sun, 28 Jun 2026 21:38:14 -0700 Subject: [PATCH 8/8] minor updates --- docs/dse-architecture-vision/index.html | 32 +++++++++++++------------ docs/roadmap.md | 8 +++---- 2 files changed, 21 insertions(+), 19 deletions(-) diff --git a/docs/dse-architecture-vision/index.html b/docs/dse-architecture-vision/index.html index 2de71d6..bbbf8e1 100644 --- a/docs/dse-architecture-vision/index.html +++ b/docs/dse-architecture-vision/index.html @@ -36,8 +36,9 @@ /* panels with check/cross */ .panel{ border:2.5px solid var(--c); border-radius:14px; background:#fff; overflow:hidden; } -.panel .ph{ background:var(--c); color:#fff; padding:7px 12px; } +.panel .ph{ background:var(--c); color:#fff; padding:5px 12px; line-height:1.2; } .panel .ph b{ font-size:13.5px; } .panel .ph span{ font-size:10.5px; font-style:italic; opacity:.92; display:block; } +.panel .psub{ font-size:10.5px; font-style:italic; color:#475569; margin:7px 12px 0; line-height:1.3; } .panel ul{ margin:8px 12px; padding:0; list-style:none; } .panel li{ font-size:10.5px; margin:5px 0; padding-left:20px; position:relative; line-height:1.28; } .panel.good li:before{ content:"✓"; position:absolute; left:0; color:var(--c); font-weight:700; } @@ -170,23 +171,24 @@

    Benefits

    -

    The Data Lake supports direct, in-browser access

    -

    Put the effort into good data and metadata and the data does the work, not servers and not users.

    -
    Our job: instruct providers
    -
    Instruct NASA data providers on good metadata and cloud-friendly data delivery into the data lake, so users and services can get data out in an optimal way.
    -
    Optimal = direct in-browser
    -
    The optimal path is direct, in-browser access: users explore the data in the browser: no files to download, no libraries to install.
    +

    Put effort into data and metadata and the data does the work, not servers or users.

    +
    +
    Optimal access is direct in-browser
    +
    The optimal path is direct, in-browser access: users explore the data in the browser: no files to download, no libraries to install.
    Why it's optimal
    -
    With well-structured data, consistent metadata, and caching, in-browser access is fast and cost-effective: versus server-side processing (NASA must build and maintain services that do all the work) or data egress (users do all the work and pay to move data).
    +
    With well-structured data, consistent metadata, and caching, in-browser access is fast and cost-effective: versus server-side processing (NASA must build and maintain services that do all the work) or data egress (users do all the work and NASA pays to move data).
    +
    Instruct providers
    +
    Instruct NASA data providers on good metadata and cloud-friendly data delivery into the data lake, so users and services can get data out in + an optimal way.
    -

    The Data Lake compliments existing systems

    +

    The data lake compliments existing systems

      -
    • The data lake will compliment existing systems, not replace them.
    • -
    • Traditional access methods currently in use will keep working.
    • -
    • As datasets get integrated with Icechunk / Iceberg, users start seeing the benefits of more efficient and more powerful access.
    • +
    • The data lake will compliment existing systems, not replace them.
    • +
    • Traditional access methods currently in use will keep working.
    • +
    • As datasets get integrated with Icechunk / Iceberg, users start seeing the benefits of more efficient and more powerful access.
    @@ -202,21 +204,21 @@

    NASA ESDIS Architecture Cloud-Native Data Evolution

    Roles & Responsibilities

    Clear roles, responsibilities and well-defined interfaces will be required to transition from data silos and disparate systems to a shared cloud-native data lake.

    -
    Data Producer Teamsall NASA-funded data production (mission, science investigator & DAAC)
      +
      Data Producer Teams
      all NASA-funded data production (mission, science investigator & DAAC)
      • Who? All NASA-funded teams who produce all levels of products; mission science teams to project-funded value-added products
      • Mission data processing is a complementary system which will feed into the data lake
      • Each mission's Data Management Plan (DMP) should require a detailed plan for adhering to data lake conventions, such as CF conventions, GeoZarr and object-store-optimized chunking.
      • DMPs should require a detalied plan for Icechunk or Iceberg delivery
      publish →
      -
      Data System TeamMaintains the data lake contract and infrastructure.
        +
        Data System Team
        Maintains the data lake contract and infrastructure.
        • Develop and maintain the standards for the data lake.
        • Develop and maintain interfaces for data producers to submit products to the data lake.
        • Maintain, monitor and secure the data storarge and query engine infrastructure. Ensure durability and reliability. Validate incoming data.
        • Maintain supporting libraries for data integration.
        → consume
        -
        Data Services Teamsdata lake consumers
          +
          Data Services Teams
          data lake consumers
          • E.g. Subsetting & reformatting (Harmony); on-demand products (SlideRule)
          • Analytics & visualization services (TiTiler-CMR, Worldview, VEDA)
          • Build once on shared data models to support extensibility to many datasets.
          • diff --git a/docs/roadmap.md b/docs/roadmap.md index 6148d0f..f224254 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -87,13 +87,13 @@ While the grid above tracks *what* moves through our portfolio, the phases below Foundational libraries: Zarr · Icechunk · obstore (IO) · warp / resampling / projection performance · in-browser Zarr + COG · GeoZarr & standards -**FY26.4–27.1 — Demonstrate the data lake.** Demonstrate the utility and performance of Icechunk stores as a data lake platform across varied data types (HLS, NISAR, GPM IMERG, NLDAS, TEMPO, …). VEDA instances demonstrate the data lake in action with scientists, and we migrate the data services component (starting with TiTiler-CMR) to the Data Services team so ODD can prototype other services. +**FY26.4–27.1 — Demonstrate the data lake.** Demonstrate the utility and performance of Icechunk stores as a data lake platform across varied data types (HLS, NISAR, GPM IMERG, NLDAS, TEMPO, ...). Leverage VEDA instances to demonstrate the value of the data lake through services, and direct access the value to scientists. Simultaneously, we will migrate the data services components, specifically TiTiler-CMR, to the Data Services team. -**FY27.1–27.2 — Demonstrate the query engine + service integration.** Show discovery and query across the lake via the query engine (DataFusion), and integrate it with the data services so a single interface serves discovery, query, and access. +**FY27.1–27.2 — Demonstrate the query engine + service integration.** Showcase integrated discovery, query and access via the query engine. Integrate the query engine with data services so a single interface serves discovery, query, and access. -**FY27.3–27.4 — Demonstrate caching + AI use.** Demonstrate caching performance using multiscales held in the Icechunk store and cached as a *data cache* (cached Zarr arrays), not a per-service tiling cache. Work with the AI/ML teams to demonstrate use of the data lake by AI (e.g. Water Insight or EIE): LLMs discover, reason about, and ingest data from the lake. +**FY27.3–27.4 — Demonstrate caching + AI use.** Demonstrate performance using multiscales and a *data cache* (i.e. a distributed in-memory store). Work with the AI/ML teams to demonstrate use of the data lake by AI (e.g. Water Insight or EIE); LLMs discover, reason about, and ingest data from the lake. -**Throughout — alongside every phase.** Socialization of the plan, integration of external teams, and iterating on the plan as we work to incorporate varied datasets. Plus continuation of foundational work in Zarr, Icechunk, and the underlying geospatial libraries: IO (obstore), warp / resampling / projection **performance**, reading and handling Zarr + COG directly in the browser, and geospatial data standards (GeoZarr). +**Throughout — alongside every phase.** Socialize the vision with other teams and incoporate feedback. Iterate on the plan as we work to incorporate varied datasets. Continue foundational work in Zarr, Icechunk, and other underlying geospatial libraries: IO (obstore), warp / resampling / projection performance, reading and handling Zarr + COG directly in the browser, and geospatial data standards (GeoZarr). ## How we work