diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml index 689fb9f..0ecb8e2 100644 --- a/.github/workflows/deploy.yml +++ b/.github/workflows/deploy.yml @@ -3,7 +3,7 @@ name: Publish docs via GitHub Pages on: push: branches: - - main + - add-dse-architecture-vision-deck pull_request: branches: - main diff --git a/docs/dse-architecture-vision/diagrams_svg/07_reference.svg b/docs/dse-architecture-vision/diagrams_svg/07_reference.svg new file mode 100644 index 0000000..bca481b --- /dev/null +++ b/docs/dse-architecture-vision/diagrams_svg/07_reference.svg @@ -0,0 +1 @@ +Reference Architecture — Tools & FlowsConcrete mapping of the vision onto NASA tools. Bold = implementation / tool · small = logical component.NASA Data System Evolution (DSE)Architecture Visionexistsin developmentnew data lakeData Product GenerationEarthdata Cloud Data LakeDiscoverabilityServicesComplementary system — not the Icechunk data lake (ODD's purview).HySDSdata production systemMAAP = one exampleplatform on HySDSSTACproduct metadata (at publication)Staging bucketobject storage (staging)Cumulusingest / archiveEarthdata HarmonyGIBS image-pyramid generationObject storageGIBS tile pyramids — NOT the data lakeuse only via the GIBS servicemetadatadata productsmoves fromtriggers GIBS generationwrites pyramidsProducer roles & processing levelsAll NASA-funded data production — products span Level 0 to Level 4L0L1L2L3L4Mission-funded · standard product suiteofficial suite — often L0–L3, sometimes L4Project-funded · value-added productsderived — L2–L4Boundary set by each mission's Data Management Plan — going forward,DMPs should also require:CF metadata conventionsObject-store-optimized chunk structureIcechunk / Iceberg storage deliveryCMR (RDS)collection & granule metadata catalogFile-based data lake · object store on S3Archival filesHDF5 / netCDF (cloud-optimized)Apache Icebergtabular table storeIcechunk / Zarrn-D array store (virtual or native)TBD data pipelineevent-driven processingpublishes tomoves toS3 notificationappends rows / arraysregisters inCMR APIsearch for collectionsSTAC APIrepresent data to users (downstream view)— not how metadata is stored at restApache DataFusionquery engine:query & filter collection viewsZarrarray interfaceservespublishes viewstable providersaccessesarray access viaCustom Product GenerationSlideRuleusers select from existing registered algorithmspython client | HTTP API | web appMAAP (on HySDS)authorized users use MAAP data processing systemcloud-hosted notebooks | DPS APIAccess / Subsetting / ReformattingEarthdata Harmonylarge async requests (servers)python client | HTTP APIZarrsmall synchronous requests (serverless)python clientSQLsmall synchronous requests (serverless)VisualizationGIBSfast, pre-generated imagery (servers)TiTilerdynamic (user-driven) imagery (servers)deck.gldata directly in-browser(serverless, authentication barrier)search for collectionsquery + filteraccess viaDirect AccessDirect S3 / Zarr accessread Icechunk / Zarr arrays directly over S3 — no service layerZarr protocol (object_store · zarr-python)cloud-hosted notebooks (us-west-2) or any S3 clientarchival files also readable in place via h5corodirect accessTeal = new data lake components (Iceberg · Icechunk/Zarr · query engine · direct access). Grey = existing / neutral; dashed = in development. Mission data processing (HySDS / MAAP) is complementary, not the data lake. STAC API is a user-facing view, not storage-at-rest. Tool choices are illustrative. \ No newline at end of file diff --git a/docs/dse-architecture-vision/diagrams_svg/cloud-data-lake-simple.svg b/docs/dse-architecture-vision/diagrams_svg/cloud-data-lake-simple.svg new file mode 100644 index 0000000..126f78e --- /dev/null +++ b/docs/dse-architecture-vision/diagrams_svg/cloud-data-lake-simple.svg @@ -0,0 +1,4 @@ + + +Earthdata CloudData LakeNetCDF4NetCDF4NetCDF4Query EngineUser Interfaceswebsites, notebooksAPIsstandard APIs: tiles, STAC, EDR, WMScustom outputs: reproject, reformat, analysisdiscover and filterraw data chunks,chunk manifests and metadatacloud-hosted systems support direct access to the query enginecloud-hosted systems support direct access to data \ No newline at end of file diff --git a/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg b/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg new file mode 100644 index 0000000..1f7be87 --- /dev/null +++ b/docs/dse-architecture-vision/diagrams_svg/nasa-esdis-evolution.svg @@ -0,0 +1,4 @@ + + +CMR APIDiscovery and AccessfcCommon MetadataRepository(CMR) RDStable providersServicesEarthdata CloudData Lakepython client | HTTP API | web appNetCDF4TEMPOHDF5ICESat-2COGHLSHDF5GPM IMERGHDF5NISARpython client | HTTP APIMission Product TeamsData ProductGenerationOn-Demand + Custom Product GenerationAccess / Subsetting / ReformattingVisualizationlarge async requests(servers)small synchronous requests(serverless)outputs topublishes toicechunk + iceberg datapipelinetriggers S3 notificationappends to icechunk(virtual or native)appends iceberg rowspython clientfast, pre-generated imagery(servers)dynamic (user-driven) imagery(servers)data directly in-browser(serverless, authentication barrier)pngGIBSusers select from existing registered algorithmsauthorized users use MAAPdata processing systemcloud-hosted notebooks + DPS APIsmall synchronous requests(serverless)Usersaccess viaaccessesquery + filtera collectiongenerates products on in developmentexistsaccess viastagingpublishes todiscovers frommoves topython client \ No newline at end of file diff --git a/docs/dse-architecture-vision/index.html b/docs/dse-architecture-vision/index.html new file mode 100644 index 0000000..bbbf8e1 --- /dev/null +++ b/docs/dse-architecture-vision/index.html @@ -0,0 +1,355 @@ + + + + + +NASA DSE — Architecture Vision (editable) + + + + + +
+ + +
+

NASA EARTHDATA CLOUD DATA SYSTEM EVOLUTION (DSE)

+

Cloud-Native Data Lake

+
+

From "lift and shift" granules to a unified, cloud-native data lake.

+

Working draft · June 2026  ·  JP Swinski, Sean Harkins & Aimee Barciauskas

+
+ + +
+

Why Now?

+

NASA has successfully migrated many of its collections to cloud storage. However, it still lacks a unifying vision for a cloud-native architecture.

+
+
TODAYEarthdata Cloud is mission-centric & fragmented
    +
  • Each mission picks its own formats and chunking. There is little cross-mission consistency in data structure.
  • +
  • Even cloud-optimized HDF5 granules fall short of delivering the optimal performance from cloud object storage.
  • +
  • Differentiated metadata makes comparing similar products hard; data fusion is painful.
  • +
  • There is no structural incentive for missions to invest in downstream usability.
  • +
  • Metadata and data management are still distinct, risking inconsistency
  • +
  • Download + process is still a primary access pattern (e.g. Download is the only access method available in the Earthdata search user interface)
  • +
  • CMR is strained by analytics & AI-agent traffic
  • +
+
3-5 YEARS IN THE FUTUREEarthdata Cloud is anchored by a unified, cloud-native data lake
    +
  • Organizing data around a single data model (i.e. Zarr), means building a tool or service once and it can be used for many datasets.
  • +
  • Mission data product requirements are adopted: consistent cloud-friendly structure (sharding, chunking), consistent and standards-compliant metadata.
  • +
  • Chunk manifests create a bridge between archival file formats and a single data model.
  • +
  • Compliant data products are all available through a centralized data lake, utilizing a storage engine which supports ACID transactions for metadata and data.
  • +
  • SlideRule and Harmony leverage chunk-manifest indices for product generation and subsetting.
  • +
  • An integrated query engine enables frictionless discovery to acccess.
  • +
  • A horizontally scalable query engine supports analytics and operational-scale consumers.
  • +
  • AI agents would be able to utilize the query engine and suite of tools to more reliably discover, access and interpret data for consumers.
  • +
+
+
+ + + +
+

High-level Data Lake Architecture

+ High-level architecture +
SVG: diagrams_svg/cloud-data-lake-simple.svg
+
+ + +
+

Benefits

+
+ + +
+

Put effort into data and metadata and the data does the work, not servers or users.

+
+
Optimal access is direct in-browser
+
The optimal path is direct, in-browser access: users explore the data in the browser: no files to download, no libraries to install.
+
Why it's optimal
+
With well-structured data, consistent metadata, and caching, in-browser access is fast and cost-effective: versus server-side processing (NASA must build and maintain services that do all the work) or data egress (users do all the work and NASA pays to move data).
+
Instruct providers
+
Instruct NASA data providers on good metadata and cloud-friendly data delivery into the data lake, so users and services can get data out in + an optimal way.
+
+ + +
+

The data lake compliments existing systems

+
    +
  • The data lake will compliment existing systems, not replace them.
  • +
  • Traditional access methods currently in use will keep working.
  • +
  • As datasets get integrated with Icechunk / Iceberg, users start seeing the benefits of more efficient and more powerful access.
  • +
+
+ + +
+

NASA ESDIS Architecture Cloud-Native Data Evolution

+ Reference architecture: tools and flows +
SVG: diagrams_svg/nasa-esdis-evolution.svg
+
+ + +
+

Roles & Responsibilities

+

Clear roles, responsibilities and well-defined interfaces will be required to transition from data silos and disparate systems to a shared cloud-native data lake.

+
+
Data Producer Teams
all NASA-funded data production (mission, science investigator & DAAC)
    +
  • Who? All NASA-funded teams who produce all levels of products; mission science teams to project-funded value-added products
  • +
  • Mission data processing is a complementary system which will feed into the data lake
  • +
  • Each mission's Data Management Plan (DMP) should require a detailed plan for adhering to data lake conventions, such as CF conventions, GeoZarr and object-store-optimized chunking.
  • +
  • DMPs should require a detalied plan for Icechunk or Iceberg delivery
  • +
+
publish →
+
Data System Team
Maintains the data lake contract and infrastructure.
    +
  • Develop and maintain the standards for the data lake.
  • +
  • Develop and maintain interfaces for data producers to submit products to the data lake.
  • +
  • Maintain, monitor and secure the data storarge and query engine infrastructure. Ensure durability and reliability. Validate incoming data.
  • +
  • Maintain supporting libraries for data integration.
  • +
+
→ consume
+
Data Services Teams
data lake consumers
    +
  • E.g. Subsetting & reformatting (Harmony); on-demand products (SlideRule)
  • +
  • Analytics & visualization services (TiTiler-CMR, Worldview, VEDA)
  • +
  • Build once on shared data models to support extensibility to many datasets.
  • +
+
+
+ + +
+

Additional recommendations

+
Mission product teams should be provided with AI tooling
+
Guidance on creating optimized archival formats and deliveryo to the Icechunk or Iceberg stores.
+
Enterprise tools accessing the data should pay special attention to the IO layer choice and configuration
+
What matters for efficient (fast, cost-minimizing) access is the object-store IO-layer library: 2 high performance options are object_store used in zarr-python and h5coro used by SlideRule. Both offer advantages (next slide).
+
Implement a shared data cache
+
Services with high demand and low-latency requirements should consider using a data cache. Such a data cache could be shared by multiple services.
+
+ + + +
+

Discussion questions

+
1 — Consolidate the interfaces?
+
Should investment be consolidated into accessing data via the Zarr and Query Engine interfaces, ultimately divesting from CMR and HDF-specific tooling?
+
2 — The future of HDF
+
Would the natural outcome be moving away from HDF entirely? Or does NASA see archival file formats as having a role indefinitely?
+
+ + +
+

References

+ +
+ + +
+

Extras

+
+ + +
+

Extra — the array stack vs. the tabular stack

+

Two parallel stacks, one query engine. Choose the store by data shape: dense arrays → Zarr / Icechunk; records (points, swaths, features) → Parquet / Iceberg.

+ + + + + + + + + + +
LayerArray / n-D worldTabular world
On-disk formatZarrParquet  (GeoParquet for vector)
Transactional storeIcechunkIceberg  (+ pyiceberg for snapshots)
Format reader (decode → memory)zarr-pythonpyarrow
In-memory analysisxarraypandas · polars · GeoPandas
Query engineDataFusion  (via zarr-datafusion)DataFusion  (polars overlaps)
Example NASA productsgridded L3/L4 — GPM IMERG · NLDAS · MUR SST · TEMPO L3 · HLS · model reanalysispoints / records — ICESat-2 photons · GEDI footprints · swath L1B/L2 · in-situ & vector · STAC catalog
+
One query engine over both: DataFusion spans the two stacks. polars is a dataframe library that also acts as a mini query engine, so it overlaps DataFusion rather than matching zarr-python's decode-only role.
+
+ + +
+

Object-store IO layer — what drives efficient, low-cost access

+

The IO library manages manages S3 GET requests (parallelism, synchronicity, request block size).

+
+

object_store

cloud-native · general purpose
    +
  • Unified async API over S3 / GCS / Azure (Apache Arrow project)
  • +
  • Powers the Rust data ecosystem: DataFusion, Iceberg, Icechunk / Zarr
  • +
  • Concurrent range reads + connection pooling for chunk-level access
  • +
+

h5coro

cloud-optimized HDF5 · file-level
    +
  • Reads HDF5 directly from S3 without the HDF5 library
  • +
  • Minimizes requests by smart caching of metadata and B-trees
  • +
  • Efficient access to existing archival granules, no reformatting
  • +
  • File-level and HDF5-specific by design
  • +
+
+

Bottom line: object_store for cloud-native stores; h5coro for legacy HDF5 in place.

+
+ + +
+

Caching — a data cache, not a tiling cache

+

Cache the data, not the rendered pixels — so every service benefits, not just visualization. (FY27 focus: demonstrate caching performance.)

+
Multiscales live in the Icechunk store
+
The Icechunk store includes multiscales (overview / pyramid resolutions of the arrays), so coarse-resolution reads don't require pulling full-resolution chunks.
+
Cache the multiscales as data
+
These multiscales are cached as a data cache — cached Zarr arrays / chunks — not a tiling cache. The same cached data serves tiles, timeseries, analytics, and AI workloads alike.
+
Why a data cache wins
+
A tiling cache accelerates only one visualization service. A data cache accelerates every consumer of the array, is reused across services, and avoids re-rendering the same pixels repeatedly.
+
+ + +
+

AI as a primary objective

+

Access is shifting from web / Python / in-house systems toward AI agents. The architecture must let LLMs discover, reason about, and ingest NASA Earth data.

+
Discover
+
Rich, consistent metadata makes datasets findable by agents — and discovery will broaden beyond STAC. In two years, semantic search over ATBDs may matter as much as a STAC endpoint; the metadata layer is designed for both structured (STAC / query) and semantic / LLM retrieval.
+
Reason
+
Consistent metadata (CF, GeoZarr) plus a uniform query interface (DataFusion / SQL) let agents compare datasets and compose queries without per-dataset glue code.
+
Ingest
+
Cloud-friendly Zarr / Icechunk chunking and cached multiscales give agents efficient, low-cost array access at the resolution they need — direct over S3.
+
On the roadmap
+
Work with the AI/ML teams to demonstrate use of the data lake.
+
+ + +
+

Long term — a simpler vision

+
Deprecate HDF tooling in favor of Zarr
+
Once the data lake is established, support for HDF-specific tooling is deprecated in favor of supporting only Zarr tooling — a single, cloud-native access stack to build and maintain.
+
Access only via the query engine + Zarr interfaces
+
A future architecture focuses on data lake access through the query engine (DataFusion) and Zarr interfaces only — not CMR and not archival files. Direct CMR and archival-file access are slow, costly, and error-prone by comparison.
+
Why this is feasible
+
The DataFusion query layer is stateless and horizontally scalable over partitioned metadata in object storage — replacing CMR's single-database (RDS) bottleneck — so discovery and query scale with object storage and absorb analytics- and AI-agent-scale traffic.
+
+ +
+ + + + + diff --git a/docs/dse-architecture-vision/todos.md b/docs/dse-architecture-vision/todos.md new file mode 100644 index 0000000..a4224e7 --- /dev/null +++ b/docs/dse-architecture-vision/todos.md @@ -0,0 +1,3 @@ +- [ ] add stac, stac-geoparquet to services +- [x] revise roles and responsibilities +- [x] review roadmap \ No newline at end of file diff --git a/docs/roadmap.md b/docs/roadmap.md index a6c8988..f224254 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -53,6 +53,49 @@ The **◆** designation represents a category of ongoing work. | **Empowered users** | Cloud-native guidance · Science support · Format evaluation | In-browser rendering · Cloud-optimized decision framework · Improved access & auth libraries · Dataset + tooling coverage metrics | AI-assisted optimization (skills + tooling) · ESRI / ArcGIS integration | | **Trusted & reliable data** | ◆ Transactional Zarr (Icechunk) | Virtual stores for ongoing datasets · Synchronized metadata + data | Event-driven (object store notifications) for near-real time (NRT) updates | +## Phases + +While the grid above tracks *what* moves through our portfolio, the phases below sketch *when* — a notional sequence (timelines are notional, not concrete). + + + ODD phases — notional timeline + timelines are notional, not concrete + + FY26.4 + FY27.1 + FY27.2 + FY27.3 + FY27.4 + + + + + + + + + + + Demonstrate the data lake — varied datasets + + Demonstrate the query engine + service integration + + Demonstrate caching + AI use + + Throughout: socialization of the plan · external-team integration · iterating on the plan as we incorporate varied datasets + + Foundational libraries: Zarr · Icechunk · obstore (IO) · warp / resampling / projection performance · in-browser Zarr + COG · GeoZarr & standards + + +**FY26.4–27.1 — Demonstrate the data lake.** Demonstrate the utility and performance of Icechunk stores as a data lake platform across varied data types (HLS, NISAR, GPM IMERG, NLDAS, TEMPO, ...). Leverage VEDA instances to demonstrate the value of the data lake through services, and direct access the value to scientists. Simultaneously, we will migrate the data services components, specifically TiTiler-CMR, to the Data Services team. + +**FY27.1–27.2 — Demonstrate the query engine + service integration.** Showcase integrated discovery, query and access via the query engine. Integrate the query engine with data services so a single interface serves discovery, query, and access. + +**FY27.3–27.4 — Demonstrate caching + AI use.** Demonstrate performance using multiscales and a *data cache* (i.e. a distributed in-memory store). Work with the AI/ML teams to demonstrate use of the data lake by AI (e.g. Water Insight or EIE); LLMs discover, reason about, and ingest data from the lake. + +**Throughout — alongside every phase.** Socialize the vision with other teams and incoporate feedback. Iterate on the plan as we work to incorporate varied datasets. Continue foundational work in Zarr, Icechunk, and other underlying geospatial libraries: IO (obstore), warp / resampling / projection performance, reading and handling Zarr + COG directly in the browser, and geospatial data standards (GeoZarr). + + ## How we work ODD is a research and development team, not an operations or continued-maintenance team. Success for any item on this roadmap is *graduating off of it* — not staying on it indefinitely.