+
+
+
+ NASA EARTHDATA CLOUD DATA SYSTEM EVOLUTION (DSE)
+ Cloud-Native Data Lake
+
+ From "lift and shift" granules to a unified, cloud-native data lake.
+ Working draft · June 2026 · JP Swinski, Sean Harkins & Aimee Barciauskas
+
+
+
+
+ Why Now?
+ NASA has successfully migrated many of its collections to cloud storage. However, it still lacks a unifying vision for a cloud-native architecture.
+
+
TODAYEarthdata Cloud is mission-centric & fragmented
+ - Each mission picks its own formats and chunking. There is little cross-mission consistency in data structure.
+ - Even cloud-optimized HDF5 granules fall short of delivering the optimal performance from cloud object storage.
+ - Differentiated metadata makes comparing similar products hard; data fusion is painful.
+ - There is no structural incentive for missions to invest in downstream usability.
+ - Metadata and data management are still distinct, risking inconsistency
+ - Download + process is still a primary access pattern (e.g. Download is the only access method available in the Earthdata search user interface)
+ - CMR is strained by analytics & AI-agent traffic
+
+
3-5 YEARS IN THE FUTUREEarthdata Cloud is anchored by a unified, cloud-native data lake
+ - Organizing data around a single data model (i.e. Zarr), means building a tool or service once and it can be used for many datasets.
+ - Mission data product requirements are adopted: consistent cloud-friendly structure (sharding, chunking), consistent and standards-compliant metadata.
+ - Chunk manifests create a bridge between archival file formats and a single data model.
+ - Compliant data products are all available through a centralized data lake, utilizing a storage engine which supports ACID transactions for metadata and data.
+ - SlideRule and Harmony leverage chunk-manifest indices for product generation and subsetting.
+ - An integrated query engine enables frictionless discovery to acccess.
+ - A horizontally scalable query engine supports analytics and operational-scale consumers.
+ - AI agents would be able to utilize the query engine and suite of tools to more reliably discover, access and interpret data for consumers.
+
+
+
+
+
+
+
+ High-level Data Lake Architecture
+
+
SVG: diagrams_svg/cloud-data-lake-simple.svg
+
+
+
+
+
+
+
+ Put effort into data and metadata and the data does the work, not servers or users.
+
+ Optimal access is direct in-browser
+ The optimal path is direct, in-browser access: users explore the data in the browser: no files to download, no libraries to install.
+ Why it's optimal
+ With well-structured data, consistent metadata, and caching, in-browser access is fast and cost-effective: versus server-side processing (NASA must build and maintain services that do all the work) or data egress (users do all the work and NASA pays to move data).
+ Instruct providers
+ Instruct NASA data providers on good metadata and cloud-friendly data delivery into the data lake, so users and services can get data out in
+ an optimal way.
+
+
+
+
+ The data lake compliments existing systems
+
+ - The data lake will compliment existing systems, not replace them.
+ - Traditional access methods currently in use will keep working.
+ - As datasets get integrated with Icechunk / Iceberg, users start seeing the benefits of more efficient and more powerful access.
+
+
+
+
+
+ NASA ESDIS Architecture Cloud-Native Data Evolution
+
+ SVG: diagrams_svg/nasa-esdis-evolution.svg
+
+
+
+
+ Roles & Responsibilities
+ Clear roles, responsibilities and well-defined interfaces will be required to transition from data silos and disparate systems to a shared cloud-native data lake.
+
+
Data Producer Teams
all NASA-funded data production (mission, science investigator & DAAC)
+ - Who? All NASA-funded teams who produce all levels of products; mission science teams to project-funded value-added products
+ - Mission data processing is a complementary system which will feed into the data lake
+ - Each mission's Data Management Plan (DMP) should require a detailed plan for adhering to data lake conventions, such as CF conventions, GeoZarr and object-store-optimized chunking.
+ - DMPs should require a detalied plan for Icechunk or Iceberg delivery
+
+
publish →
+
Data System Team
Maintains the data lake contract and infrastructure.
+ - Develop and maintain the standards for the data lake.
+ - Develop and maintain interfaces for data producers to submit products to the data lake.
+ - Maintain, monitor and secure the data storarge and query engine infrastructure. Ensure durability and reliability. Validate incoming data.
+ - Maintain supporting libraries for data integration.
+
+
→ consume
+
Data Services Teams
data lake consumers
+ - E.g. Subsetting & reformatting (Harmony); on-demand products (SlideRule)
+ - Analytics & visualization services (TiTiler-CMR, Worldview, VEDA)
+ - Build once on shared data models to support extensibility to many datasets.
+
+
+
+
+
+
+ Additional recommendations
+ Mission product teams should be provided with AI tooling
+ Guidance on creating optimized archival formats and deliveryo to the Icechunk or Iceberg stores.
+ Enterprise tools accessing the data should pay special attention to the IO layer choice and configuration
+ What matters for efficient (fast, cost-minimizing) access is the object-store IO-layer library: 2 high performance options are
object_store used in
zarr-python and
h5coro used by SlideRule. Both offer advantages (next slide).
+ Implement a shared data cache
+ Services with high demand and low-latency requirements should consider using a data cache. Such a data cache could be shared by multiple services.
+
+
+
+
+
+ Discussion questions
+ 1 — Consolidate the interfaces?
+ Should investment be consolidated into accessing data via the Zarr and Query Engine interfaces, ultimately divesting from CMR and HDF-specific tooling?
+ 2 — The future of HDF
+ Would the natural outcome be moving away from HDF entirely? Or does NASA see archival file formats as having a role indefinitely?
+
+
+
+
+
+
+
+
+
+
+ Extra — the array stack vs. the tabular stack
+ Two parallel stacks, one query engine. Choose the store by data shape: dense arrays → Zarr / Icechunk; records (points, swaths, features) → Parquet / Iceberg.
+
+ | Layer | Array / n-D world | Tabular world |
+
+ | On-disk format | Zarr | Parquet (GeoParquet for vector) |
+ | Transactional store | Icechunk | Iceberg (+ pyiceberg for snapshots) |
+ | Format reader (decode → memory) | zarr-python | pyarrow |
+ | In-memory analysis | xarray | pandas · polars · GeoPandas |
+ | Query engine | DataFusion (via zarr-datafusion) | DataFusion (polars overlaps) |
+ | Example NASA products | gridded L3/L4 — GPM IMERG · NLDAS · MUR SST · TEMPO L3 · HLS · model reanalysis | points / records — ICESat-2 photons · GEDI footprints · swath L1B/L2 · in-situ & vector · STAC catalog |
+
+
+ One query engine over both: DataFusion spans the two stacks. polars is a dataframe library that also acts as a mini query engine, so it overlaps DataFusion rather than matching zarr-python's decode-only role.
+
+
+
+
+ Object-store IO layer — what drives efficient, low-cost access
+ The IO library manages manages S3 GET requests (parallelism, synchronicity, request block size).
+
+
object_store
cloud-native · general purpose
+ - Unified async API over S3 / GCS / Azure (Apache Arrow project)
+ - Powers the Rust data ecosystem: DataFusion, Iceberg, Icechunk / Zarr
+ - Concurrent range reads + connection pooling for chunk-level access
+
+
h5coro
cloud-optimized HDF5 · file-level
+ - Reads HDF5 directly from S3 without the HDF5 library
+ - Minimizes requests by smart caching of metadata and B-trees
+ - Efficient access to existing archival granules, no reformatting
+ - File-level and HDF5-specific by design
+
+
+ Bottom line: object_store for cloud-native stores; h5coro for legacy HDF5 in place.
+
+
+
+
+ Caching — a data cache, not a tiling cache
+ Cache the data, not the rendered pixels — so every service benefits, not just visualization. (FY27 focus: demonstrate caching performance.)
+ Multiscales live in the Icechunk store
+ The Icechunk store includes multiscales (overview / pyramid resolutions of the arrays), so coarse-resolution reads don't require pulling full-resolution chunks.
+ Cache the multiscales as data
+ These multiscales are cached as a data cache — cached Zarr arrays / chunks — not a tiling cache. The same cached data serves tiles, timeseries, analytics, and AI workloads alike.
+ Why a data cache wins
+ A tiling cache accelerates only one visualization service. A data cache accelerates every consumer of the array, is reused across services, and avoids re-rendering the same pixels repeatedly.
+
+
+
+
+ AI as a primary objective
+ Access is shifting from web / Python / in-house systems toward AI agents. The architecture must let LLMs discover, reason about, and ingest NASA Earth data.
+ Discover
+ Rich, consistent metadata makes datasets findable by agents — and discovery will broaden beyond STAC. In two years, semantic search over ATBDs may matter as much as a STAC endpoint; the metadata layer is designed for both structured (STAC / query) and semantic / LLM retrieval.
+ Reason
+ Consistent metadata (CF, GeoZarr) plus a uniform query interface (DataFusion / SQL) let agents compare datasets and compose queries without per-dataset glue code.
+ Ingest
+ Cloud-friendly Zarr / Icechunk chunking and cached multiscales give agents efficient, low-cost array access at the resolution they need — direct over S3.
+ On the roadmap
+ Work with the AI/ML teams to demonstrate use of the data lake.
+
+
+
+
+ Long term — a simpler vision
+ Deprecate HDF tooling in favor of Zarr
+ Once the data lake is established, support for HDF-specific tooling is deprecated in favor of supporting only Zarr tooling — a single, cloud-native access stack to build and maintain.
+ Access only via the query engine + Zarr interfaces
+ A future architecture focuses on data lake access through the query engine (DataFusion) and Zarr interfaces only — not CMR and not archival files. Direct CMR and archival-file access are slow, costly, and error-prone by comparison.
+ Why this is feasible
+ The DataFusion query layer is stateless and horizontally scalable over partitioned metadata in object storage — replacing CMR's single-database (RDS) bottleneck — so discovery and query scale with object storage and absorb analytics- and AI-agent-scale traffic.
+
+
+