Skip to content

null-jones/openspeclib

Repository files navigation

OpenSpecLib

An open-source amalgamated spectral library and processing toolkit that combines spectral measurements from multiple authoritative sources into a unified, schema-validated data structure.

Overview

Spectral libraries are essential reference datasets for material identification, remote sensing, and geochemical analysis. However, the major publicly available libraries each employ different file formats, metadata schemas, and organizational conventions, creating barriers to cross-library search, comparison, and interoperability.

OpenSpecLib addresses this fragmentation by ingesting spectral data from multiple sources, normalizing it into a standard data structure defined by a formal JSON Schema, and producing a versioned master library suitable for downstream analysis and tool development.

Source Libraries

Currently Included

The following sources are ingested into the published master library (v0.0.5):

Source Materials Wavelength Range Spectra
USGS Spectral Library Version 7 Minerals, rocks, soils, vegetation, water, man-made 0.2 -- 200 um ~2,500
ECOSTRESS Spectral Library Minerals, rocks, soils, vegetation, man-made, meteorites 0.35 -- 15.4 um ~3,400
EcoSIS Vegetation, canopy, soil, water, urban materials (curated subset) 350 -- 2500 nm ~17,000

Total: ~22,900 spectra across 3 source libraries.

Planned for Future Releases

The following sources have loaders implemented but require manual data placement and are not yet bundled in releases:

Source Materials Wavelength Range Spectra
RELAB Spectral Database Minerals, meteorites, lunar samples 0.3 -- 26 um ~3,000
ASU Thermal Emission Spectral Library Rock-forming minerals (thermal IR) 5 -- 45 um (2000 -- 220 cm-1) ~800
Bishop Spectral Library Carbonates, hydrated minerals, phyllosilicates 0.3 -- 25 um ~500

See docs/adding-sources.md for guidance on integrating additional libraries.

Quick Start

Use the Web Viewer (no installation required)

The fastest way to explore OpenSpecLib is the browser-based viewer:

Open the OpenSpecLib Viewer

Search, filter, plot, and export spectra directly in your browser. The viewer runs entirely client-side via DuckDB-WASM — no server required, no account needed. Features:

  • Full-text search across ~22,900 spectra from 3 source libraries
  • Filter by material category, source library, measurement technique, and wavelength range
  • Plot individual spectra or build a custom library
  • Simulate satellite-sensor downsampling (Sentinel-2, Landsat, WorldView-3, SuperDove, Wyvern, EnMAP, PRISMA, Tanager)
  • Export to CSV or ENVI .sli/.hdr format

Download a Release

For programmatic use, download the latest release from GitHub Releases:

import json
import pyarrow.parquet as pq

# Load the catalog (JSON; metadata index, no spectral arrays)
with open("catalog.json") as f:
    catalog = json.load(f)

print(f"Total spectra: {catalog['statistics']['total_spectra']}")
print(f"Sources: {list(catalog['sources'].keys())}")

# Find all olivine spectra
olivine = [s for s in catalog["spectra"] if "olivine" in s["name"].lower()]
print(f"Found {len(olivine)} olivine spectra")

# Per-source Parquet files — load the one the catalog points at and find the row
table = pq.read_table(olivine[0]["chunk_file"])
mask = [row_id == olivine[0]["id"] for row_id in table.column("id").to_pylist()]
row = table.filter(mask).to_pylist()[0]
print(f"Wavelengths: {row['spectral_data.wavelengths'][:5]}...")

For ad-hoc analytics, query the Parquet files directly with DuckDB without loading them into Python. Category filtering is a column predicate — no file-partitioning needed:

SELECT id, name, "material.formula"
FROM 'spectra/usgs_splib07.parquet'
WHERE "material.category" = 'mineral'
  AND "spectral_data.wavelength_min" < 0.4;

See schemas/library.parquet-schema.md for the full column reference.

Installation

pip install -e .

# With development tools
pip install -e ".[dev]"

Building the Library Locally

USGS, ECOSTRESS, and EcoSIS download automatically via the CLI. RELAB, ASU TES, and Bishop require manual data placement into the target directory before ingest.

# Download source data (auto-download for bundled sources)
openspeclib download --source usgs --target ./raw/usgs
openspeclib download --source ecostress --target ./raw/ecostress
openspeclib download --source ecosis --target ./raw/ecosis

# Ingest each source
openspeclib ingest --source usgs --input ./raw/usgs --output ./processed/
openspeclib ingest --source ecostress --input ./raw/ecostress --output ./processed/
openspeclib ingest --source ecosis --input ./raw/ecosis --output ./processed/

# Combine into master library
openspeclib combine --input ./processed/ --output ./library/

# Validate
openspeclib validate ./library/

Data Structure

The master library uses a two-tier architecture:

  • Catalog (catalog.json) — Complete metadata index for every spectrum. No spectral arrays. Small enough to load in memory for search and discovery.
  • Library chunks (spectra/{source}/{category}.parquet) — Full spectrum records including wavelength and value arrays, partitioned by source and material category. Stored as Apache Parquet (zstd-compressed) for fast columnar queries via DuckDB / Polars / pandas.

Each spectrum record contains:

  • Source provenance — library, version, DOI, license, citation
  • Material classification — name, category, subcategory, chemical formula, searchable keywords
  • Sample information — ID, description, particle size, origin, preparation
  • Measurement conditions — instrument, technique, laboratory, geometry
  • Spectral data — wavelength axis, values, bandpass, unit information
  • Quality indicators — bad band detection, coverage fraction

See docs/data-structure.md for the full schema specification.

Licensing

Important: Licensing terms differ between the source spectral libraries included in OpenSpecLib. Users must review and comply with the specific license for each source from which they use data.

The OpenSpecLib code is released under the MIT License. However, the spectral data retains the original licensing terms of each source library. While most sources are designated Public Domain, terms vary -- for example, the Bishop Spectral Library restricts use to non-commercial purposes with mandatory citation.

Every release includes a licenses.json file alongside catalog.json, providing a machine-readable index of licensing and citation information keyed by source identifier (matching the source.library field in catalog records and Parquet files). Use it to programmatically look up the license and required citation for any spectrum.

See docs/licensing.md for full details and docs/provenance.md for complete source attributions.

Documentation

GitHub Actions

The library is built and released via a manually triggered GitHub Actions workflow:

  1. Go to Actions > Build and Release Spectral Library
  2. Click Run workflow
  3. Enter a semver version string (e.g., 1.0.0)
  4. Optionally select specific sources to include

The workflow downloads all source data, processes and combines it, validates the output, and creates a GitHub Release with the packaged library.

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff check src/ tests/

# Regenerate schemas (catalog/spectrum JSON Schemas + chunk Arrow schema dump)
python scripts/generate_schemas.py

License

MIT License. See LICENSE for details.

Source spectral libraries retain their original licenses. See docs/provenance.md for individual source library terms.

About

Open source amalgamated spectral libraries and exploration toolkit

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors