Skip to content

Explore faceted metadata query optimizations via pre-computation #18

@rdhyee

Description

@rdhyee

Goal

Complement geospatial optimizations (#17) with fast faceted metadata queries - the kind that power dashboards, filter dropdowns, and exploratory UI.

User experience we're targeting:

  • Instant facet counts on page load ("1.2M samples from OPENCONTEXT, 3.1M from SESAR...")
  • Responsive filter refinement (select material type → see updated counts)
  • Fast cross-entity label lookups (show material names without slow joins)

The Problem

Even though parquet has good native optimizations (dictionary encoding, column stats), some queries still require full scans:

Query Type Current Behavior Pain Point
Facet counts GROUP BY n scans all rows Slow on 20M rows
Filter + count WHERE material='Rock' GROUP BY source Compounds the cost
Label lookups Join through p__* arrays Array membership is slow
Dashboard totals Multiple aggregations Repeated scans

What Parquet Gives Us (Free)

Feature Benefit
Dictionary encoding n, otype filters are fast (int comparison)
Min/max stats Range predicates skip row groups
Column pruning Only reads columns in SELECT

Already fast: WHERE n = 'OPENCONTEXT' (equality on dictionary column)

Still slow: SELECT n, COUNT(*) GROUP BY n (must scan to count)


Pre-computation Strategies

1. Facet Count Summary Table (Recommended First)

Pre-compute common aggregations as a tiny parquet:

-- facet_summaries.parquet (~1KB)
| facet_type | facet_value       | count   |
|------------|-------------------|---------|
| source     | OPENCONTEXT       | 1,200,000 |
| source     | SESAR             | 3,100,000 |
| source     | GEOME             | 1,500,000 |
| material   | Rock              | 2,500,000 |
| material   |Ite            | 800,000 |
| material   |Ite | 600,000 |
| object_type| Specimen          | 4,000,000 |
| ...        | ...               | ...     |

Benefits:

  • Instant dashboard load (fetch 1KB instead of scanning 280MB)
  • Can include combinations: source + material cross-tab
  • Easy to regenerate when data updates

Query pattern:

-- Instead of slow:
SELECT n, COUNT(*) FROM wide WHERE otype='MaterialSampleRecord' GROUP BY n

-- Fast lookup:
SELECT * FROM facet_summaries WHERE facet_type = 'source'

2. Cross-Facet Intersection Counts

For "how many Rock samples from OPENCONTEXT?":

-- facet_intersections.parquet
| source      | material | count |
|-------------|----------|-------|
| OPENCONTEXT | Rock     | 50,000|
| OPENCONTEXT | Ite   | 30,000|
| SESAR       | Rock     | 800,000|

Trade-off: Combinatorial explosion if too many facets. Limit to top N values per facet.

3. Denormalized Label Columns

Avoid joins for display by flattening lookups:

-- Current: requires join through p__has_material_category array
SELECT msr.label, ic.label as material
FROM wide msr, wide ic  
WHERE ic.row_id = ANY(msr.p__has_material_category)

-- Pre-computed: labels already present
| row_id | label | material_labels | context_labels |
|--------|-------|-----------------|----------------|
| 123    | "Pottery sherd" | ["Rock", "ite"] | ["Earth surface"] |

Trade-off: Increases file size, duplicates data. Best for frequently-displayed fields only.

4. Bloom Filters on High-Cardinality Columns

For "does this parquet contain samples from project X?":

# Enable bloom filter when writing parquet
pq.write_table(table, 'file.parquet', 
    write_statistics=True,
    column_encoding={'project': 'BLOOM_FILTER'})

Benefits: Fast negative lookups ("this row group definitely doesn't have X")


Interaction with PQG Schema

Same principle as #17: core spec stays clean, enhancements are optional

File Purpose Size
isamples_wide.parquet Core data 280MB
isamples_facet_summaries.parquet Pre-computed counts ~10KB
isamples_facet_intersections.parquet Cross-tab counts ~100KB
isamples_wide_denormalized.parquet With label columns ~350MB

Clients choose which files to use based on their needs.


Exploration Strategy

Phase 1: Identify Hot Queries

  • What facets does the UI actually need?
  • What's the current query time for each?
  • Which cause the most pain?

Phase 2: Build Facet Summary Table

  • Generate facet_summaries.parquet
  • Measure dashboard load improvement
  • Decide which cross-facet intersections matter

Phase 3: Evaluate Denormalization

  • Prototype material_labels column
  • Measure size increase vs query speedup
  • Decide if worth the trade-off

Phase 4: Document & Ship

  • Add to PQG spec as optional enhancement files
  • Update data pipeline to regenerate on refresh
  • Publish alongside core parquet

Benchmark Matrix (To Fill In)

Query Current (scan) With Summary Table Notes
Source facet counts ? ms ? ms
Material facet counts ? ms ? ms
Source × Material cross-tab ? ms ? ms
Sample with material labels ? ms ? ms With denorm

Success Criteria

  • Dashboard facet counts load in <100ms
  • Summary tables <1% size of main parquet
  • Can regenerate summaries in <5 min
  • Documented in PQG spec as optional enhancement files

Related

cc @smrgeoinfo @datadavev

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions