-
Notifications
You must be signed in to change notification settings - Fork 2
Explore faceted metadata query optimizations via pre-computation #18
Description
Goal
Complement geospatial optimizations (#17) with fast faceted metadata queries - the kind that power dashboards, filter dropdowns, and exploratory UI.
User experience we're targeting:
- Instant facet counts on page load ("1.2M samples from OPENCONTEXT, 3.1M from SESAR...")
- Responsive filter refinement (select material type → see updated counts)
- Fast cross-entity label lookups (show material names without slow joins)
The Problem
Even though parquet has good native optimizations (dictionary encoding, column stats), some queries still require full scans:
| Query Type | Current Behavior | Pain Point |
|---|---|---|
| Facet counts | GROUP BY n scans all rows |
Slow on 20M rows |
| Filter + count | WHERE material='Rock' GROUP BY source |
Compounds the cost |
| Label lookups | Join through p__* arrays |
Array membership is slow |
| Dashboard totals | Multiple aggregations | Repeated scans |
What Parquet Gives Us (Free)
| Feature | Benefit |
|---|---|
| Dictionary encoding | n, otype filters are fast (int comparison) |
| Min/max stats | Range predicates skip row groups |
| Column pruning | Only reads columns in SELECT |
Already fast: WHERE n = 'OPENCONTEXT' (equality on dictionary column)
Still slow: SELECT n, COUNT(*) GROUP BY n (must scan to count)
Pre-computation Strategies
1. Facet Count Summary Table (Recommended First)
Pre-compute common aggregations as a tiny parquet:
-- facet_summaries.parquet (~1KB)
| facet_type | facet_value | count |
|------------|-------------------|---------|
| source | OPENCONTEXT | 1,200,000 |
| source | SESAR | 3,100,000 |
| source | GEOME | 1,500,000 |
| material | Rock | 2,500,000 |
| material |Ite | 800,000 |
| material |Ite | 600,000 |
| object_type| Specimen | 4,000,000 |
| ... | ... | ... |Benefits:
- Instant dashboard load (fetch 1KB instead of scanning 280MB)
- Can include combinations:
source + materialcross-tab - Easy to regenerate when data updates
Query pattern:
-- Instead of slow:
SELECT n, COUNT(*) FROM wide WHERE otype='MaterialSampleRecord' GROUP BY n
-- Fast lookup:
SELECT * FROM facet_summaries WHERE facet_type = 'source'2. Cross-Facet Intersection Counts
For "how many Rock samples from OPENCONTEXT?":
-- facet_intersections.parquet
| source | material | count |
|-------------|----------|-------|
| OPENCONTEXT | Rock | 50,000|
| OPENCONTEXT | Ite | 30,000|
| SESAR | Rock | 800,000|Trade-off: Combinatorial explosion if too many facets. Limit to top N values per facet.
3. Denormalized Label Columns
Avoid joins for display by flattening lookups:
-- Current: requires join through p__has_material_category array
SELECT msr.label, ic.label as material
FROM wide msr, wide ic
WHERE ic.row_id = ANY(msr.p__has_material_category)
-- Pre-computed: labels already present
| row_id | label | material_labels | context_labels |
|--------|-------|-----------------|----------------|
| 123 | "Pottery sherd" | ["Rock", "ite"] | ["Earth surface"] |Trade-off: Increases file size, duplicates data. Best for frequently-displayed fields only.
4. Bloom Filters on High-Cardinality Columns
For "does this parquet contain samples from project X?":
# Enable bloom filter when writing parquet
pq.write_table(table, 'file.parquet',
write_statistics=True,
column_encoding={'project': 'BLOOM_FILTER'})Benefits: Fast negative lookups ("this row group definitely doesn't have X")
Interaction with PQG Schema
Same principle as #17: core spec stays clean, enhancements are optional
| File | Purpose | Size |
|---|---|---|
isamples_wide.parquet |
Core data | 280MB |
isamples_facet_summaries.parquet |
Pre-computed counts | ~10KB |
isamples_facet_intersections.parquet |
Cross-tab counts | ~100KB |
isamples_wide_denormalized.parquet |
With label columns | ~350MB |
Clients choose which files to use based on their needs.
Exploration Strategy
Phase 1: Identify Hot Queries
- What facets does the UI actually need?
- What's the current query time for each?
- Which cause the most pain?
Phase 2: Build Facet Summary Table
- Generate
facet_summaries.parquet - Measure dashboard load improvement
- Decide which cross-facet intersections matter
Phase 3: Evaluate Denormalization
- Prototype
material_labelscolumn - Measure size increase vs query speedup
- Decide if worth the trade-off
Phase 4: Document & Ship
- Add to PQG spec as optional enhancement files
- Update data pipeline to regenerate on refresh
- Publish alongside core parquet
Benchmark Matrix (To Fill In)
| Query | Current (scan) | With Summary Table | Notes |
|---|---|---|---|
| Source facet counts | ? ms | ? ms | |
| Material facet counts | ? ms | ? ms | |
| Source × Material cross-tab | ? ms | ? ms | |
| Sample with material labels | ? ms | ? ms | With denorm |
Success Criteria
- Dashboard facet counts load in <100ms
- Summary tables <1% size of main parquet
- Can regenerate summaries in <5 min
- Documented in PQG spec as optional enhancement files
Related
- Explore geospatial query optimizations via H3 pre-computation #17 - Geospatial optimizations (H3) - complementary work
- Define formal PQG Parquet Schema specification #16 - PQG Parquet Schema specification - where to document enhancement files
- Parquet bloom filters