Explore faceted metadata query optimizations via pre-computation

## Goal

Complement geospatial optimizations (#17) with **fast faceted metadata queries** - the kind that power dashboards, filter dropdowns, and exploratory UI.

**User experience we're targeting:**
- Instant facet counts on page load ("1.2M samples from OPENCONTEXT, 3.1M from SESAR...")
- Responsive filter refinement (select material type → see updated counts)
- Fast cross-entity label lookups (show material names without slow joins)

---

## The Problem

Even though parquet has good native optimizations (dictionary encoding, column stats), some queries still require full scans:

| Query Type | Current Behavior | Pain Point |
|------------|------------------|------------|
| Facet counts | `GROUP BY n` scans all rows | Slow on 20M rows |
| Filter + count | `WHERE material='Rock' GROUP BY source` | Compounds the cost |
| Label lookups | Join through `p__*` arrays | Array membership is slow |
| Dashboard totals | Multiple aggregations | Repeated scans |

---

## What Parquet Gives Us (Free)

| Feature | Benefit |
|---------|---------|
| Dictionary encoding | `n`, `otype` filters are fast (int comparison) |
| Min/max stats | Range predicates skip row groups |
| Column pruning | Only reads columns in SELECT |

**Already fast:** `WHERE n = 'OPENCONTEXT'` (equality on dictionary column)

**Still slow:** `SELECT n, COUNT(*) GROUP BY n` (must scan to count)

---

## Pre-computation Strategies

### 1. Facet Count Summary Table (Recommended First)

Pre-compute common aggregations as a tiny parquet:

```sql
-- facet_summaries.parquet (~1KB)
| facet_type | facet_value       | count   |
|------------|-------------------|---------|
| source     | OPENCONTEXT       | 1,200,000 |
| source     | SESAR             | 3,100,000 |
| source     | GEOME             | 1,500,000 |
| material   | Rock              | 2,500,000 |
| material   |Ite            | 800,000 |
| material   |Ite | 600,000 |
| object_type| Specimen          | 4,000,000 |
| ...        | ...               | ...     |
```

**Benefits:**
- Instant dashboard load (fetch 1KB instead of scanning 280MB)
- Can include combinations: `source + material` cross-tab
- Easy to regenerate when data updates

**Query pattern:**
```sql
-- Instead of slow:
SELECT n, COUNT(*) FROM wide WHERE otype='MaterialSampleRecord' GROUP BY n

-- Fast lookup:
SELECT * FROM facet_summaries WHERE facet_type = 'source'
```

### 2. Cross-Facet Intersection Counts

For "how many Rock samples from OPENCONTEXT?":

```sql
-- facet_intersections.parquet
| source      | material | count |
|-------------|----------|-------|
| OPENCONTEXT | Rock     | 50,000|
| OPENCONTEXT | Ite   | 30,000|
| SESAR       | Rock     | 800,000|
```

**Trade-off:** Combinatorial explosion if too many facets. Limit to top N values per facet.

### 3. Denormalized Label Columns

Avoid joins for display by flattening lookups:

```sql
-- Current: requires join through p__has_material_category array
SELECT msr.label, ic.label as material
FROM wide msr, wide ic  
WHERE ic.row_id = ANY(msr.p__has_material_category)

-- Pre-computed: labels already present
| row_id | label | material_labels | context_labels |
|--------|-------|-----------------|----------------|
| 123    | "Pottery sherd" | ["Rock", "ite"] | ["Earth surface"] |
```

**Trade-off:** Increases file size, duplicates data. Best for frequently-displayed fields only.

### 4. Bloom Filters on High-Cardinality Columns

For "does this parquet contain samples from project X?":

```python
# Enable bloom filter when writing parquet
pq.write_table(table, 'file.parquet', 
    write_statistics=True,
    column_encoding={'project': 'BLOOM_FILTER'})
```

**Benefits:** Fast negative lookups ("this row group definitely doesn't have X")

---

## Interaction with PQG Schema

Same principle as #17: **core spec stays clean, enhancements are optional**

| File | Purpose | Size |
|------|---------|------|
| `isamples_wide.parquet` | Core data | 280MB |
| `isamples_facet_summaries.parquet` | Pre-computed counts | ~10KB |
| `isamples_facet_intersections.parquet` | Cross-tab counts | ~100KB |
| `isamples_wide_denormalized.parquet` | With label columns | ~350MB |

Clients choose which files to use based on their needs.

---

## Exploration Strategy

### Phase 1: Identify Hot Queries
- What facets does the UI actually need?
- What's the current query time for each?
- Which cause the most pain?

### Phase 2: Build Facet Summary Table
- Generate `facet_summaries.parquet`
- Measure dashboard load improvement
- Decide which cross-facet intersections matter

### Phase 3: Evaluate Denormalization
- Prototype `material_labels` column
- Measure size increase vs query speedup
- Decide if worth the trade-off

### Phase 4: Document & Ship
- Add to PQG spec as optional enhancement files
- Update data pipeline to regenerate on refresh
- Publish alongside core parquet

---

## Benchmark Matrix (To Fill In)

| Query | Current (scan) | With Summary Table | Notes |
|-------|----------------|-------------------|-------|
| Source facet counts | ? ms | ? ms | |
| Material facet counts | ? ms | ? ms | |
| Source × Material cross-tab | ? ms | ? ms | |
| Sample with material labels | ? ms | ? ms | With denorm |

---

## Success Criteria

- [ ] Dashboard facet counts load in <100ms
- [ ] Summary tables <1% size of main parquet
- [ ] Can regenerate summaries in <5 min
- [ ] Documented in PQG spec as optional enhancement files

---

## Related

- #17 - Geospatial optimizations (H3) - complementary work
- #16 - PQG Parquet Schema specification - where to document enhancement files
- [Parquet bloom filters](https://parquet.apache.org/docs/file-format/bloomfilter/)

cc @smrgeoinfo @datadavev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore faceted metadata query optimizations via pre-computation #18

Goal

The Problem

What Parquet Gives Us (Free)

Pre-computation Strategies

1. Facet Count Summary Table (Recommended First)

2. Cross-Facet Intersection Counts

3. Denormalized Label Columns

4. Bloom Filters on High-Cardinality Columns

Interaction with PQG Schema

Exploration Strategy

Phase 1: Identify Hot Queries

Phase 2: Build Facet Summary Table

Phase 3: Evaluate Denormalization

Phase 4: Document & Ship

Benchmark Matrix (To Fill In)

Success Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Query Type	Current Behavior	Pain Point
Facet counts	`GROUP BY n` scans all rows	Slow on 20M rows
Filter + count	`WHERE material='Rock' GROUP BY source`	Compounds the cost
Label lookups	Join through `p__*` arrays	Array membership is slow
Dashboard totals	Multiple aggregations	Repeated scans

Feature	Benefit
Dictionary encoding	`n`, `otype` filters are fast (int comparison)
Min/max stats	Range predicates skip row groups
Column pruning	Only reads columns in SELECT

File	Purpose	Size
`isamples_wide.parquet`	Core data	280MB
`isamples_facet_summaries.parquet`	Pre-computed counts	~10KB
`isamples_facet_intersections.parquet`	Cross-tab counts	~100KB
`isamples_wide_denormalized.parquet`	With label columns	~350MB

Query	Current (scan)	With Summary Table	Notes
Source facet counts	? ms	? ms
Material facet counts	? ms	? ms
Source × Material cross-tab	? ms	? ms
Sample with material labels	? ms	? ms	With denorm

Explore faceted metadata query optimizations via pre-computation #18

Description

Goal

The Problem

What Parquet Gives Us (Free)

Pre-computation Strategies

1. Facet Count Summary Table (Recommended First)

2. Cross-Facet Intersection Counts

3. Denormalized Label Columns

4. Bloom Filters on High-Cardinality Columns

Interaction with PQG Schema

Exploration Strategy

Phase 1: Identify Hot Queries

Phase 2: Build Facet Summary Table

Phase 3: Evaluate Denormalization

Phase 4: Document & Ship

Benchmark Matrix (To Fill In)

Success Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions