Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions .claude/agents/archivist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
name: archivist
description: Metadata enrichment and curation. Use to enrich track metadata, review flagged conflicts, and curate album art quality.
model: sonnet
tools:
- Read
- Write
- Edit
- Glob
- Grep
- Bash
permissionMode: acceptEdits
skills:
- enrich
color: amber
maxTurns: 50
memory: project
---
# Archivist

Metadata enrichment and curation agent for the Crate music library.

## Role

Curate and enrich track metadata using external APIs. Run the enrichment pipeline, review flagged conflicts, and help resolve uncertain matches interactively.

## Capabilities

- **Run enrichment**: Invoke `python tools/enrich_metadata.py` with appropriate flags
- **Review queue**: Read and walk through `review_queue.json` flagged tracks with the user
- **Apply corrections**: Edit `metadata_enriched.json` to apply chosen corrections
- **Re-enrich**: Re-run enrichment after manual corrections using `--resume`

## Understanding the Pipeline

### Confidence Scoring
- **>= 0.85**: Auto-accepted — fields updated directly
- **0.50–0.85**: Flagged for manual review
- **< 0.50**: Skipped — original metadata kept

### Conflict Classifications
- `confirmed`: External data matches existing tags — no action needed
- `supplement`: Existing field was empty, external has data — auto-filled if confidence >= 0.50
- `likely_correction`: Multiple sources disagree with existing tag — flagged with suggested correction
- `alternative`: One source disagrees — noted but existing kept

### Artwork Selection
Album art scored 0–100 on resolution, source, type, and format. Only upgrades when new score exceeds old by > 10 points.

## Process

1. Check if `metadata_base.json` exists in the metadata directory
2. Run enrichment: `python tools/enrich_metadata.py --input metadata/metadata_base.json --output metadata/`
3. Review `metadata/review_queue.json` — present each flagged item to the user
4. For each flagged track, show existing vs suggested values and let the user choose
5. Apply corrections to `metadata/metadata_enriched.json`
6. If corrections were made, offer to re-run with `--resume` to fill remaining gaps

## On Blockers

If the MusicBrainz API is unreachable, the script falls back to offline mode (copies base metadata as-is with `status: skipped`). Report this and suggest retrying later.

## Constraints

- **Respect rate limits**: Never bypass the 1 req/sec MusicBrainz limit
- **Don't auto-apply review items**: Always present flagged tracks to the user for decision
- **Keep originals**: Never delete or overwrite `metadata_base.json`
140 changes: 140 additions & 0 deletions .claude/skills/enrich.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Enrich — Metadata Enrichment Pipeline

## When to Use

- After adding new music to the library
- When tracks have incomplete or incorrect metadata
- To fetch album art for tracks missing artwork
- Periodically to re-enrich with improved matching

## Pipeline

```
./tools/pipeline.sh [/path/to/new/music]
```

That single command handles everything:

```
[extract] → [upload] → [enrich] → [publish]
```

| Step | What it does | When it runs |
|------|-------------|--------------|
| Extract | Scans audio files for ID3/Vorbis tags | Only with a path argument |
| Upload | Uploads new audio to S3 | Only with a path argument |
| Enrich | Queries MusicBrainz + Cover Art Archive | Always |
| Publish | Uploads artwork to S3, pushes manifest | Always (unless `--skip-publish`) |

## Common Usage

```bash
# Re-enrich entire library (idempotent — skips already-processed tracks)
./tools/pipeline.sh

# Add new music and enrich everything
./tools/pipeline.sh /path/to/new/tracks

# Preview what enrichment would do (writes dry_run_report.json)
./tools/pipeline.sh --dry-run

# Apply a previous dry run (reads cached results, no re-querying)
./tools/pipeline.sh

# Re-process everything from scratch
./tools/pipeline.sh --no-resume

# Limit to first N tracks (useful for testing)
./tools/pipeline.sh --limit 10
```

## Options

| Flag | Effect |
|------|--------|
| `--dry-run` | Preview matches, write `dry_run_report.json`, don't modify anything |
| `--skip-publish` | Enrich locally but don't push to S3 |
| `--skip-upload` | Skip uploading new audio files |
| `--skip-artwork` | Skip album art fetching |
| `--no-resume` | Re-process all tracks from scratch |
| `--limit N` | Only process first N tracks |

## How It Works

### Matching
1. Searches MusicBrainz by `artist + title`, then `artist + album`, then `title only`
2. Scores candidates (0.0–1.0) using weighted field similarity
3. Thresholds: **>= 0.85** auto-accept, **0.50–0.85** flag for review, **< 0.50** skip

### Dry-Run → Real Run
- `--dry-run` saves all match results to `dry_run_report.json`
- A subsequent real run loads cached results — zero API re-queries
- After applying, the report is deleted

### Resume
- `.enrichment_state.json` tracks processed track IDs
- `--resume` (on by default) skips already-processed tracks
- When resuming, reads from `metadata_enriched.json` to preserve prior work

## Output Files

| File | Purpose |
|------|---------|
| `metadata/metadata_enriched.json` | Full metadata with enrichment data per track |
| `metadata/review_queue.json` | Tracks needing human review |
| `metadata/dry_run_report.json` | Dry-run results (consumed by next real run) |
| `metadata/.enrichment_state.json` | Resume checkpoint |
| `metadata/manifest_enriched.json` | Clean manifest built during publish |
| `metadata/artwork/*_enriched.jpg` | Downloaded album art |

## Review Queue

Tracks are flagged for review when:
- Match confidence is between 0.50 and 0.85
- Multiple sources disagree with existing tags (`likely_correction`)
- Multiple high-confidence candidates disagree with each other
- Album art upgrade available when existing art is present
- Track has neither artist nor title

Use the **archivist agent** to walk through flagged tracks interactively.

## Conflict Classifications

| Classification | Meaning | Action |
|---------------|---------|--------|
| `confirmed` | External data matches existing | No change |
| `supplement` | Empty field filled | Auto-filled |
| `likely_correction` | Multiple sources disagree with tag | Flagged |
| `alternative` | One source offers different value | Noted, kept existing |

## Individual Scripts

For fine-grained control, run scripts directly:

```bash
# Enrich only
python tools/enrich_metadata.py --input metadata/manifest.json --output metadata/ --resume

# Publish only (after manual edits to metadata_enriched.json)
python tools/publish_manifest.py --metadata-dir metadata/

# Extract only
python tools/extract_metadata.py /path/to/audio --output metadata/
```

## Rate Limits

- MusicBrainz: 1 req/sec (enforced)
- Cover Art Archive: 1 req/sec (enforced)
- Full run: ~2-3 seconds per track
- 118 tracks ≈ 4-6 minutes

## Troubleshooting

| Issue | Solution |
|-------|----------|
| "MusicBrainz API is unreachable" | Check internet; falls back to offline mode |
| Many "no match" results | Tracks may have poor/missing metadata |
| Interrupted mid-run | Just re-run — `--resume` is default |
| Want to re-process one track | Remove its ID from `.enrichment_state.json` |
| Artwork not showing in app | Check CloudFront invalidation completed |
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,14 @@ fffff.at-archive/
# Artwork is in S3, not git
metadata/artwork/

# Enrichment pipeline output (regenerated by pipeline.sh)
metadata/manifest.json
metadata/metadata_enriched.json
metadata/manifest_enriched.json
metadata/review_queue.json
metadata/.enrichment_state.json
metadata/dry_run_report.json

# Local dev manifest (copy from production for testing)
www/manifest.json

Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Changelog

## [0.2.0] - 2026-02-15T20:41:35-05:00

### Added
- Metadata enrichment pipeline via MusicBrainz, Cover Art Archive, and iTunes Search API
- Single entrypoint `pipeline.sh` for extract, upload, enrich, and publish steps
- Confidence-based matching with auto-accept, review, and skip thresholds
- Resume and dry-run support for idempotent re-runs
- Publish step uploads artwork to S3 and pushes enriched manifest
- Generative CSS gradient backgrounds for tracks without album artwork
- Archivist agent and enrich skill for future curation workflows

### Changed
- `batch_upload.py` accepts enriched metadata format with `--enriched` flag

## [0.1.0] - 2026-02-14T01:12:36+00:00

### Added
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.1.0
0.2.0
38 changes: 34 additions & 4 deletions tools/batch_upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
AWS_PROFILE = os.environ.get('AWS_PROFILE', 'default')
TRACKS_BUCKET = os.environ.get('TRACKS_BUCKET', '')
METADATA_FILE = 'metadata_base.json'
ENRICHED_METADATA_FILE = 'metadata_enriched.json'
MANIFEST_FILE = 'manifest.json'


Expand Down Expand Up @@ -59,11 +60,35 @@ def get_content_type(filepath: Path) -> str:
return types.get(ext, 'application/octet-stream')


def load_metadata(metadata_dir: Path) -> dict:
"""Load metadata_base.json."""
def load_metadata(metadata_dir: Path, enriched: bool = False) -> dict:
"""Load metadata JSON. Prefers enriched if --enriched flag is set.

Normalizes tracks to dict format regardless of input shape (list or dict).
"""
if enriched:
enriched_file = metadata_dir / ENRICHED_METADATA_FILE
if enriched_file.exists():
print(f"Using enriched metadata: {enriched_file}")
with open(enriched_file) as f:
data = json.load(f)
return _normalize_tracks(data)
print(f"Enriched metadata not found, falling back to base")
metadata_file = metadata_dir / METADATA_FILE
with open(metadata_file) as f:
return json.load(f)
data = json.load(f)
return _normalize_tracks(data)


def _normalize_tracks(data: dict) -> dict:
"""Ensure tracks is a dict keyed by path/id (handles manifest list format)."""
tracks = data.get('tracks', {})
if isinstance(tracks, list):
tracks_dict = {}
for track in tracks:
key = track.get('path') or track.get('original_path') or track['id']
tracks_dict[key] = track
data['tracks'] = tracks_dict
return data


def save_metadata(metadata_dir: Path, metadata: dict):
Expand Down Expand Up @@ -145,12 +170,17 @@ def main():
action='store_true',
help='Skip uploading artwork files'
)
parser.add_argument(
'--enriched',
action='store_true',
help='Use metadata_enriched.json instead of metadata_base.json'
)

args = parser.parse_args()

# Load metadata
print(f"Loading metadata from {args.metadata_dir}...")
metadata = load_metadata(args.metadata_dir)
metadata = load_metadata(args.metadata_dir, enriched=args.enriched)

total_tracks = len(metadata['tracks'])
print(f"Found {total_tracks} tracks in metadata")
Expand Down
Loading