Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
fd66908
docs: add CLAUDE.md — AI-assistant guide for this repo
sanaro99 May 24, 2026
4c6f0dd
feat(audio): backbone — extractor + ASR + prosody + emotion + analyzer
sanaro99 May 24, 2026
8331b83
feat(pipeline): wire AudioIngestStage + AudioAnalyzeStage
sanaro99 May 24, 2026
c492af2
test(audio): coverage for Phase 2 backbone
sanaro99 May 24, 2026
91cd642
docs: mark Phase 2 — Audio backbone as done
sanaro99 May 24, 2026
e005618
feat(interpreter): semantic chunker (VAD + clause boundaries)
sanaro99 May 24, 2026
6992519
feat(interpreter): persona prompt + few-shots (PROMPT v1)
sanaro99 May 24, 2026
a48bc36
feat(interpreter): planner with JSON parsing + validation
sanaro99 May 24, 2026
024a9d6
feat(pipeline): wire SemanticChunkStage + InterpreterPlanStage
sanaro99 May 24, 2026
520f3a6
test(interpreter): coverage for chunker + planner
sanaro99 May 24, 2026
9863d0a
docs: mark Phase 3 — Interpreter brain as done
sanaro99 May 24, 2026
05d13fd
docs(plan): pivot Phase 4/5 to phrase-level corpus retrieval
sanaro99 May 25, 2026
28bac89
docs: tighten retrieval invariant for the phrase-level pivot
sanaro99 May 25, 2026
d4f82f8
feat(schema): bump AvatarRenderPlan to v5.1 — retrieval metadata
sanaro99 May 25, 2026
c1aff00
chore(env): Phase 4 prerequisites — deps, config, paths, script logging
sanaro99 May 26, 2026
665b2f4
feat(avatar): mediapipe pose extractor + vrm retargeter
sanaro99 May 26, 2026
be1a969
feat(avatar): RetrievalIndex (FAISS + sentence-transformers) + WLASL …
sanaro99 May 26, 2026
20e8061
feat(scripts): fetch_openasl.py — TSV manifest → trimmed clips + JSON…
sanaro99 May 26, 2026
0d356d6
feat(scripts): build_corpus_index.py — embeddings + per-clip poses
sanaro99 May 26, 2026
123bd98
feat(scripts): retrieval_eval.py + 10-chunk fixture — week-2 quality …
sanaro99 May 26, 2026
15e9b77
feat(scripts): build_pose_library.py — WLASL top-500 fallback
sanaro99 May 26, 2026
7a8f9b8
test(avatar): vrm_retarget + pose_library + retrieval coverage
sanaro99 May 26, 2026
7a31b2e
docs(business): consolidate into single plan around committed ASL app…
sanaro99 May 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,17 @@ assets/final/*.mp4
assets/words/
assets/chained/

# Phase 4 corpus — video bytes + per-clip pose JSON are huge; we only
# track the manifest JSON and the FAISS index file.
assets/corpus/openasl/
assets/corpus/openasl_poses/
assets/corpus/openasl_embeddings.npy
assets/corpus/aslcitizen/
assets/corpus/aslcitizen_poses/
assets/corpus/aslcitizen_embeddings.npy
# Phase 4 WLASL pose-library fallback output
assets/pose_library/

# Pipeline stage disk cache (regenerated on first run)
data/cache/

Expand Down
196 changes: 196 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# CLAUDE.md — Working in this repository

This file gives any AI assistant (Claude Code, Cursor, etc.) the
minimum context needed to make good edits here. **Read it once per
session, then defer to the canonical docs it points to.**

---

## What this project is

GenASL is an AI pipeline that produces a **3D ASL interpreter avatar**
overlay for YouTube videos. It mimics how a human interpreter works:
listen → analyse emotion + prosody → decide signing strategy with an
LLM → drive a Ready Player Me VRM avatar in the browser via three.js.

> **Status:** prototype in build-out. Phase 1 (bootstrap) is shipped;
> Phases 2–7 are pending and have detailed plans under `docs/plan/`.

---

## Read these before editing

In this order:

1. **[`README.md`](README.md)** — what the project does and how to run it.
2. **[`docs/architecture-overview.md`](docs/architecture-overview.md)** — canonical technical reference.
3. **[`docs/plan/README.md`](docs/plan/README.md)** — implementation roadmap; if you're working a specific phase, also read the matching `docs/plan/phase-N-*.md`.
4. **[`business/feasibility-study/01-technology-feasibility.md`](business/feasibility-study/01-technology-feasibility.md)** — why this architecture and not the others.

If those four contradict this file, **the docs win**; flag the
inconsistency and ask before reconciling.

---

## Non-negotiable invariants

These come from the feasibility study and the user's explicit instructions.
Violating them invalidates the work.

1. **No word-level ASL output.** Word-level gloss is a *valid internal
representation* inside `AslPlanSegment.sign_sequence`, but it is
**never surfaced to the user**. The Chrome extension never shows
gloss text. We do not ship the old WLASL clip-stitching pipeline.

2. **Phrase-level retrieval-augmented, not pure generative.** Tightened
on 2026-05-24: every output segment's motion comes from a Deaf-signer
recording, *and the default tier is a continuous clip retrieved at
phrase level* from `assets/corpus/openasl/` (with ASL Citizen as a
lexical secondary). Per-gloss WLASL stitching from
`assets/pose_library/` is the last-resort fallback, always tagged
`fidelity="stitched"` (or `"degraded"` if > 50% of glosses miss).
AI orchestrates known-good primitives; generative steps only fill
*transitions* and *NMM augmentation on top of* the retrieved face.
If a phase implementation makes this invariant un-verifiable after
the fact, the phase plan is wrong — flag it before shipping.

3. **Platform-agnostic and platform-pays.** The B2B monetization model
is platforms paying for the SDK, not end users paying for access.
Do not add consumer paywalls or restrict accessibility behind a
user-tier gate. Free for Deaf-led orgs is non-negotiable.

4. **Per-stage disk cache or it doesn't ship.** Every pipeline stage
subclasses `Stage[InT, OutT]` from `src/pipeline/stages/base.py`
and implements a deterministic `fingerprint()`. Reruns must be
JSON-read fast.

5. **Pydantic models, not dicts, between stages.** The schema in
`src/pipeline/models.py` is authoritative; new fields land there.
Bump `schema_version` only on a breaking change to `AvatarRenderPlan`
(current target: `5.1` once Phase 5 lands with the retrieval
metadata fields).

6. **Market expansion, not substitution.** GenASL serves the underserved — content that today has no ASL at all because human interpretation isn't economically viable for it. Human interpreters remain the gold standard for live, high-stakes, nuanced settings, and broader ambient ASL exposure created by GenASL increases demand and visibility for their work. Public-facing copy must reflect this: we expand the pie, we don't take a slice from interpreters.

---

## Repository layout (essential bits only)

```
src/
├── api/server.py # /health, /asl/avatar
├── audio/
│ ├── source_video.py # yt-dlp source MP4 (Stage 1 input)
│ └── ... # Phase 2 lands extractor, asr, prosody, emotion, analyzer
├── interpreter/ # Phase 3 — chunker, prompt, planner
├── avatar/ # Phase 4–5 — retrieval, pose extractor, vrm retarget,
│ # motion synth, NMM, vrm schema
├── core/
│ ├── config.py # Pydantic Settings; get_settings() singleton
│ ├── paths.py # all filesystem paths
│ ├── ffmpeg.py # find_ffmpeg / find_ffprobe
│ └── logging.py
├── llm/providers/ # Ollama / Gemini / OpenAI; one chat() method
├── pipeline/
│ ├── models.py # v5.0 Pydantic schema (authoritative)
│ ├── pipeline_avatar.py # InterpreterAvatarPipeline orchestrator
│ ├── run_pipeline.py # CLI entry
│ ├── io.py # save_avatar_plan + print_summary
│ └── stages/
│ ├── base.py # Stage[InT, OutT] ABC + cache
│ └── ... # concrete stages land per phase plans
chrome-extension/ # MV3; Phase 6 wires three.js + VRM
docs/{architecture-overview, plan/, ...}
business/{README, feasibility-study/}
```

---

## Common commands

```bash
# Tests
pytest tests/ -v

# Run the pipeline CLI on a YouTube video ID
python -m src.pipeline.run_pipeline 31y2Bq1RYQA

# Run the local API server
python -m src.api.server # http://127.0.0.1:8794
curl http://127.0.0.1:8794/health
```

`config.yaml` (root) overrides Pydantic defaults from `src/core/config.py`.
API keys (`GEMINI_API_KEY`, `OPENAI_API_KEY`) come from the environment,
never from config.

---

## Conventions

- **Stages live in `src/pipeline/stages/<name>.py`**, one class per
file, `name` class-var = snake_case matching the filename.
- **Domain logic** (the heavy lifting a stage delegates to) goes under
`src/{audio,interpreter,avatar}/` so stages stay thin and testable.
- **Tests** mirror module paths: `tests/test_<module>.py`. New stage
tests follow `tests/test_stage_cache.py`. Integration smoke tests
follow `tests/test_avatar_pipeline_bootstrap.py`.
- **LLM access** goes through `src.llm.providers.make_provider`.
Never import `openai` directly outside the providers dir.
- **Paths** import from `src.core.paths`, never re-derive with
`Path(__file__).parents[N]`.
- **Heavy library imports** (faster-whisper, librosa, mediapipe) are
lazy — inside functions, not at module top-level — so importing a
module is free for tests that don't exercise it.
- **One-line module docstrings** on the first line stating purpose
and phase of origin.

---

## What NOT to do

- ❌ Resurrect the gloss pipeline. v4.0 schema, `Pipeline` class,
`compose_pip`, `transcript_ingestion`, and the WLASL clip-chaining
code are gone deliberately. Git history preserves them; don't
cherry-pick back into the active tree.
- ❌ Build a consumer payment tier or premium toggle. Platforms pay.
- ❌ Add a `mode` toggle returning to word-level output. There is one
pipeline mode now.
- ❌ Ship a pure-neural sign synthesiser (SignDiff/T2S-GPT style)
without the retrieval anchor. The corpus is the moat.
- ❌ Auto-install dependencies, modify `cookies.txt`, or commit secrets.
`cookies.txt` is tracked but session-refresh diffs to it should be
reverted, not pushed.
- ❌ Edit `src/pipeline/models.py` shapes without bumping
`schema_version` if it would break the extension's JSON consumer.
- ❌ Skip the `fingerprint()` on a new stage. "It's just a prototype"
is not an excuse; cache invariants are load-bearing.

---

## When something is unclear

1. Check `docs/architecture-overview.md` — it's the canonical reference.
2. Check the matching `docs/plan/phase-N-*.md` for the phase you're in.
3. Check the feasibility study under `business/feasibility-study/`
for the *why*.
4. If still unclear, leave a `# TODO(phaseN-clarify):` comment and a
brief note in the phase doc's **Open questions** section. Ship the
rest; don't block.

---

## Phase status (mirror of `docs/plan/README.md`)

| Phase | Status |
|-------|--------|
| 1 — Bootstrap | **Done** |
| 2 — Audio backbone | **Done** |
| 3 — Interpreter brain | **Done** |
| 4 — Corpus retrieval (OpenASL + ASL Citizen; WLASL fallback) | Pending |
| 5 — Motion synthesis (retrieval-driven) + NMM | Pending |
| 6 — Chrome extension VRM | Pending |
| 7 — API + end-to-end | Pending |

When you ship a phase, update **both** this table and
`docs/plan/README.md`.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,8 +153,8 @@ that any contributor (human or AI) can pick up a phase cold:
| Phase | What it delivers | Status |
|---|---|---|
| [1 — Bootstrap](docs/plan/phase-1-bootstrap.md) | Config sections, v5.0 schema, skeleton, mode toggle | **Done** |
| [2 — Audio backbone](docs/plan/phase-2-audio-backbone.md) | Whisper + librosa + emotion → `AudioAnalysis` | Pending |
| [3 — Interpreter brain](docs/plan/phase-3-interpreter-brain.md) | LLM persona producing `AslPlanSegment` | Pending |
| [2 — Audio backbone](docs/plan/phase-2-audio-backbone.md) | Whisper + librosa + emotion → `AudioAnalysis` | **Done** |
| [3 — Interpreter brain](docs/plan/phase-3-interpreter-brain.md) | LLM persona producing `AslPlanSegment` | **Done** |
| [4 — Pose library](docs/plan/phase-4-pose-library.md) | Mediapipe → per-gloss joint-angle JSON | Pending |
| [5 — Motion synthesis + NMM](docs/plan/phase-5-motion-synthesis.md) | Retrieve + spline + prosody-driven NMM | Pending |
| [6 — Chrome extension VRM](docs/plan/phase-6-chrome-extension-vrm.md) | three.js + @pixiv/three-vrm in PiP | Pending |
Expand Down
Loading