Skip to content

Refactor/pipeline stages#2

Merged
sanaro99 merged 23 commits into
masterfrom
refactor/pipeline-stages
Jun 7, 2026
Merged

Refactor/pipeline stages#2
sanaro99 merged 23 commits into
masterfrom
refactor/pipeline-stages

Conversation

@sanaro99

@sanaro99 sanaro99 commented Jun 7, 2026

Copy link
Copy Markdown
Owner

No description provided.

sanaro99 and others added 23 commits May 23, 2026 20:12
Concise per-session brief for AI assistants (Claude Code, Cursor):
canonical-docs pointer, non-negotiable invariants (no word-level
output, retrieval-augmented, platform-pays, per-stage cache, augmenta-
tion-not-replacement), repo layout, conventions, and a phase status
table that mirrors docs/plan/README.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 2 of docs/plan/. Five domain modules under src/audio/, each
self-contained and lazy-importing its heavy dep so importing the
module is free in tests that don't use that path:

* extractor.py: ffmpeg rip to 16 kHz mono WAV with mtime-aware caching
  under data/audio_cache/<video_id>.wav.
* asr.py: faster-whisper wrapper with thread-safe model singleton +
  word-level WordTiming output. VAD filter on; lazy import.
* prosody.py: librosa pyin + RMS at 50 ms stride → ProsodyFrame list
  with normalized RMS (99th-percentile reference) and voiced flag.
* emotion.py: LLM-from-text-and-prosody classifier (no second model
  on CPU). 7 labels (neutral|happy|sad|angry|anxious|questioning|
  emphatic), code-fence-tolerant JSON parsing, intensity clamped 0..1,
  defaults to neutral on malformed/empty.
* analyzer.py: ThreadPoolExecutor fuses ASR + prosody in parallel
  (CPU vs light work), then emotion (depends on both) into one
  AudioAnalysis.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* AudioIngestStage: download source video via src.audio.source_video,
  rip audio via src.audio.extractor, emit AudioIngestOutput with a
  repo-relative WAV path. Fingerprint covers video_id + sample rate.
* AudioAnalyzeStage: delegate to src.audio.analyzer.analyze; finger-
  print covers audio_path + duration + every relevant audio setting
  (asr_model, compute_type, language, frame strides) + the LLM
  provider/model — flipping any of those invalidates this stage's
  cache without disturbing the upstream ingest cache.
* pipeline_avatar.py: instantiate both stages; add run_audio_only()
  helper that returns the typed AudioAnalysis so Phase 3 work can
  build on top without depending on later phases. Full run() still
  raises NotImplementedError until Phase 5 lands motion synthesis.
* stages/__init__.py: re-export the two new stages.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
10 new tests covering:
* AudioIngestStage cache hit/miss behaviour with mocked download +
  extract (no network, no ffmpeg required to run the test).
* AudioAnalyzeStage fingerprint stability + asr_model-changes-cache-key
  invariant.
* Emotion classifier with FakeProvider: valid response, out-of-range
  clamp to neutral/1.0, malformed JSON falls back to neutral,
  code-fenced JSON parses, silent windows skip the provider call.
* Prosody extractor on a synthetic 440 Hz sine (skipped when librosa
  isn't installed; passes on environments that have it).
* faster-whisper smoke test (skipped when the dep isn't installed;
  marked slow).

requirements.txt: promote Phase 2 deps from commented placeholders to
real entries (faster-whisper, librosa, soundfile, numpy).

pytest.ini: register the 'slow' marker so the suite runs clean with
no warnings.

29 passing + 2 skipped (correctly guarded behind importorskip).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase status board in README.md, docs/plan/README.md, and CLAUDE.md
now reflect Phase 2 completion. Phase 3 (interpreter brain) is next
and consumes AudioAnalysis via run_audio_only() on the orchestrator.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Walks AudioAnalysis.asr_words and emits InterpreterChunks on either a
hard boundary (silence >= vad_min_silence_ms) or a soft boundary
(sentence punctuation) once the running text exceeds max_chunk_chars.
Each chunk carries dominant emotion, F0 range, RMS mean, speaking rate,
and an end-of-chunk pause flag for the interpreter LLM.
System prompt fixes JSON-only output, the seven NMM keys, and the
yes/no vs wh-question vs negation NMM rules. Few-shots cover wh-Q,
yes/no Q, negation, emphasis, neutral declarative, and a role-shift
quote. PROMPT_VERSION participates in the interpreter stage cache
fingerprint so prompt edits invalidate just that stage.
plan_chunks() calls the configured LLMProvider once per chunk, strips
```json fences, retries once on parse failure, and falls back to a
minimal AslPlanSegment if the model still returns junk. Sign tokens
are normalised (UPPERCASE ASCII alnum/underscore), NMM intents clamped
to [0, 1], role shifts validated. Gloss filtering against the pose
library is deferred to Phase 5.
Two cacheable stages around the Phase 3 domain modules. The semantic
chunk fingerprint covers max/min chunk chars and the VAD silence
threshold; the interpreter fingerprint folds in PROMPT_VERSION,
provider+model, and chunk text — so re-runs are JSON reads and prompt
iteration invalidates exactly the interpreter cache. Pipeline.run()
still raises until Phase 5 ships motion synthesis.
Chunker: respects max_chunk_chars on long pause-less runs, splits on a
hard silence boundary. Planner: one provider call per chunk, retry on
malformed JSON, fallback when both attempts fail, NMM clamped to
[0, 1], code-fence stripping. InterpreterPlanStage: fingerprint folds
in PROMPT_VERSION and chunk text; second .run() with the same input
hits the disk cache and skips the provider entirely.
The per-gloss WLASL stitching path is structurally Signed English with
NMM dressing, not ASL. The unit of retrieval moves from a gloss
keyframe to a continuous Deaf-signed clip — OpenASL as primary index,
ASL Citizen as a lexical secondary, WLASL kept only as the last-resort
stitching fallback. Each output segment is tagged with a fidelity tier
so the consumer can render a badge in dev mode.

Adds docs/plan/phase-4-corpus-retrieval.md as the new spec; the old
phase-4-pose-library.md is retained with a superseded banner because
its content still describes the fallback path correctly. Phase 5 is
rewritten end-to-end for the tiered retrieval + retrieved-face NMM
behavior.
The previous wording allowed per-gloss WLASL stitching to satisfy the
retrieval invariant, which is exactly the loophole that produced
Signed English at Phase 5. Default tier is now a continuous Deaf-signed
clip retrieved at phrase level; WLASL stitching is permitted only as
the tagged fallback. Adds a retrieval config section, OpenASL/ASL
Citizen/WLASL tier descriptions, and updates the v5.1 schema sketch
+ flow diagram.
AslPlanSegment gains four optional fields the Phase 5 motion-synth
stage will populate: query_text (phrase-level retrieval query, can be
emitted by the interpreter brain or composed at synth time),
retrieved_clip_id, retrieval_similarity, and a fidelity tier
("retrieval" | "lexical" | "stitched" | "degraded"). MotionSynthOutput
gains annotated_segments so the timeline stage can carry the tier into
the final AvatarRenderPlan.

All Phase-3-and-earlier code keeps working because the new fields are
optional with safe defaults. Test coverage: the bootstrap roundtrip
test now asserts the v5.1 string and the retrieval fields, plus a new
back-compat test confirms a pre-Phase-5 segment still parses.
* requirements.txt — uncomment mediapipe/opencv and add sentence-
  transformers, faiss-cpu, scipy, tqdm. These are needed by the corpus
  fetch + index build scripts and the runtime retrieval API.
* .gitignore — exclude corpus video bytes, per-clip pose JSON, and the
  WLASL pose-library output. Manifests + the FAISS index file remain
  tracked so a fresh clone gets the index for free.
* src/core/config.py — RetrievalSettings (embedding model, phrase /
  lexical similarity thresholds, max clip duration, primary/secondary
  corpus names) and a corpus_root path entry.
* src/core/paths.py — corpus_{clip,pose}_dir, corpus_manifest_path,
  corpus_index_path, corpus_embeddings_path helpers so every Phase 4
  module agrees on layout.
* src/core/logging.py — setup_script_logging(name) for the long-running
  offline scripts: each invocation writes a timestamped log file under
  logs/ so the user can tail it during a multi-hour run without one
  script clobbering another's output.
* config.yaml — exposes the new retrieval section with documented
  defaults.
vrm_retarget.py — direct-mapping landmarks → VRM humanoid bone
quaternions. Mediapipe pose_world_landmarks (Y-down, hip-centered)
get flipped to VRM's Y-up frame, then each bone's quaternion is the
shortest-arc rotation aligning its rest-pose direction with the
relevant landmark-pair vector. Includes the full VRM humanoid bone
list (core + 30 finger bones) so the dict is gap-free; missing inputs
fall through to identity. Finger joints get segment-to-segment
alignments rather than full IK — adequate for the prototype's
visible hand articulation; library-based retargeter is a v1.1 task.

pose_extractor.py — extract_pose_stream(clip_path, target_fps=30)
opens a clip via OpenCV, samples at the target fps, runs Mediapipe
Holistic per frame, and yields paired MotionFrame + NmmFrame tracks.
NMM frames carry coarse geometric approximations of ARKit blendshapes
(jawOpen, brow direction, mouth width, eye openness) so Phase 5 can
keep the retrieved signer's natural facial expressions when present.
Heavy deps imported lazily; rest_motion_frame helper for the idle
pose between segments.
…loader

retrieval.py — RetrievalIndex(name=...) lazily loads a FAISS index
and the matching sentence-transformer (configured in
settings.retrieval.embedding_model). query(text, k) returns ranked
RetrievalHits with cosine similarity normalised into [0, 1]; the
threshold check is left to the caller (MotionSynthStage). load_poses
reads the per-clip pose JSON written by build_corpus_index. An
index_signature property gives Phase 5 a cheap cache key. The
from_memory classmethod is the test seam — tests/test_retrieval.py
exercises the full code path without touching FAISS or the model.

pose_library.py — file-backed PoseLibrary keyed by uppercase gloss
for the WLASL fallback tier. Lazy: construction touches no disk, only
has/get/glosses do; get() is cached after first read. Case-insensitive
lookup so callers can pass either "HELLO" or "hello".
… manifest

Reads an upstream OpenASL release manifest (TSV/CSV with clip_id,
youtube_id, start_seconds, end_seconds, caption_en, optional
signer_id) via --source PATH or --source URL. For each row, downloads
the source YouTube video once (cached per youtube_id under
assets/corpus/openasl/_sources/), then ffmpeg-trims [start, end] to
assets/corpus/openasl/<clip_id>.mp4. Probes the trim for actual
duration and writes/merges assets/corpus/openasl_manifest.json.

Resumable (--no-resume to force re-fetch), parallel (--workers K),
manifest flushed every N rows so a Ctrl-C doesn't lose progress.
Clips exceeding settings.retrieval.max_clip_duration_ms are
skipped at fetch time so we don't waste disk on full lectures.

Logging: each invocation writes
logs/fetch_openasl-<YYYYMMDD-HHMMSS>.log via setup_script_logging.
Console defaults to INFO; pass --log-level DEBUG for per-row detail.
The log path is printed at start and end so the user can tail -F it
during a multi-hour run.
Two-phase offline build over a corpus manifest:

1. Embedding + FAISS index. Loads the sentence-transformer named in
   settings.retrieval.embedding_model, encodes every caption with
   batch=128 and L2 normalization (so inner-product = cosine), and
   writes assets/corpus/<name>.faiss + <name>_embeddings.npy.

2. Pose extraction. For each clip, runs extract_pose_stream in a
   child process (ProcessPoolExecutor) so mediapipe state stays
   isolated and a single crash doesn't poison the whole batch.
   Output: assets/corpus/<name>_poses/<clip_id>.json — the runtime
   RetrievalIndex.load_poses() consumes this shape directly.

CLI flags --skip-poses / --skip-index let the user re-do just one
half (e.g. after changing the embedding model). --limit N for smoke
runs. --workers K (default 2) for pose extraction parallelism.
Progress lines emit every 25 clips with rate + ETA so the user can
tell whether a multi-hour run is on track.

Logging mirrors fetch_openasl: timestamped log file under logs/.
…gate

Loads tests/fixtures/retrieval_eval.json (10 hand-curated English
chunks across wh-Q, yes/no Q, negation, topic-comment, classifier,
role shift, time anchor, numeric, and two neutral declaratives),
queries the OpenASL index for top-k hits per chunk, and prints each
hit's caption + clip id to console plus a side-by-side markdown
table to logs/retrieval_eval-<YYYYMMDD-HHMMSS>.md.

This is the human-in-the-loop gate documented in
docs/plan/phase-4-corpus-retrieval.md Verification: at least 7 of 10
chunks must have an on-target top-3 to proceed to Phase 5. Automated
pass/fail would be wrong here — ASL semantic match is too subjective
for a regex test.

Defaults to k=3 to match the gate criterion; --k 5 for wider exploration.
Builds the per-gloss WLASL pose library used by Phase 5 as the
last-resort fallback tier. Walks assets/word_manifest.json, picks the
best clip per gloss honoring preferred_signer_ids from the manifest,
runs extract_pose_stream per clip, and writes a PoseLibraryEntry JSON
to assets/pose_library/<GLOSS>.json.

Defaults to --limit 500 per the corpus-retrieval pivot — the
full 2000-entry build is no longer the primary asset. Use --all to
build everything (several hours on CPU), --gloss HELLO to debug a
single sign, --force to re-extract over an existing JSON.

Logging is the same setup_script_logging shape used by the corpus
scripts; progress lines every 25 clips with clips/s rate.
vrm_retarget — quaternion unit-norm + identity-for-equal-vectors +
180-degree-flip + synthetic T-pose landmarks producing near-identity
arm quats + bent-arm producing a non-identity lower-arm quat. The
tests use plain (x, y, z) tuples / tiny objects with .x/.y/.z so
mediapipe isn't imported.

pose_library — known/unknown gloss lookup, case-insensitive lookup,
glosses-property listing, get()-caching, and a lazy-no-disk-touch-at-
init test that constructs the library then adds a file and confirms
has() picks it up.

retrieval — RetrievalIndex.from_memory test seam used to bypass
FAISS / sentence-transformers. Exact-caption query returns top-1 at
sim ~1.0; lexical-overlap query soft-matches the closest caption;
empty / whitespace query returns []; load_poses reads disk lazily and
raises FileNotFoundError on a missing clip id; index_signature
changes when the manifest grows.

Total: 16 new tests, all green alongside the existing 43.
…roach

Rewrite the business docs from two competing layers (v1 word-level-learner
plan + v2 feasibility study) into one coherent plan aligned with the committed
approach: retrieval-augmented, grammar-aware, phrase-level ASL; platform-pays;
no word-level output. Refresh all market data to May 2026.

- Promote the six numbered docs to the canonical plan; reframe
  feasibility-study/ as the technical & feasibility appendix.
- Refresh regulatory drivers: ADA Title II deadline extended to 2027/2028;
  EAA live since June 2025; 2025 litigation rebound (~3,900, +24%).
- Add Sorenson (Hand Talk + OmniBridge acquisition, April 2026 avatar POCs)
  as the now-live incumbent threat across competitive sections.
- Re-derive TAM/SAM/SOM; fold induced-demand model into market analysis.
- Align F1 Stage 3 to phrase-level-retrieval-first; note Phases 1-3 shipped.
- Map the product roadmap to actual pipeline phases 4-7.
- Scrub dead "v1 plan" references; cite real corpora (OpenASL, ASL Citizen).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 7, 2026 09:35
@sanaro99 sanaro99 merged commit 7dfdc04 into master Jun 7, 2026
1 check passed
@sanaro99 sanaro99 deleted the refactor/pipeline-stages branch June 7, 2026 09:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances the pipeline from a “skeleton” toward an operational Phase 2–4 implementation by adding concrete cached stages for the audio backbone and interpreter planning, introducing Phase 4 corpus retrieval + pose extraction infrastructure (OpenASL ingestion, FAISS index build, pose extraction), and updating the v5 schema/documents accordingly.

Changes:

  • Add cached pipeline stages for audio ingest/analyze and semantic chunking/interpreter planning, plus a run_audio_only() pipeline helper.
  • Introduce Phase 4 corpus retrieval layer (RetrievalIndex runtime API, Mediapipe pose extraction, offline scripts, fixtures) and corresponding settings/config/docs.
  • Bump schema to v5.1 and add retrieval metadata fields to plan segments; add extensive unit tests for audio/interpreter/retrieval/pose library.

Reviewed changes

Copilot reviewed 58 out of 60 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_vrm_retarget.py Adds unit tests for VRM retarget quaternion math and landmark→bone mapping behavior.
tests/test_retrieval.py Adds RetrievalIndex tests using in-memory FAISS/embedder doubles and on-disk pose JSON fixtures.
tests/test_pose_library.py Adds tests for the lazy, cached WLASL pose-library loader.
tests/test_interpreter_planner.py Adds tests for chunker + interpreter planner parsing/retry/fallback behavior and stage cache/fingerprint stability.
tests/test_avatar_pipeline_bootstrap.py Updates schema round-trip test to v5.1 and adds back-compat coverage for new segment fields.
tests/test_audio_analyzer.py Adds tests for audio ingest/analyze caching and emotion/prosody/ASR behaviors (with optional heavy deps).
tests/fixtures/retrieval_eval.json Adds a hand-curated retrieval evaluation fixture for the Phase 4 quality gate script.
src/pipeline/stages/semantic_chunk.py Introduces Stage 3 wrapper around chunker with disk caching + fingerprinting.
src/pipeline/stages/interpreter_plan.py Introduces Stage 4 wrapper around planner with disk caching + fingerprinting.
src/pipeline/stages/audio_ingest.py Introduces Stage 1 (download + extract WAV) with disk caching + fingerprinting.
src/pipeline/stages/audio_analyze.py Introduces Stage 2 (ASR+prosody+emotion fusion) with disk caching + fingerprinting.
src/pipeline/stages/init.py Exposes concrete stage classes via package exports.
src/pipeline/pipeline_avatar.py Wires Phase 2–3 stages and adds run_audio_only(); keeps full run() as NotImplemented until Phase 5.
src/pipeline/models.py Bumps schema_version to 5.1; adds retrieval-related metadata fields to AslPlanSegment and annotated_segments to MotionSynthOutput.
src/interpreter/prompt.py Adds interpreter persona system prompt + few-shot examples and message builders; introduces PROMPT_VERSION.
src/interpreter/planner.py Adds robust planner implementation (JSON extraction, retry, fallback, gloss/NMM normalization).
src/interpreter/chunker.py Adds chunker implementation producing InterpreterChunks with emotion/prosody summaries and boundary logic.
src/interpreter/init.py Adds interpreter package marker docstring.
src/core/paths.py Adds corpus path helpers for Phase 4 assets layout.
src/core/logging.py Adds setup_script_logging() for per-script timestamped logs; improves module-level docs.
src/core/config.py Adds RetrievalSettings and paths.corpus_root; wires retrieval into Settings.
src/avatar/retrieval.py Adds RetrievalIndex runtime API with lazy FAISS/embedder loading, query, and pose loading.
src/avatar/pose_library.py Adds lazy WLASL pose-library loader used as last-resort fallback tier.
src/avatar/pose_extractor.py Adds Mediapipe Holistic pose extraction producing MotionFrame + NmmFrame streams with rest-pose fallbacks.
src/avatar/init.py Adds avatar package marker docstring.
src/audio/prosody.py Adds librosa-based prosody extraction with lazy heavy imports.
src/audio/extractor.py Adds ffmpeg-based audio extraction with caching and ffprobe duration probing.
src/audio/emotion.py Adds LLM-based emotion classification over text+prosody summary with robust parsing/clamping.
src/audio/asr.py Adds faster-whisper wrapper with lazy model load + thread-safe cache.
src/audio/analyzer.py Adds parallel ASR/prosody execution and fuses results with emotion into AudioAnalysis.
scripts/retrieval_eval.py Adds human-in-the-loop retrieval quality gate script that emits console logs + markdown report.
scripts/fetch_openasl.py Adds OpenASL corpus fetch/trim/manifest builder with concurrency and resumability.
scripts/build_pose_library.py Adds WLASL fallback pose-library builder producing per-gloss PoseLibraryEntry JSON.
scripts/build_corpus_index.py Adds corpus embedding + FAISS index build and per-clip pose extraction (multi-process).
requirements.txt Promotes Phase 2/4 dependencies (whisper/librosa/mediapipe/faiss/sentence-transformers, etc.) into requirements.
README.md Updates phase status table to mark Phases 2–3 as done.
pytest.ini Adds slow marker definition and guidance for skipping heavy/slow tests.
docs/plan/README.md Updates phase roadmap and introduces Phase 4 corpus retrieval plan entry.
docs/plan/phase-4-pose-library.md Archives the original Phase 4 pose-library plan and links to the new Phase 4 retrieval plan.
docs/plan/phase-4-corpus-retrieval.md Adds the new Phase 4 corpus ingestion + phrase-level retrieval index plan document.
docs/plan/00-architecture.md Updates architecture diagram/text for phrase-level retrieval and schema v5.1.
docs/architecture-overview.md Updates canonical architecture overview to phrase-level retrieval and fidelity tagging.
config.yaml Adds retrieval settings defaults (embedding model, thresholds, corpus names).
CLAUDE.md Adds repository “working agreement” with invariants and updated phase status.
business/README.md Rewrites business plan framing to the committed phrase-level retrieval approach.
business/feasibility-study/README.md Reframes feasibility docs as a technical appendix supporting the unified plan.
business/feasibility-study/05-feasibility-verdict.md Updates verdict language to align with the unified plan and incumbent threat framing.
business/feasibility-study/04-pricing-strategy-comparison.md Updates terminology and clarifies hybrid model references.
business/feasibility-study/03-market-expansion.md Updates induced-demand framing and references to the old consumer-learner thesis.
business/feasibility-study/02-competitive-tech-comparison.md Updates competitor section (Sorenson/Hand Talk) and threat framing.
business/feasibility-study/01-technology-feasibility.md Updates feasibility framing and Stage 3 (retrieval-first) architecture details.
business/04-value-proposition.md Updates value proposition to platform-pays + phrase-level retrieval approach.
business/03-competitive-landscape.md Updates competitive landscape to the five-family taxonomy and corpus-as-moat framing.
business/01-executive-summary.md Updates executive summary to reflect committed architecture and updated regulatory runway.
.gitignore Ignores large Phase 4 corpus artifacts (clips/poses/embeddings) and pose_library outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +38
def fingerprint(self, inp: SemanticChunkInput) -> str:
s = self.settings
analysis = inp.analysis
return stable_hash([
"semantic_chunk",
analysis.duration_ms,
len(analysis.asr_words),
# Include first/last word to detect content drift cheaply.
analysis.asr_words[0].word if analysis.asr_words else "",
analysis.asr_words[-1].word if analysis.asr_words else "",
s.interpreter.max_chunk_chars,
s.interpreter.min_chunk_chars,
s.audio.vad_min_silence_ms,
])
Comment on lines +40 to +50
def process(self, inp: InterpreterPlanInput) -> InterpreterPlanOutput:
segments, provider, model = plan_chunks(
inp.chunks, settings=self.settings.interpreter
)
logger.info(
"InterpreterPlanStage: %d segments via %s/%s",
len(segments), provider, model,
)
return InterpreterPlanOutput(
segments=segments, provider=provider, model=model
)
Comment on lines +44 to +46
def process(self, inp: AudioAnalyzeInput) -> AudioAnalyzeOutput:
wav_path = PROJECT_ROOT / inp.audio_path
analysis = analyze(wav_path, inp.duration_ms)
Comment on lines +20 to +34
from pydantic import BaseModel

from src.core.paths import PROJECT_ROOT
from src.pipeline.models import MotionFrame, NmmFrame

logger = logging.getLogger(__name__)


class PoseLibraryEntry(BaseModel):
gloss: str
duration_ms: int
fps: int = 30
source_clip: str = ""
keyframes: list[MotionFrame]
nmm: list[NmmFrame] = []
Comment thread scripts/fetch_openasl.py
Comment on lines +31 to +33
# Full pull, 4 parallel workers, resume previous run
python -m scripts.fetch_openasl --source path/to/openasl.tsv \
--workers 4 --resume
Comment thread scripts/fetch_openasl.py
Comment on lines +92 to +96
path = Path(source).expanduser().resolve()
if not path.is_file():
raise FileNotFoundError(f"Source manifest not found: {path}")
logger.info("Reading source manifest %s", path)
handle = path.open("r", encoding="utf-8")
Comment thread src/avatar/retrieval.py
Comment on lines +88 to +96
@property
def index_signature(self) -> str:
"""Cheap stable hash of the index file's mtime + manifest length."""
manifest_n = len(self._load_manifest())
try:
mtime = int(self.index_path.stat().st_mtime)
except FileNotFoundError:
mtime = 0
return f"{self.name}:{manifest_n}:{mtime}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants