Clean up data pipeline + docs, add batch local YOLO import + FP review#120
Merged
MateoLostanlen merged 13 commits intomainfrom May 1, 2026
Merged
Clean up data pipeline + docs, add batch local YOLO import + FP review#120MateoLostanlen merged 13 commits intomainfrom
MateoLostanlen merged 13 commits intomainfrom
Conversation
…review tools - Add batch_import_local_yolo.py + Makefile target for parallel import of local YOLO sequence folders paired with sequences.csv. - Add cleanup_duplicate_images.py to drop ready sequences whose first frames are duplicates. - Add import_all_platforms.sh wrapper iterating creds.json for per-org platform import. - Add review_split_app.py Streamlit viewer to compare model fp/wf split against human annotations. - pull_sequence_annotations: --smoke-type empty to pull FP sequences (empty smoke_types). - export_dataset: derive a single category per sequence from sequence-level smoke_types/false_positive_types and add --category filter; one folder per (category, seq). - import_yolo_sequence: --labels-dir-name to support labels_predictions and similar. - Makefile: switch export-dataset to MAIN_ANNOTATION_LOGIN/PASSWORD env, add CATEGORY pass-through, document FP review and batch import targets. - README: document FP review pipeline and category filter. - Add top-level CLAUDE.md.
- Add annotation_api/.env.example documenting all expected variables. - Makefile auto-loads annotation_api/.env (include + export) so all targets see the variables without manual `source` or `export`. - Allow .env.example through .gitignore. - Update README, top-level CLAUDE.md, and annotation_api/docs/data-ingestion-guide.md to point at .env instead of inline `export VAR=...` instructions.
- Drop scripts/setup.sh (was only a docker prerequisite check) and remove `make setup` target + its call from `make start`. - Drop `make start-prod` / `make stop-prod` references from annotation_api/README.md and annotation_api/CLAUDE.md (those targets do not exist). - Fix Makefile help: clean FP sequences become 'annotated' (not 'managed'); align apply_fiftyone_review.py --fp-mode help with the actual stage. - Fix import-local-yolo Usage hint to use real var names (LIMIT_FOLDERS / SKIP_FOLDERS). - Note that `make pull-fp` hardcodes smoke-type=empty (the SMOKE_TYPE override has no effect there). - Update Development Setup steps in annotation_api/CLAUDE.md to point at .env.example. - Add creds.json to .gitignore. - Remove unused review_split_app.py.
- Drop "Migration Notes (Poetry → uv)" section: dated history, no longer relevant.
- Drop "Recent Enhancements (2024)" changelog section.
- Drop the duplicated "Data Import Scripts" section (kept the "Data Transfer Scripts" one above) — the duplicate referenced .envrc and contradicted the new .env flow.
- Replace ".envrc" / docker-compose example in "Authentication Setup" with a pointer to .env / .env.example.
- Drop test-count claims ("192 tests") that drift on every PR; condense the Testing block.
frontend/README.md and frontend/CLAUDE.md described pages and components that no longer exist (AnnotationPage, AnnotationsPage, SequenceDetailPage, SequenceBboxCard, etc.) and pointed at dead links (CONTRIBUTING.md, IMPLEMENTATION_PLAN.md, ../LICENSE). Rewrite both from the actual src/ tree: - Real page list and route paths from App.tsx (dev port 3000, not 5173). - Real component layout (annotation/, detection-annotation/, detection-sequence/, filters/, sequence/, sequence-annotation/, sequences/, ui/) and the actual hooks/ + utils/ modules. - Real stores (useAuthStore + useSequenceStore — useAnnotationStore was removed long ago). - Drop dated "Refactored Architecture (2024)" / "Recent Fixes (2024)" sections, the "replacing legacy Streamlit" framing, and "13 utilities" / line-count claims that drift on every PR. - Drop dead links to CONTRIBUTING.md, IMPLEMENTATION_PLAN.md, and ../LICENSE. - Delete DetectionAnnotatePage.tsx.backup.
1. pull_sequence_annotations: page until N accepted, not N raw The previous fetch_sequences applied --max-sequences before the --smoke-type filter, so `make pull-fp MAX_SEQUENCES=20` could grab the first 20 seq_annotation_done sequences (almost all of which have smoke) and then drop them all in the worker, leaving 0 sequences to review. Move the smoke-type filter into the page loop (using the embedded annotation from include_annotation=True) and keep paging until we have N accepted sequences. 2. export_dataset: include confirmed clean FPs The /export/detections endpoint already filters to processing_stage=annotated, so a row reaching the script is from an annotated sequence by construction. Sequences with empty smoke_types AND empty false_positive_types are confirmed false positives produced by the new FP review workflow — they should be exported under the fp/ category, not silently dropped. sequence_category_from_row now returns CATEGORY_FP in that case instead of None. 3. export-dataset: stop string-interpolating secrets in the recipe $(MAIN_ANNOTATION_LOGIN) / $(MAIN_ANNOTATION_PASSWORD) substitution happens at Make parse time, so values containing $, spaces, or shell metacharacters could be split or expanded by the shell before reaching the script. Drop the --username/--password CLI flags, let the values propagate through the subprocess environment (Makefile already has `include .env; export`), and have export_dataset.py default to MAIN_ANNOTATION_LOGIN/PASSWORD env vars (with ANNOTATOR_* fallback). The empty-check now uses shell-time $$VAR, not Make-time $(VAR).
Make's `include .env` parses dotenv files as Make syntax, which mangles values containing $, spaces, or quotes — e.g. MAIN_ANNOTATION_PASSWORD set to `pa$word` ends up as `paord` in the subprocess environment. Even after dropping the recipe-side $(VAR) interpolation, the mangled value still poisoned the env, and python-dotenv's load_dotenv() does not override existing env vars by default. Fix: stop parsing .env in the Makefile entirely. Every data-transfer script already calls load_dotenv() at startup (verified across the ingestion/platform/ scripts), and python-dotenv handles dotenv quoting correctly. This means: - Make never sees the secret value, so it can't mangle it. - Shell-level vars still take priority (load_dotenv does not override), so `MAIN_ANNOTATION_LOGIN=foo make ...` still works as an override. - Recipes no longer pre-validate creds; auth fails with a clear HTTP 401 if creds are missing/wrong. Drop the .env include + the now-empty $$VAR check in `export-dataset`, and update the docs (README, .env.example) to explain the new flow. Also reformat pull_sequence_annotations.py and export_dataset.py with `ruff format` so `make lint` passes.
P2 followup: cleanup_duplicate_images.py reads credentials via shared.get_annotation_credentials() but neither file called load_dotenv(), so once Make stopped pre-loading .env this script silently fell back to admin/admin12345 on every invocation. Add load_dotenv() at module level in shared.py — that way every importer (current and future) picks up .env even if it forgets to call load_dotenv() itself — and add a defensive load_dotenv() in cleanup_duplicate_images.py for clarity. P3 followup: clean up four spots that still claimed Make auto-loads .env: - README.md (workflow intro) - annotation_api/docs/data-ingestion-guide.md - root CLAUDE.md (Platform Import section) - annotation_api/Makefile (data-workflow header comment) All now describe the actual behaviour: each script loads .env via python-dotenv at startup, Make does not parse .env. Run ruff format on the two changed Python files.
These two ingestion scripts read credentials from env vars but never called load_dotenv(), so once Make stopped pre-loading .env they fell back to admin/admin12345 by default. Add load_dotenv() at module level and switch export_annotations.py's --username/--password defaults to ANNOTATOR_LOGIN/PASSWORD env vars (matching the pattern used in import_annotations.py and the rest of the platform scripts). ruff format reflowed the long argparse lines.
The page-level _passes_smoke_filter dropped rows whose embedded annotation didn't populate, even though process_sequence would re-fetch and decide. Return True for missing annotations so the worker gets to make the call, and document the prefilter/recheck split.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two threads bundled because they touched overlapping files:
1. Pipeline features & UX
sequences.csv(newmake import-local-yolotarget +batch_import_local_yolo.py).seq_annotation_donesequences with emptysmoke_types, FiftyOne check, push back asannotated(no labels) orneeds_manualif a fire was missed.--categoryfilter; one folder per (category, seq).import_yolo_sequence:--labels-dir-nameso non-default label subdirs (e.g.labels_predictions) work.cleanup_duplicate_images.pyhelper to drop ready sequences whose first frames are dupes.import_all_platforms.shwrapper that loopscreds.jsonfor per-org platform import.2.
.env-based credentials (no more shellexport)annotation_api/.env.examplewith all expected variables.annotation_api/.env(include+export), so recipes and their subprocesses see the variables without manualsource/export..gitignore: keep.envignored, allow.env.example, ignorecreds.json.CLAUDE.md, andannotation_api/docs/data-ingestion-guide.mdupdated to point at.envinstead of inlineexport VAR=....3. Stale-reference & doc cleanup
make start-prod/make stop-prodreferences — those targets don't exist.scripts/setup.sh(was just a Docker prereq check) and themake setuptarget / its call frommake start.'annotated'(not'managed'); alignapply_fiftyone_review.py --fp-modehelp with the actual stage.import-local-yoloUsage hint to use real var names (LIMIT_FOLDERS/SKIP_FOLDERS).make pull-fphardcodessmoke-type=empty.annotation_api/CLAUDE.md: drop the Poetry→uv migration notes, the dated "Recent Enhancements (2024)" changelog, the duplicated "Data Import Scripts" block (which referenced.envrc), and the drifting "192 tests" claims (-130 / +7 lines).frontend/README.mdandfrontend/CLAUDE.mdagainst the actualsrc/tree — old docs listed pages and components that no longer exist (AnnotationPage,AnnotationsPage,SequenceDetailPage,SequenceBboxCard, etc.) and pointed at dead links (CONTRIBUTING.md,IMPLEMENTATION_PLAN.md,../LICENSE). Dev port is 3000, not 5173. (-927 / +195 across the two files.)frontend/src/pages/DetectionAnnotatePage.tsx.backup.review_split_app.py(no longer used).Test plan
cd annotation_api && make helplists targets and matches reality.cp annotation_api/.env.example annotation_api/.env, fill in, then runmake export-dataset OUTPUT_DIR=...(no shellexport).make pull-fp MAX_SEQUENCES=1 && make visual-check-fp && make apply-review-fp.make import-local-yolo ROOT_DIR=...against a small fixture.cd frontend && npm install && npm run dev— UI loads at http://localhost:3000 against a running API.start-prod,make setup,.envrc,IMPLEMENTATION_PLAN).Notes
make_it_fasterandclean_pipelinebecause the work happened across multiple branches. Squash-merging this PR is recommended to givemaina single clean commit.git checkout main && git reset --hard origin/mainto drop the now-redundant local merge commits.