Clean up data pipeline + docs, add batch local YOLO import + FP review by MateoLostanlen · Pull Request #120 · pyronear/pyro-annotator

MateoLostanlen · 2026-05-01T05:31:58Z

Summary

Two threads bundled because they touched overlapping files:

1. Pipeline features & UX

Batch import of local YOLO sequence folders paired with sequences.csv (new make import-local-yolo target + batch_import_local_yolo.py).
FP review pipeline: pull seq_annotation_done sequences with empty smoke_types, FiftyOne check, push back as annotated (no labels) or needs_manual if a fire was missed.
Dataset export: derive a single category per sequence from sequence-level fields and add --category filter; one folder per (category, seq).
import_yolo_sequence: --labels-dir-name so non-default label subdirs (e.g. labels_predictions) work.
cleanup_duplicate_images.py helper to drop ready sequences whose first frames are dupes.
import_all_platforms.sh wrapper that loops creds.json for per-org platform import.

2. .env-based credentials (no more shell export)

New annotation_api/.env.example with all expected variables.
Makefile auto-loads annotation_api/.env (include + export), so recipes and their subprocesses see the variables without manual source / export.
.gitignore: keep .env ignored, allow .env.example, ignore creds.json.
README, root CLAUDE.md, and annotation_api/docs/data-ingestion-guide.md updated to point at .env instead of inline export VAR=....

3. Stale-reference & doc cleanup

Drop make start-prod / make stop-prod references — those targets don't exist.
Drop scripts/setup.sh (was just a Docker prereq check) and the make setup target / its call from make start.
Fix Makefile help: clean FP sequences become 'annotated' (not 'managed'); align apply_fiftyone_review.py --fp-mode help with the actual stage.
Fix import-local-yolo Usage hint to use real var names (LIMIT_FOLDERS/SKIP_FOLDERS).
Note that make pull-fp hardcodes smoke-type=empty.
Trim annotation_api/CLAUDE.md: drop the Poetry→uv migration notes, the dated "Recent Enhancements (2024)" changelog, the duplicated "Data Import Scripts" block (which referenced .envrc), and the drifting "192 tests" claims (-130 / +7 lines).
Rewrite frontend/README.md and frontend/CLAUDE.md against the actual src/ tree — old docs listed pages and components that no longer exist (AnnotationPage, AnnotationsPage, SequenceDetailPage, SequenceBboxCard, etc.) and pointed at dead links (CONTRIBUTING.md, IMPLEMENTATION_PLAN.md, ../LICENSE). Dev port is 3000, not 5173. (-927 / +195 across the two files.)
Delete frontend/src/pages/DetectionAnnotatePage.tsx.backup.
Delete review_split_app.py (no longer used).

Test plan

cd annotation_api && make help lists targets and matches reality.
cp annotation_api/.env.example annotation_api/.env, fill in, then run make export-dataset OUTPUT_DIR=... (no shell export).
make pull-fp MAX_SEQUENCES=1 && make visual-check-fp && make apply-review-fp.
make import-local-yolo ROOT_DIR=... against a small fixture.
Frontend: cd frontend && npm install && npm run dev — UI loads at http://localhost:3000 against a running API.
Spot-check that no doc references a removed file/target (search start-prod, make setup, .envrc, IMPLEMENTATION_PLAN).

Notes

Branch history includes local merge commits from make_it_faster and clean_pipeline because the work happened across multiple branches. Squash-merging this PR is recommended to give main a single clean commit.
After merge, recommend git checkout main && git reset --hard origin/main to drop the now-redundant local merge commits.

…review tools - Add batch_import_local_yolo.py + Makefile target for parallel import of local YOLO sequence folders paired with sequences.csv. - Add cleanup_duplicate_images.py to drop ready sequences whose first frames are duplicates. - Add import_all_platforms.sh wrapper iterating creds.json for per-org platform import. - Add review_split_app.py Streamlit viewer to compare model fp/wf split against human annotations. - pull_sequence_annotations: --smoke-type empty to pull FP sequences (empty smoke_types). - export_dataset: derive a single category per sequence from sequence-level smoke_types/false_positive_types and add --category filter; one folder per (category, seq). - import_yolo_sequence: --labels-dir-name to support labels_predictions and similar. - Makefile: switch export-dataset to MAIN_ANNOTATION_LOGIN/PASSWORD env, add CATEGORY pass-through, document FP review and batch import targets. - README: document FP review pipeline and category filter. - Add top-level CLAUDE.md.

- Add annotation_api/.env.example documenting all expected variables. - Makefile auto-loads annotation_api/.env (include + export) so all targets see the variables without manual `source` or `export`. - Allow .env.example through .gitignore. - Update README, top-level CLAUDE.md, and annotation_api/docs/data-ingestion-guide.md to point at .env instead of inline `export VAR=...` instructions.

- Drop scripts/setup.sh (was only a docker prerequisite check) and remove `make setup` target + its call from `make start`. - Drop `make start-prod` / `make stop-prod` references from annotation_api/README.md and annotation_api/CLAUDE.md (those targets do not exist). - Fix Makefile help: clean FP sequences become 'annotated' (not 'managed'); align apply_fiftyone_review.py --fp-mode help with the actual stage. - Fix import-local-yolo Usage hint to use real var names (LIMIT_FOLDERS / SKIP_FOLDERS). - Note that `make pull-fp` hardcodes smoke-type=empty (the SMOKE_TYPE override has no effect there). - Update Development Setup steps in annotation_api/CLAUDE.md to point at .env.example. - Add creds.json to .gitignore. - Remove unused review_split_app.py.

- Drop "Migration Notes (Poetry → uv)" section: dated history, no longer relevant. - Drop "Recent Enhancements (2024)" changelog section. - Drop the duplicated "Data Import Scripts" section (kept the "Data Transfer Scripts" one above) — the duplicate referenced .envrc and contradicted the new .env flow. - Replace ".envrc" / docker-compose example in "Authentication Setup" with a pointer to .env / .env.example. - Drop test-count claims ("192 tests") that drift on every PR; condense the Testing block.

frontend/README.md and frontend/CLAUDE.md described pages and components that no longer exist (AnnotationPage, AnnotationsPage, SequenceDetailPage, SequenceBboxCard, etc.) and pointed at dead links (CONTRIBUTING.md, IMPLEMENTATION_PLAN.md, ../LICENSE). Rewrite both from the actual src/ tree: - Real page list and route paths from App.tsx (dev port 3000, not 5173). - Real component layout (annotation/, detection-annotation/, detection-sequence/, filters/, sequence/, sequence-annotation/, sequences/, ui/) and the actual hooks/ + utils/ modules. - Real stores (useAuthStore + useSequenceStore — useAnnotationStore was removed long ago). - Drop dated "Refactored Architecture (2024)" / "Recent Fixes (2024)" sections, the "replacing legacy Streamlit" framing, and "13 utilities" / line-count claims that drift on every PR. - Drop dead links to CONTRIBUTING.md, IMPLEMENTATION_PLAN.md, and ../LICENSE. - Delete DetectionAnnotatePage.tsx.backup.

1. pull_sequence_annotations: page until N accepted, not N raw The previous fetch_sequences applied --max-sequences before the --smoke-type filter, so `make pull-fp MAX_SEQUENCES=20` could grab the first 20 seq_annotation_done sequences (almost all of which have smoke) and then drop them all in the worker, leaving 0 sequences to review. Move the smoke-type filter into the page loop (using the embedded annotation from include_annotation=True) and keep paging until we have N accepted sequences. 2. export_dataset: include confirmed clean FPs The /export/detections endpoint already filters to processing_stage=annotated, so a row reaching the script is from an annotated sequence by construction. Sequences with empty smoke_types AND empty false_positive_types are confirmed false positives produced by the new FP review workflow — they should be exported under the fp/ category, not silently dropped. sequence_category_from_row now returns CATEGORY_FP in that case instead of None. 3. export-dataset: stop string-interpolating secrets in the recipe $(MAIN_ANNOTATION_LOGIN) / $(MAIN_ANNOTATION_PASSWORD) substitution happens at Make parse time, so values containing $, spaces, or shell metacharacters could be split or expanded by the shell before reaching the script. Drop the --username/--password CLI flags, let the values propagate through the subprocess environment (Makefile already has `include .env; export`), and have export_dataset.py default to MAIN_ANNOTATION_LOGIN/PASSWORD env vars (with ANNOTATOR_* fallback). The empty-check now uses shell-time $$VAR, not Make-time $(VAR).

Make's `include .env` parses dotenv files as Make syntax, which mangles values containing $, spaces, or quotes — e.g. MAIN_ANNOTATION_PASSWORD set to `pa$word` ends up as `paord` in the subprocess environment. Even after dropping the recipe-side $(VAR) interpolation, the mangled value still poisoned the env, and python-dotenv's load_dotenv() does not override existing env vars by default. Fix: stop parsing .env in the Makefile entirely. Every data-transfer script already calls load_dotenv() at startup (verified across the ingestion/platform/ scripts), and python-dotenv handles dotenv quoting correctly. This means: - Make never sees the secret value, so it can't mangle it. - Shell-level vars still take priority (load_dotenv does not override), so `MAIN_ANNOTATION_LOGIN=foo make ...` still works as an override. - Recipes no longer pre-validate creds; auth fails with a clear HTTP 401 if creds are missing/wrong. Drop the .env include + the now-empty $$VAR check in `export-dataset`, and update the docs (README, .env.example) to explain the new flow. Also reformat pull_sequence_annotations.py and export_dataset.py with `ruff format` so `make lint` passes.

P2 followup: cleanup_duplicate_images.py reads credentials via shared.get_annotation_credentials() but neither file called load_dotenv(), so once Make stopped pre-loading .env this script silently fell back to admin/admin12345 on every invocation. Add load_dotenv() at module level in shared.py — that way every importer (current and future) picks up .env even if it forgets to call load_dotenv() itself — and add a defensive load_dotenv() in cleanup_duplicate_images.py for clarity. P3 followup: clean up four spots that still claimed Make auto-loads .env: - README.md (workflow intro) - annotation_api/docs/data-ingestion-guide.md - root CLAUDE.md (Platform Import section) - annotation_api/Makefile (data-workflow header comment) All now describe the actual behaviour: each script loads .env via python-dotenv at startup, Make does not parse .env. Run ruff format on the two changed Python files.

These two ingestion scripts read credentials from env vars but never called load_dotenv(), so once Make stopped pre-loading .env they fell back to admin/admin12345 by default. Add load_dotenv() at module level and switch export_annotations.py's --username/--password defaults to ANNOTATOR_LOGIN/PASSWORD env vars (matching the pattern used in import_annotations.py and the rest of the platform scripts). ruff format reflowed the long argparse lines.

The page-level _passes_smoke_filter dropped rows whose embedded annotation didn't populate, even though process_sequence would re-fetch and decide. Return True for missing annotations so the worker gets to make the call, and document the prefilter/recheck split.

MateoLostanlen added 13 commits May 1, 2026 06:46

Merge branch 'make_it_faster'

430b93a

Merge origin/main

9df170d

Merge branch 'clean_pipeline'

34d2ff2

MateoLostanlen merged commit 9038267 into main May 1, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up data pipeline + docs, add batch local YOLO import + FP review#120

Clean up data pipeline + docs, add batch local YOLO import + FP review#120
MateoLostanlen merged 13 commits intomainfrom
cleanup_pipeline_and_docs

MateoLostanlen commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MateoLostanlen commented May 1, 2026

Summary

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant