Skip to content

Clean up data pipeline + docs, add batch local YOLO import + FP review#120

Merged
MateoLostanlen merged 13 commits intomainfrom
cleanup_pipeline_and_docs
May 1, 2026
Merged

Clean up data pipeline + docs, add batch local YOLO import + FP review#120
MateoLostanlen merged 13 commits intomainfrom
cleanup_pipeline_and_docs

Conversation

@MateoLostanlen
Copy link
Copy Markdown
Member

Summary

Two threads bundled because they touched overlapping files:

1. Pipeline features & UX

  • Batch import of local YOLO sequence folders paired with sequences.csv (new make import-local-yolo target + batch_import_local_yolo.py).
  • FP review pipeline: pull seq_annotation_done sequences with empty smoke_types, FiftyOne check, push back as annotated (no labels) or needs_manual if a fire was missed.
  • Dataset export: derive a single category per sequence from sequence-level fields and add --category filter; one folder per (category, seq).
  • import_yolo_sequence: --labels-dir-name so non-default label subdirs (e.g. labels_predictions) work.
  • cleanup_duplicate_images.py helper to drop ready sequences whose first frames are dupes.
  • import_all_platforms.sh wrapper that loops creds.json for per-org platform import.

2. .env-based credentials (no more shell export)

  • New annotation_api/.env.example with all expected variables.
  • Makefile auto-loads annotation_api/.env (include + export), so recipes and their subprocesses see the variables without manual source / export.
  • .gitignore: keep .env ignored, allow .env.example, ignore creds.json.
  • README, root CLAUDE.md, and annotation_api/docs/data-ingestion-guide.md updated to point at .env instead of inline export VAR=....

3. Stale-reference & doc cleanup

  • Drop make start-prod / make stop-prod references — those targets don't exist.
  • Drop scripts/setup.sh (was just a Docker prereq check) and the make setup target / its call from make start.
  • Fix Makefile help: clean FP sequences become 'annotated' (not 'managed'); align apply_fiftyone_review.py --fp-mode help with the actual stage.
  • Fix import-local-yolo Usage hint to use real var names (LIMIT_FOLDERS/SKIP_FOLDERS).
  • Note that make pull-fp hardcodes smoke-type=empty.
  • Trim annotation_api/CLAUDE.md: drop the Poetry→uv migration notes, the dated "Recent Enhancements (2024)" changelog, the duplicated "Data Import Scripts" block (which referenced .envrc), and the drifting "192 tests" claims (-130 / +7 lines).
  • Rewrite frontend/README.md and frontend/CLAUDE.md against the actual src/ tree — old docs listed pages and components that no longer exist (AnnotationPage, AnnotationsPage, SequenceDetailPage, SequenceBboxCard, etc.) and pointed at dead links (CONTRIBUTING.md, IMPLEMENTATION_PLAN.md, ../LICENSE). Dev port is 3000, not 5173. (-927 / +195 across the two files.)
  • Delete frontend/src/pages/DetectionAnnotatePage.tsx.backup.
  • Delete review_split_app.py (no longer used).

Test plan

  • cd annotation_api && make help lists targets and matches reality.
  • cp annotation_api/.env.example annotation_api/.env, fill in, then run make export-dataset OUTPUT_DIR=... (no shell export).
  • make pull-fp MAX_SEQUENCES=1 && make visual-check-fp && make apply-review-fp.
  • make import-local-yolo ROOT_DIR=... against a small fixture.
  • Frontend: cd frontend && npm install && npm run dev — UI loads at http://localhost:3000 against a running API.
  • Spot-check that no doc references a removed file/target (search start-prod, make setup, .envrc, IMPLEMENTATION_PLAN).

Notes

  • Branch history includes local merge commits from make_it_faster and clean_pipeline because the work happened across multiple branches. Squash-merging this PR is recommended to give main a single clean commit.
  • After merge, recommend git checkout main && git reset --hard origin/main to drop the now-redundant local merge commits.

…review tools

- Add batch_import_local_yolo.py + Makefile target for parallel import of local YOLO sequence folders paired with sequences.csv.
- Add cleanup_duplicate_images.py to drop ready sequences whose first frames are duplicates.
- Add import_all_platforms.sh wrapper iterating creds.json for per-org platform import.
- Add review_split_app.py Streamlit viewer to compare model fp/wf split against human annotations.
- pull_sequence_annotations: --smoke-type empty to pull FP sequences (empty smoke_types).
- export_dataset: derive a single category per sequence from sequence-level smoke_types/false_positive_types and add --category filter; one folder per (category, seq).
- import_yolo_sequence: --labels-dir-name to support labels_predictions and similar.
- Makefile: switch export-dataset to MAIN_ANNOTATION_LOGIN/PASSWORD env, add CATEGORY pass-through, document FP review and batch import targets.
- README: document FP review pipeline and category filter.
- Add top-level CLAUDE.md.
- Add annotation_api/.env.example documenting all expected variables.
- Makefile auto-loads annotation_api/.env (include + export) so all targets see the variables without manual `source` or `export`.
- Allow .env.example through .gitignore.
- Update README, top-level CLAUDE.md, and annotation_api/docs/data-ingestion-guide.md to point at .env instead of inline `export VAR=...` instructions.
- Drop scripts/setup.sh (was only a docker prerequisite check) and remove `make setup` target + its call from `make start`.
- Drop `make start-prod` / `make stop-prod` references from annotation_api/README.md and annotation_api/CLAUDE.md (those targets do not exist).
- Fix Makefile help: clean FP sequences become 'annotated' (not 'managed'); align apply_fiftyone_review.py --fp-mode help with the actual stage.
- Fix import-local-yolo Usage hint to use real var names (LIMIT_FOLDERS / SKIP_FOLDERS).
- Note that `make pull-fp` hardcodes smoke-type=empty (the SMOKE_TYPE override has no effect there).
- Update Development Setup steps in annotation_api/CLAUDE.md to point at .env.example.
- Add creds.json to .gitignore.
- Remove unused review_split_app.py.
- Drop "Migration Notes (Poetry → uv)" section: dated history, no longer relevant.
- Drop "Recent Enhancements (2024)" changelog section.
- Drop the duplicated "Data Import Scripts" section (kept the "Data Transfer Scripts" one above) — the duplicate referenced .envrc and contradicted the new .env flow.
- Replace ".envrc" / docker-compose example in "Authentication Setup" with a pointer to .env / .env.example.
- Drop test-count claims ("192 tests") that drift on every PR; condense the Testing block.
frontend/README.md and frontend/CLAUDE.md described pages and
components that no longer exist (AnnotationPage, AnnotationsPage,
SequenceDetailPage, SequenceBboxCard, etc.) and pointed at dead
links (CONTRIBUTING.md, IMPLEMENTATION_PLAN.md, ../LICENSE).

Rewrite both from the actual src/ tree:
- Real page list and route paths from App.tsx (dev port 3000, not 5173).
- Real component layout (annotation/, detection-annotation/,
  detection-sequence/, filters/, sequence/, sequence-annotation/,
  sequences/, ui/) and the actual hooks/ + utils/ modules.
- Real stores (useAuthStore + useSequenceStore — useAnnotationStore
  was removed long ago).
- Drop dated "Refactored Architecture (2024)" / "Recent Fixes (2024)"
  sections, the "replacing legacy Streamlit" framing, and "13
  utilities" / line-count claims that drift on every PR.
- Drop dead links to CONTRIBUTING.md, IMPLEMENTATION_PLAN.md, and
  ../LICENSE.
- Delete DetectionAnnotatePage.tsx.backup.
1. pull_sequence_annotations: page until N accepted, not N raw

   The previous fetch_sequences applied --max-sequences before the
   --smoke-type filter, so `make pull-fp MAX_SEQUENCES=20` could grab
   the first 20 seq_annotation_done sequences (almost all of which
   have smoke) and then drop them all in the worker, leaving 0
   sequences to review. Move the smoke-type filter into the page loop
   (using the embedded annotation from include_annotation=True) and
   keep paging until we have N accepted sequences.

2. export_dataset: include confirmed clean FPs

   The /export/detections endpoint already filters to
   processing_stage=annotated, so a row reaching the script is from
   an annotated sequence by construction. Sequences with empty
   smoke_types AND empty false_positive_types are confirmed false
   positives produced by the new FP review workflow — they should be
   exported under the fp/ category, not silently dropped.
   sequence_category_from_row now returns CATEGORY_FP in that case
   instead of None.

3. export-dataset: stop string-interpolating secrets in the recipe

   $(MAIN_ANNOTATION_LOGIN) / $(MAIN_ANNOTATION_PASSWORD) substitution
   happens at Make parse time, so values containing $, spaces, or
   shell metacharacters could be split or expanded by the shell
   before reaching the script. Drop the --username/--password CLI
   flags, let the values propagate through the subprocess environment
   (Makefile already has `include .env; export`), and have
   export_dataset.py default to MAIN_ANNOTATION_LOGIN/PASSWORD env
   vars (with ANNOTATOR_* fallback). The empty-check now uses
   shell-time $$VAR, not Make-time $(VAR).
Make's `include .env` parses dotenv files as Make syntax, which mangles
values containing $, spaces, or quotes — e.g. MAIN_ANNOTATION_PASSWORD
set to `pa$word` ends up as `paord` in the subprocess environment. Even
after dropping the recipe-side $(VAR) interpolation, the mangled value
still poisoned the env, and python-dotenv's load_dotenv() does not
override existing env vars by default.

Fix: stop parsing .env in the Makefile entirely. Every data-transfer
script already calls load_dotenv() at startup (verified across the
ingestion/platform/ scripts), and python-dotenv handles dotenv quoting
correctly. This means:

- Make never sees the secret value, so it can't mangle it.
- Shell-level vars still take priority (load_dotenv does not override),
  so `MAIN_ANNOTATION_LOGIN=foo make ...` still works as an override.
- Recipes no longer pre-validate creds; auth fails with a clear HTTP
  401 if creds are missing/wrong.

Drop the .env include + the now-empty $$VAR check in `export-dataset`,
and update the docs (README, .env.example) to explain the new flow.

Also reformat pull_sequence_annotations.py and export_dataset.py with
`ruff format` so `make lint` passes.
P2 followup: cleanup_duplicate_images.py reads credentials via
shared.get_annotation_credentials() but neither file called load_dotenv(),
so once Make stopped pre-loading .env this script silently fell back to
admin/admin12345 on every invocation. Add load_dotenv() at module level
in shared.py — that way every importer (current and future) picks up
.env even if it forgets to call load_dotenv() itself — and add a
defensive load_dotenv() in cleanup_duplicate_images.py for clarity.

P3 followup: clean up four spots that still claimed Make auto-loads .env:
- README.md (workflow intro)
- annotation_api/docs/data-ingestion-guide.md
- root CLAUDE.md (Platform Import section)
- annotation_api/Makefile (data-workflow header comment)
All now describe the actual behaviour: each script loads .env via
python-dotenv at startup, Make does not parse .env.

Run ruff format on the two changed Python files.
These two ingestion scripts read credentials from env vars but never
called load_dotenv(), so once Make stopped pre-loading .env they fell
back to admin/admin12345 by default. Add load_dotenv() at module level
and switch export_annotations.py's --username/--password defaults to
ANNOTATOR_LOGIN/PASSWORD env vars (matching the pattern used in
import_annotations.py and the rest of the platform scripts).

ruff format reflowed the long argparse lines.
The page-level _passes_smoke_filter dropped rows whose embedded
annotation didn't populate, even though process_sequence would re-fetch
and decide. Return True for missing annotations so the worker gets to
make the call, and document the prefilter/recheck split.
@MateoLostanlen MateoLostanlen merged commit 9038267 into main May 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant