Skip to content

Rework analysis pipeline: modular corpus tools & regression tests#257

Merged
MinhHaDuong merged 16 commits intomainfrom
feat/corpus
Feb 10, 2026
Merged

Rework analysis pipeline: modular corpus tools & regression tests#257
MinhHaDuong merged 16 commits intomainfrom
feat/corpus

Conversation

@MinhHaDuong
Copy link
Collaborator

Summary

  • Modularized analysis: replaced monolithic analyze_monitor_logs.py and dashboard code with focused modules — classifier.py, tabulate_articles.py, tabulate_tokens.py, tabulate_queries.py, describe_sessions.py
  • New Makefile: builds figures, CSVs, and stats via make analysis; includes make test for byte-level regression checks against published reference figures
  • Infra cleanup: relocated data/ out of repo (symlink), added anonymized monitor-logs archive, removed unused dashboard spec and config

Test plan

  • make clean-figures clean-csv clean-stats && make analysis — all outputs regenerate without errors
  • make test — all 4 figures byte-identical to reference (~/CNRS/papiers/sent/CIRED.digital final report/fig/)

🤖 Generated with Claude Code

MinhHaDuong and others added 16 commits November 5, 2025 21:20
- Implement anonymize_monitor_logs.py CLI to produce privacy-safe dataset
  - Removes PII: server_context (IPs), sessionId, userAgent, headers, profile
  - Skips userProfile events entirely
  - Preserves timestamps, queries, responses for analysis
  - Outputs to reports/monitor-logs-anon/ with metadata and checksums
  - Supports optional zip packaging for redistribution

- Include sample anonymized archive (6.7M) with CC BY 4.0 license
  - Track via Git LFS for large files

- Update .gitignore to whitelist monitor-logs-anon*.zip
- Add .gitattributes for LFS zip tracking

- Update analysis README with anonymization usage docs
The exploratory notebook has been fully superseded by the
modular analysis scripts in src/analysis/. It also had
undefined references that broke the ruff CI check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MinhHaDuong MinhHaDuong merged commit 0b0c035 into main Feb 10, 2026
5 checks passed
@MinhHaDuong MinhHaDuong deleted the feat/corpus branch February 10, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant