Rework analysis pipeline: modular corpus tools & regression tests#257
Merged
MinhHaDuong merged 16 commits intomainfrom Feb 10, 2026
Merged
Rework analysis pipeline: modular corpus tools & regression tests#257MinhHaDuong merged 16 commits intomainfrom
MinhHaDuong merged 16 commits intomainfrom
Conversation
- Implement anonymize_monitor_logs.py CLI to produce privacy-safe dataset - Removes PII: server_context (IPs), sessionId, userAgent, headers, profile - Skips userProfile events entirely - Preserves timestamps, queries, responses for analysis - Outputs to reports/monitor-logs-anon/ with metadata and checksums - Supports optional zip packaging for redistribution - Include sample anonymized archive (6.7M) with CC BY 4.0 license - Track via Git LFS for large files - Update .gitignore to whitelist monitor-logs-anon*.zip - Add .gitattributes for LFS zip tracking - Update analysis README with anonymization usage docs
The exploratory notebook has been fully superseded by the modular analysis scripts in src/analysis/. It also had undefined references that broke the ruff CI check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
analyze_monitor_logs.pyand dashboard code with focused modules —classifier.py,tabulate_articles.py,tabulate_tokens.py,tabulate_queries.py,describe_sessions.pymake analysis; includesmake testfor byte-level regression checks against published reference figuresdata/out of repo (symlink), added anonymized monitor-logs archive, removed unused dashboard spec and configTest plan
make clean-figures clean-csv clean-stats && make analysis— all outputs regenerate without errorsmake test— all 4 figures byte-identical to reference (~/CNRS/papiers/sent/CIRED.digital final report/fig/)🤖 Generated with Claude Code