Rework analysis pipeline: modular corpus tools & regression tests by MinhHaDuong · Pull Request #257 · CIRED/cired.digital

MinhHaDuong · 2026-02-10T15:04:29Z

Summary

Modularized analysis: replaced monolithic analyze_monitor_logs.py and dashboard code with focused modules — classifier.py, tabulate_articles.py, tabulate_tokens.py, tabulate_queries.py, describe_sessions.py
New Makefile: builds figures, CSVs, and stats via make analysis; includes make test for byte-level regression checks against published reference figures
Infra cleanup: relocated data/ out of repo (symlink), added anonymized monitor-logs archive, removed unused dashboard spec and config

Test plan

make clean-figures clean-csv clean-stats && make analysis — all outputs regenerate without errors
make test — all 4 figures byte-identical to reference (~/CNRS/papiers/sent/CIRED.digital final report/fig/)

🤖 Generated with Claude Code

…e start.

- Implement anonymize_monitor_logs.py CLI to produce privacy-safe dataset - Removes PII: server_context (IPs), sessionId, userAgent, headers, profile - Skips userProfile events entirely - Preserves timestamps, queries, responses for analysis - Outputs to reports/monitor-logs-anon/ with metadata and checksums - Supports optional zip packaging for redistribution - Include sample anonymized archive (6.7M) with CC BY 4.0 license - Track via Git LFS for large files - Update .gitignore to whitelist monitor-logs-anon*.zip - Add .gitattributes for LFS zip tracking - Update analysis README with anonymization usage docs

…ization

The exploratory notebook has been fully superseded by the modular analysis scripts in src/analysis/. It also had undefined references that broke the ruff CI check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MinhHaDuong and others added 16 commits November 5, 2025 21:20

Modularize the classifier and use it to augment the dataframe from th…

077e013

…e start.

Add session length and duration distribution analysis

38a53f4

Implement IP and user agent classification in logloader

9530f95

Add tabulation of user queries with CSV output

32e4000

Report monthly usage.

8c247a5

feat: enhance augment_dataframe to improve data extraction and normal…

6f28c58

…ization

Save figures under reports/analysis/

fc4c07e

feat: implement persistent PTR cache for IP resolution

80e84dd

Output to reports/analysis

b936ce8

Delete unused.

a2e9210

feat: add statistics generation for event and session summaries

d078470

Update after chore:

ccc656c

Relocate data/ out of repo

c019738

feat: add regression tests for reference figures in Makefile

22683ea

Remove obsolete EDA notebook

694bfc6

The exploratory notebook has been fully superseded by the modular analysis scripts in src/analysis/. It also had undefined references that broke the ruff CI check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MinhHaDuong merged commit 0b0c035 into main Feb 10, 2026
5 checks passed

MinhHaDuong deleted the feat/corpus branch February 10, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework analysis pipeline: modular corpus tools & regression tests#257

Rework analysis pipeline: modular corpus tools & regression tests#257
MinhHaDuong merged 16 commits intomainfrom
feat/corpus

MinhHaDuong commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MinhHaDuong commented Feb 10, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant