Skip to content

Prepare anonymized dataset for Zenodo deposit#259

Merged
MinhHaDuong merged 2 commits intomainfrom
feat/dataset-release
Feb 10, 2026
Merged

Prepare anonymized dataset for Zenodo deposit#259
MinhHaDuong merged 2 commits intomainfrom
feat/dataset-release

Conversation

@MinhHaDuong
Copy link
Collaborator

Summary

  • Enriched anonymization pipeline: replaces PII with analytically useful derived fields (anonymous session IDs, geographic origin classification, device class) instead of simply stripping them
  • All Makefile recipes now pass LOGS_ROOT env var, enabling make LOGS_ROOT=reports/monitor-logs-anon analysis to reproduce figures and tables from the anonymized archive alone
  • Widened glob patterns in tabulate_tokens.py and tabulate_articles.py to match anonymized filenames
  • Regenerated the anonymized ZIP archive with enriched data
  • Added CITATION.cff and dataset README generation

Test plan

  • make test — 4/4 regression figures byte-identical to reference (raw logs)
  • make LOGS_ROOT=reports/monitor-logs-anon analysis — full pipeline succeeds on anonymized data (259 sessions, 394 responses, 411 articles)
  • Pre-commit hooks pass (ruff, mypy)

🤖 Generated with Claude Code

MinhHaDuong and others added 2 commits February 10, 2026 16:32
- Add write_readme() to anonymize_monitor_logs.py so the ZIP
  includes a self-contained README with coverage, schema, event
  types, anonymization details, license, and citation info
- Add CITATION.cff for software/dataset citation
- Add `make dataset` target to regenerate the archive
- Regenerate ZIP (now dated 20260210, replaces 20251218)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enrich anonymization to preserve analytical shape: sessionId → anon_NNNN,
server_context → {origin: classification}, userAgent → device class.
Pass LOGS_ROOT env var through all Makefile recipes so the pipeline
works identically on both raw logs and the anonymized archive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MinhHaDuong MinhHaDuong merged commit 6be4734 into main Feb 10, 2026
5 checks passed
@MinhHaDuong MinhHaDuong deleted the feat/dataset-release branch February 10, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant