RSS aggregator that fetches feeds, deduplicates them using semantic similarity, and generates a markdown digest posted daily to r/Guadeloupe.
mise is used for tool version management (Python + PDM).
mise install # install Python and PDM
pdm install # install Python dependenciessrc/rss_summary/ # main package
aggregate.py # CLI: pdm run aggregate-rss
weekly.py # CLI: pdm run weekly-digest
post_to_reddit.py # CLI: pdm run post-to-reddit
similarity.py # semantic deduplication via sentence-transformers
classification.py # thematic classification via trained LinearSVC head
formatting.py # markdown table generation
parsing.py # HTML parsing and image extraction
last_run.py # .last-run timestamp persistence
classifier/
train.py # offline: train LinearSVC head on data/themes.json
infer.py # offline: batch classify a daily feed file for evaluation
tests/
data/
rss_list.txt # RSS feed URLs (one per line)
taxonomy.toml # ordered theme names (10 themes) used for digest section ordering
themes.json # labeled training examples (one per theme)
classifier_head.joblib # trained classifier head (committed, ~800KB)
classifier_eval.json # last cross-validation evaluation results
feed.md # latest daily digest
feed-YYYY-MM-DD.md # dated archive copies
weekly-wXX-prose.md # weekly prose digest (Mistral-generated)
weekly-wXX-review.md # taxonomy review report
.last-run # last successful run timestamp (committed)
Fetches RSS feeds and writes a deduplicated markdown digest.
pdm run aggregate-rss [RSS_FILE] [OUTPUT_FILE] [OPTIONS]
Arguments (optional):
RSS_FILE RSS feed list [default: data/rss_list.txt]
OUTPUT_FILE Output markdown file [default: data/feed.md]
Options:
--with-images Add image preview column to the table
--classify Group output by thematic taxonomy
--taxonomy PATH Taxonomy TOML config [default: data/taxonomy.toml]
--until DATETIME Upper date bound for articles (YYYY-MM-DD or YYYY-MM-DD HH:MM:SS)
--dry-run Run without updating .last-run
--restore Restore .last-run from backup and exit
--summarize Prepend a Mistral-generated prose summary (requires MISTRAL_API_KEY)
Deduplication uses a two-stage pipeline:
- Fuzzy title match via
difflib.SequenceMatcher(threshold 0.85) - Semantic similarity via
BAAI/bge-m3(threshold 0.75)
The model is downloaded automatically on first run and cached in ~/.cache/huggingface.
Classification (enabled with --classify): uses a trained LinearSVC head on concatenated BAAI/bge-m3 + multilingual-e5-large-instruct embeddings (2048-dim, ~80% accuracy / 0.82 macro F1, 10 themes). The head is stored in data/classifier_head.joblib and committed — no retraining needed on first clone.
Last-run tracking: the date of last execution is stored in .last-run. Only entries published since the previous run are fetched.
Summary (enabled with --summarize): calls Mistral once to generate a 100–150 word neutral prose overview of the day's articles. The output is structured as ## En bref (prose) followed by ## Plus en détails (the article table). Enabled by default in the CI daily workflow via MISTRAL_API_KEY.
Clusters and classifies a full week of daily digests into a Mistral-generated prose digest.
pdm run weekly-digest [OPTIONS]
Options:
--week INT ISO week number [default: current week]
--year INT Year for --week [default: current year]
--data-dir PATH Directory with daily files [default: data]
--output-dir PATH Output directory [default: data]
--taxonomy PATH Taxonomy TOML config [default: data/taxonomy.toml]
--top-per-theme INT Max clusters per theme used [default: 2]
--min-days INT Min daily files required [default: 7]
--suggest Also write taxonomy review report
--enrich-review With --suggest: append Mistral theme suggestions for problematic clusters
--apply-suggestions With --enrich-review: write suggestions into themes.json automatically
Requires all 7 daily feed-YYYY-MM-DD.md files for the target week (controlled by --min-days). Always outputs data/weekly-wXX-prose.md — a flowing editorial text generated by Mistral. With --suggest, also writes data/weekly-wXX-review.md.
--enrich-review appends a ## Suggestions Mistral section to the review file with a theme recommendation, a paste-ready themes.json example string, and a one-sentence justification per problematic cluster. --apply-suggestions writes those suggestions directly into data/themes.json; if Mistral suggests a brand-new theme (not in the taxonomy), a GitHub issue is opened automatically.
Requires MISTRAL_API_KEY to be set. The CI workflow passes it via the MISTRAL_API_KEY repository secret.
Posts a markdown feed to r/Guadeloupe via Playwright (Firefox). Runs locally only — not suitable for CI due to IP/fingerprint detection.
pdm run post-to-reddit [OPTIONS]
Options:
--feed-file PATH Local markdown file to post [default: data/feed.md]
--feed-url TEXT URL to fetch the markdown from (e.g. raw GitHub URL)
--feed-file and --feed-url are mutually exclusive.
Requires a .env file with Reddit credentials (see .env.example).
The trained head is committed and ready to use. Retrain only when you update data/themes.json.
data/taxonomy.toml controls the order of sections in the weekly digest output — changing it does not require retraining.
1. Update training examples (data/themes.json) — add labeled examples for any new or changed theme. Format:
[
{
"theme": "Theme display name",
"label": "theme_slug",
"description": "...",
"examples": ["Article title | Summary excerpt", ...]
}
]2. Train:
pdm run python classifier/train.py
# outputs: data/classifier_head.joblib, data/classifier_eval.json3. Evaluate on a real feed file:
pdm run python classifier/infer.py data/feed-YYYY-MM-DD.md4. Commit the updated head:
git add data/classifier_head.joblib data/classifier_eval.json data/themes.json
git commit -m "feat(classifier): retrain with updated examples"The head is ~800KB and safe to commit. Both BAAI/bge-m3 and multilingual-e5-large-instruct are frozen encoders — only the LinearSVC head is trained. Both models are cached in CI.
pdm run pytest
pdm run pytest --cov=rss_summary # with coverage report