prezis · dependabot · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -18,10 +18,10 @@ jobs:
         python-version: ["3.10", "3.11", "3.12"]
 
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v6
 
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
         with:
           python-version: ${{ matrix.python-version }}
           cache: pip

diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -13,10 +13,10 @@ jobs:
   build:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v6
 
       - name: Set up Python 3.11
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
         with:
           python-version: "3.11"
 
@@ -31,7 +31,7 @@ jobs:
       - name: Check dist
         run: twine check dist/*
 
-      - uses: actions/upload-artifact@v4
+      - uses: actions/upload-artifact@v7
         with:
           name: dist
           path: dist/

diff --git a/.gitleaksignore b/.gitleaksignore
@@ -0,0 +1,6 @@
+# .gitleaksignore — confirmed false positives (auto-generated 2026-04-26)
+# Triaged: test fixtures (0xabcdef placeholders), upstream fork code, public on-chain addresses, dev tokens.
+# Format: <commit>:<file>:<rule>:<startline>  (gitleaks fingerprint)
+
+7a396da7d18004bf61a0ed0b69145e2bbeb62098:tests/test_token_extractor.py:generic-api-key:105
+7a396da7d18004bf61a0ed0b69145e2bbeb62098:tests/test_token_extractor.py:generic-api-key:119
diff --git a/.mailmap b/.mailmap
@@ -0,0 +1,4 @@
+# Canonical author mapping — GitHub honours this for attribution display.
+# Rewrites the 5 early commits (scraperx v1.0.0/v1.1.0/thread-walk/search/errors)
+# that were made before global git config was corrected, without force-push.
+Przemyslaw Palyska <przemyslaw.palyska@gmail.com> <noreply@users.noreply.github.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,85 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [1.4.3] — 2026-04-25
+
+Bug-fix release: production-grade SQLite WAL hygiene across all storage callsites. Important for anyone running scraperx components as long-lived daemons (BMW corpus ingester, Reddit/KBA/forum scrapers, GitHub analyzer in batch mode) — closes the unbounded-WAL disaster vector.
+
+### Added
+
+- **`scraperx/_sqlite_pragmas.py`** — shared `apply_pragmas(conn)` helper that applies the production-grade WAL hygiene stack (`journal_mode=WAL`, `journal_size_limit=64MB`, `synchronous=NORMAL`, `busy_timeout=5000`, `foreign_keys=ON`, `mmap_size=256MB`, `temp_store=MEMORY`). Idempotent. Per-connection PRAGMAs are LOST on close, so the helper MUST run on every new connection — that's why it's a function, not a one-time DB header write.
+- **`tests/test_sqlite_pragmas.py`** — 7 tests covering the helper itself + end-to-end PRAGMA verification on `SocialDB`, `AvatarMatcher`, and `VerifiedAvatarRegistry`.
+
+### Fixed
+
+- **Closes the unbounded-WAL disaster vector** for long-running scraperx daemons (BMW corpus ingester, Reddit/KBA/forum scrapers running 24/7). Pre-1.4.3, `~/.scraperx/social.db` had `journal_size_limit=-1` (uncapped) — same root cause that produced an 87 GB WAL on a sister project. With `journal_size_limit=64MB` + `wal_autocheckpoint` defaults, the WAL is bounded by design.
+- **`scraperx/social_db.py`** `SocialDB.__init__`: was setting only `PRAGMA journal_mode=WAL`. Now applies the full 7-PRAGMA stack via `apply_pragmas()`.
+- **`scraperx/avatar_matcher.py`** `AvatarMatcher.__init__`: was setting only `PRAGMA journal_mode=WAL`. Now applies the full stack.
+- **`scraperx/avatar_matcher.py`** `VerifiedAvatarRegistry.__init__`: was setting **NO PRAGMAs at all** — implicitly assumed another consumer (`SocialDB` / `AvatarMatcher`) had opened the shared `~/.scraperx/social.db` first. That assumption broke whenever a fresh process imported `VerifiedAvatarRegistry` standalone (e.g. the GitHub analyzer telemetry path). Now applies the full stack on every connect.
+
+### Notes
+
+The fix is fully additive — no schema migration, no behaviour change beyond performance + safety. Existing DB files keep working; the WAL bound only takes effect on the next checkpoint after upgrade.
+
+Research grounding (2026): loke.dev "20GB WAL File That Shouldn't Exist", oneuptime "How to Set Up SQLite for Production Use", powersync "SQLite Optimizations For Ultra High-Performance", phiresky tune.md, sqlite.org/pragma.html.
+
+## [1.4.2] — 2026-04-18
+
+Telemetry: `--log-verdict` flag + agree/disagree corpus builder for calibrating v1.5.0.
+
+### Added
+
+- **`scraperx/github_analyzer/telemetry.py`** — `log_verdict(report, feedback=None)` appends one JSONL event to `~/.scraperx/verdicts.jsonl`. Fields: `timestamp`, `repo`, `url`, `overall`, `sub_scores` (all 4), `mentions_count`, `warnings_count`, `warnings[:5]`, `scraperx_version`, `feedback`. Returns `True/False` — never raises. Creates `~/.scraperx/` automatically.
+- **`prompt_and_log_verdict(report)`** — interactive wrapper for CLI use. Logs the scoring event first (feedback-free), then prompts `Agree? [y/n/<reason>] (Enter to skip)` on stderr (safe for `--json` mode). User response coerced: `y/yes/agree/tak → "agree"`, `n/no/disagree/nie → "disagree"`, anything else stored as free-text.  Non-TTY stdin (pipes) is detected and silently skipped.
+- **`scraperx github --log-verdict`** — new CLI flag. Fires `prompt_and_log_verdict` after output so it never delays the report rendering.
+- **`_normalise_feedback(raw)`** — canonical alias coercion. Handles Polish (`tak`/`nie`) and common informal aliases (`ok`, `yep`, `nope`).
+- **44 new tests** in `tests/test_github_telemetry.py` covering all feedback aliases, JSONL field correctness, multi-event append, warning cap, permission-error graceful return, non-TTY auto-skip, and timestamp ISO-8601-Z round-trip.
+
+### Changed
+
+- **`__version__` bumped to `1.4.2`** (1.4.1 was the metadata-enrichment commit; 1.4.2 adds telemetry).
+- **`cli.py`** imports `prompt_and_log_verdict` from `telemetry`; `log_verdict` import removed (unused at CLI level — CLI always uses the interactive wrapper).
+
+## [1.4.0] — 2026-04-18
+
+Major feature release: deep GitHub repository trust analysis with scored verdicts, community mention aggregation across 6 dedicated platforms + 6 generic sites, GitHub Trending scraper, and graceful GPU-backed synthesis.
+
+### Added — `scraperx.github_analyzer` module
+
+- **`analyze_github_repo(url)` / `GithubAnalyzer`** — end-to-end pipeline: REST metadata → scoring → community mentions → optional web-search layer → LLM-synthesized 3-bullet verdict with inline citations + 0-100 overall score. Dependency-injected at every external call (GitHub token, SQLite cache, web-search fn, LLM fn) so the whole thing is unit-testable without a network.
+- **`github_api.py`** — stdlib-only GitHub REST client. 8 endpoints: `get_repo`, `get_contributors`, `get_recent_commits`, `get_releases`, `get_top_forks`, `get_readme`, `get_workflows`, `get_advisories` (GHSA). Rate-limit header absorption + fail-fast pre-flight when the window is exhausted. Exceptions: `GithubAPIError`, `RepoNotFoundError`, `RateLimitExceededError(reset_at)`.
+- **`scoring.py`** — 4 pure heuristics (0-100 int each): `bus_factor_score` (k-at-50% contribution share), `momentum_score` (commits + star delta over 90 days), `health_score` (archived / license / issue & fork ratios), `readme_quality_score` (length + heading + code + link + install keyword). Graceful on malformed input — never raises.
+- **`mentions/`** — 6 Tier-A platform adapters: `hn` (Algolia HN Search), `reddit` (`/search.json`), `stackoverflow` (StackExchange API 2.3), `devto` (dev.to articles + client-side slug filter), `arxiv` (Atom XML, `xml.etree`), `pwc` (Papers With Code). Every adapter: common contract (never raise, return `[]` on any error, normalize to `ExternalMention`, cache hit/miss via SQLite). `ALL_SOURCES` registry for iteration.
+- **`semantic.py`** — Tier-B generic web search. Takes an injected `web_search_fn` (matches `local_web_search` MCP signature), composes `(site:lobste.rs OR site:substack.com …) "owner/repo"` queries, filters hits to an allowlist of hosts (Lobsters, Substack, Medium, Product Hunt, Bluesky, LinkedIn). Graceful degradation when `web_search_fn` is None.
+- **`trending.py`** — `fetch_trending(since, language, spoken_language_code)` scrapes github.com/trending (no public API). Dual parser: BeautifulSoup preferred, regex fallback when bs4 unavailable (same optional-bs4 pattern as `video_discovery.py`). Returns `list[TrendingRepo]`. Browser User-Agent required — GitHub blocks naked urllib.
+- **`synthesis.py`** — populated report → `trust.overall` + `trust.rationale` + `verdict_markdown`. Dependency-injected `local_llm_fn` (qwen3:4b fast, qwen3.5:27b on `deep=True`). Robust JSON extraction via brace-counter (qwen sometimes wraps its output in prose or code fences). Heuristic fallback (sub-score weighted average) when the LLM is unreachable or returns unparseable output.
+- **`schemas.py`** — 7 stdlib dataclasses: `GithubReport`, `RepoTrustScore`, `ContributorInfo`, `ForkInfo`, `ExternalMention`, `SecurityAdvisory` (GHSA), `TrendingRepo`. No Pydantic — matches scraperx core discipline. Full JSON serialization via `to_dict()`.
+
+### Added — CLI
+
+- **`scraperx github OWNER/REPO [--json] [--deep] [--no-mentions] [--no-cache]`** — produces markdown trust report (or JSON dump with `--json`). Accepts shorthand `owner/repo`, full URL, `.git` suffix, SSH form, or sub-path URLs. Invalid URL → exit 2 with stderr message.
+- **`scraperx trending [--since daily|weekly|monthly] [--lang python] [--spoken en] [--limit 25] [--json]`** — lists github.com/trending. Defaults to daily + all languages (per Q2 handoff decision).
+
+### Added — SQLite cache
+
+- **3 new tables** in `social_db.py` (share the existing `~/.scraperx/social.db`): `github_repo_cache` (composite key `(full_name, kind)`, per-kind TTL: repo 24h, commits 6h, etc.), `github_fork_cache` (6h TTL), `github_mentions_cache` (4h TTL). Composite-kind design means one table covers repo / contributors / commits / releases / readme / workflows / issues / advisories without schema churn.
+- **New SocialDB methods**: `save_repo_cache`/`get_repo_cache`, `save_fork_cache`/`get_fork_cache`, `save_mentions_cache`/`get_mentions_cache`, `purge_expired_github_cache`. Query-hash normalisation so `"Yt-Dlp"` and `"  yt-dlp  "` collide. Empty results NOT cached — lets transient errors retry next call.
+
+### Added — top-level exports
+
+- **`scraperx` package** re-exports: `GithubAnalyzer`, `GithubReport`, `InvalidRepoUrlError`, `analyze_github_repo`, `parse_github_repo_url`.
+
+### Added — Tests
+
+- **236 new tests** covering: URL parsing across 6 shapes, schema round-trip, SQLite cache (hit/miss/TTL/purge/case-insensitivity), GitHub REST (auth/404/403-rate-limit/URLError/invalid-JSON/pre-flight), scoring (34 parametrized heuristic cases), mention adapters (happy + error + cache per platform), semantic layer (graceful degradation + site filter + subdomain acceptance), trending (dual-parser + URL building), synthesis (JSON extraction + heuristic fallback + LLM happy + 5 error paths), CLI end-to-end (argv dispatch + flags + `__main__` routing), full-pipeline e2e integration (happy + partial-failure + 404-short-circuit + skip-mentions). Total suite: **441 passing, 0 ruff warnings**.
+
+### Changed
+
+- **`pyproject.toml`**: `description` extended to mention GitHub analyzer; `keywords` +5 entries.
+- **`README.md`**: new top-level feature section (see below).
+
+## [Unreleased-prior-to-1.4.0]
+
 ### Fixed
 - **`VimeoScraper.get_metadata()` — fallback to player config when oEmbed 404s.** Vimeo's oEmbed endpoint has been unreliable since late 2025 (returns 404 on live queries even for public videos). `get_metadata` now tries oEmbed first and transparently falls back to `player.vimeo.com/video/{id}/config` for durable metadata (title, author, duration, thumbnail). Result dict now includes a `source` field (`"oembed"` | `"player_config"`). Only raises if BOTH endpoints fail.
 

diff --git a/README.md b/README.md
@@ -23,7 +23,9 @@ ScraperX fetches social-media posts, transcribes videos, and verifies authentici
 - **Impersonation detection** (NEW) — perceptual-hash avatar matcher (pHash 8×8) with SQLite cache + rolling-window registry. Catches scammers who re-upload a victim's avatar under a typosquat handle.
 - **Scam content detection** — crypto-giveaway phrases, wallet addresses, shortener domains, emoji spam.
 - **Token extraction** — `$CASHTAG` mentions + known Solana tokens.
-- **SQLite persistence** — tweets, profiles, mentions, avatar hashes, search cache.
+- **GitHub deep analyzer** (NEW in 1.4.0) — paste any `owner/repo` URL, get a 0–100 trust verdict with 3-bullet rationale, community mention aggregation across HN / Reddit / StackOverflow / dev.to / arXiv / Papers With Code, notable forks, security advisories (GHSA), and sub-scores for bus factor / momentum / health / README quality. Optional LLM synthesis via local GPU.
+- **GitHub trending** (NEW in 1.4.0) — `scraperx trending` lists github.com/trending for daily / weekly / monthly windows with language filters.
+- **SQLite persistence** — tweets, profiles, mentions, avatar hashes, search cache, GitHub repo/fork/mention caches with per-kind TTL.
 
 Why no API keys? The official APIs are expensive, rate-limited, and unstable. ScraperX leans on public endpoints (oEmbed, FxTwitter, vxTwitter, syndication, yt-dlp) with no auth wall.
 
@@ -302,6 +304,87 @@ result = fetch_any_video_transcript("https://some-blog.com/post-with-vimeo-embed
 
 Deduplicates by `(provider, id)`. Works without `beautifulsoup4` (regex fallback). Returns `VideoRef` objects with `page_url` + `referer` for embed-locked downstream calls.
 
+### 7. GitHub deep analyzer — trust verdicts + community mentions (NEW in 1.4.0)
+
+One command, one verdict. Paste a repo URL, get back:
+
+- 0–100 overall trust score with a one-line rationale
+- 4 sub-scores: bus factor, momentum, health, README quality
+- Community mentions across 6 dedicated platforms (HN, Reddit, StackOverflow, dev.to, arXiv, Papers With Code) + 6 generic sites via the Tier-B semantic layer (Lobsters, Medium, Bluesky, Product Hunt, Substack, LinkedIn)
+- Notable forks (catches "community took over" signals)
+- Security advisories (GHSA)
+- 3-bullet verdict with inline `[n]` citations to the mentions list
+
+#### CLI
+
+```bash
+# Markdown report
+scraperx github yt-dlp/yt-dlp
+
+# Full URL, JSON output
+scraperx github https://github.com/rust-lang/rust --json
+
+# Deep mode — qwen3.5:27b synthesis instead of qwen3:4b (slower, higher quality)
+scraperx github yt-dlp/yt-dlp --deep
+
+# Skip community mentions for a quick metadata-only check
+scraperx github yt-dlp/yt-dlp --no-mentions
+
+# Disable SQLite cache for this run
+scraperx github yt-dlp/yt-dlp --no-cache
+
+# Also: trending
+scraperx trending                         # daily, all languages
+scraperx trending --since weekly --lang python --limit 10
+scraperx trending --json
+```
+
+#### Python
+
+```python
+from scraperx import GithubAnalyzer, analyze_github_repo
+
+# One-shot — heuristic verdict (no LLM, no cache)
+report = analyze_github_repo("yt-dlp/yt-dlp")
+print(f"Trust: {report.trust.overall}/100 — {report.trust.rationale}")
+
+# With full wiring: cache + web-search + LLM synthesis
+from scraperx import SocialDB
+
+analyzer = GithubAnalyzer(
+    github_token=None,           # or os.environ["GITHUB_TOKEN"] for 5000/h
+    db=SocialDB(),               # SQLite cache, 4-24h TTL per kind
+    web_search_fn=my_web_search, # Tier B — any local_web_search-compatible callable
+    local_llm_fn=my_local_llm,   # qwen3:4b fast / qwen3.5:27b deep
+)
+report = analyzer.analyze_repo("https://github.com/rust-lang/rust", deep=True)
+
+print(report.verdict_markdown)
+for m in report.mentions:
+    print(f"[{m.source}] {m.title} — {m.url}")
+```
+
+#### Authentication (Q1 handoff decision)
+
+Unauth by default — 60 requests per hour, enough for personal use. Set `GITHUB_TOKEN` env var to upgrade to 5000/h; the analyzer picks it up automatically. No config file, no prompt.
+
+#### How it avoids API-key lock-in
+
+- HN: Algolia HN Search (free, unauthed)
+- Reddit: `/search.json` (free, unauthed, UA required)
+- StackOverflow: StackExchange API 2.3 (free, unauthed, 300/day)
+- dev.to: public `/api/articles` (free)
+- arXiv: Atom XML export (free)
+- Papers With Code: public v1 API (free)
+- Trending: HTML scrape of github.com/trending (no API exists)
+- GitHub REST: works unauthed at 60/h
+
+#### Cache discipline
+
+`SocialDB` caches repo metadata for 24h, commits/issues for 6h, mentions for 4h. Empty results are NOT cached — transient network errors can retry next call. All new tables share the existing `~/.scraperx/social.db` file.
+
+---
+
 ### 6. Profile, search, token extraction
 
 ```python