Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
a87e641
deps(ci): Bump actions/setup-python from 5 to 6 (#1)
dependabot[bot] Apr 17, 2026
84dd0ab
deps(python): Update playwright requirement from >=1.40 to >=1.58.0 (#4)
dependabot[bot] Apr 17, 2026
1b4184d
deps(python): Update imagehash requirement from >=4.3 to >=4.3.2 (#5)
dependabot[bot] Apr 17, 2026
6489189
deps(ci): bump actions/checkout v4→v6, actions/upload-artifact v4→v7
prezis Apr 17, 2026
4eaa07a
deps(python): Update faster-whisper requirement from >=1.0 to >=1.2.1…
dependabot[bot] Apr 17, 2026
5119b18
deps(python): Update pytest requirement from >=7.0 to >=9.0.3 (#7)
dependabot[bot] Apr 17, 2026
a9db48f
deps(python): bump Pillow >=10.0 → >=12.2.0
prezis Apr 17, 2026
7a46d33
fix: 3 CRITICAL + 4 HIGH security/correctness bugs found by code revi…
prezis Apr 17, 2026
831b62b
docs(next-session): GitHub deep-analyzer spec — 20-source rubric, 16-…
prezis Apr 18, 2026
6aca279
docs(reports): add 2026-04-18 dependency vulnerability audit
prezis Apr 18, 2026
740d61a
feat: Add initial GitHub repo trust analysis implementation
prezis Apr 18, 2026
d1b9e51
chore(auto): update social_db.py,test_github_analyzer_cache.py
prezis Apr 18, 2026
a0421e0
feat: Add initial GitHub API client implementation
prezis Apr 18, 2026
2667e42
feat: Add initial scoring heuristics for GitHub repo analysis
prezis Apr 18, 2026
500d180
feat: Add support for multiple external platforms in github_analyzer men
prezis Apr 18, 2026
8f8e79a
chore(auto): update semantic.py,test_github_semantic.py
prezis Apr 18, 2026
3158e70
chore(auto): update trending.py,test_github_trending.py
prezis Apr 18, 2026
b96f38e
feat: Synthesize verdict and rationale for GitHub repo trust assessment
prezis Apr 18, 2026
0753af8
chore(auto): update __main__.py,cli.py,core.py
prezis Apr 18, 2026
22c19c0
chore(auto): update test_github_pipeline_e2e.py
prezis Apr 18, 2026
e865685
chore(auto): update CHANGELOG.md,README.md,pyproject.toml
prezis Apr 18, 2026
be1204e
checkpoint (00:18): session crash protection — devto.py,hn.py,reddit.…
prezis Apr 18, 2026
fb9fbe2
feat: Add safe_float to handle numeric coercion quirks in Reddit respons
prezis Apr 18, 2026
515ebe5
feat(telemetry): add --log-verdict flag + verdicts.jsonl corpus build…
prezis Apr 22, 2026
ee94e20
The provided text appears to be a large list of URLs (web addresses) for
prezis Apr 22, 2026
3040fbf
chore: add .mailmap for canonical author attribution
prezis Apr 22, 2026
ce809a2
docs: add API endpoint discovery runbook
prezis Apr 24, 2026
217e7ea
chore(auto): update 2026-04.jsonl,__init__.py,_output.py
prezis Apr 25, 2026
6e846ec
chore(auto): update 2026-04.jsonl,nhtsa.py
prezis Apr 25, 2026
010c269
chore(auto): update 2026-04.jsonl,__init__.py,_http.py
prezis Apr 25, 2026
62af291
docs(bmw_corpus): README skill doc + initial JSONL outputs
prezis Apr 25, 2026
2a11eb1
docs(bmw_corpus): add Future Paths section to README
prezis Apr 25, 2026
f6df0e9
chore(auto): update 2026-04.jsonl,2026-04.jsonl.processed,2026-04.jsonl
prezis Apr 25, 2026
19c1bde
chore(auto): update 2026-04.jsonl.processed
prezis Apr 25, 2026
453ad8e
fix(sqlite): apply full WAL hygiene PRAGMA stack to all 3 connection …
prezis Apr 25, 2026
ab5b1eb
checkpoint (12:08): session crash protection — 2026-04.jsonl,2026-04.…
prezis Apr 25, 2026
7523021
chore(release): 1.4.3 — SQLite WAL hygiene fix
prezis Apr 25, 2026
96bb1c4
chore(auto): update 2026-04.jsonl,2026-04.jsonl.processed,dvsa.py
prezis Apr 25, 2026
ecad2d4
chore(auto): update 2026-04.jsonl,2026-04.jsonl.processed,2026-04.jso…
prezis Apr 25, 2026
03f7b9b
security: add .gitleaksignore (Solana mint addresses, false positives)
prezis Apr 26, 2026
b3ae606
chore(auto): update 2026-04.jsonl,2026-04.jsonl.processed,2026-04.jsonl
prezis Apr 26, 2026
426d472
feat(fetch): add smart_fetch thin-client (Jina → urllib → Playwright …
prezis Apr 26, 2026
42f6182
feat(gh-discover): topic-first GitHub repo discovery (Phase 2.2 P3)
prezis Apr 26, 2026
fec1798
feat(tv-resolve): TradingView symbol/exchange resolver with negative …
prezis Apr 26, 2026
6685b5f
chore(auto): update 2026-04.jsonl,2026-04.jsonl.processed
prezis Apr 26, 2026
6bcf4ab
deps(python): Update beautifulsoup4 requirement from >=4.12 to >=4.14.3
dependabot[bot] Apr 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ jobs:
python-version: ["3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}
cache: pip
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6

- name: Set up Python 3.11
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: "3.11"

Expand All @@ -31,7 +31,7 @@ jobs:
- name: Check dist
run: twine check dist/*

- uses: actions/upload-artifact@v4
- uses: actions/upload-artifact@v7
with:
name: dist
path: dist/
Expand Down
6 changes: 6 additions & 0 deletions .gitleaksignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# .gitleaksignore — confirmed false positives (auto-generated 2026-04-26)
# Triaged: test fixtures (0xabcdef placeholders), upstream fork code, public on-chain addresses, dev tokens.
# Format: <commit>:<file>:<rule>:<startline> (gitleaks fingerprint)

7a396da7d18004bf61a0ed0b69145e2bbeb62098:tests/test_token_extractor.py:generic-api-key:105
7a396da7d18004bf61a0ed0b69145e2bbeb62098:tests/test_token_extractor.py:generic-api-key:119
4 changes: 4 additions & 0 deletions .mailmap
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Canonical author mapping — GitHub honours this for attribution display.
# Rewrites the 5 early commits (scraperx v1.0.0/v1.1.0/thread-walk/search/errors)
# that were made before global git config was corrected, without force-push.
Przemyslaw Palyska <przemyslaw.palyska@gmail.com> <noreply@users.noreply.github.com>
79 changes: 79 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,85 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [1.4.3] — 2026-04-25

Bug-fix release: production-grade SQLite WAL hygiene across all storage callsites. Important for anyone running scraperx components as long-lived daemons (BMW corpus ingester, Reddit/KBA/forum scrapers, GitHub analyzer in batch mode) — closes the unbounded-WAL disaster vector.

### Added

- **`scraperx/_sqlite_pragmas.py`** — shared `apply_pragmas(conn)` helper that applies the production-grade WAL hygiene stack (`journal_mode=WAL`, `journal_size_limit=64MB`, `synchronous=NORMAL`, `busy_timeout=5000`, `foreign_keys=ON`, `mmap_size=256MB`, `temp_store=MEMORY`). Idempotent. Per-connection PRAGMAs are LOST on close, so the helper MUST run on every new connection — that's why it's a function, not a one-time DB header write.
- **`tests/test_sqlite_pragmas.py`** — 7 tests covering the helper itself + end-to-end PRAGMA verification on `SocialDB`, `AvatarMatcher`, and `VerifiedAvatarRegistry`.

### Fixed

- **Closes the unbounded-WAL disaster vector** for long-running scraperx daemons (BMW corpus ingester, Reddit/KBA/forum scrapers running 24/7). Pre-1.4.3, `~/.scraperx/social.db` had `journal_size_limit=-1` (uncapped) — same root cause that produced an 87 GB WAL on a sister project. With `journal_size_limit=64MB` + `wal_autocheckpoint` defaults, the WAL is bounded by design.
- **`scraperx/social_db.py`** `SocialDB.__init__`: was setting only `PRAGMA journal_mode=WAL`. Now applies the full 7-PRAGMA stack via `apply_pragmas()`.
- **`scraperx/avatar_matcher.py`** `AvatarMatcher.__init__`: was setting only `PRAGMA journal_mode=WAL`. Now applies the full stack.
- **`scraperx/avatar_matcher.py`** `VerifiedAvatarRegistry.__init__`: was setting **NO PRAGMAs at all** — implicitly assumed another consumer (`SocialDB` / `AvatarMatcher`) had opened the shared `~/.scraperx/social.db` first. That assumption broke whenever a fresh process imported `VerifiedAvatarRegistry` standalone (e.g. the GitHub analyzer telemetry path). Now applies the full stack on every connect.

### Notes

The fix is fully additive — no schema migration, no behaviour change beyond performance + safety. Existing DB files keep working; the WAL bound only takes effect on the next checkpoint after upgrade.

Research grounding (2026): loke.dev "20GB WAL File That Shouldn't Exist", oneuptime "How to Set Up SQLite for Production Use", powersync "SQLite Optimizations For Ultra High-Performance", phiresky tune.md, sqlite.org/pragma.html.

## [1.4.2] — 2026-04-18

Telemetry: `--log-verdict` flag + agree/disagree corpus builder for calibrating v1.5.0.

### Added

- **`scraperx/github_analyzer/telemetry.py`** — `log_verdict(report, feedback=None)` appends one JSONL event to `~/.scraperx/verdicts.jsonl`. Fields: `timestamp`, `repo`, `url`, `overall`, `sub_scores` (all 4), `mentions_count`, `warnings_count`, `warnings[:5]`, `scraperx_version`, `feedback`. Returns `True/False` — never raises. Creates `~/.scraperx/` automatically.
- **`prompt_and_log_verdict(report)`** — interactive wrapper for CLI use. Logs the scoring event first (feedback-free), then prompts `Agree? [y/n/<reason>] (Enter to skip)` on stderr (safe for `--json` mode). User response coerced: `y/yes/agree/tak → "agree"`, `n/no/disagree/nie → "disagree"`, anything else stored as free-text. Non-TTY stdin (pipes) is detected and silently skipped.
- **`scraperx github --log-verdict`** — new CLI flag. Fires `prompt_and_log_verdict` after output so it never delays the report rendering.
- **`_normalise_feedback(raw)`** — canonical alias coercion. Handles Polish (`tak`/`nie`) and common informal aliases (`ok`, `yep`, `nope`).
- **44 new tests** in `tests/test_github_telemetry.py` covering all feedback aliases, JSONL field correctness, multi-event append, warning cap, permission-error graceful return, non-TTY auto-skip, and timestamp ISO-8601-Z round-trip.

### Changed

- **`__version__` bumped to `1.4.2`** (1.4.1 was the metadata-enrichment commit; 1.4.2 adds telemetry).
- **`cli.py`** imports `prompt_and_log_verdict` from `telemetry`; `log_verdict` import removed (unused at CLI level — CLI always uses the interactive wrapper).

## [1.4.0] — 2026-04-18

Major feature release: deep GitHub repository trust analysis with scored verdicts, community mention aggregation across 6 dedicated platforms + 6 generic sites, GitHub Trending scraper, and graceful GPU-backed synthesis.

### Added — `scraperx.github_analyzer` module

- **`analyze_github_repo(url)` / `GithubAnalyzer`** — end-to-end pipeline: REST metadata → scoring → community mentions → optional web-search layer → LLM-synthesized 3-bullet verdict with inline citations + 0-100 overall score. Dependency-injected at every external call (GitHub token, SQLite cache, web-search fn, LLM fn) so the whole thing is unit-testable without a network.
- **`github_api.py`** — stdlib-only GitHub REST client. 8 endpoints: `get_repo`, `get_contributors`, `get_recent_commits`, `get_releases`, `get_top_forks`, `get_readme`, `get_workflows`, `get_advisories` (GHSA). Rate-limit header absorption + fail-fast pre-flight when the window is exhausted. Exceptions: `GithubAPIError`, `RepoNotFoundError`, `RateLimitExceededError(reset_at)`.
- **`scoring.py`** — 4 pure heuristics (0-100 int each): `bus_factor_score` (k-at-50% contribution share), `momentum_score` (commits + star delta over 90 days), `health_score` (archived / license / issue & fork ratios), `readme_quality_score` (length + heading + code + link + install keyword). Graceful on malformed input — never raises.
- **`mentions/`** — 6 Tier-A platform adapters: `hn` (Algolia HN Search), `reddit` (`/search.json`), `stackoverflow` (StackExchange API 2.3), `devto` (dev.to articles + client-side slug filter), `arxiv` (Atom XML, `xml.etree`), `pwc` (Papers With Code). Every adapter: common contract (never raise, return `[]` on any error, normalize to `ExternalMention`, cache hit/miss via SQLite). `ALL_SOURCES` registry for iteration.
- **`semantic.py`** — Tier-B generic web search. Takes an injected `web_search_fn` (matches `local_web_search` MCP signature), composes `(site:lobste.rs OR site:substack.com …) "owner/repo"` queries, filters hits to an allowlist of hosts (Lobsters, Substack, Medium, Product Hunt, Bluesky, LinkedIn). Graceful degradation when `web_search_fn` is None.
- **`trending.py`** — `fetch_trending(since, language, spoken_language_code)` scrapes github.com/trending (no public API). Dual parser: BeautifulSoup preferred, regex fallback when bs4 unavailable (same optional-bs4 pattern as `video_discovery.py`). Returns `list[TrendingRepo]`. Browser User-Agent required — GitHub blocks naked urllib.
- **`synthesis.py`** — populated report → `trust.overall` + `trust.rationale` + `verdict_markdown`. Dependency-injected `local_llm_fn` (qwen3:4b fast, qwen3.5:27b on `deep=True`). Robust JSON extraction via brace-counter (qwen sometimes wraps its output in prose or code fences). Heuristic fallback (sub-score weighted average) when the LLM is unreachable or returns unparseable output.
- **`schemas.py`** — 7 stdlib dataclasses: `GithubReport`, `RepoTrustScore`, `ContributorInfo`, `ForkInfo`, `ExternalMention`, `SecurityAdvisory` (GHSA), `TrendingRepo`. No Pydantic — matches scraperx core discipline. Full JSON serialization via `to_dict()`.

### Added — CLI

- **`scraperx github OWNER/REPO [--json] [--deep] [--no-mentions] [--no-cache]`** — produces markdown trust report (or JSON dump with `--json`). Accepts shorthand `owner/repo`, full URL, `.git` suffix, SSH form, or sub-path URLs. Invalid URL → exit 2 with stderr message.
- **`scraperx trending [--since daily|weekly|monthly] [--lang python] [--spoken en] [--limit 25] [--json]`** — lists github.com/trending. Defaults to daily + all languages (per Q2 handoff decision).

### Added — SQLite cache

- **3 new tables** in `social_db.py` (share the existing `~/.scraperx/social.db`): `github_repo_cache` (composite key `(full_name, kind)`, per-kind TTL: repo 24h, commits 6h, etc.), `github_fork_cache` (6h TTL), `github_mentions_cache` (4h TTL). Composite-kind design means one table covers repo / contributors / commits / releases / readme / workflows / issues / advisories without schema churn.
- **New SocialDB methods**: `save_repo_cache`/`get_repo_cache`, `save_fork_cache`/`get_fork_cache`, `save_mentions_cache`/`get_mentions_cache`, `purge_expired_github_cache`. Query-hash normalisation so `"Yt-Dlp"` and `" yt-dlp "` collide. Empty results NOT cached — lets transient errors retry next call.

### Added — top-level exports

- **`scraperx` package** re-exports: `GithubAnalyzer`, `GithubReport`, `InvalidRepoUrlError`, `analyze_github_repo`, `parse_github_repo_url`.

### Added — Tests

- **236 new tests** covering: URL parsing across 6 shapes, schema round-trip, SQLite cache (hit/miss/TTL/purge/case-insensitivity), GitHub REST (auth/404/403-rate-limit/URLError/invalid-JSON/pre-flight), scoring (34 parametrized heuristic cases), mention adapters (happy + error + cache per platform), semantic layer (graceful degradation + site filter + subdomain acceptance), trending (dual-parser + URL building), synthesis (JSON extraction + heuristic fallback + LLM happy + 5 error paths), CLI end-to-end (argv dispatch + flags + `__main__` routing), full-pipeline e2e integration (happy + partial-failure + 404-short-circuit + skip-mentions). Total suite: **441 passing, 0 ruff warnings**.

### Changed

- **`pyproject.toml`**: `description` extended to mention GitHub analyzer; `keywords` +5 entries.
- **`README.md`**: new top-level feature section (see below).

## [Unreleased-prior-to-1.4.0]

### Fixed
- **`VimeoScraper.get_metadata()` — fallback to player config when oEmbed 404s.** Vimeo's oEmbed endpoint has been unreliable since late 2025 (returns 404 on live queries even for public videos). `get_metadata` now tries oEmbed first and transparently falls back to `player.vimeo.com/video/{id}/config` for durable metadata (title, author, duration, thumbnail). Result dict now includes a `source` field (`"oembed"` | `"player_config"`). Only raises if BOTH endpoints fail.

Expand Down
85 changes: 84 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ ScraperX fetches social-media posts, transcribes videos, and verifies authentici
- **Impersonation detection** (NEW) — perceptual-hash avatar matcher (pHash 8×8) with SQLite cache + rolling-window registry. Catches scammers who re-upload a victim's avatar under a typosquat handle.
- **Scam content detection** — crypto-giveaway phrases, wallet addresses, shortener domains, emoji spam.
- **Token extraction** — `$CASHTAG` mentions + known Solana tokens.
- **SQLite persistence** — tweets, profiles, mentions, avatar hashes, search cache.
- **GitHub deep analyzer** (NEW in 1.4.0) — paste any `owner/repo` URL, get a 0–100 trust verdict with 3-bullet rationale, community mention aggregation across HN / Reddit / StackOverflow / dev.to / arXiv / Papers With Code, notable forks, security advisories (GHSA), and sub-scores for bus factor / momentum / health / README quality. Optional LLM synthesis via local GPU.
- **GitHub trending** (NEW in 1.4.0) — `scraperx trending` lists github.com/trending for daily / weekly / monthly windows with language filters.
- **SQLite persistence** — tweets, profiles, mentions, avatar hashes, search cache, GitHub repo/fork/mention caches with per-kind TTL.

Why no API keys? The official APIs are expensive, rate-limited, and unstable. ScraperX leans on public endpoints (oEmbed, FxTwitter, vxTwitter, syndication, yt-dlp) with no auth wall.

Expand Down Expand Up @@ -302,6 +304,87 @@ result = fetch_any_video_transcript("https://some-blog.com/post-with-vimeo-embed

Deduplicates by `(provider, id)`. Works without `beautifulsoup4` (regex fallback). Returns `VideoRef` objects with `page_url` + `referer` for embed-locked downstream calls.

### 7. GitHub deep analyzer — trust verdicts + community mentions (NEW in 1.4.0)

One command, one verdict. Paste a repo URL, get back:

- 0–100 overall trust score with a one-line rationale
- 4 sub-scores: bus factor, momentum, health, README quality
- Community mentions across 6 dedicated platforms (HN, Reddit, StackOverflow, dev.to, arXiv, Papers With Code) + 6 generic sites via the Tier-B semantic layer (Lobsters, Medium, Bluesky, Product Hunt, Substack, LinkedIn)
- Notable forks (catches "community took over" signals)
- Security advisories (GHSA)
- 3-bullet verdict with inline `[n]` citations to the mentions list

#### CLI

```bash
# Markdown report
scraperx github yt-dlp/yt-dlp

# Full URL, JSON output
scraperx github https://github.com/rust-lang/rust --json

# Deep mode — qwen3.5:27b synthesis instead of qwen3:4b (slower, higher quality)
scraperx github yt-dlp/yt-dlp --deep

# Skip community mentions for a quick metadata-only check
scraperx github yt-dlp/yt-dlp --no-mentions

# Disable SQLite cache for this run
scraperx github yt-dlp/yt-dlp --no-cache

# Also: trending
scraperx trending # daily, all languages
scraperx trending --since weekly --lang python --limit 10
scraperx trending --json
```

#### Python

```python
from scraperx import GithubAnalyzer, analyze_github_repo

# One-shot — heuristic verdict (no LLM, no cache)
report = analyze_github_repo("yt-dlp/yt-dlp")
print(f"Trust: {report.trust.overall}/100 — {report.trust.rationale}")

# With full wiring: cache + web-search + LLM synthesis
from scraperx import SocialDB

analyzer = GithubAnalyzer(
github_token=None, # or os.environ["GITHUB_TOKEN"] for 5000/h
db=SocialDB(), # SQLite cache, 4-24h TTL per kind
web_search_fn=my_web_search, # Tier B — any local_web_search-compatible callable
local_llm_fn=my_local_llm, # qwen3:4b fast / qwen3.5:27b deep
)
report = analyzer.analyze_repo("https://github.com/rust-lang/rust", deep=True)

print(report.verdict_markdown)
for m in report.mentions:
print(f"[{m.source}] {m.title} — {m.url}")
```

#### Authentication (Q1 handoff decision)

Unauth by default — 60 requests per hour, enough for personal use. Set `GITHUB_TOKEN` env var to upgrade to 5000/h; the analyzer picks it up automatically. No config file, no prompt.

#### How it avoids API-key lock-in

- HN: Algolia HN Search (free, unauthed)
- Reddit: `/search.json` (free, unauthed, UA required)
- StackOverflow: StackExchange API 2.3 (free, unauthed, 300/day)
- dev.to: public `/api/articles` (free)
- arXiv: Atom XML export (free)
- Papers With Code: public v1 API (free)
- Trending: HTML scrape of github.com/trending (no API exists)
- GitHub REST: works unauthed at 60/h

#### Cache discipline

`SocialDB` caches repo metadata for 24h, commits/issues for 6h, mentions for 4h. Empty results are NOT cached — transient network errors can retry next call. All new tables share the existing `~/.scraperx/social.db` file.

---

### 6. Profile, search, token extraction

```python
Expand Down
Loading