Shai ("Sherut Yediot")

A TypeScript scraping pipeline backed by BullMQ and Redis. The backend discovers article URLs, crawls pages, extracts article content, normalizes the result, stores the result as JSON.

Codebase Overview

The backend is organized as a job pipeline:

DISCOVERY -> CRAWL -> EXTRACT -> NORMALIZE -> JSON storage

The main runtime pieces are:

workers/: BullMQ workers and handlers for each pipeline stage.
queues/: typed BullMQ queue factories, queue names, and enqueue helpers.
services/: scraping, extraction, normalization, storage, session, URL policy, and job ID utilities.
types/: shared TypeScript types for jobs, articles, sessions, and site config.
config/: Redis and per-site scraping configuration.

Pipeline Flow

A DISCOVERY job is seeded from configured site sources.
Discovery reads RSS feeds, sitemaps, or page links and filters URLs through the site URL policy.
Each discovered article URL becomes a CRAWL job.
Crawling fetches page HTML using the configured session type and timeout.
The crawler emits an EXTRACT job containing the fetched HTML, status, headers, and final URL.
Extraction parses article fields such as title, author, published date, site name, and content.
Extraction emits a NORMALIZE job with a RawArticle payload.
Normalization cleans text, validates minimum content length, computes IDs/hashes, and produces an Article.
Storage writes articles to data/articles.json.

Important Modules

workers/index.ts: starts all workers and handles graceful shutdown.
workers/runner.ts: creates one BullMQ worker per job type.
workers/handlers.ts: orchestrates the pipeline from discovery through storage.
queues/enqueue.ts: adds jobs to queues with attempts, priority, and deterministic BullMQ jobIds.
services/discovery/feed-discovery.ts: discovers URLs from feeds, sitemaps, and HTML links.
services/crawling/http-crawler.ts: fetches article pages with fetch and reports session health.
services/extraction/generic-extractor.ts: parses article data from HTML with Cheerio.
services/normalization/article-normalizer.ts: converts raw article data into the final stored article shape.
services/storage/json-article-store.ts: persists articles to JSON.
services/session/in-memory-session-manager.ts: provides basic in-memory per-domain session reuse.
services/urls/url-policy.ts: enforces domain, allow-path, and deny-path URL rules.
services/jobs/job-id.ts: creates stable hash-based job IDs for duplicate prevention while jobs are in Redis.
config/sites.ts: defines supported sites, discovery sources, extraction selectors, and normalization rules.

Scripts

npm run build       # type-check the backend
npm run dev:worker  # start all workers and wait for queued jobs
npm run demo:scrape # start workers, seed configured discovery jobs, then stop after DEMO_MS or 30s
npm run smoke       # run a small BullMQ plumbing smoke test

Redis is required for BullMQ. By default the backend connects to 127.0.0.1:6379; override with REDIS_HOST and REDIS_PORT if needed.

Current State And Gaps

This is a working foundation/prototype rather than a fully mature scraper. The typed pipeline, queue plumbing, basic HTTP crawling, generic extraction, and JSON storage are in place.

Known gaps and future hooks:

RETRY jobs are typed but currently only logged.
Browser and hybrid crawling are represented in types/config but not implemented.
rateLimitPerSecond exists in site config but is not enforced yet.
session.sticky and session.maxFailuresBeforeRotate exist in config but are not wired into session rotation yet.
Session state is in-memory only.
Storage is JSON-file based, not database-backed.
contentHash is stored for future deduplication but not actively used yet.
BullMQ duplicate prevention is not permanent because completed jobs use removeOnComplete: true.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
GEMINI.md		GEMINI.md
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shai ("Sherut Yediot")

Codebase Overview

Pipeline Flow

Important Modules

Scripts

Current State And Gaps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Shai ("Sherut Yediot")

Codebase Overview

Pipeline Flow

Important Modules

Scripts

Current State And Gaps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages