Shai ("Sherut Yediot")
A TypeScript scraping pipeline backed by BullMQ and Redis. The backend discovers article URLs, crawls pages, extracts article content, normalizes the result, stores the result as JSON.
The backend is organized as a job pipeline:
DISCOVERY -> CRAWL -> EXTRACT -> NORMALIZE -> JSON storage
The main runtime pieces are:
workers/: BullMQ workers and handlers for each pipeline stage.queues/: typed BullMQ queue factories, queue names, and enqueue helpers.services/: scraping, extraction, normalization, storage, session, URL policy, and job ID utilities.types/: shared TypeScript types for jobs, articles, sessions, and site config.config/: Redis and per-site scraping configuration.
- A
DISCOVERYjob is seeded from configured site sources. - Discovery reads RSS feeds, sitemaps, or page links and filters URLs through the site URL policy.
- Each discovered article URL becomes a
CRAWLjob. - Crawling fetches page HTML using the configured session type and timeout.
- The crawler emits an
EXTRACTjob containing the fetched HTML, status, headers, and final URL. - Extraction parses article fields such as title, author, published date, site name, and content.
- Extraction emits a
NORMALIZEjob with aRawArticlepayload. - Normalization cleans text, validates minimum content length, computes IDs/hashes, and produces an
Article. - Storage writes articles to
data/articles.json.
workers/index.ts: starts all workers and handles graceful shutdown.workers/runner.ts: creates one BullMQ worker per job type.workers/handlers.ts: orchestrates the pipeline from discovery through storage.queues/enqueue.ts: adds jobs to queues with attempts, priority, and deterministic BullMQjobIds.services/discovery/feed-discovery.ts: discovers URLs from feeds, sitemaps, and HTML links.services/crawling/http-crawler.ts: fetches article pages withfetchand reports session health.services/extraction/generic-extractor.ts: parses article data from HTML with Cheerio.services/normalization/article-normalizer.ts: converts raw article data into the final stored article shape.services/storage/json-article-store.ts: persists articles to JSON.services/session/in-memory-session-manager.ts: provides basic in-memory per-domain session reuse.services/urls/url-policy.ts: enforces domain, allow-path, and deny-path URL rules.services/jobs/job-id.ts: creates stable hash-based job IDs for duplicate prevention while jobs are in Redis.config/sites.ts: defines supported sites, discovery sources, extraction selectors, and normalization rules.
npm run build # type-check the backend
npm run dev:worker # start all workers and wait for queued jobs
npm run demo:scrape # start workers, seed configured discovery jobs, then stop after DEMO_MS or 30s
npm run smoke # run a small BullMQ plumbing smoke testRedis is required for BullMQ. By default the backend connects to 127.0.0.1:6379; override with REDIS_HOST and REDIS_PORT if needed.
This is a working foundation/prototype rather than a fully mature scraper. The typed pipeline, queue plumbing, basic HTTP crawling, generic extraction, and JSON storage are in place.
Known gaps and future hooks:
RETRYjobs are typed but currently only logged.- Browser and hybrid crawling are represented in types/config but not implemented.
rateLimitPerSecondexists in site config but is not enforced yet.session.stickyandsession.maxFailuresBeforeRotateexist in config but are not wired into session rotation yet.- Session state is in-memory only.
- Storage is JSON-file based, not database-backed.
contentHashis stored for future deduplication but not actively used yet.- BullMQ duplicate prevention is not permanent because completed jobs use
removeOnComplete: true.