A high-performance, horizontally scalable, and fault-tolerant distributed task queue built from scratch using Node.js, Redis, and Lua Scripting. Designed to handle high-concurrency background job processing with strict at-least-once delivery guarantees, automated dead-letter routing, and real-time observability.
This system abandons basic, fragile list-popping in favor of a robust state-machine architecture, ensuring no tasks are lost to microsecond hardware failures or "poison pill" data.
- Producers: Ingest tasks into a Redis
List(queue:pending) at extremely high throughput. - Worker Cluster (Consumers): Scaled horizontally via PM2. Workers use atomic operations to safely acquire tasks.
- The State Engine: Tasks are moved to a Redis
Sorted Set(queue:processing) using the current timestamp as a score. Workers emit continuous Heartbeats to extend their lease. - The Janitor (Reaper): A detached microservice that sweeps the
queue:processingset for "zombie" tasks (where a worker crashed and stopped emitting heartbeats) and safely re-queues them. - Observability API: An Express.js backend and React dashboard to monitor queue health and manually recover Dead Letters.
- 100% Atomic Task Acquisition: Utilized custom Redis Lua Scripts to combine
RPOPandZADDinto a single atomic transaction, closing the microsecond gap where tasks could be lost during a crash. - Crash Resilience & Heartbeats: Workers run in parallel and maintain tasks via a heartbeat lease mechanism. If a worker process is killed (
SIGINT, Out of Memory), the task is preserved and automatically recovered by the independent Janitor process. - Poison Pill Isolation: Tasks that continuously fail logic blocks are tracked via a retry counter. Upon exceeding
MAX_RETRIES, they are banished to a Dead Letter Queue (DLQ) to prevent infinite loop processing. - Horizontal Scalability: Deployed using PM2 in cluster mode, allowing seamless CPU core utilization and linear throughput scaling.
The system was stress-tested with a burst of 10,000 concurrent jobs to validate ingestion limits, concurrency handling, and error recovery routing.
The Redis producer easily handles massive traffic spikes. In benchmark testing, the system successfully ingested 10,000 tasks in 631ms (achieving an ingestion rate of ~15,800 tasks/second).
The React Observability Dashboard tracks real-time engine state. Below, the system is dynamically distributing the 10,000 queued tasks across the multi-worker PM2 cluster.
PM2 logs validate the parallel nature of the system. Workers operate completely independently. When bench-8576 fails, the worker immediately logs the error, increments the retry counter, re-queues it, and continues processing other tasks without blocking the event loop.
Tasks that systematically fail (simulated 20% crash rate) exhaust their retry limits (3/3). The system successfully catches these "poison pills" and routes them to the DLQ, ensuring the main processing queue remains unblocked.
Using the API Control Panel, all 22 tasks in the Dead Letter Queue were reset and successfully re-processed. The system reaches a completely healthy, empty state with 0% task loss.
- Node.js (v20+)
- Redis Server (Running locally or via Docker)
- pnpm (
npm install -g pnpm) - PM2 (
npm install -g pm2)
-
Clone the repository:
git clone [https://github.com/RahulGIT24/Distributed-Job-Queue-System](https://github.com/RahulGIT24/Distributed-Job-Queue-System) cd main -
Install dependencies:
pnpm install
-
Start the Redis server (if not already running):
-
Configure environment variables: Create a
.envfile in the root directory with the following content:REDIS_HOST=localhost REDIS_PORT=6379 PORT=3000 QUEUE_PENDING="queue:pending" QUEUE_PROCESSING = "queue:processing" QUEUE_DLQ = "queue:dead_letters" -
Build the TypeScript code:
pnpm run build
-
Start the worker cluster:
pm2 start ecosystem.config.js
-
Start the Janitor process:
pnpm run janitor
-
Start the Observability API:
pnpm run start
-
Start stress testing:
pnpm run benchmark
-
Now go to frontend directory and start the React dashboard:
cd client pnpm install pnpm run start -
Open the dashboard at
http://localhost:5173to monitor the queue in real-time.




