A scalable, fault-tolerant distributed task scheduling engine designed for high-throughput environments. This project implements core concepts of priority scheduling, dependency management, and worker health monitoring.
- Priority Scheduling: Min-heap based scheduler supporting O(log n) operations.
- Dependency Awareness: Handles Task Dependencies via Directed Acyclic Graphs (DAG) and Kahn's Algorithm for topological sorting.
- Fault Tolerance: Heartbeat-based monitoring with automatic task reassignment on worker failure.
- Scalability: Designed to handle 1M+ tasks per day with parallel worker pools.
- ML Prediction Runner: A Python-based implementation for parallelizing ML inference jobs with exponential backoff and dead-letter queuing.
/core-engine: Java implementation of the scheduling core./python-runner: Python implementation of the distributed prediction runner./docs: Detailed architecture and design documents.
The scheduler uses a PriorityBlockingQueue to ensure tasks with higher priority (lower numerical value) are processed first.
Tasks can define dependencies. The scheduler ensures a task only moves to the READY state once all its parent tasks are COMPLETED. Circular dependencies are detected at submission time using DFS-based cycle detection.
The WorkerManager tracks heartbeats from distributed nodes. If a worker fails to check in within the configurable timeout (default 30s), its active tasks are automatically returned to the READY queue for reassignment.
- Exponential Backoff: Retries failing tasks with increasing wait times.
- Dead Letter Queue (DLQ): Persistently failing tasks are moved to a DLQ with full failure metadata for debugging.
The Python prediction runner demonstrates an ~87% reduction in batch processing time by distributing workloads across parallel workers (simulated 47m down to 6m for 100k requests).
MIT