Skip to content

sharathchandra1110/distributed-task-engine

Repository files navigation

Distributed Task Scheduling & Execution Engine

A scalable, fault-tolerant distributed task scheduling engine designed for high-throughput environments. This project implements core concepts of priority scheduling, dependency management, and worker health monitoring.

Key Features

  • Priority Scheduling: Min-heap based scheduler supporting O(log n) operations.
  • Dependency Awareness: Handles Task Dependencies via Directed Acyclic Graphs (DAG) and Kahn's Algorithm for topological sorting.
  • Fault Tolerance: Heartbeat-based monitoring with automatic task reassignment on worker failure.
  • Scalability: Designed to handle 1M+ tasks per day with parallel worker pools.
  • ML Prediction Runner: A Python-based implementation for parallelizing ML inference jobs with exponential backoff and dead-letter queuing.

Project Structure

  • /core-engine: Java implementation of the scheduling core.
  • /python-runner: Python implementation of the distributed prediction runner.
  • /docs: Detailed architecture and design documents.

Core Concepts

1. Priority Scheduler

The scheduler uses a PriorityBlockingQueue to ensure tasks with higher priority (lower numerical value) are processed first.

2. Dependency Management

Tasks can define dependencies. The scheduler ensures a task only moves to the READY state once all its parent tasks are COMPLETED. Circular dependencies are detected at submission time using DFS-based cycle detection.

3. Worker Monitoring

The WorkerManager tracks heartbeats from distributed nodes. If a worker fails to check in within the configurable timeout (default 30s), its active tasks are automatically returned to the READY queue for reassignment.

4. Fault Recovery

  • Exponential Backoff: Retries failing tasks with increasing wait times.
  • Dead Letter Queue (DLQ): Persistently failing tasks are moved to a DLQ with full failure metadata for debugging.

Performance

The Python prediction runner demonstrates an ~87% reduction in batch processing time by distributing workloads across parallel workers (simulated 47m down to 6m for 100k requests).

License

MIT

About

Distributed task execution engine with scheduler, worker management, priority queue, DLQ, and metrics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors