Distributed Task Scheduling & Execution Engine

A scalable, fault-tolerant distributed task scheduling engine designed for high-throughput environments. This project implements core concepts of priority scheduling, dependency management, and worker health monitoring.

Key Features

Priority Scheduling: Min-heap based scheduler supporting O(log n) operations.
Dependency Awareness: Handles Task Dependencies via Directed Acyclic Graphs (DAG) and Kahn's Algorithm for topological sorting.
Fault Tolerance: Heartbeat-based monitoring with automatic task reassignment on worker failure.
Scalability: Designed to handle 1M+ tasks per day with parallel worker pools.
ML Prediction Runner: A Python-based implementation for parallelizing ML inference jobs with exponential backoff and dead-letter queuing.

Project Structure

/core-engine: Java implementation of the scheduling core.
/python-runner: Python implementation of the distributed prediction runner.
/docs: Detailed architecture and design documents.

Core Concepts

1. Priority Scheduler

The scheduler uses a PriorityBlockingQueue to ensure tasks with higher priority (lower numerical value) are processed first.

2. Dependency Management

Tasks can define dependencies. The scheduler ensures a task only moves to the READY state once all its parent tasks are COMPLETED. Circular dependencies are detected at submission time using DFS-based cycle detection.

3. Worker Monitoring

The WorkerManager tracks heartbeats from distributed nodes. If a worker fails to check in within the configurable timeout (default 30s), its active tasks are automatically returned to the READY queue for reassignment.

4. Fault Recovery

Exponential Backoff: Retries failing tasks with increasing wait times.
Dead Letter Queue (DLQ): Persistently failing tasks are moved to a DLQ with full failure metadata for debugging.

Performance

The Python prediction runner demonstrates an ~87% reduction in batch processing time by distributing workloads across parallel workers (simulated 47m down to 6m for 100k requests).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
core-engine		core-engine
docs		docs
python-runner/src		python-runner/src
.gitignore		.gitignore
COMPLETION_REPORT.txt		COMPLETION_REPORT.txt
FEATURES.md		FEATURES.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Task Scheduling & Execution Engine

Key Features

Project Structure

Core Concepts

1. Priority Scheduler

2. Dependency Management

3. Worker Monitoring

4. Fault Recovery

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Task Scheduling & Execution Engine

Key Features

Project Structure

Core Concepts

1. Priority Scheduler

2. Dependency Management

3. Worker Monitoring

4. Fault Recovery

Performance

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages