This project is a distributed system for discovering large prime numbers using a coordinated worker setup, a fault-tolerant file system, server replication, and snapshot-based recovery.
The repository is organized into three main subsystems:
- PrimeFS: a small distributed file system with gRPC and non-gRPC variants, client-side caching, and commit-on-close semantics.
- Replication + Fault Tolerance: a replicated storage layer with crash recovery, leader-aware client retries, and Raft-based coordination.
- Prime App + Snapshot: a worker-coordinator application that assigns dataset files, computes primes, aggregates results, and restores progress from snapshots.
- Workers read input datasets through PrimeFS, compute primes, and report progress back to the coordinator.
- The coordinator assigns work, tracks worker state, detects failures, and writes unique results to
results.txt. - Snapshot files allow the coordinator to recover its state after a crash and continue processing from the latest known progress.
- The system is designed to tolerate transient failures while keeping results consistent and recoverable.
File_System/— PrimeFS implementation and supporting files.Replication + Fault Tolerance/— replicated storage and Raft-based fault tolerance.Prime App + Snapshot/— coordinator, workers, and snapshot recovery logic.
- 22110103 — Jaidev Sanjay Khalane
- 22110108 — John Twipraham Debbarma
- 22110016 — Aditya Mangla
- Part 1: File System (
File_System) - Part 2: Fault Tolerance (
Replication + Fault Tolerance) - Part 3: Server Replication (
Replication + Fault Tolerance) - Part 4: Prime Number Application (
Prime App + Snapshot) - Part 5: Snapshot (
Prime App + Snapshot)
For more details on the design and implementation of each part, please refer to the corresponding design documents:
- Part 1: DesignDoc_P1.md
- Part 2: DesignDoc_P2.md
- Part 3: DesignDoc_P3.md
- Part 4 and 5: DesignDoc_P45.md