system-design/Distributed-System.md at main · Ashfaqbs/system-design

What Is a Distributed System?

Layman Explanation

Imagine a large factory producing cars. This factory is too big and complex for one person to handle everything. So, the work is divided:

One group handles the engine.
Another group works on painting.
Another handles the interior. Each group is located in a separate building, but they all coordinate to build the final car. If one group stops, the car can't be completed.

This is how a distributed system works — the job is too big for one machine, so multiple machines (servers) are given parts of the job. They work independently but must stay in sync to complete the task.

Technical Explanation

A distributed system is a collection of independent computers (nodes or servers) that appear to the end-user as a single coherent system. These nodes communicate with each other over a network to perform tasks in a coordinated manner. Each component in the system may be:

On different physical machines.
In different locations or data centers.
Running different parts of the application or different types of services.

Despite the physical separation, they work together to offer high performance, scalability, fault tolerance, and reliability.

What Qualifies as a Distributed System?

Here is a categorized view of technologies and systems that are considered distributed systems, with explanations:

1. Distributed Databases

Examples: Cassandra, MongoDB, Google Spanner
Layman Analogy: Imagine storing books in multiple libraries instead of one. If one library is full or burns down, the rest still have copies.
Technical View: Data is spread across multiple servers. Each server holds a portion of the dataset (called shards or partitions). If one server fails, others can take over, ensuring availability and durability.

2. Distributed File Systems

Examples: Hadoop Distributed File System (HDFS), Google File System (GFS)
Layman Analogy: A large movie is broken into small scenes and stored across multiple USB drives. To watch the full movie, the scenes are played from each USB in order.
Technical View: Large files are split into chunks and stored on different machines. A master node keeps track of where each chunk is located. Clients request data from the correct node, enabling parallel access and better performance.

3. Microservices Architecture

Examples: Netflix microservices, e-commerce services (catalog, cart, payment)
Layman Analogy: A food delivery service with separate departments for order taking, cooking, packing, and delivery. Each team works independently but must stay in sync for the final delivery.
Technical View: Each part of the application (service) runs independently, often in containers like Docker, and communicates via APIs or messaging queues. These services are usually deployed on separate servers or clusters.

4. Cloud Infrastructure and Platforms

Examples: AWS, GCP, Azure
Layman Analogy: Renting storage and machines across the globe rather than owning one big computer in one place.
Technical View: Provides compute, storage, and networking services across distributed physical locations. Cloud platforms themselves are built using distributed principles to offer elastic scaling and high availability.

5. Peer-to-Peer (P2P) Systems

Examples: BitTorrent, Bitcoin, Ethereum
Layman Analogy: Instead of downloading a file from one central source, users share the file chunks with each other.
Technical View: Every node in the system acts as both a client and a server. Data is exchanged directly between nodes without a central coordinator. This enhances fault tolerance and decentralization.

6. Content Delivery Networks (CDNs)

Examples: Cloudflare, Akamai
Layman Analogy: Storing ice cream in multiple freezers across the city. When someone orders it, the nearest freezer delivers it to avoid melting.
Technical View: Static content (e.g., images, CSS, JS, videos) is cached and stored in edge servers located close to users. Requests are served from the nearest node, improving latency and reducing server load.

7. Distributed Caching Systems

Examples: Redis Cluster, Memcached
Layman Analogy: Keeping frequently used tools in multiple toolboxes located in different rooms instead of one central locker.
Technical View: Frequently accessed data is stored in memory across several servers to allow fast retrieval and reduce load on backend systems.

8. Blockchain and Distributed Ledgers

Examples: Ethereum, Hyperledger Fabric
Layman Analogy: Everyone in a group keeps a copy of the same notebook. When someone writes in it, all others update their copy.
Technical View: Every node maintains a copy of the ledger. Changes are validated through consensus algorithms like Proof of Work or Proof of Stake. It is inherently fault-tolerant and tamper-resistant.

Why These Systems Are Called Distributed Systems

These systems qualify as distributed because they meet one or more of the following conditions:

Data/logic is split across multiple nodes.
Nodes communicate to share work, coordinate actions, or replicate state.
The system can scale horizontally by adding more nodes.
They maintain availability and fault tolerance despite node failures or network issues.
They rely on network communication as the glue between parts.

Summary Table

Type	Real-World Analogy	Key Features
Distributed DB	Books in multiple libraries	Data partitioning and replication
Distributed File System	Scenes of a movie on different drives	Chunking, redundancy
Microservices	Food delivery departments	Independent, loosely coupled
Cloud Platforms	Renting resources globally	Scalable, elastic infrastructure
P2P Systems	File sharing among friends	No central authority
CDNs	Ice cream from nearest freezer	Edge caching
Caching Systems	Tools in multiple rooms	Fast access to frequent data
Blockchain	Everyone keeps the same notebook	Decentralized and secure ledger

Deep Dive into Partition Tolerance in CAP Theorem (With MongoDB, Kafka, and Application-Level Perspective)

1. Introduction to Partition Tolerance (P in CAP)

Partition Tolerance means that a distributed system can continue to operate correctly even if network failures or partitions occur—i.e., when some parts of the system can't communicate with others.

In CAP (Consistency, Availability, Partition Tolerance), Partition Tolerance is non-negotiable in any distributed system. If a system spans across networked nodes, the possibility of network failures mandates designing with partition tolerance in mind.

2. Partition Tolerance in Databases

MongoDB (CP/AP depending on configuration)

Sharded Architecture: MongoDB can be horizontally scaled using shards.
With Replication: If network partition happens and a primary becomes unreachable, MongoDB uses automatic election to promote a new primary. Data consistency depends on write concern levels.
Without Replication: If a shard has no replica and it goes down, all data in that shard becomes unavailable, affecting reads/writes for that data segment.

Kafka (AP by default)

Partitions: Kafka topics are split into partitions (like shards) distributed across brokers.
Replication: Kafka replicates each partition across brokers. If the leader broker goes down, a replica can take over.
Without Replication: If a broker hosting an unreplicated partition goes down, that partition becomes unavailable. All messages in that partition are at risk of permanent loss.
Tolerance: Kafka favors availability over strict consistency. It allows writes to continue during a partition, with eventual consistency achieved once the partition heals.

3. Partition Tolerance in Application Architecture

Without Kubernetes (Traditional VM Setup)

Services are deployed manually across different VMs.
No automatic health checks or rescheduling of failed services.
Network partitions between services (e.g., Product Service can't reach Payment Service) can cause failures.
Partition handling is manual—needs retry logic, circuit breakers, and fallback mechanisms.
Fault Tolerance requires custom load balancers, redundant instances, and monitoring scripts.

With Kubernetes (Modern Cloud-Native Setup)

Kubernetes manages pod health and automatically restarts failed pods.
Services are replicated and distributed across nodes.
Network partitions are detected and handled through service discovery and health checks.
Supports self-healing and auto-rescheduling of pods.
Ensures high availability even during partial network outages.
Stronger abstraction for partition tolerance with less manual effort.

4. Fault Tolerance vs. Partition Tolerance

Fault Tolerance: Focuses on system behavior during component failures (e.g., a pod crashes).
Partition Tolerance: Focuses on system behavior during network communication failures (e.g., one service can't reach another due to a network issue).

Both are critical but solve different classes of problems.

5. Recommendations for Designing Partition-Tolerant Systems

Always assume network partitions will happen.
Use replication wherever possible (databases, brokers, services).
Design services to be stateless or persist state in resilient systems.
Implement retry mechanisms, timeouts, and circuit breakers.
Use message brokers like Kafka for asynchronous, fault-tolerant communication.
For mission-critical systems, orchestrate with Kubernetes to leverage built-in recovery, load balancing, and observability.

6. Summary Table

Component	With Replication	Without Replication	Behavior Under Partition
MongoDB	Elects new primary, continues	Data unavailable	Partial unavailability or split-brain
Kafka	Replica becomes leader, no data loss	Data loss risk	Partition unavailable
VMs (Apps)	Manual failover, possible downtime	Full outage	Requires retries/manual recovery
Kubernetes	Auto-recovery and rescheduling	Limited scope (pod-specific)	Partial failure managed gracefully

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What Is a Distributed System?

Layman Explanation

Technical Explanation

What Qualifies as a Distributed System?

1. Distributed Databases

2. Distributed File Systems

3. Microservices Architecture

4. Cloud Infrastructure and Platforms

5. Peer-to-Peer (P2P) Systems

6. Content Delivery Networks (CDNs)

7. Distributed Caching Systems

8. Blockchain and Distributed Ledgers

Why These Systems Are Called Distributed Systems

Summary Table

1. Introduction to Partition Tolerance (P in CAP)

2. Partition Tolerance in Databases

MongoDB (CP/AP depending on configuration)

Kafka (AP by default)

3. Partition Tolerance in Application Architecture

Without Kubernetes (Traditional VM Setup)

With Kubernetes (Modern Cloud-Native Setup)

4. Fault Tolerance vs. Partition Tolerance

5. Recommendations for Designing Partition-Tolerant Systems

6. Summary Table

FilesExpand file tree

Distributed-System.md

Latest commit

History

Distributed-System.md

File metadata and controls

What Is a Distributed System?

Layman Explanation

Technical Explanation

What Qualifies as a Distributed System?

1. Distributed Databases

2. Distributed File Systems

3. Microservices Architecture

4. Cloud Infrastructure and Platforms

5. Peer-to-Peer (P2P) Systems

6. Content Delivery Networks (CDNs)

7. Distributed Caching Systems

8. Blockchain and Distributed Ledgers

Why These Systems Are Called Distributed Systems

Summary Table

1. Introduction to Partition Tolerance (P in CAP)

2. Partition Tolerance in Databases

MongoDB (CP/AP depending on configuration)

Kafka (AP by default)

3. Partition Tolerance in Application Architecture

Without Kubernetes (Traditional VM Setup)

With Kubernetes (Modern Cloud-Native Setup)

4. Fault Tolerance vs. Partition Tolerance

5. Recommendations for Designing Partition-Tolerant Systems

6. Summary Table