30 production-grade system design case studies for mastering distributed systems, preparing for FAANG interviews, and building real-world architectures. From Netflix-scale video streaming to Uber's geospatial matching, learn how industry giants handle billions of requests per day.
Each case study follows a 5-chapter structure:
- Requirements & Scale - Functional/non-functional requirements, traffic estimates, cost analysis
- Architecture Design - Components (what & why), data flows, data models, APIs, monitoring
- Key Technical Decisions - Trade-offs with "when to reconsider" triggers (e.g., SQL vs NoSQL)
- Wrap-Up & Deep Dives - Scaling playbook (MVP→Production→Scale), failure scenarios, SLOs, pitfalls, interview tips
Writing Philosophy: Concise & practical (ByteByteGo/Educative style), real-world trade-offs over theory, domain-specific depth (not templated copy-paste).
| # | System | Key Concepts | Scale Targets |
|---|---|---|---|
| 01 | Real-Time Chat | WebSocket, message queue, presence, E2EE | 100M users, 1M concurrent |
| 02 | Ride-Sharing | Geospatial matching (Geohash), ETA prediction, surge pricing | 10M rides/day, real-time tracking |
| # | System | Key Concepts | Scale Targets |
|---|---|---|---|
| 03 | Video Streaming | Adaptive bitrate (HLS), CDN, multi-region, transcoding | 100M hours/day, Netflix-scale |
| 06 | Collaborative Docs | Operational Transform/CRDT, WebSocket, conflict resolution | 1M concurrent editors, Google Docs |
| 07 | CDN | Edge caching, origin shielding, cache invalidation, geo-routing | 10TB/day bandwidth, Cloudflare-scale |
| 12 | Live Streaming | LL-HLS/WebRTC, RTMP ingest, chat moderation, DVR | 1M concurrent viewers, Twitch-scale |
| # | System | Key Concepts | Scale Targets |
|---|---|---|---|
| 04 | Social Media Feed | Fanout (push/pull hybrid), timeline ranking, newsfeed generation | 500M users, Twitter/Instagram |
| 05 | E-Commerce | Inventory management, shopping cart, checkout, order fulfillment | 10M products, Amazon-scale |
| 08 | Stock Trading | Order matching engine, order book, market data, low-latency | <10ms p99 latency, NASDAQ-scale |
| 20 | Hotel Reservation | Pessimistic locking, overbooking prevention, payment auth/capture | 1M bookings/day, Booking.com |
| # | System | Key Concepts | Scale Targets |
|---|---|---|---|
| 09 | Email Service | SMTP/IMAP, spam filtering, attachment storage, rate limiting | 10B emails/day, Gmail-scale |
| 10 | Search Engine | Inverted index, PageRank, query processing, autocomplete | 10B documents, Google-scale |
| 11 | Task Scheduler | Cron-like scheduling, DAG execution, retry logic, priority queues | 1M tasks/day, Airflow/Temporal |
| 17 | IoT Pipeline | MQTT ingestion, stream processing (Flink), time-series DB, OTA updates | 10M devices, 1B events/day |
| 18 | Distributed Cache | Consistent hashing, LRU/LFU eviction, master-replica, pub/sub | 100K RPS, Redis/Memcached |
| 19 | Recommendation Engine | Collaborative filtering, two-tower embeddings, candidate generation, real-time signals | 100M users, Netflix/YouTube |
| 23 | Observability Platform | Metrics (Prometheus), logs (Loki), traces (Jaeger), alerting, cardinality limits | 10M metrics/sec, Datadog-scale |
| 27 | Distributed File Storage | Erasure coding (Reed-Solomon), replication, metadata sharding, 11 9's durability | 10B objects, 10PB, S3-scale |
| 28 | Web Crawler | URL frontier, robots.txt, politeness, deduplication (MD5/Simhash), BFS | 10B pages, Googlebot-scale |
| 30 | Real-Time Analytics | Stream processing (Flink), OLAP (ClickHouse), pre-aggregation, caching | 10B events/day, Google Analytics |
| # | System | Key Concepts | Scale Targets |
|---|---|---|---|
| 13 | Food Delivery | Regional dispatch, ML ETA, adaptive telemetry, surge pricing | 10M orders/day, DoorDash/UberEats |
| 14 | Online Banking | Double-entry ledger, ACID transactions, fraud detection, reconciliation | 100M accounts, Chase/Wells Fargo |
| 15 | Ad Serving | <100ms decisioning, frequency caps, first-price auction, pacing, privacy-first | 1M RPS, Google Ads |
| # | System | Key Concepts | Scale Targets |
|---|---|---|---|
| 16 | Video Conferencing | SFU architecture, WebRTC, simulcast, adaptive jitter buffer, server-side recording | 10K concurrent rooms, Zoom/Meet |
| 21 | Message Broker | Topics/partitions, ISR replication, consumer groups, leader election, exactly-once | 1M msg/sec, Kafka-scale |
| # | System | Key Concepts | Scale Targets |
|---|---|---|---|
| 22 | API Gateway | Rate limiting (token bucket), JWT auth, circuit breakers, PSP routing | 100K RPS, Kong/AWS Gateway |
| 24 | ML Model Inference | Dynamic batching, GPU optimization, A/B testing, feature store, canary deployment | 100K predictions/sec, TensorFlow Serving |
| 25 | Payment Gateway | PSP routing, fraud detection, PCI DSS (HSM tokenization), settlement | 10K TPS, Stripe/Adyen |
| 26 | Content Moderation | AI classifiers (BERT/ResNet), confidence routing, human review, CSAM detection | 100M items/day, Meta/YouTube |
| 29 | Proximity Service | Geohash/H3, Redis Geo, radius search, geofencing, real-time location updates | 100M places, 10M updates/sec, Yelp/Uber |
- Start with 3 intermediate systems (Chat, Social Feed, Hotel Reservation, Content Moderation)
- Master 3 advanced core systems (Stock Trading, Email Service, Video Streaming)
- Deep dive into 2 advanced specialized (Ride-Sharing, Ad Serving Platform)
Pick 3-5 systems to showcase:
- Full-Stack: Email Service (#9) + Social Feed (#4) + Content Moderation (#26)
- Backend/Infrastructure: Message Broker (#21) + API Gateway (#22) + Cache (#18)
- Fintech: Stock Trading (#8) + Payment Gateway (#25) + Banking (#14)
- Media/Ads: Live Streaming (#12) + Ad Serving (#15) + CDN (#7)
- Study: Browse
case-studies/folder, read README.md files - Generate: Use
PROMPT.mdtemplate with AI (Claude/GPT-4) to create new case studies - Customize: Fork and adapt for your own projects
Backend: Python, Node.js, Go, Java | Databases: PostgreSQL, MongoDB, Redis, Cassandra
Queues: Kafka, RabbitMQ | Real-Time: WebSocket, WebRTC, gRPC
Infrastructure: Docker, Kubernetes, Terraform | Cloud: AWS, GCP, Azure
Monitoring: Prometheus, Grafana, ELK | AI/ML: TensorFlow Serving, PyTorch
✅ Job Interviews: Covers common FAANG system design questions
✅ Portfolio: Showcase architectural thinking to employers
✅ Learning: Structured path from beginner to advanced
✅ Reference: Real-world patterns for your own projects
Found an issue or want to add a case study? Pull requests welcome!
MIT License - Free to use for personal and commercial projects.
If this helped you land an interview or learn something new, give it a star!
Happy Learning! 🚀