Understanding Distributed Systems and System Design: A Beginner’s Guide
Table of contents
- Introduction
- What Is a Distributed System?
- Stateless Servers and Session Management
- Scaling Strategies: Replication vs. Sharding
- Avoiding Single Points of Failure
- Consistency Models: Strong vs. Eventual
- Leader-Based Replication (Raft)
- Achieving Eventual Consistency (CRDTs & Vector Clocks)
- Handling the Thundering Herd Problem
- Reducing Data Transfer: Compression & Caching
- Preventing Hotspots in Sharded Systems
- Conclusion
Introduction
Have you ever wondered how apps like Amazon, Netflix, or Instagram handle millions of users at the same time without crashing?
The secret lies in distributed systems, a collection of computers working together as one powerful system. But building such systems isn’t easy. You need smart strategies to scale, stay available, and keep data consistent.
In this article, we’ll break down key concepts in distributed systems and system design, using simple language and real-world analogies, so you can understand how large-scale applications really work.
Let’s dive in.
What Is a Distributed System?
A distributed system is a group of computers (nodes) connected over a network that work together to behave like a single system.
Example: When you search on Google, your request doesn’t go to just one server. It’s handled by thousands of machines across the world, all cooperating.
Why Use Distributed Systems?
- Scale: Handle more users and data than a single machine ever could.
- Availability: If one server fails, others keep working.
- Performance: Serve users from locations close to them (like CDNs).
But with great power comes great complexity, like network delays, data conflicts, and partial failures.
So we need smart design patterns to manage it all.
Stateless Servers and Session Management
Modern web apps often use stateless servers, meaning they don’t store user data (like login status) locally.
Why? Because if a server stores your session, and it crashes, you get logged out. Not good.
So where do we store session data?
Best Options: DynamoDB & ElastiCache
- DynamoDB: A fast, scalable NoSQL database. Stores session tokens safely and durably.
- ElastiCache (Redis): Keeps session data in memory, super fast for reads.
👉 Like keeping your ID card in a digital wallet instead of handing it to each clerk.
Both allow any server to look up your session, making your app scalable and highly available.
Scaling Strategies: Replication vs. Sharding
As your app grows, you need to scale your database. Two main ways:
Replication
- Copy the same data to multiple servers.
- Great for read-heavy apps.
- Example: News site, thousands read articles, few write them.
But: All replicas must process every write, so write performance doesn’t scale.
Sharding (Partitioning)
- Split data into pieces (shards) stored on different servers.
- Example: User IDs 1–1M → Shard A, 1M–2M → Shard B.
Now, each shard handles only part of the load, great for write-heavy workloads.
👉 Think of it like splitting a giant book into chapters, each stored in a different library.
Avoiding Single Points of Failure
A single point of failure (SPOF) is any component that, if it breaks, brings down the whole system.
Example: One database server. If it dies, the app stops.
Solution: Redundancy + Failover
- Run multiple copies of critical components.
- Use automatic failover, if Server A fails, traffic shifts to Server B.
Like having backup generators in a hospital.
Other helpful tools:
- Load balancers: Distribute traffic.
- Circuit breakers: Stop cascading failures (like Netflix’s Hystrix).
But the core idea is: no single component should be irreplaceable.
Consistency Models: Strong vs. Eventual
When multiple users access data, how fresh should it be?
Strong Consistency (e.g., Linearizability)
- Every read returns the most recent write.
- Feels intuitive, like a bank balance.
- But slow: requires coordination between nodes.
Used in banking systems, where accuracy is critical.
Eventual Consistency
- Updates eventually reach all nodes.
- Reads may return stale data temporarily.
- But super fast and highly available.
Like WhatsApp message delivery: you might see a message a second later than your friend, but eventually, everyone sees it.
Used in social media, DNS, and collaborative tools.
Leader-Based Replication (Raft)
How do distributed systems agree on what’s true?
One way: elect a leader.
Raft Consensus Algorithm
- One node becomes the leader.
- All writes go through the leader.
- Leader replicates changes to followers.
- If leader fails, new election happens.
Used in etcd, Kubernetes, and Consul.
Like a team with a team lead: everyone reports to them, and they make final decisions.
This ensures safe, ordered updates, a key part of many databases.
Achieving Eventual Consistency (CRDTs & Vector Clocks)
In systems without a leader (like peer-to-peer networks), how do we merge data safely?
CRDTs (Conflict-Free Replicated Data Types)
- Special data types that automatically resolve conflicts.
- Example: A “last-writer-wins” register or a “grow-only counter”.
No matter the order of updates, all nodes converge to the same result.
Used in real-time apps like Google Docs.
Vector Clocks
- Track causality between events across nodes.
- Helps determine: Did update A happen before B? Or were they simultaneous?
Used to detect conflicts during sync.
Together, CRDTs and vector clocks help systems stay available and eventually consistent, even when disconnected.
Handling the Thundering Herd Problem
Imagine a service goes down, then comes back.
Suddenly, thousands of clients retry at once, boom! Server gets overwhelmed.
This is the thundering herd problem.
Solutions:
- Caching: Serve repeated requests from cache, not backend.
- Request throttling: Limit how many requests a client can send per second.
Like a bouncer at a club: only lets a few people in at a time, even if everyone shows up together.
This smooths out traffic spikes and protects your system.
Reducing Data Transfer: Compression & Caching
Sending less data = faster apps + lower costs.
Data Compression (Gzip, Brotli)
- Shrinks HTML, JSON, JS before sending.
- Saves bandwidth, speeds up load time.
Like zipping a file before email.
Caching
- Store static assets (images, scripts) in browser or CDN.
- Avoid re-downloading the same data.
Like saving your favorite playlist offline, no need to stream every time.
Together, they reduce load on servers and improve user experience.
Preventing Hotspots in Sharded Systems
A hotspot is a shard that gets way more traffic than others, causing overload.
Example: All VIP users are in one shard → that shard crashes.
Solution: Smart Partitioning
- Consistent Hashing: Distributes keys evenly, minimizes reshuffling.
- Dynamic Partitioning: Automatically split busy shards.
Like spreading party guests evenly across rooms, not cramming everyone into the kitchen.
This keeps load balanced and prevents bottlenecks.
Conclusion
Building distributed systems is all about trade-offs:
- Speed vs. consistency
- Availability vs. accuracy
- Simplicity vs. scalability
We learned how:
- Sharding scales writes.
- Replication improves read performance.
- Eventual consistency enables high availability.
- CRDTs and Raft help manage data safely.
- Caching, compression, and throttling protect and optimize performance.
No single solution fits all, but by combining these patterns, you can build systems that are fast, reliable, and scalable.
Master these concepts, and you’re well on your way to designing the next big app.
See you on the next post.
Sincerely,
Eng. Adrian Beria