Understanding Distributed Systems and System Design: A Beginner’s Guide

August 5, 2025—

Introduction
What Is a Distributed System?
- Why Use Distributed Systems?
Stateless Servers and Session Management
- Best Options: DynamoDB & ElastiCache
Scaling Strategies: Replication vs. Sharding
- Replication
- Sharding (Partitioning)
Avoiding Single Points of Failure
- Solution: Redundancy + Failover
Consistency Models: Strong vs. Eventual
- Strong Consistency (e.g., Linearizability)
- Eventual Consistency
Leader-Based Replication (Raft)
- Raft Consensus Algorithm
Achieving Eventual Consistency (CRDTs & Vector Clocks)
- CRDTs (Conflict-Free Replicated Data Types)
- Vector Clocks
Handling the Thundering Herd Problem
- Solutions:
Reducing Data Transfer: Compression & Caching
- Data Compression (Gzip, Brotli)
- Caching
Preventing Hotspots in Sharded Systems
- Solution: Smart Partitioning
Conclusion

Introduction

Have you ever wondered how apps like Amazon, Netflix, or Instagram handle millions of users at the same time without crashing?

The secret lies in distributed systems, a collection of computers working together as one powerful system. But building such systems isn’t easy. You need smart strategies to scale, stay available, and keep data consistent.

In this article, we’ll break down key concepts in distributed systems and system design, using simple language and real-world analogies, so you can understand how large-scale applications really work.

Let’s dive in.

What Is a Distributed System?

A distributed system is a group of computers (nodes) connected over a network that work together to behave like a single system.

Example: When you search on Google, your request doesn’t go to just one server. It’s handled by thousands of machines across the world, all cooperating.

Why Use Distributed Systems?

Scale: Handle more users and data than a single machine ever could.
Availability: If one server fails, others keep working.
Performance: Serve users from locations close to them (like CDNs).

But with great power comes great complexity, like network delays, data conflicts, and partial failures.

So we need smart design patterns to manage it all.

Stateless Servers and Session Management

Modern web apps often use stateless servers, meaning they don’t store user data (like login status) locally.

Why? Because if a server stores your session, and it crashes, you get logged out. Not good.

So where do we store session data?

Best Options: DynamoDB & ElastiCache

DynamoDB: A fast, scalable NoSQL database. Stores session tokens safely and durably.
ElastiCache (Redis): Keeps session data in memory, super fast for reads.

👉 Like keeping your ID card in a digital wallet instead of handing it to each clerk.

Both allow any server to look up your session, making your app scalable and highly available.

Scaling Strategies: Replication vs. Sharding

As your app grows, you need to scale your database. Two main ways:

Replication

Copy the same data to multiple servers.
Great for read-heavy apps.
Example: News site, thousands read articles, few write them.

But: All replicas must process every write, so write performance doesn’t scale.

Sharding (Partitioning)

Split data into pieces (shards) stored on different servers.
Example: User IDs 1–1M → Shard A, 1M–2M → Shard B.

Now, each shard handles only part of the load, great for write-heavy workloads.

👉 Think of it like splitting a giant book into chapters, each stored in a different library.

Avoiding Single Points of Failure

A single point of failure (SPOF) is any component that, if it breaks, brings down the whole system.

Example: One database server. If it dies, the app stops.

Solution: Redundancy + Failover

Run multiple copies of critical components.
Use automatic failover, if Server A fails, traffic shifts to Server B.

Like having backup generators in a hospital.

Other helpful tools:

Load balancers: Distribute traffic.
Circuit breakers: Stop cascading failures (like Netflix’s Hystrix).

But the core idea is: no single component should be irreplaceable.

Consistency Models: Strong vs. Eventual

When multiple users access data, how fresh should it be?

Strong Consistency (e.g., Linearizability)

Every read returns the most recent write.
Feels intuitive, like a bank balance.
But slow: requires coordination between nodes.

Used in banking systems, where accuracy is critical.

Eventual Consistency

Updates eventually reach all nodes.
Reads may return stale data temporarily.
But super fast and highly available.

Like WhatsApp message delivery: you might see a message a second later than your friend, but eventually, everyone sees it.

Used in social media, DNS, and collaborative tools.

Leader-Based Replication (Raft)

How do distributed systems agree on what’s true?

One way: elect a leader.

Raft Consensus Algorithm

One node becomes the leader.
All writes go through the leader.
Leader replicates changes to followers.
If leader fails, new election happens.

Used in etcd, Kubernetes, and Consul.

Like a team with a team lead: everyone reports to them, and they make final decisions.

This ensures safe, ordered updates, a key part of many databases.

Achieving Eventual Consistency (CRDTs & Vector Clocks)

In systems without a leader (like peer-to-peer networks), how do we merge data safely?

CRDTs (Conflict-Free Replicated Data Types)

Special data types that automatically resolve conflicts.
Example: A “last-writer-wins” register or a “grow-only counter”.

No matter the order of updates, all nodes converge to the same result.

Used in real-time apps like Google Docs.

Vector Clocks

Track causality between events across nodes.
Helps determine: Did update A happen before B? Or were they simultaneous?

Used to detect conflicts during sync.

Together, CRDTs and vector clocks help systems stay available and eventually consistent, even when disconnected.

Handling the Thundering Herd Problem

Imagine a service goes down, then comes back.

Suddenly, thousands of clients retry at once, boom! Server gets overwhelmed.

This is the thundering herd problem.

Solutions:

Caching: Serve repeated requests from cache, not backend.
Request throttling: Limit how many requests a client can send per second.

Like a bouncer at a club: only lets a few people in at a time, even if everyone shows up together.

This smooths out traffic spikes and protects your system.

Reducing Data Transfer: Compression & Caching

Sending less data = faster apps + lower costs.

Data Compression (Gzip, Brotli)

Shrinks HTML, JSON, JS before sending.
Saves bandwidth, speeds up load time.

Like zipping a file before email.

Caching

Store static assets (images, scripts) in browser or CDN.
Avoid re-downloading the same data.

Like saving your favorite playlist offline, no need to stream every time.

Together, they reduce load on servers and improve user experience.

Preventing Hotspots in Sharded Systems

A hotspot is a shard that gets way more traffic than others, causing overload.

Example: All VIP users are in one shard → that shard crashes.

Solution: Smart Partitioning

Consistent Hashing: Distributes keys evenly, minimizes reshuffling.
Dynamic Partitioning: Automatically split busy shards.

Like spreading party guests evenly across rooms, not cramming everyone into the kitchen.

This keeps load balanced and prevents bottlenecks.

Conclusion

Building distributed systems is all about trade-offs:

Speed vs. consistency
Availability vs. accuracy
Simplicity vs. scalability

We learned how:

Sharding scales writes.
Replication improves read performance.
Eventual consistency enables high availability.
CRDTs and Raft help manage data safely.
Caching, compression, and throttling protect and optimize performance.

No single solution fits all, but by combining these patterns, you can build systems that are fast, reliable, and scalable.

Master these concepts, and you’re well on your way to designing the next big app.

See you on the next post.

Sincerely,
Eng. Adrian Beria