Distributed Systems Consensus
Understand consensus algorithms that keep distributed systems consistent. Covers Raft, Paxos, leader election, log replication, split-brain prevention, and the fundamental impossibility results that shape distributed system design.
Consensus is the problem of getting multiple nodes to agree on a value. It sounds simple, but in the presence of network partitions, node failures, and message delays, it is one of the hardest problems in computer science. Every distributed database, message queue, and coordination service relies on consensus algorithms.
The Consensus Problem
Nodes: A, B, C (any two form a majority)
Normal operation:
Client → Node A: "Set x = 5"
Node A → Node B, C: "Agree to set x = 5?"
Node B, C: "Agreed"
Node A → Client: "Done, x = 5"
With partition:
[Node A, Node B] | [Node C]
Node A + B = majority (2/3), can still make progress
Node C alone = minority, cannot make progress
Consistency maintained!
Without consensus:
Both sides accept writes → split-brain → data loss
CAP Theorem and Real-World Trade-offs
CAP Theorem: Pick 2 of 3
C: Consistency
A: Availability
P: Partition tolerance
In practice, P is mandatory (networks partition).
So the choice is: C vs A during partitions.
CP systems (choose consistency):
etcd, ZooKeeper, CockroachDB
During partition: minority side rejects writes
AP systems (choose availability):
Cassandra, DynamoDB, CouchDB
During partition: both sides accept writes, reconcile later
Raft Consensus
Raft is the most widely used consensus algorithm (etcd, CockroachDB, TiKV):
Leader Election
1. All nodes start as FOLLOWERS
2. If no heartbeat from leader after timeout → become CANDIDATE
3. CANDIDATE requests votes from all nodes
4. If majority votes received → become LEADER
5. LEADER sends heartbeats to prevent new elections
Timeline:
T=0: Nodes A, B, C all FOLLOWERS
T=3s: Node B timeout (no leader heartbeat)
T=3s: Node B → CANDIDATE, requests votes
T=3.1s: Node A, C vote for B
T=3.1s: Node B → LEADER (2/3 votes = majority)
T=3.2s: Node B starts sending heartbeats
Log Replication
Client: "Set x = 5"
↓
Leader (Node B):
1. Append to local log: [index=3, term=1, cmd="set x=5"]
2. Send AppendEntries to followers
↓
Followers (Nodes A, C):
3. Append to local log
4. Reply "success" to leader
↓
Leader:
5. Majority replied (2/3) → COMMIT entry
6. Apply to state machine (x = 5)
7. Reply to client: "success"
8. Notify followers: entry committed, apply it
Paxos
Proposer → Acceptors (majority) → Learners
Phase 1 (Prepare):
Proposer: "I want to propose value V with proposal number N"
Acceptors: "OK, I promise not to accept proposals < N"
Phase 2 (Accept):
Proposer: "Accept value V with proposal number N"
Acceptors (majority): "Accepted"
→ Value V is chosen
Paxos is correct but hard to implement.
Raft was designed as an "understandable Paxos."
Split-Brain Prevention
3-node cluster:
Quorum = ⌊3/2⌋ + 1 = 2
5-node cluster:
Quorum = ⌊5/2⌋ + 1 = 3
Rule: Only process with majority can make decisions.
Partition: [A, B] | [C, D, E]
[A, B] = 2 nodes, cannot form quorum of 3 → REJECT writes
[C, D, E] = 3 nodes, forms quorum → ACCEPT writes
No split-brain: only one partition is active
Practical Applications
| System | Algorithm | Use Case |
|---|---|---|
| etcd | Raft | Kubernetes state store |
| ZooKeeper | ZAB (Paxos variant) | Distributed coordination |
| CockroachDB | Raft | Distributed SQL |
| TiKV | Raft | Distributed key-value |
| Consul | Raft | Service discovery |
| Kafka | KRaft (Raft-based) | Metadata management |
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Even number of nodes (2, 4) | Tie votes, cannot form majority in split | Always odd numbers (3, 5, 7) |
| Cluster too large (> 7 nodes) | Consensus latency increases | 3 or 5 nodes for consensus, more for read replicas |
| Ignoring network latency | Consensus timeout tuning issues | Measure latency, set appropriate timeouts |
| No monitoring of leader changes | Frequent elections indicate problems | Alert on leader election frequency |
| Rolling updates without quorum safety | Temporary quorum loss | Update one node at a time |
Consensus algorithms are the foundation of distributed systems reliability. They turn unreliable networks and failing nodes into systems that behave correctly — as long as a majority of nodes are available.