Distributed Systems Consensus

Consensus is the problem of getting multiple nodes to agree on a value. It sounds simple, but in the presence of network partitions, node failures, and message delays, it is one of the hardest problems in computer science. Every distributed database, message queue, and coordination service relies on consensus algorithms.

The Consensus Problem

Nodes: A, B, C (any two form a majority)

Normal operation:
  Client → Node A: "Set x = 5"
  Node A → Node B, C: "Agree to set x = 5?"
  Node B, C: "Agreed"
  Node A → Client: "Done, x = 5"

With partition:
  [Node A, Node B] | [Node C]
  
  Node A + B = majority (2/3), can still make progress
  Node C alone = minority, cannot make progress
  Consistency maintained!

Without consensus:
  Both sides accept writes → split-brain → data loss

CAP Theorem and Real-World Trade-offs

CAP Theorem: Pick 2 of 3
  C: Consistency
  A: Availability
  P: Partition tolerance

In practice, P is mandatory (networks partition).
So the choice is: C vs A during partitions.

CP systems (choose consistency): 
  etcd, ZooKeeper, CockroachDB
  During partition: minority side rejects writes

AP systems (choose availability):
  Cassandra, DynamoDB, CouchDB
  During partition: both sides accept writes, reconcile later

Raft Consensus

Raft is the most widely used consensus algorithm (etcd, CockroachDB, TiKV):

Leader Election

1. All nodes start as FOLLOWERS
2. If no heartbeat from leader after timeout → become CANDIDATE
3. CANDIDATE requests votes from all nodes
4. If majority votes received → become LEADER
5. LEADER sends heartbeats to prevent new elections

Timeline:
  T=0:  Nodes A, B, C all FOLLOWERS
  T=3s: Node B timeout (no leader heartbeat)
  T=3s: Node B → CANDIDATE, requests votes
  T=3.1s: Node A, C vote for B
  T=3.1s: Node B → LEADER (2/3 votes = majority)
  T=3.2s: Node B starts sending heartbeats

Log Replication

Client: "Set x = 5"
  ↓
Leader (Node B):
  1. Append to local log: [index=3, term=1, cmd="set x=5"]
  2. Send AppendEntries to followers
  ↓
Followers (Nodes A, C):
  3. Append to local log
  4. Reply "success" to leader
  ↓
Leader:
  5. Majority replied (2/3) → COMMIT entry
  6. Apply to state machine (x = 5)
  7. Reply to client: "success"
  8. Notify followers: entry committed, apply it

Paxos

Proposer → Acceptors (majority) → Learners

Phase 1 (Prepare):
  Proposer: "I want to propose value V with proposal number N"
  Acceptors: "OK, I promise not to accept proposals < N"
  
Phase 2 (Accept):
  Proposer: "Accept value V with proposal number N"
  Acceptors (majority): "Accepted"
  → Value V is chosen

Paxos is correct but hard to implement.
Raft was designed as an "understandable Paxos."

Split-Brain Prevention

3-node cluster:
  Quorum = ⌊3/2⌋ + 1 = 2

5-node cluster:
  Quorum = ⌊5/2⌋ + 1 = 3

Rule: Only process with majority can make decisions.

Partition: [A, B] | [C, D, E]
  [A, B] = 2 nodes, cannot form quorum of 3 → REJECT writes
  [C, D, E] = 3 nodes, forms quorum → ACCEPT writes
  No split-brain: only one partition is active

Practical Applications

System	Algorithm	Use Case
etcd	Raft	Kubernetes state store
ZooKeeper	ZAB (Paxos variant)	Distributed coordination
CockroachDB	Raft	Distributed SQL
TiKV	Raft	Distributed key-value
Consul	Raft	Service discovery
Kafka	KRaft (Raft-based)	Metadata management

Anti-Patterns

Anti-Pattern	Consequence	Fix
Even number of nodes (2, 4)	Tie votes, cannot form majority in split	Always odd numbers (3, 5, 7)
Cluster too large (> 7 nodes)	Consensus latency increases	3 or 5 nodes for consensus, more for read replicas
Ignoring network latency	Consensus timeout tuning issues	Measure latency, set appropriate timeouts
No monitoring of leader changes	Frequent elections indicate problems	Alert on leader election frequency
Rolling updates without quorum safety	Temporary quorum loss	Update one node at a time

Consensus algorithms are the foundation of distributed systems reliability. They turn unreliable networks and failing nodes into systems that behave correctly — as long as a majority of nodes are available.