ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Distributed Systems Consensus

Understand consensus algorithms that keep distributed systems consistent. Covers Raft, Paxos, leader election, log replication, split-brain prevention, and the fundamental impossibility results that shape distributed system design.

Consensus is the problem of getting multiple nodes to agree on a value. It sounds simple, but in the presence of network partitions, node failures, and message delays, it is one of the hardest problems in computer science. Every distributed database, message queue, and coordination service relies on consensus algorithms.


The Consensus Problem

Nodes: A, B, C (any two form a majority)

Normal operation:
  Client → Node A: "Set x = 5"
  Node A → Node B, C: "Agree to set x = 5?"
  Node B, C: "Agreed"
  Node A → Client: "Done, x = 5"

With partition:
  [Node A, Node B] | [Node C]
  
  Node A + B = majority (2/3), can still make progress
  Node C alone = minority, cannot make progress
  Consistency maintained!

Without consensus:
  Both sides accept writes → split-brain → data loss

CAP Theorem and Real-World Trade-offs

CAP Theorem: Pick 2 of 3
  C: Consistency
  A: Availability
  P: Partition tolerance

In practice, P is mandatory (networks partition).
So the choice is: C vs A during partitions.

CP systems (choose consistency): 
  etcd, ZooKeeper, CockroachDB
  During partition: minority side rejects writes

AP systems (choose availability):
  Cassandra, DynamoDB, CouchDB
  During partition: both sides accept writes, reconcile later

Raft Consensus

Raft is the most widely used consensus algorithm (etcd, CockroachDB, TiKV):

Leader Election

1. All nodes start as FOLLOWERS
2. If no heartbeat from leader after timeout → become CANDIDATE
3. CANDIDATE requests votes from all nodes
4. If majority votes received → become LEADER
5. LEADER sends heartbeats to prevent new elections

Timeline:
  T=0:  Nodes A, B, C all FOLLOWERS
  T=3s: Node B timeout (no leader heartbeat)
  T=3s: Node B → CANDIDATE, requests votes
  T=3.1s: Node A, C vote for B
  T=3.1s: Node B → LEADER (2/3 votes = majority)
  T=3.2s: Node B starts sending heartbeats

Log Replication

Client: "Set x = 5"

Leader (Node B):
  1. Append to local log: [index=3, term=1, cmd="set x=5"]
  2. Send AppendEntries to followers

Followers (Nodes A, C):
  3. Append to local log
  4. Reply "success" to leader

Leader:
  5. Majority replied (2/3) → COMMIT entry
  6. Apply to state machine (x = 5)
  7. Reply to client: "success"
  8. Notify followers: entry committed, apply it

Paxos

Proposer → Acceptors (majority) → Learners

Phase 1 (Prepare):
  Proposer: "I want to propose value V with proposal number N"
  Acceptors: "OK, I promise not to accept proposals < N"
  
Phase 2 (Accept):
  Proposer: "Accept value V with proposal number N"
  Acceptors (majority): "Accepted"
  → Value V is chosen

Paxos is correct but hard to implement.
Raft was designed as an "understandable Paxos."

Split-Brain Prevention

3-node cluster:
  Quorum = ⌊3/2⌋ + 1 = 2

5-node cluster:
  Quorum = ⌊5/2⌋ + 1 = 3

Rule: Only process with majority can make decisions.

Partition: [A, B] | [C, D, E]
  [A, B] = 2 nodes, cannot form quorum of 3 → REJECT writes
  [C, D, E] = 3 nodes, forms quorum → ACCEPT writes
  No split-brain: only one partition is active

Practical Applications

SystemAlgorithmUse Case
etcdRaftKubernetes state store
ZooKeeperZAB (Paxos variant)Distributed coordination
CockroachDBRaftDistributed SQL
TiKVRaftDistributed key-value
ConsulRaftService discovery
KafkaKRaft (Raft-based)Metadata management

Anti-Patterns

Anti-PatternConsequenceFix
Even number of nodes (2, 4)Tie votes, cannot form majority in splitAlways odd numbers (3, 5, 7)
Cluster too large (> 7 nodes)Consensus latency increases3 or 5 nodes for consensus, more for read replicas
Ignoring network latencyConsensus timeout tuning issuesMeasure latency, set appropriate timeouts
No monitoring of leader changesFrequent elections indicate problemsAlert on leader election frequency
Rolling updates without quorum safetyTemporary quorum lossUpdate one node at a time

Consensus algorithms are the foundation of distributed systems reliability. They turn unreliable networks and failing nodes into systems that behave correctly — as long as a majority of nodes are available.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →