ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Database Replication Topologies

Design and operate database replication for high availability, disaster recovery, and read scaling. Covers single-leader, multi-leader, and leaderless replication, replication lag, conflict resolution, and the patterns that keep data consistent across replicas.

Database replication copies data from one server to others to achieve high availability, disaster recovery, and read scalability. But replication introduces complexity: lag, conflicts, split-brain scenarios, and consistency tradeoffs. Understanding replication topologies is essential for designing reliable data systems.


Replication Topologies

Single-Leader (Primary-Replica):
  ┌──────────┐    Replication    ┌──────────┐
  │ Primary  │ ────────────────→ │ Replica 1│ (read-only)
  │ (writes) │ ────────────────→ │ Replica 2│ (read-only)
  └──────────┘                   └──────────┘
  
  Writes: Primary only
  Reads: Primary + Replicas
  Failover: Promote a replica to primary
  Best for: Read-heavy workloads, simple architecture
  
Multi-Leader (Active-Active):
  ┌──────────┐    ←→    ┌──────────┐
  │ Leader 1 │          │ Leader 2 │
  │ (US-East)│ ←──────→ │ (EU-West)│
  └──────────┘          └──────────┘
  
  Writes: Any leader
  Conflict: Same row modified on both leaders simultaneously
  Resolution: Last-write-wins, custom merge, CRDT
  Best for: Multi-region, offline-capable apps
  
Leaderless (Quorum):
  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │  Node 1  │  │  Node 2  │  │  Node 3  │
  └──────────┘  └──────────┘  └──────────┘
  
  Write: To W of N nodes (e.g., 2 of 3)
  Read: From R of N nodes (e.g., 2 of 3)
  Rule: W + R > N guarantees overlap (latest value)
  Best for: High availability, partition tolerance

Replication Lag

# Monitoring replication lag

# PostgreSQL: Check replication lag
"""
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes,
    EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds
FROM pg_stat_replication;
"""

# Read-your-writes consistency pattern
class ReadYourWritesRouter:
    """Ensure user reads their own recent writes."""
    
    def route_read(self, user_id: str, query: str):
        last_write = self.get_last_write_time(user_id)
        replica_lag = self.get_replica_lag()
        
        if last_write and replica_lag > 0:
            time_since_write = time.time() - last_write
            if time_since_write < replica_lag + 1:
                # Recent write — read from primary
                return self.primary.execute(query)
        
        # Safe to read from replica
        return self.replica.execute(query)

Failover Strategies

Automatic Failover (Managed Services):
  AWS RDS: Automatic failover in ~60-120 seconds
  Azure SQL: Automatic failover groups with DNS redirect
  Cloud SQL: Regional HA with automatic failover
  
Manual Failover (Self-Managed):
  1. Detect primary failure (health checks, heartbeat)
  2. Verify primary is truly down (avoid split-brain)
  3. Select promotion candidate (most up-to-date replica)
  4. Promote replica to primary
  5. Redirect application connections
  6. Reconfigure remaining replicas to follow new primary
  
  Risk: Data loss if async replication and primary failed before
        replicas caught up.
  
  Mitigation: Synchronous replication (zero data loss, higher latency)

Anti-Patterns

Anti-PatternConsequenceFix
Ignore replication lagStale reads, user confusionMonitor lag, route critical reads to primary
No failover testingFirst failover is during an outageRegular failover drills
Multi-leader without conflict resolutionData corruption, lost writesExplicit conflict resolution strategy
Sync replication everywhereHigh latency, reduced availabilitySync for critical data, async for the rest
No monitoring of replica healthSilent replica failureAlert on lag, connection status, query errors

Replication is the foundation of high availability, but it introduces complexity that must be actively managed. Choose the simplest topology that meets your requirements, and always monitor replication health.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →