DNS Engineering
Master DNS as a critical infrastructure component. Covers DNS architecture, caching, security extensions (DNSSEC), split-horizon DNS, DNS-based service discovery, failover patterns, and the DNS problems that cause the most outages.
DNS is the most critical infrastructure component that most engineers take for granted. Every API call, every page load, every microservice request starts with a DNS lookup. When DNS breaks, everything breaks. Understanding DNS deeply — its caching layers, failure modes, and security implications — is essential for building reliable systems.
DNS Resolution Path
Application → Local Cache → OS Resolver → Recursive Resolver → Root → TLD → Authoritative
(ISP/8.8.8.8)
1. App calls getaddrinfo("api.example.com")
2. Check local DNS cache (browser, OS)
3. OS resolver sends query to configured recursive resolver
4. Recursive resolver checks its cache
5. If not cached: Root → .com TLD → example.com authoritative
6. Authoritative responds with IP address
7. Every layer caches the result for TTL duration
Record Types
| Type | Purpose | Example |
|---|---|---|
| A | IPv4 address | api.example.com → 192.168.1.100 |
| AAAA | IPv6 address | api.example.com → 2001:db8::1 |
| CNAME | Alias to another name | www → api.example.com |
| MX | Mail server | example.com → mail.example.com |
| TXT | Arbitrary text | SPF, DKIM, domain verification |
| NS | Nameserver delegation | example.com → ns1.provider.com |
| SRV | Service location | _sip._tcp.example.com → sip.example.com:5060 |
| CAA | Certificate authority auth | example.com → letsencrypt.org |
TTL Strategy
Record Type Recommended TTL Rationale
───────────────── ─────────────── ──────────
Production A/AAAA 300s (5 min) Balance between caching and failover speed
CDN CNAME 3600s (1 hr) CDN handles its own failover
MX records 3600s (1 hr) Mail routing changes rarely
TXT (SPF/DKIM) 3600s (1 hr) Email auth changes rarely
NS records 86400s (24 hr) Nameserver changes are planned
Pre-migration 60s (1 min) Lower before DNS changes, raise after
TTL Before Migration
Week before migration: Lower TTL to 60s
Day of migration: Change DNS records
Post-migration: Verify, then raise TTL to normal
DNS-Based Load Balancing
Round-Robin
api.example.com. 300 IN A 192.168.1.100
api.example.com. 300 IN A 192.168.1.101
api.example.com. 300 IN A 192.168.1.102
Simple distribution, no health awareness.
Weighted Routing
# Route 53 weighted routing
api.example.com:
- record: 192.168.1.100
weight: 70 # Primary
- record: 192.168.1.101
weight: 20 # Secondary
- record: 192.168.1.102
weight: 10 # Canary
Geolocation Routing
api.example.com:
- region: us-east-1
target: us-east.api.example.com
- region: eu-west-1
target: eu-west.api.example.com
- region: ap-southeast-1
target: ap-southeast.api.example.com
- default:
target: us-east.api.example.com
DNS Security
DNSSEC
DNSSEC adds cryptographic signatures to DNS responses, preventing cache poisoning and man-in-the-middle attacks:
Without DNSSEC:
Attacker can forge DNS responses → redirect traffic to malicious server
With DNSSEC:
Resolver validates signature chain → forged responses are rejected
DNS over HTTPS (DoH) / DNS over TLS (DoT)
Traditional DNS: Plaintext UDP on port 53 (anyone can see queries)
DoT: Encrypted DNS over TLS on port 853
DoH: Encrypted DNS over HTTPS on port 443
Common DNS Problems
| Problem | Symptom | Fix |
|---|---|---|
| TTL too high during migration | Old IP served for hours | Lower TTL days before change |
| CNAME at zone apex | Doesn’t work per RFC | Use ALIAS/ANAME record or A record |
| Too many CNAME chains | Slow resolution, timeouts | Flatten CNAME chains |
| No CAA records | Any CA can issue certificates | Add CAA restricting to your CA |
| NS delegation mismatch | Partial resolution failures | Ensure NS records match at registrar and zone |
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Hardcoded IP addresses | Cannot change infrastructure without code change | Use DNS names everywhere |
| TTL of 86400s on active records | 24-hour failover time | 300s for production records |
| Single DNS provider | DNS provider outage = total outage | Multi-provider DNS |
| No DNS monitoring | Resolution failures go undetected | Monitor query success rate and latency |
| Ignoring negative caching | NXDOMAIN cached, new records invisible | Check SOA minimum TTL (negative cache TTL) |
DNS is deceptively simple on the surface and deeply complex underneath. Most “network issues” are actually DNS issues. Most “slow connections” start with slow DNS. Understanding DNS at depth prevents entire categories of outages.