DNS for Engineers Who Keep Breaking Things: From Resolution to Debugging
Understand DNS from the ground up so you stop debugging the wrong thing when name resolution fails. Covers how DNS actually works, TTL traps, split-horizon, DNSSEC, and the debugging tools that will save you at 3 AM.
DNS is the most critical infrastructure service you never think about — until it breaks. Then it is the only thing anyone thinks about, because when DNS fails, everything fails. Your APIs return connection errors. Your databases become unreachable. Your CDN stops serving content. Kubernetes pods cannot find each other. And the error messages all say something different, because nobody logs “DNS resolution failed” — they log “connection refused” or “host not found” or simply “timeout.”
This guide is for engineers who need to understand DNS well enough to debug production issues and design reliable name resolution architectures.
How DNS Resolution Actually Works
Every time your application makes an HTTP request, this is what happens before a single byte of data is exchanged:
Your Application
│
▼
1. Check /etc/hosts (local override)
│ Not found
▼
2. Check local DNS cache (systemd-resolved, nscd, or app-level)
│ Not found or expired
▼
3. Query recursive resolver (usually your ISP or 8.8.8.8 / 1.1.1.1)
│ Recursive resolver starts the hunt:
▼
4. Query root nameserver → "Who handles .com?"
│ Response: "a.gtld-servers.net handles .com"
▼
5. Query .com TLD nameserver → "Who handles example.com?"
│ Response: "ns1.example.com is authoritative for example.com"
▼
6. Query authoritative nameserver → "What is the IP for api.example.com?"
│ Response: "93.184.216.34, TTL 300"
▼
7. Recursive resolver caches the result for 300 seconds
│ Returns answer to your application
▼
8. Your application connects to 93.184.216.34
│ Total time: 20-200ms for first resolution, <1ms for cached
Record Types You Need to Know
| Type | Purpose | Example |
|---|---|---|
| A | Domain → IPv4 address | api.example.com → 93.184.216.34 |
| AAAA | Domain → IPv6 address | api.example.com → 2606:2800:220:1:: |
| CNAME | Domain → another domain (alias) | www.example.com → example.com |
| MX | Mail exchange server | example.com → mail.example.com (priority 10) |
| TXT | Arbitrary text (SPF, DKIM, verification) | example.com → "v=spf1 include:_spf.google.com" |
| NS | Nameserver delegation | example.com → ns1.example.com |
| SRV | Service discovery (port + host) | _http._tcp.example.com → 10 0 8080 api.example.com |
| SOA | Zone authority info | Contains serial, refresh, retry, expire timers |
TTL: The Trap That Gets Everyone
TTL (Time To Live) controls how long DNS records are cached. Setting TTL wrong causes two categories of pain:
| TTL Too Short (< 60s) | TTL Too Long (> 3600s) |
|---|---|
| Increased DNS query traffic (cost) | Changes take hours to propagate |
| Higher resolution latency | Disaster recovery is slow |
| Resolver rate limiting risk | Cannot quickly redirect traffic |
| Appropriate for: failover records | Appropriate for: stable records that never change |
TTL Strategy
Static infrastructure (mail servers, NS records): 86400 (24 hours)
Normal web services: 300-3600 (5 minutes to 1 hour)
Records you might need to change quickly: 60-300 (1-5 minutes)
Pro tip: Before a planned migration:
1. Lower TTL to 60 seconds 48 hours before the change
2. Make the DNS change
3. Wait for propagation (now only ~60 seconds)
4. Raise TTL back to normal after verifying
The most common DNS mistake in production: Changing a DNS record and expecting it to take effect immediately. If the TTL was 3600 (1 hour), resolvers around the world will serve the old record for up to an hour. There is nothing you can do about this. Plan accordingly.
DNS in Kubernetes
Kubernetes has its own internal DNS system (CoreDNS) that resolves service names to cluster IPs.
# Inside a Kubernetes cluster, service discovery is DNS:
# Same namespace:
curl http://api-service:8080/health
# Resolves to: api-service.default.svc.cluster.local
# Cross-namespace:
curl http://api-service.production.svc.cluster.local:8080/health
# External:
curl http://api.example.com
# CoreDNS forwards to upstream resolver
Kubernetes DNS Debugging
# 1. Check if CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# 2. Test DNS resolution from inside a pod
kubectl run dns-test --image=busybox:1.36 --rm -it -- \
nslookup api-service.production.svc.cluster.local
# 3. Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# 4. Verify /etc/resolv.conf in a pod
kubectl exec deploy/api-server -- cat /etc/resolv.conf
# Should show: nameserver 10.96.0.10 (CoreDNS ClusterIP)
# search production.svc.cluster.local svc.cluster.local cluster.local
# 5. Common issue: ndots:5 default causes excessive DNS queries
# For external domains, pods try 5 search suffixes before querying the real domain
# Fix: set dnsConfig.options ndots to 2 in pod spec
Debugging DNS in Production
When something is broken and you suspect DNS, use these tools in order:
Quick Diagnosis Workflow
# Step 1: Can you resolve the domain at all?
dig api.example.com
# Step 2: What nameserver is answering?
dig api.example.com +trace
# Step 3: Is the answer cached somewhere stale?
# Query authoritative nameserver directly
dig @ns1.example.com api.example.com
# Step 4: Is it a specific record type issue?
dig api.example.com A
dig api.example.com AAAA
dig api.example.com CNAME
# Step 5: What does your application see?
# (Use the same resolver as the app)
nslookup api.example.com
host api.example.com
# Step 6: Check for DNS propagation across resolvers
dig @8.8.8.8 api.example.com # Google
dig @1.1.1.1 api.example.com # Cloudflare
dig @208.67.222.222 api.example.com # OpenDNS
Reading dig Output
$ dig api.example.com
; <<>> DiG 9.18.1 <<>> api.example.com
;; QUESTION SECTION:
;api.example.com. IN A ← What we asked for
;; ANSWER SECTION:
api.example.com. 300 IN A 93.184.216.34 ← The answer
^^^ ← TTL (300 seconds remaining)
;; Query time: 23 msec ← Resolution time
;; SERVER: 192.168.1.1#53 ← Which resolver answered
;; WHEN: Mon Jul 15 10:30:00 UTC 2024
Split-Horizon DNS
Split-horizon (or split-brain) DNS returns different answers depending on where the query comes from. Common use case: internal services resolve to private IPs, external users resolve to public IPs.
External query for api.company.com:
→ 203.0.113.50 (public load balancer IP)
Internal query for api.company.com (from office/VPN):
→ 10.0.1.50 (private IP, bypasses public internet)
| When to Use | When to Avoid |
|---|---|
| Internal services accessed by both internal and external users | When it adds debugging complexity without clear benefit |
| Reducing hairpin routing through the public internet | When internal and external behavior should be identical |
| Security: hiding internal topology from external resolvers | When you do not have separate DNS infrastructure to manage |
Implementation Checklist
- Audit all DNS records: remove stale entries, verify TTLs are appropriate
- Set TTL to 300s (5 min) for records that might need emergency changes
- Verify Kubernetes CoreDNS is healthy: pods running, no error logs
- Fix ndots setting in Kubernetes pods to reduce unnecessary DNS queries
- Document your DNS architecture: authoritative nameservers, resolvers, caching layers
- Set up DNS monitoring: alert on resolution failures, not just application errors
- Before any DNS migration, lower TTL 48 hours in advance
- Add
digandnslookupto your incident response debugging checklist - Implement DNSSEC for domains that handle sensitive traffic
- Test DNS failover: can you redirect traffic within 5 minutes?