DNS for Engineers Who Keep Breaking Things: From Resolution to Debugging

DNS is the most critical infrastructure service you never think about — until it breaks. Then it is the only thing anyone thinks about, because when DNS fails, everything fails. Your APIs return connection errors. Your databases become unreachable. Your CDN stops serving content. Kubernetes pods cannot find each other. And the error messages all say something different, because nobody logs “DNS resolution failed” — they log “connection refused” or “host not found” or simply “timeout.”

This guide is for engineers who need to understand DNS well enough to debug production issues and design reliable name resolution architectures.

How DNS Resolution Actually Works

Every time your application makes an HTTP request, this is what happens before a single byte of data is exchanged:

Your Application
   │
   ▼
1. Check /etc/hosts (local override)
   │ Not found
   ▼
2. Check local DNS cache (systemd-resolved, nscd, or app-level)
   │ Not found or expired
   ▼
3. Query recursive resolver (usually your ISP or 8.8.8.8 / 1.1.1.1)
   │ Recursive resolver starts the hunt:
   ▼
4. Query root nameserver → "Who handles .com?"
   │ Response: "a.gtld-servers.net handles .com"
   ▼
5. Query .com TLD nameserver → "Who handles example.com?"
   │ Response: "ns1.example.com is authoritative for example.com"
   ▼
6. Query authoritative nameserver → "What is the IP for api.example.com?"
   │ Response: "93.184.216.34, TTL 300"
   ▼
7. Recursive resolver caches the result for 300 seconds
   │ Returns answer to your application
   ▼
8. Your application connects to 93.184.216.34
   │ Total time: 20-200ms for first resolution, <1ms for cached

Record Types You Need to Know

Type	Purpose	Example
A	Domain → IPv4 address	`api.example.com → 93.184.216.34`
AAAA	Domain → IPv6 address	`api.example.com → 2606:2800:220:1::`
CNAME	Domain → another domain (alias)	`www.example.com → example.com`
MX	Mail exchange server	`example.com → mail.example.com (priority 10)`
TXT	Arbitrary text (SPF, DKIM, verification)	`example.com → "v=spf1 include:_spf.google.com"`
NS	Nameserver delegation	`example.com → ns1.example.com`
SRV	Service discovery (port + host)	`_http._tcp.example.com → 10 0 8080 api.example.com`
SOA	Zone authority info	Contains serial, refresh, retry, expire timers

TTL: The Trap That Gets Everyone

TTL (Time To Live) controls how long DNS records are cached. Setting TTL wrong causes two categories of pain:

TTL Too Short (< 60s)	TTL Too Long (> 3600s)
Increased DNS query traffic (cost)	Changes take hours to propagate
Higher resolution latency	Disaster recovery is slow
Resolver rate limiting risk	Cannot quickly redirect traffic
Appropriate for: failover records	Appropriate for: stable records that never change

TTL Strategy

Static infrastructure (mail servers, NS records): 86400 (24 hours)
Normal web services: 300-3600 (5 minutes to 1 hour)
Records you might need to change quickly: 60-300 (1-5 minutes)

Pro tip: Before a planned migration:
1. Lower TTL to 60 seconds 48 hours before the change
2. Make the DNS change
3. Wait for propagation (now only ~60 seconds)
4. Raise TTL back to normal after verifying

The most common DNS mistake in production: Changing a DNS record and expecting it to take effect immediately. If the TTL was 3600 (1 hour), resolvers around the world will serve the old record for up to an hour. There is nothing you can do about this. Plan accordingly.

DNS in Kubernetes

Kubernetes has its own internal DNS system (CoreDNS) that resolves service names to cluster IPs.

# Inside a Kubernetes cluster, service discovery is DNS:

# Same namespace:
curl http://api-service:8080/health
# Resolves to: api-service.default.svc.cluster.local

# Cross-namespace:
curl http://api-service.production.svc.cluster.local:8080/health

# External:
curl http://api.example.com
# CoreDNS forwards to upstream resolver

Kubernetes DNS Debugging

# 1. Check if CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 2. Test DNS resolution from inside a pod
kubectl run dns-test --image=busybox:1.36 --rm -it -- \
  nslookup api-service.production.svc.cluster.local

# 3. Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# 4. Verify /etc/resolv.conf in a pod
kubectl exec deploy/api-server -- cat /etc/resolv.conf
# Should show: nameserver 10.96.0.10 (CoreDNS ClusterIP)
#              search production.svc.cluster.local svc.cluster.local cluster.local

# 5. Common issue: ndots:5 default causes excessive DNS queries
# For external domains, pods try 5 search suffixes before querying the real domain
# Fix: set dnsConfig.options ndots to 2 in pod spec

Debugging DNS in Production

When something is broken and you suspect DNS, use these tools in order:

Quick Diagnosis Workflow

# Step 1: Can you resolve the domain at all?
dig api.example.com

# Step 2: What nameserver is answering?
dig api.example.com +trace

# Step 3: Is the answer cached somewhere stale?
# Query authoritative nameserver directly
dig @ns1.example.com api.example.com

# Step 4: Is it a specific record type issue?
dig api.example.com A
dig api.example.com AAAA
dig api.example.com CNAME

# Step 5: What does your application see?
# (Use the same resolver as the app)
nslookup api.example.com
host api.example.com

# Step 6: Check for DNS propagation across resolvers
dig @8.8.8.8 api.example.com    # Google
dig @1.1.1.1 api.example.com    # Cloudflare
dig @208.67.222.222 api.example.com  # OpenDNS

Reading `dig` Output

$ dig api.example.com

; <<>> DiG 9.18.1 <<>> api.example.com
;; QUESTION SECTION:
;api.example.com.       IN  A          ← What we asked for

;; ANSWER SECTION:
api.example.com.  300   IN  A  93.184.216.34   ← The answer
                  ^^^                           ← TTL (300 seconds remaining)

;; Query time: 23 msec                 ← Resolution time
;; SERVER: 192.168.1.1#53              ← Which resolver answered
;; WHEN: Mon Jul 15 10:30:00 UTC 2024

Split-Horizon DNS

Split-horizon (or split-brain) DNS returns different answers depending on where the query comes from. Common use case: internal services resolve to private IPs, external users resolve to public IPs.

External query for api.company.com:
  → 203.0.113.50 (public load balancer IP)

Internal query for api.company.com (from office/VPN):
  → 10.0.1.50 (private IP, bypasses public internet)

When to Use	When to Avoid
Internal services accessed by both internal and external users	When it adds debugging complexity without clear benefit
Reducing hairpin routing through the public internet	When internal and external behavior should be identical
Security: hiding internal topology from external resolvers	When you do not have separate DNS infrastructure to manage

How DNS Resolution Actually Works

Record Types You Need to Know

TTL: The Trap That Gets Everyone

TTL Strategy

DNS in Kubernetes

Kubernetes DNS Debugging

Debugging DNS in Production

Quick Diagnosis Workflow

Reading dig Output

Split-Horizon DNS

Implementation Checklist

More in Networking

API Gateway Networking: Traffic Management at the Edge

BGP Fundamentals for Engineers

CDN Architecture: Serving Content at the Edge

Reading `dig` Output