ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

DNS for Engineers Who Keep Breaking Things: From Resolution to Debugging

Understand DNS from the ground up so you stop debugging the wrong thing when name resolution fails. Covers how DNS actually works, TTL traps, split-horizon, DNSSEC, and the debugging tools that will save you at 3 AM.

DNS is the most critical infrastructure service you never think about — until it breaks. Then it is the only thing anyone thinks about, because when DNS fails, everything fails. Your APIs return connection errors. Your databases become unreachable. Your CDN stops serving content. Kubernetes pods cannot find each other. And the error messages all say something different, because nobody logs “DNS resolution failed” — they log “connection refused” or “host not found” or simply “timeout.”

This guide is for engineers who need to understand DNS well enough to debug production issues and design reliable name resolution architectures.


How DNS Resolution Actually Works

Every time your application makes an HTTP request, this is what happens before a single byte of data is exchanged:

Your Application


1. Check /etc/hosts (local override)
   │ Not found

2. Check local DNS cache (systemd-resolved, nscd, or app-level)
   │ Not found or expired

3. Query recursive resolver (usually your ISP or 8.8.8.8 / 1.1.1.1)
   │ Recursive resolver starts the hunt:

4. Query root nameserver → "Who handles .com?"
   │ Response: "a.gtld-servers.net handles .com"

5. Query .com TLD nameserver → "Who handles example.com?"
   │ Response: "ns1.example.com is authoritative for example.com"

6. Query authoritative nameserver → "What is the IP for api.example.com?"
   │ Response: "93.184.216.34, TTL 300"

7. Recursive resolver caches the result for 300 seconds
   │ Returns answer to your application

8. Your application connects to 93.184.216.34
   │ Total time: 20-200ms for first resolution, <1ms for cached

Record Types You Need to Know

TypePurposeExample
ADomain → IPv4 addressapi.example.com → 93.184.216.34
AAAADomain → IPv6 addressapi.example.com → 2606:2800:220:1::
CNAMEDomain → another domain (alias)www.example.com → example.com
MXMail exchange serverexample.com → mail.example.com (priority 10)
TXTArbitrary text (SPF, DKIM, verification)example.com → "v=spf1 include:_spf.google.com"
NSNameserver delegationexample.com → ns1.example.com
SRVService discovery (port + host)_http._tcp.example.com → 10 0 8080 api.example.com
SOAZone authority infoContains serial, refresh, retry, expire timers

TTL: The Trap That Gets Everyone

TTL (Time To Live) controls how long DNS records are cached. Setting TTL wrong causes two categories of pain:

TTL Too Short (< 60s)TTL Too Long (> 3600s)
Increased DNS query traffic (cost)Changes take hours to propagate
Higher resolution latencyDisaster recovery is slow
Resolver rate limiting riskCannot quickly redirect traffic
Appropriate for: failover recordsAppropriate for: stable records that never change

TTL Strategy

Static infrastructure (mail servers, NS records): 86400 (24 hours)
Normal web services: 300-3600 (5 minutes to 1 hour)
Records you might need to change quickly: 60-300 (1-5 minutes)

Pro tip: Before a planned migration:
1. Lower TTL to 60 seconds 48 hours before the change
2. Make the DNS change
3. Wait for propagation (now only ~60 seconds)
4. Raise TTL back to normal after verifying

The most common DNS mistake in production: Changing a DNS record and expecting it to take effect immediately. If the TTL was 3600 (1 hour), resolvers around the world will serve the old record for up to an hour. There is nothing you can do about this. Plan accordingly.


DNS in Kubernetes

Kubernetes has its own internal DNS system (CoreDNS) that resolves service names to cluster IPs.

# Inside a Kubernetes cluster, service discovery is DNS:

# Same namespace:
curl http://api-service:8080/health
# Resolves to: api-service.default.svc.cluster.local

# Cross-namespace:
curl http://api-service.production.svc.cluster.local:8080/health

# External:
curl http://api.example.com
# CoreDNS forwards to upstream resolver

Kubernetes DNS Debugging

# 1. Check if CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 2. Test DNS resolution from inside a pod
kubectl run dns-test --image=busybox:1.36 --rm -it -- \
  nslookup api-service.production.svc.cluster.local

# 3. Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# 4. Verify /etc/resolv.conf in a pod
kubectl exec deploy/api-server -- cat /etc/resolv.conf
# Should show: nameserver 10.96.0.10 (CoreDNS ClusterIP)
#              search production.svc.cluster.local svc.cluster.local cluster.local

# 5. Common issue: ndots:5 default causes excessive DNS queries
# For external domains, pods try 5 search suffixes before querying the real domain
# Fix: set dnsConfig.options ndots to 2 in pod spec

Debugging DNS in Production

When something is broken and you suspect DNS, use these tools in order:

Quick Diagnosis Workflow

# Step 1: Can you resolve the domain at all?
dig api.example.com

# Step 2: What nameserver is answering?
dig api.example.com +trace

# Step 3: Is the answer cached somewhere stale?
# Query authoritative nameserver directly
dig @ns1.example.com api.example.com

# Step 4: Is it a specific record type issue?
dig api.example.com A
dig api.example.com AAAA
dig api.example.com CNAME

# Step 5: What does your application see?
# (Use the same resolver as the app)
nslookup api.example.com
host api.example.com

# Step 6: Check for DNS propagation across resolvers
dig @8.8.8.8 api.example.com    # Google
dig @1.1.1.1 api.example.com    # Cloudflare
dig @208.67.222.222 api.example.com  # OpenDNS

Reading dig Output

$ dig api.example.com

; <<>> DiG 9.18.1 <<>> api.example.com
;; QUESTION SECTION:
;api.example.com.       IN  A          ← What we asked for

;; ANSWER SECTION:
api.example.com.  300   IN  A  93.184.216.34   ← The answer
                  ^^^                           ← TTL (300 seconds remaining)

;; Query time: 23 msec                 ← Resolution time
;; SERVER: 192.168.1.1#53              ← Which resolver answered
;; WHEN: Mon Jul 15 10:30:00 UTC 2024

Split-Horizon DNS

Split-horizon (or split-brain) DNS returns different answers depending on where the query comes from. Common use case: internal services resolve to private IPs, external users resolve to public IPs.

External query for api.company.com:
  → 203.0.113.50 (public load balancer IP)

Internal query for api.company.com (from office/VPN):
  → 10.0.1.50 (private IP, bypasses public internet)
When to UseWhen to Avoid
Internal services accessed by both internal and external usersWhen it adds debugging complexity without clear benefit
Reducing hairpin routing through the public internetWhen internal and external behavior should be identical
Security: hiding internal topology from external resolversWhen you do not have separate DNS infrastructure to manage

Implementation Checklist

  • Audit all DNS records: remove stale entries, verify TTLs are appropriate
  • Set TTL to 300s (5 min) for records that might need emergency changes
  • Verify Kubernetes CoreDNS is healthy: pods running, no error logs
  • Fix ndots setting in Kubernetes pods to reduce unnecessary DNS queries
  • Document your DNS architecture: authoritative nameservers, resolvers, caching layers
  • Set up DNS monitoring: alert on resolution failures, not just application errors
  • Before any DNS migration, lower TTL 48 hours in advance
  • Add dig and nslookup to your incident response debugging checklist
  • Implement DNSSEC for domains that handle sensitive traffic
  • Test DNS failover: can you redirect traffic within 5 minutes?
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →