Network Monitoring and Troubleshooting

Network issues are the most frustrating category of production problems because symptoms appear far from the cause. A database timing out might be a network congestion issue three hops away. An API returning 502s might be a load balancer health check misconfiguration. Effective network monitoring means instrumenting every layer and knowing which tool to reach for when things break.

Network Monitoring Stack

Layer	What to Monitor	Tools
Application	HTTP latency, error rates, throughput	Prometheus, Datadog, New Relic
Transport	TCP connections, retransmissions, RSTs	ss, netstat, tcpdump, Wireshark
Network	Packet loss, latency, routing	ping, traceroute, mtr, smokeping
DNS	Resolution time, failures, TTL	dig, nslookup, dnstop
Infrastructure	Interface errors, bandwidth, drops	SNMP, Telegraf, CloudWatch
Security	Blocked connections, DDoS indicators	VPC Flow Logs, WAF logs

Common Network Failure Patterns

Symptom	Likely Cause	Diagnostic Tool
Intermittent timeouts	Network congestion, packet loss	mtr, tcpdump
Connection refused	Service not listening, firewall block	telnet, nc, iptables -L
DNS resolution failure	DNS server down, wrong resolver, TTL	dig, nslookup, /etc/resolv.conf
High latency to specific host	Routing issue, congested link	traceroute, mtr
SSL/TLS errors	Certificate expired, mismatch, protocol	openssl s_client, curl -v
Connection reset (RST)	Firewall, load balancer timeout, server crash	tcpdump, firewall logs
Slow first request, fast subsequent	DNS resolution, TCP cold start, TLS handshake	curl -w timing, dig

Diagnostic Tools

Tool	Purpose	When to Use
ping	Basic connectivity + latency	First check: is the host reachable?
traceroute/mtr	Path analysis, hop-by-hop latency	Where is latency introduced?
dig/nslookup	DNS resolution	DNS issues
curl -v	HTTP request debugging	API connectivity, SSL/TLS issues
telnet/nc	TCP port connectivity	Is the service listening on expected port?
ss/netstat	Local socket state	Connection counts, states, listening ports
tcpdump	Packet capture	Deep analysis of network conversations
Wireshark	Packet analysis (GUI)	Detailed protocol analysis
iperf3	Bandwidth testing	Network throughput between two points
openssl s_client	TLS debugging	Certificate chain, protocol, cipher issues

Latency Diagnosis Framework

Measure total request latency:
    curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nFirst byte: %{time_starttransfer}s\nTotal: %{time_total}s\n" -o /dev/null -s https://api.example.com

Interpretation:
    DNS:        50ms → Normal (< 100ms)
    Connect:   150ms → Check geographic distance, network path
    TLS:       300ms → Check TLS version, certificate chain length
    First byte: 800ms → Server processing time (application issue)
    Total:     850ms → Download time minimal

DNS Troubleshooting

Issue	Diagnosis	Fix
Slow resolution	`dig +stats example.com`	Use faster DNS (8.8.8.8, 1.1.1.1) or local cache
Wrong IP returned	`dig example.com @8.8.8.8`	Check DNS records, TTL, propagation
NXDOMAIN	Record doesn’t exist	Verify record in DNS provider
SERVFAIL	DNS server error	Try different resolver, check zone config
High TTL after change	Old record cached	Lower TTL before change, wait for propagation

Cloud Network Monitoring

Cloud	Flow Logs	Network Monitor	DNS Logging
AWS	VPC Flow Logs	CloudWatch, Reachability Analyzer	Route 53 Query Logging
Azure	NSG Flow Logs	Network Watcher, Connection Monitor	Private DNS Query Logging
GCP	VPC Flow Logs	Network Intelligence Center	Cloud DNS Logging

Anti-Patterns

Anti-Pattern	Problem	Fix
No network metrics	Blind to network-related issues	Monitor at every layer (application to infra)
Overly permissive security groups	Hard to diagnose what’s allowed vs. blocked	Least-privilege rules with logging
DNS TTL too high	Changes take hours to propagate	TTL of 300s for dynamic records, lower before migrations
No baseline metrics	Can’t tell if current latency is normal	Establish baselines, alert on deviation
Diagnosing at wrong layer	Chasing application bugs when it’s network	Start with network basics (ping, traceroute) first

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For network engineering consulting, visit garnetgrid.com. :::