Network issues are the most frustrating category of production problems because symptoms appear far from the cause. A database timing out might be a network congestion issue three hops away. An API returning 502s might be a load balancer health check misconfiguration. Effective network monitoring means instrumenting every layer and knowing which tool to reach for when things break.
Network Monitoring Stack
| Layer | What to Monitor | Tools |
|---|
| Application | HTTP latency, error rates, throughput | Prometheus, Datadog, New Relic |
| Transport | TCP connections, retransmissions, RSTs | ss, netstat, tcpdump, Wireshark |
| Network | Packet loss, latency, routing | ping, traceroute, mtr, smokeping |
| DNS | Resolution time, failures, TTL | dig, nslookup, dnstop |
| Infrastructure | Interface errors, bandwidth, drops | SNMP, Telegraf, CloudWatch |
| Security | Blocked connections, DDoS indicators | VPC Flow Logs, WAF logs |
Common Network Failure Patterns
| Symptom | Likely Cause | Diagnostic Tool |
|---|
| Intermittent timeouts | Network congestion, packet loss | mtr, tcpdump |
| Connection refused | Service not listening, firewall block | telnet, nc, iptables -L |
| DNS resolution failure | DNS server down, wrong resolver, TTL | dig, nslookup, /etc/resolv.conf |
| High latency to specific host | Routing issue, congested link | traceroute, mtr |
| SSL/TLS errors | Certificate expired, mismatch, protocol | openssl s_client, curl -v |
| Connection reset (RST) | Firewall, load balancer timeout, server crash | tcpdump, firewall logs |
| Slow first request, fast subsequent | DNS resolution, TCP cold start, TLS handshake | curl -w timing, dig |
| Tool | Purpose | When to Use |
|---|
| ping | Basic connectivity + latency | First check: is the host reachable? |
| traceroute/mtr | Path analysis, hop-by-hop latency | Where is latency introduced? |
| dig/nslookup | DNS resolution | DNS issues |
| curl -v | HTTP request debugging | API connectivity, SSL/TLS issues |
| telnet/nc | TCP port connectivity | Is the service listening on expected port? |
| ss/netstat | Local socket state | Connection counts, states, listening ports |
| tcpdump | Packet capture | Deep analysis of network conversations |
| Wireshark | Packet analysis (GUI) | Detailed protocol analysis |
| iperf3 | Bandwidth testing | Network throughput between two points |
| openssl s_client | TLS debugging | Certificate chain, protocol, cipher issues |
Latency Diagnosis Framework
Measure total request latency:
curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nFirst byte: %{time_starttransfer}s\nTotal: %{time_total}s\n" -o /dev/null -s https://api.example.com
Interpretation:
DNS: 50ms → Normal (< 100ms)
Connect: 150ms → Check geographic distance, network path
TLS: 300ms → Check TLS version, certificate chain length
First byte: 800ms → Server processing time (application issue)
Total: 850ms → Download time minimal
DNS Troubleshooting
| Issue | Diagnosis | Fix |
|---|
| Slow resolution | dig +stats example.com | Use faster DNS (8.8.8.8, 1.1.1.1) or local cache |
| Wrong IP returned | dig example.com @8.8.8.8 | Check DNS records, TTL, propagation |
| NXDOMAIN | Record doesn’t exist | Verify record in DNS provider |
| SERVFAIL | DNS server error | Try different resolver, check zone config |
| High TTL after change | Old record cached | Lower TTL before change, wait for propagation |
Cloud Network Monitoring
| Cloud | Flow Logs | Network Monitor | DNS Logging |
|---|
| AWS | VPC Flow Logs | CloudWatch, Reachability Analyzer | Route 53 Query Logging |
| Azure | NSG Flow Logs | Network Watcher, Connection Monitor | Private DNS Query Logging |
| GCP | VPC Flow Logs | Network Intelligence Center | Cloud DNS Logging |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|
| No network metrics | Blind to network-related issues | Monitor at every layer (application to infra) |
| Overly permissive security groups | Hard to diagnose what’s allowed vs. blocked | Least-privilege rules with logging |
| DNS TTL too high | Changes take hours to propagate | TTL of 300s for dynamic records, lower before migrations |
| No baseline metrics | Can’t tell if current latency is normal | Establish baselines, alert on deviation |
| Diagnosing at wrong layer | Chasing application bugs when it’s network | Start with network basics (ping, traceroute) first |
Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For network engineering consulting, visit garnetgrid.com.
:::
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting
Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.
View Full Profile →