ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Network Monitoring and Troubleshooting

Monitor and troubleshoot network issues in production environments. Covers monitoring tools, common failure patterns, packet analysis, latency diagnosis, and DNS troubleshooting.

Network issues are the most frustrating category of production problems because symptoms appear far from the cause. A database timing out might be a network congestion issue three hops away. An API returning 502s might be a load balancer health check misconfiguration. Effective network monitoring means instrumenting every layer and knowing which tool to reach for when things break.


Network Monitoring Stack

LayerWhat to MonitorTools
ApplicationHTTP latency, error rates, throughputPrometheus, Datadog, New Relic
TransportTCP connections, retransmissions, RSTsss, netstat, tcpdump, Wireshark
NetworkPacket loss, latency, routingping, traceroute, mtr, smokeping
DNSResolution time, failures, TTLdig, nslookup, dnstop
InfrastructureInterface errors, bandwidth, dropsSNMP, Telegraf, CloudWatch
SecurityBlocked connections, DDoS indicatorsVPC Flow Logs, WAF logs

Common Network Failure Patterns

SymptomLikely CauseDiagnostic Tool
Intermittent timeoutsNetwork congestion, packet lossmtr, tcpdump
Connection refusedService not listening, firewall blocktelnet, nc, iptables -L
DNS resolution failureDNS server down, wrong resolver, TTLdig, nslookup, /etc/resolv.conf
High latency to specific hostRouting issue, congested linktraceroute, mtr
SSL/TLS errorsCertificate expired, mismatch, protocolopenssl s_client, curl -v
Connection reset (RST)Firewall, load balancer timeout, server crashtcpdump, firewall logs
Slow first request, fast subsequentDNS resolution, TCP cold start, TLS handshakecurl -w timing, dig

Diagnostic Tools

ToolPurposeWhen to Use
pingBasic connectivity + latencyFirst check: is the host reachable?
traceroute/mtrPath analysis, hop-by-hop latencyWhere is latency introduced?
dig/nslookupDNS resolutionDNS issues
curl -vHTTP request debuggingAPI connectivity, SSL/TLS issues
telnet/ncTCP port connectivityIs the service listening on expected port?
ss/netstatLocal socket stateConnection counts, states, listening ports
tcpdumpPacket captureDeep analysis of network conversations
WiresharkPacket analysis (GUI)Detailed protocol analysis
iperf3Bandwidth testingNetwork throughput between two points
openssl s_clientTLS debuggingCertificate chain, protocol, cipher issues

Latency Diagnosis Framework

Measure total request latency:
    curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nFirst byte: %{time_starttransfer}s\nTotal: %{time_total}s\n" -o /dev/null -s https://api.example.com

Interpretation:
    DNS:        50ms → Normal (< 100ms)
    Connect:   150ms → Check geographic distance, network path
    TLS:       300ms → Check TLS version, certificate chain length
    First byte: 800ms → Server processing time (application issue)
    Total:     850ms → Download time minimal

DNS Troubleshooting

IssueDiagnosisFix
Slow resolutiondig +stats example.comUse faster DNS (8.8.8.8, 1.1.1.1) or local cache
Wrong IP returneddig example.com @8.8.8.8Check DNS records, TTL, propagation
NXDOMAINRecord doesn’t existVerify record in DNS provider
SERVFAILDNS server errorTry different resolver, check zone config
High TTL after changeOld record cachedLower TTL before change, wait for propagation

Cloud Network Monitoring

CloudFlow LogsNetwork MonitorDNS Logging
AWSVPC Flow LogsCloudWatch, Reachability AnalyzerRoute 53 Query Logging
AzureNSG Flow LogsNetwork Watcher, Connection MonitorPrivate DNS Query Logging
GCPVPC Flow LogsNetwork Intelligence CenterCloud DNS Logging

Anti-Patterns

Anti-PatternProblemFix
No network metricsBlind to network-related issuesMonitor at every layer (application to infra)
Overly permissive security groupsHard to diagnose what’s allowed vs. blockedLeast-privilege rules with logging
DNS TTL too highChanges take hours to propagateTTL of 300s for dynamic records, lower before migrations
No baseline metricsCan’t tell if current latency is normalEstablish baselines, alert on deviation
Diagnosing at wrong layerChasing application bugs when it’s networkStart with network basics (ping, traceroute) first

Checklist

  • Network monitoring in place for latency, packet loss, throughput
  • DNS monitoring: resolution time, failure rate, TTL tracking
  • Cloud flow logs enabled for network traffic analysis
  • Baseline metrics established for normal network performance
  • Alerting on latency spikes, packet loss > 0.1%, connection failures
  • Diagnostic runbook: standard troubleshooting steps for common issues
  • SSL/TLS certificate monitoring with expiry alerts (30-day warning)
  • Firewall/security group rules documented and reviewed quarterly
  • Bandwidth capacity planned for peak traffic
  • Network troubleshooting tools available on all servers (curl, dig, ss, tcpdump)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For network engineering consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →