Network Observability: Seeing What Flows Through Your Infrastructure
Implement network observability to detect anomalies, debug connectivity issues, and understand traffic patterns. Covers flow logs, packet capture, eBPF-based monitoring, DNS analytics, and the dashboards that make network behavior visible to application teams.
Network observability makes the invisible visible. When an API call takes 3 seconds instead of 300 milliseconds, is it the application, the database, the network, or DNS? Without network observability, the answer is “we don’t know” — and the debugging session becomes a blame-shifting exercise between infrastructure and application teams.
Network observability captures what is flowing through your infrastructure — source, destination, ports, protocols, latency, packet loss — and presents it in a way that both network engineers and application developers can use.
The Three Pillars of Network Observability
Flow Data
Flow records capture metadata about network connections without inspecting packet contents:
Source: 10.0.1.15:45283
Destination: 10.0.2.8:5432
Protocol: TCP
Bytes: 2,847,593
Packets: 1,943
Duration: 4.2s
Action: ACCEPT
Sources:
- AWS VPC Flow Logs: Capture accepted/rejected traffic at ENI level
- Azure NSG Flow Logs: Network security group level capture
- GCP VPC Flow Logs: Subnet-level sampling
- NetFlow/IPFIX: On-premises router/switch data
DNS Analytics
DNS is the first step of every network connection. DNS failures and latency are the most common — and most underdiagnosed — network issues:
Query: api.stripe.com
Type: A
Response: 54.187.174.169
Latency: 15ms
Source: pod/order-service-7d5f4
Query: internal-db.cluster.local
Type: A
Response: NXDOMAIN ← problem!
Latency: 200ms ← also a problem
Source: pod/payment-service-a3b2
Packet Capture
When flow data is not detailed enough, packet capture shows exactly what happened at the wire level:
# Capture packets on a specific interface
tcpdump -i eth0 -w capture.pcap host 10.0.2.8 and port 5432
# Capture inside a Kubernetes pod
kubectl debug -it pod/order-service --image=nicolaka/netshoot -- tcpdump -i eth0 -c 100
Packet capture is a debugging tool, not a monitoring tool. Use it for specific investigations, not continuous monitoring (the volume is prohibitive).
eBPF-Based Network Monitoring
eBPF (extended Berkeley Packet Filter) enables high-performance network monitoring directly in the Linux kernel without modifying applications or adding sidecars:
Cilium Hubble
# Enable Hubble observability in Cilium
hubble:
enabled: true
relay:
enabled: true
ui:
enabled: true
Hubble provides:
- Real-time service map (which service talks to which)
- Per-request latency histograms
- DNS query logging
- Network policy verdict logging (allowed vs denied)
- HTTP/gRPC protocol-aware visibility
Why eBPF Matters
Traditional network monitoring requires either mirroring traffic (expensive, bandwidth-consuming) or using proxies (latency overhead, complexity). eBPF instruments the kernel itself — zero-copy, zero-overhead visibility into every packet.
Building a Network Dashboard
Service Communication Map
Visualize which services communicate and how much traffic flows between them:
api-gateway ──[2,500 RPS]──▶ order-service
api-gateway ──[1,800 RPS]──▶ user-service
order-service ──[2,500 RPS]──▶ postgres
order-service ──[900 RPS]───▶ payment-service
payment-service ──[900 RPS]──▶ stripe.com (external)
Key Metrics to Display
Connection Metrics:
- TCP connection establishment time (P50, P95, P99)
- Connection error rate
- Active connections by service pair
- Retransmission rate
DNS Metrics:
- Query latency (P50, P95)
- NXDOMAIN rate
- Query volume by source
Traffic Metrics:
- Bandwidth by service pair
- Cross-AZ traffic volume (cost indicator)
- External egress volume
Alerting Rules
# TCP retransmission rate spike
- alert: HighRetransmissionRate
expr: rate(tcp_retransmits_total[5m]) > 100
for: 5m
annotations:
summary: "High TCP retransmission rate on {{ $labels.instance }}"
# DNS latency spike
- alert: DNSLatencyHigh
expr: histogram_quantile(0.95, rate(dns_query_duration_seconds_bucket[5m])) > 0.1
for: 5m
annotations:
summary: "DNS P95 latency exceeds 100ms"
# Unexpected external traffic
- alert: UnexpectedEgressTraffic
expr: sum(rate(network_egress_bytes_total{destination!~"10\\..*"}[5m])) by (source_pod) > 1e6
for: 10m
annotations:
summary: "Pod {{ $labels.source_pod }} sending unexpected external traffic"
Troubleshooting Workflows
”It’s the Network” Investigation
When application teams report latency issues, use this workflow:
Step 1: Check application-level metrics
→ Is the latency in the app (processing time) or network (wait time)?
Step 2: Check DNS latency
→ Is name resolution slow? NXDOMAIN errors?
Step 3: Check TCP connection metrics
→ Are connections being established slowly? High SYN retransmits?
Step 4: Check flow data between the two services
→ Is there packet loss? Retransmissions?
Step 5: Check network policy verdicts
→ Are legitimate requests being blocked by a policy?
Step 6: Packet capture (last resort)
→ Capture 60 seconds of traffic and analyze in Wireshark
Cross-AZ Latency
Cross-AZ communication adds 0.5-2ms per round trip. For chatty services, this accumulates:
Service A (us-east-1a) → Service B (us-east-1b)
Single request: +1ms
Feature requiring 15 service calls: +15ms
Under load (100 concurrent): thread pool exhaustion
Detection: Correlate flow log source/destination AZ with latency metrics. Remediation: Co-locate chatty services in the same AZ or reduce cross-service call volume.
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No flow logs enabled | Blind to network-level issues | Enable VPC flow logs on all production subnets |
| Packet capture as monitoring | Storage and performance explosion | Use flow data for monitoring, pcap for debugging |
| Ignoring DNS metrics | Slowest-resolvable-name determines latency | Monitor DNS latency and error rates |
| No cross-AZ traffic visibility | Surprise data transfer costs | Tag and measure inter-AZ traffic |
| Network team only dashboard | App teams cannot self-serve | Build service-centric views, not network-centric |
Network observability is not a network team tool — it is a platform capability that empowers every team to debug their own connectivity issues. The goal is to make “it’s the network” a testable hypothesis instead of a scapegoat.