Network Observability: Seeing What Flows Through Your Infrastructure

Network observability makes the invisible visible. When an API call takes 3 seconds instead of 300 milliseconds, is it the application, the database, the network, or DNS? Without network observability, the answer is “we don’t know” — and the debugging session becomes a blame-shifting exercise between infrastructure and application teams.

Network observability captures what is flowing through your infrastructure — source, destination, ports, protocols, latency, packet loss — and presents it in a way that both network engineers and application developers can use.

The Three Pillars of Network Observability

Flow Data

Flow records capture metadata about network connections without inspecting packet contents:

Source:       10.0.1.15:45283
Destination:  10.0.2.8:5432
Protocol:     TCP
Bytes:        2,847,593
Packets:      1,943
Duration:     4.2s
Action:       ACCEPT

Sources:

AWS VPC Flow Logs: Capture accepted/rejected traffic at ENI level
Azure NSG Flow Logs: Network security group level capture
GCP VPC Flow Logs: Subnet-level sampling
NetFlow/IPFIX: On-premises router/switch data

DNS Analytics

DNS is the first step of every network connection. DNS failures and latency are the most common — and most underdiagnosed — network issues:

Query:    api.stripe.com
Type:     A
Response: 54.187.174.169
Latency:  15ms
Source:   pod/order-service-7d5f4

Query:    internal-db.cluster.local
Type:     A
Response: NXDOMAIN  ← problem!
Latency:  200ms     ← also a problem
Source:   pod/payment-service-a3b2

Packet Capture

When flow data is not detailed enough, packet capture shows exactly what happened at the wire level:

# Capture packets on a specific interface
tcpdump -i eth0 -w capture.pcap host 10.0.2.8 and port 5432

# Capture inside a Kubernetes pod
kubectl debug -it pod/order-service --image=nicolaka/netshoot -- tcpdump -i eth0 -c 100

Packet capture is a debugging tool, not a monitoring tool. Use it for specific investigations, not continuous monitoring (the volume is prohibitive).

eBPF-Based Network Monitoring

eBPF (extended Berkeley Packet Filter) enables high-performance network monitoring directly in the Linux kernel without modifying applications or adding sidecars:

Cilium Hubble

# Enable Hubble observability in Cilium
hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true

Hubble provides:

Real-time service map (which service talks to which)
Per-request latency histograms
DNS query logging
Network policy verdict logging (allowed vs denied)
HTTP/gRPC protocol-aware visibility

Why eBPF Matters

Traditional network monitoring requires either mirroring traffic (expensive, bandwidth-consuming) or using proxies (latency overhead, complexity). eBPF instruments the kernel itself — zero-copy, zero-overhead visibility into every packet.

Building a Network Dashboard

Service Communication Map

Visualize which services communicate and how much traffic flows between them:

api-gateway ──[2,500 RPS]──▶ order-service
api-gateway ──[1,800 RPS]──▶ user-service
order-service ──[2,500 RPS]──▶ postgres
order-service ──[900 RPS]───▶ payment-service
payment-service ──[900 RPS]──▶ stripe.com (external)

Key Metrics to Display

Connection Metrics:
  - TCP connection establishment time (P50, P95, P99)
  - Connection error rate
  - Active connections by service pair
  - Retransmission rate

DNS Metrics:
  - Query latency (P50, P95)
  - NXDOMAIN rate
  - Query volume by source

Traffic Metrics:
  - Bandwidth by service pair
  - Cross-AZ traffic volume (cost indicator)
  - External egress volume

Alerting Rules

# TCP retransmission rate spike
- alert: HighRetransmissionRate
  expr: rate(tcp_retransmits_total[5m]) > 100
  for: 5m
  annotations:
    summary: "High TCP retransmission rate on {{ $labels.instance }}"

# DNS latency spike
- alert: DNSLatencyHigh
  expr: histogram_quantile(0.95, rate(dns_query_duration_seconds_bucket[5m])) > 0.1
  for: 5m
  annotations:
    summary: "DNS P95 latency exceeds 100ms"

# Unexpected external traffic
- alert: UnexpectedEgressTraffic
  expr: sum(rate(network_egress_bytes_total{destination!~"10\\..*"}[5m])) by (source_pod) > 1e6
  for: 10m
  annotations:
    summary: "Pod {{ $labels.source_pod }} sending unexpected external traffic"

Troubleshooting Workflows

”It’s the Network” Investigation

When application teams report latency issues, use this workflow:

Step 1: Check application-level metrics
  → Is the latency in the app (processing time) or network (wait time)?

Step 2: Check DNS latency
  → Is name resolution slow? NXDOMAIN errors?

Step 3: Check TCP connection metrics
  → Are connections being established slowly? High SYN retransmits?

Step 4: Check flow data between the two services
  → Is there packet loss? Retransmissions?

Step 5: Check network policy verdicts
  → Are legitimate requests being blocked by a policy?

Step 6: Packet capture (last resort)
  → Capture 60 seconds of traffic and analyze in Wireshark

Cross-AZ Latency

Cross-AZ communication adds 0.5-2ms per round trip. For chatty services, this accumulates:

Service A (us-east-1a) → Service B (us-east-1b)
  Single request: +1ms
  Feature requiring 15 service calls: +15ms
  Under load (100 concurrent): thread pool exhaustion

Detection: Correlate flow log source/destination AZ with latency metrics. Remediation: Co-locate chatty services in the same AZ or reduce cross-service call volume.

Anti-Patterns

Anti-Pattern	Consequence	Fix
No flow logs enabled	Blind to network-level issues	Enable VPC flow logs on all production subnets
Packet capture as monitoring	Storage and performance explosion	Use flow data for monitoring, pcap for debugging
Ignoring DNS metrics	Slowest-resolvable-name determines latency	Monitor DNS latency and error rates
No cross-AZ traffic visibility	Surprise data transfer costs	Tag and measure inter-AZ traffic
Network team only dashboard	App teams cannot self-serve	Build service-centric views, not network-centric

Network observability is not a network team tool — it is a platform capability that empowers every team to debug their own connectivity issues. The goal is to make “it’s the network” a testable hypothesis instead of a scapegoat.