Service Mesh Architecture
Implement a service mesh to manage service-to-service communication with zero application code changes. Covers sidecar proxies, mTLS, traffic management, observability, and deciding whether a service mesh is worth the operational complexity.
As microservices architectures grow beyond 10-20 services, the cross-cutting concerns — mutual TLS, retries, circuit breaking, observability — become too complex to implement in every service individually. A service mesh extracts these concerns into infrastructure, providing them uniformly to every service through sidecar proxies.
How a Service Mesh Works
Service A Pod Service B Pod
┌──────────────────────┐ ┌──────────────────────┐
│ Application │ │ Application │
│ (no mesh awareness) │ │ (no mesh awareness) │
│ ↓ │ │ ↑ │
│ Sidecar Proxy │───────────▶│ Sidecar Proxy │
│ (Envoy) │ mTLS │ (Envoy) │
└──────────────────────┘ └──────────────────────┘
↑ ↑
└────── Control Plane ──────────────┘
(Istio/Linkerd)
The sidecar proxy intercepts all inbound and outbound traffic. The application sends plain HTTP; the sidecar handles TLS, retries, load balancing, and telemetry transparently.
Core Capabilities
Mutual TLS (mTLS)
Zero-trust networking without application changes:
# Istio: Enable strict mTLS for the entire mesh
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT # All traffic must be mTLS
Traffic Management
# Canary deployment: 90% to v1, 10% to v2
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: v1
weight: 90
- destination:
host: order-service
subset: v2
weight: 10
Retry and Timeout Policies
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-service
spec:
hosts:
- payment-service
http:
- route:
- destination:
host: payment-service
retries:
attempts: 3
perTryTimeout: 2s
retryOn: "5xx,connect-failure,retriable-4xx"
timeout: 10s
Circuit Breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
Mesh Selection
| Feature | Istio | Linkerd | Consul Connect |
|---|---|---|---|
| Complexity | High | Low | Medium |
| Resource overhead | ~100MB per sidecar | ~20MB per sidecar | Medium |
| mTLS | Yes | Yes | Yes |
| Traffic management | Advanced | Basic | Medium |
| Multi-cluster | Yes | Yes | Yes |
| Best for | Complex requirements | Simplicity, performance | HashiCorp ecosystem |
Observability
A service mesh provides uniform telemetry for free:
Metrics (per-service, per-endpoint):
- Request rate, error rate, latency (RED metrics)
- Connection count, bytes transferred
- Retry count, circuit breaker trips
Traces:
- Automatic span injection at sidecar
- Full distributed trace across all mesh services
Access Logs:
- Structured logs for every request
- Source, destination, duration, status code, response size
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Mesh for < 10 services | Operational overhead exceeds benefit | Use simple retry libraries instead |
| Not monitoring sidecar resources | Memory/CPU overhead ignored | Monitor and set sidecar resource limits |
| mTLS in permissive mode forever | False sense of security | Set to strict after testing |
| Overly aggressive retries | Amplify failures during outages | Retry budgets, exponential backoff |
| Mesh as a substitute for good design | Infrastructure cannot fix bad architecture | Fix service boundaries first |
A service mesh is infrastructure for infrastructure. It is most valuable when the alternative is implementing the same cross-cutting concerns in 50 different services in 5 different languages.