Service Catalog Design Patterns
Design and operate an internal service catalog that gives developers and platform teams visibility into every service, its ownership, health, and dependencies. Covers catalog data models, ownership enforcement, health scoring, Backstage integration, and service maturity frameworks.
An internal service catalog answers the question every engineer asks during an incident: “Who owns this service, what does it do, and how do I contact the team?” Without a catalog, this information lives in Slack history, wiki pages last updated two years ago, and one person’s head.
What a Service Catalog Tracks
Per service:
Identity: Name, description, team, tier
Ownership: Team, on-call rotation, escalation path
Health: SLO status, error rate, latency p95
Dependencies: What it calls, what calls it
Infrastructure: Where it runs, how many instances
Documentation: Runbooks, architecture docs, API docs
Code: Repository, language, framework
Deployment: Last deploy, deploy frequency, rollback status
Cost: Monthly infrastructure cost
Data Model
apiVersion: catalog/v1
kind: Service
metadata:
name: order-service
description: "Manages order lifecycle from creation to fulfillment"
tier: 1
spec:
owner: order-team
lifecycle: production
links:
- title: API Docs
url: https://docs.internal/order-service/api
- title: Runbook
url: https://runbooks.internal/order-service
- title: Dashboard
url: https://grafana.internal/d/order-service
dependencies:
consumes:
- payment-service
- inventory-service
provides:
- order-api (REST)
- order-events (Kafka)
slos:
- name: availability
target: 99.95%
- name: latency_p95
target: 200ms
contacts:
oncall: order-team-oncall
slack: "#order-team"
Health Scoring
health_score_components:
reliability: 30%
- SLO compliance (28 days)
- Incident count and severity
operational_readiness: 25%
- Has runbook: yes/no
- Has on-call rotation: yes/no
- Has alerting configured: yes/no
code_quality: 20%
- Test coverage > 80%
- No critical security vulnerabilities
documentation: 15%
- API docs up to date
- Architecture diagram exists
deployment_health: 10%
- Deploy frequency (weekly+)
- Rollback rate (< 5%)
scores:
A (90-100): Excellent
B (75-89): Good
C (60-74): Needs attention
D (< 60): Critical risk
Ownership Enforcement
catalog_validation:
required_fields:
- metadata.name
- spec.owner
- spec.lifecycle
- spec.contacts.oncall
enforcement:
- PR check: Validates descriptor
- Weekly audit: Flags missing info
- Quarterly: Review orphan services
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Static wiki page as catalog | Outdated within weeks | Automated catalog from code + infra |
| No ownership enforcement | Orphan services | Owner required for all services |
| Catalog without health data | No signal on what needs attention | Integrate SLO, incident, deploy data |
| Only infra team maintains | Bottleneck, stale data | Each team owns their entries |
| No dependency mapping | Cannot assess blast radius | Auto-discover from traffic + config |
A service catalog is a living system, not a project. If it is not continuously updated, it becomes another stale wiki page.