Observability
Metrics, logging, tracing, and alerting for Kubernetes workloads. KubeShark treats observability as mandatory for production -- if you cannot measure it, you cannot operate it. For full configuration examples and the LLM mistake checklist, see references/observability.md.
Probes as the Foundation
Liveness, readiness, and startup probes are the most basic form of observability. They tell Kubernetes whether your application is alive, ready to serve traffic, and initialized. Without correct probes, no amount of metrics or logging prevents cascading failures. See fragile-rollouts for detailed probe rules.
Prometheus Metrics
Annotations Pattern
Add prometheus.io/scrape: "true", prometheus.io/port, and prometheus.io/path annotations to the Pod template metadata (not the Deployment metadata). This enables Prometheus auto-discovery without the prometheus-operator.
ServiceMonitor Pattern
When using prometheus-operator, prefer ServiceMonitor CRDs for type-safe configuration. The ServiceMonitor selector.matchLabels must match the Service labels, and the release label must match the Prometheus operator selector.
RED Method
Every service should expose at minimum:
- Rate -- request throughput (
http_requests_totalcounter) - Errors -- failed request count (
http_requests_total{status=~"5.."}) - Duration -- request latency (
http_request_duration_secondshistogram)
Align histogram buckets to your SLO thresholds, not arbitrary defaults. For resource-oriented services (queues, databases), add saturation metrics like queue depth and connection pool usage.
Structured Logging
Applications must log structured JSON to stdout/stderr. Rules:
- Use
timestamp,level,msgas standard fields - Include
trace_idandspan_idfor correlation with distributed traces - Never log secrets, tokens, PII, or full request bodies
- Never log to files inside the container -- it defeats node-level collection and fills the writable layer
Log aggregation uses a DaemonSet pattern (Fluent Bit on every node reading /var/log/containers/). Use sidecars only when per-pod log transformation is required.
OpenTelemetry Tracing
Auto-Instrumentation
The OpenTelemetry Operator injects instrumentation via pod annotations (e.g., instrumentation.opentelemetry.io/inject-java: "true").
Collector Sidecar
For fine-grained control, run the OTel Collector as a sidecar with gRPC (4317) and HTTP (4318) OTLP receivers. Set resource requests and limits on the sidecar to prevent it from starving the main workload.
Context Propagation
Propagate trace context (traceparent header / W3C Trace Context) across all service boundaries. Without propagation, traces are fragmented and useless.
Alerting Patterns
Write symptom-based alerts (what the user experiences), not cause-based alerts (what broke internally):
- HighErrorRate -- error rate above SLO threshold for a sustained period
- HighLatencyP99 -- p99 latency above target for a sustained period
Every PrometheusRule alert must include a runbook_url annotation pointing to actionable remediation steps. Alerts without runbooks are noise.
Use Grafana deployment annotations to correlate metric changes with releases. Integrate annotation creation into your CI/CD pipeline as a post-deploy step.
Common LLM Mistakes
Key observability errors LLMs produce include: placing Prometheus annotations on the Deployment metadata instead of the Pod template, not declaring the metrics port in the container ports list, generating file-based logging instead of structured JSON to stdout, omitting trace context propagation, writing cause-based alerts instead of symptom-based ones, and omitting resource limits on sidecar containers. See the full checklist in the reference file.