Kubernetes Performance Monitoring: Tools and Techniques for Optimization
Monitor Kubernetes performance with useful metrics, Prometheus, Grafana, kubectl, and practical resource tuning habits.
Kubernetes Performance Monitoring: Tools and Techniques for Optimization
Kubernetes performance monitoring is not just watching CPU charts. A cluster can show low average CPU while users see slow requests. A pod can have enough memory most of the day and still get killed during a batch job. A node can look healthy until disk pressure starts evicting pods. Good monitoring connects cluster signals to the experience people actually care about: is the service fast, available, and predictable?
The first mistake is starting with tools instead of questions. Prometheus, Grafana, metrics-server, kube-state-metrics, and cloud monitoring platforms are all useful, but they do not decide what matters. You decide that by understanding the workload. A public API cares about latency and errors. A queue worker cares about backlog and processing rate. A nightly job cares about completion time and failed pods. A database-like workload cares about disk latency and memory pressure.
For a quick look, kubectl top is still useful:
kubectl top nodes
kubectl top pods -A
kubectl top pod -n production api-7d9c8f7b9d-2x4mq --containers
These commands depend on metrics-server. They give recent CPU and memory usage, not a full history. Use them during triage, not as your only monitoring system. If a pod restarted ten minutes ago because it ran out of memory, kubectl top may not show the spike that caused it.
Prometheus is the common foundation for Kubernetes metrics because it scrapes time-series data and works well with Kubernetes service discovery. In a typical setup, metrics come from several places. The kubelet exposes container and pod resource metrics. cAdvisor, integrated with kubelet, contributes container CPU, memory, filesystem, and network data. node-exporter reports host-level metrics. kube-state-metrics turns Kubernetes object state into metrics: desired replicas, available replicas, pod phases, node conditions, and more.
Grafana then turns those metrics into dashboards. A good dashboard is not a wall of gauges. It should answer specific questions quickly: which service is slow, which pods are throttled, which nodes are under pressure, which Deployment is failing to roll out, and whether autoscaling is keeping up.
Start at the application layer. For user-facing services, the most important signals are request rate, error rate, and latency. If you have SLOs, graph them. A pod CPU chart does not tell you whether checkout is failing. Application metrics do. Instrument services with Prometheus client libraries, OpenTelemetry, or the monitoring system your platform already uses. Kubernetes metrics explain why the service is unhealthy; application metrics tell you that it is unhealthy.
Then connect application symptoms to pod resources. CPU usage is easy to misread in Kubernetes. A container with a CPU limit can be throttled even when average CPU does not look dramatic. Throttling happens when the container tries to use more CPU time than its limit allows within the scheduling period. For latency-sensitive apps, this can cause slow requests that appear random.
A useful PromQL query for throttling is:
rate(container_cpu_cfs_throttled_periods_total{namespace="production", container!=""}[5m])
A rising value means the container is being throttled. Pair it with CPU usage and request latency. If latency spikes line up with throttling, consider raising or removing the CPU limit, increasing replicas, or optimizing the code path. Some teams set CPU requests but avoid CPU limits for latency-sensitive services, relying on requests, autoscaling, and node capacity controls instead. That can be reasonable, but it needs cluster-level discipline so noisy workloads do not starve others.
Memory behaves differently. CPU can be throttled; memory cannot be slowed down the same way. If a container exceeds its memory limit, it can be OOMKilled. Look for restart reasons:
kubectl describe pod -n production api-7d9c8f7b9d-2x4mq
kubectl get pod -n production api-7d9c8f7b9d-2x4mq -o jsonpath='{.status.containerStatuses[*].lastState}'
In Prometheus, watch working set memory and compare it to limits:
container_memory_working_set_bytes{namespace="production", container!=""}
Do not tune memory from a single quiet hour. Look at peak traffic, batch windows, deployments, and garbage collection behavior. Java, Go, Node.js, and Python services have different memory profiles. A limit that looks generous during normal traffic may be too tight during startup, cache warmup, or a large request.
Resource requests matter because the scheduler uses them to place pods. If requests are too low, Kubernetes may pack too many busy pods onto the same node. Everything looks efficient until those pods become busy at the same time. If requests are too high, the cluster wastes capacity and autoscaling may add nodes sooner than needed. The best request is usually based on observed usage plus headroom, not a copied value from another service.
Vertical Pod Autoscaler can help by recommending requests from historical usage. Many teams run VPA in recommendation mode first because automatic updates can restart pods depending on configuration and workload type. Treat recommendations as input, not law. A service with rare but important spikes may need more headroom than its average history suggests.
Horizontal Pod Autoscaler is useful when more replicas actually improve throughput. It works well for stateless web services and workers that can share load. It does not fix a single-threaded bottleneck, a database lock, or a downstream dependency that is already saturated.
A basic HPA might look like this:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Monitor HPA behavior, not only replica count. If it constantly scales up and down, adjust stabilization windows, targets, or the metric. If it reaches maxReplicas and latency is still bad, the problem may be capacity, code, or a dependency. If it never scales while pods are clearly overloaded, check metrics availability and whether requests are set. CPU utilization targets depend on CPU requests; missing or unrealistic requests can make autoscaling misleading.
Node health is the next layer. A pod problem that appears across many services on one node is usually a node problem. Watch CPU saturation, load average, memory available, disk pressure, inode usage, filesystem latency, network errors, and kubelet health. Node conditions such as MemoryPressure, DiskPressure, and PIDPressure should be visible in dashboards and alerts.
Use kubectl describe node when a node looks suspicious:
kubectl describe node worker-12
Look at conditions, allocated resources, events, and pods scheduled on the node. A node can be overcommitted by limits, requests, or actual usage. The allocated resources section helps you see whether scheduling assumptions match reality.
Control plane monitoring matters even if your application pods look fine. API server latency can slow deployments, autoscaling, and controllers. etcd latency or disk issues can make the whole cluster feel sluggish. Controller manager and scheduler problems can delay pod placement or reconciliation. In managed Kubernetes, you may not see every control plane component, but cloud providers usually expose some health and API latency metrics.
Events are useful during incidents, but they are not a long-term metrics store. Still, they often explain what just happened:
kubectl get events -A --sort-by=.lastTimestamp
Look for failed scheduling, image pull errors, probe failures, evictions, and back-off messages. Events can be noisy, so filter by namespace or involved object when needed.
Probes deserve careful monitoring. Liveness probes that are too aggressive can restart a slow-but-recovering app and make an incident worse. Readiness probes that fail correctly can protect users by removing bad pods from service. Track probe failures and correlate them with CPU throttling, GC pauses, downstream timeouts, and deploys.
For storage-heavy workloads, container CPU and memory are not enough. Watch persistent volume latency, disk throughput, queue depth, and filesystem fullness. A pod waiting on slow storage may show low CPU because it is blocked. If a database or queue runs on Kubernetes, storage metrics are part of application performance, not infrastructure trivia.
A practical troubleshooting path starts wide and narrows. First, confirm the user-facing symptom: latency, errors, failed jobs, or backlog. Second, identify scope: one pod, one Deployment, one node, one namespace, or the whole cluster. Third, check recent changes: deployments, config updates, autoscaler activity, node rotations, or traffic spikes. Fourth, inspect pod resource behavior: CPU throttling, memory pressure, restarts, and probe failures. Fifth, inspect node and dependency health.
Alerting should avoid waking people for harmless noise. Alert on user impact first: high error rate, high latency, missed job deadline, growing queue age. Then alert on strong leading indicators: frequent OOMKills, sustained CPU throttling on latency-sensitive services, pods unavailable below desired replicas, node pressure, persistent pending pods, and HPA stuck at max replicas while service metrics are bad.
The goal is not perfect utilization. A cluster running at 95 percent resource usage all day may look efficient until one node fails and there is no room to reschedule pods. Leave capacity for rollouts, retries, traffic bursts, and failures. Optimization should reduce waste without removing the buffer that keeps incidents small.
Good Kubernetes performance monitoring feels practical. You can open a dashboard and see service health, pod health, node health, and scaling behavior without hunting through twenty tabs. You can answer whether a slowdown is code, resource limits, node pressure, storage, network, or the control plane. And when you change requests, limits, or autoscaling, you can see whether the change helped instead of guessing.
Namespace-level views are useful when many teams share a cluster. A single team may not see node-level saturation coming if they only watch their own Deployments. Platform teams should expose dashboards that show namespace CPU and memory requests, actual usage, pod counts, restarts, and throttling. That makes capacity conversations less emotional. Instead of saying a team is using "too much," you can show request trends, peak usage, and waste.
Cost optimization should come after reliability signals, not before them. If a service has never had requests tuned, you may find easy savings. But cutting requests aggressively can create scheduling pressure and noisy-neighbor problems. A good process changes one workload class at a time, watches latency and restarts, and leaves rollback notes. Treat resource tuning like production code: small changes, measured results.
Deployments themselves can create performance incidents. A rollout that replaces too many pods at once can overload cold caches, connection pools, or downstream services. Watch rollout duration, unavailable replicas, and application latency during deploys. Tune maxSurge and maxUnavailable based on how the service behaves during startup. A service with slow warmup may need a conservative rollout even if steady-state performance is fine.
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
That setting is not universally best, but it shows the tradeoff: slower rollout, more protection against capacity drops. For a stateless service that starts instantly, you may choose a faster rollout. For a JVM service warming caches and opening many downstream connections, slower may be safer.
Keep an eye on cardinality in metrics. Kubernetes labels are tempting, but high-cardinality labels such as pod UID, request ID, or user ID can make Prometheus expensive and slow. Use labels that help you aggregate: namespace, workload, pod, container, node, status code, route pattern. Avoid labels that create a new time series for every user or every request. Monitoring should not become the thing that hurts cluster performance.
Logs and traces complete the picture. Metrics tell you that latency increased; traces can show which downstream call got slow; logs can show the exact error or timeout. OpenTelemetry is commonly used to connect these signals, but the tool matters less than correlation. Use consistent service names, namespaces, versions, and trace IDs so you can move from an alert to the relevant logs without guessing.
For batch and worker systems, watch backlog age rather than only pod CPU. A queue worker can be healthy at the pod level while falling behind because incoming work exceeds processing capacity. Metrics such as oldest message age, jobs completed per minute, retries, and dead-letter counts often matter more than container utilization. HPA can scale from custom or external metrics when CPU is the wrong signal.
Review dashboards after incidents. If responders had to run five manual commands to answer the same question, that question belongs on a dashboard or in a runbook. Monitoring gets better through use. The goal is not to predict every failure; it is to make the next investigation shorter and less dependent on one person's memory.