Advanced Troubleshooting: Kubernetes Logs, Events, and Metrics Deep Dive
Correlate Kubernetes logs, events, and metrics to debug pod failures, scheduling issues, and performance bottlenecks.
Advanced Troubleshooting: Kubernetes Logs, Events, and Metrics Deep Dive
Kubernetes troubleshooting gets easier when you separate three questions: what did the container say, what did the control plane do, and what do the metrics show? Logs, events, and metrics answer different parts of the same incident.
The examples below show how to use all three together when a pod crashes, an image will not pull, a workload cannot schedule, or a service looks healthy but responds slowly.
Kubernetes Logs: The Foundation of Debugging
Logs are the detailed records of what an application or system process is doing. In Kubernetes, logs are generated by the containers running within your pods. They are often the first place to look when an application isn't behaving as expected.
Accessing Container Logs
The kubectl logs command is your primary tool for retrieving logs from pods. It's versatile and offers several useful options.
Get logs from a single container in a pod:
kubectl logs <pod-name>If a pod has only one container, this command works directly.
Get logs from a specific container in a multi-container pod:
kubectl logs <pod-name> -c <container-name>View logs from a previous instance of a crashed container: If a container has restarted due to an error, you can view its logs before the restart using the
--previousflag:kubectl logs <pod-name> --previousFollow logs in real-time: Similar to
tail -f, the-f(or--follow) flag allows you to stream new log entries as they are generated, which is invaluable for debugging live issues.kubectl logs -f <pod-name> -c <container-name>Filter logs by time: You can specify how many lines from the end to retrieve (
--tail) or logs from a specific duration (--since).kubectl logs <pod-name> --tail=100 # Last 100 lines kubectl logs <pod-name> --since=1h # Logs from the last hour
Centralized Logging Solutions
While kubectl logs is excellent for immediate debugging, it's not practical for large-scale, long-term log management. For production environments, centralized logging solutions are essential. These solutions typically involve:
- Log Agents: Running an agent (e.g., Fluentd, Fluent Bit, Filebeat) on each node to collect logs from all pods.
- Log Storage & Indexing: Storing logs in a central repository (e.g., Elasticsearch, Loki, Splunk).
- Log Visualization & Analysis: Providing an interface to search, filter, and visualize logs (e.g., Kibana, Grafana, Splunk UI).
Best Practices for Logging
- Structured Logging: Emit logs in a structured format (e.g., JSON) to make them easily parsable and queryable by centralized logging systems.
- Appropriate Log Levels: Use different log levels (DEBUG, INFO, WARN, ERROR, FATAL) to categorize messages and control verbosity.
- Avoid Sensitive Information: Do not log sensitive data (passwords, PII) directly.
Kubernetes Events: The Cluster's Storyteller
Kubernetes events are records of state changes and operations occurring within the cluster. They provide crucial insights into what Kubernetes itself is doing (or failing to do) in response to your desired state. Events are invaluable for understanding why pods aren't scheduling, images aren't pulling, or volumes aren't mounting.
Accessing Kubernetes Events
Cluster-wide events:
kubectl get eventsThis command shows all recent events in the current namespace. You can add
--all-namespacesto see events across the entire cluster.A typical event output looks like this:
LAST SEEN TYPE REASON OBJECT MESSAGE 3m21s Normal Scheduled pod/my-app-789c6f66-abcde Successfully assigned default/my-app-789c6f66-abcde to node01 3m20s Normal Pulling pod/my-app-789c6f66-abcde Pulling image "example/my-app:1.2.3" 2m58s Warning BackOff pod/my-app-789c6f66-abcde Back-off restarting failed container appEvents for one object:
kubectl describe pod <pod-name>The
Eventssection at the bottom is often the fastest way to see scheduling, pulling, mounting, and restart problems for a single pod.Sort events by creation time:
kubectl get events --sort-by=.metadata.creationTimestamp
What Events Usually Tell You
Events are short-lived records, so they are best for recent failures. Look for these common reasons:
FailedScheduling: The scheduler could not place the pod. Check node selectors, taints, tolerations, resource requests, and available capacity.ImagePullBackOfforErrImagePull: Kubernetes could not pull the image. Check the image name, tag, registry access, and image pull secret.FailedMount: A volume could not mount. Check PVC binding, storage class, node permissions, and CSI driver health.BackOff: A container keeps failing. Pair the event withkubectl logs --previous.
Kubernetes Metrics: The Resource View
Metrics tell you whether the cluster has enough CPU, memory, and capacity for the workload. They also help you separate application bugs from resource pressure.
Quick Checks with metrics-server
If metrics-server is installed, use kubectl top:
kubectl top nodes
kubectl top pods
kubectl top pod <pod-name> --containers
High pod memory near the container limit often lines up with OOMKilled restarts. High node CPU can explain latency even when the pod logs look clean.
Deeper Metrics with Prometheus
In production, Prometheus and Grafana usually provide the historical view that kubectl top lacks. Useful signals include:
- Container restarts over time.
- CPU throttling for containers with low CPU limits.
- Memory working set compared with limits.
- Pending pods by namespace.
- API server request latency and error rate.
- Node disk pressure, memory pressure, and network saturation.
Correlating Logs, Events, and Metrics
Use a time window and move from symptom to cause:
- Check pod state:
kubectl get pod <pod-name> -o wide kubectl describe pod <pod-name> - Read current and previous logs:
kubectl logs <pod-name> -c <container-name> kubectl logs <pod-name> -c <container-name> --previous - Check recent namespace events:
kubectl get events --sort-by=.metadata.creationTimestamp - Compare resource usage:
kubectl top pod <pod-name> --containers kubectl top node
For example, a pod with CrashLoopBackOff, previous logs ending in an out-of-memory error, and metrics showing memory near the limit points to a memory limit or application memory problem. A pod stuck in Pending with FailedScheduling events and low node capacity points to scheduling pressure, not a container bug.
Takeaway
Do not debug Kubernetes from one signal alone. Logs explain application behavior, events explain control-plane decisions, and metrics explain resource pressure. When you line them up by time and object name, root causes become much easier to separate from symptoms.