Advanced Troubleshooting: Kubernetes Logs, Events, and Metrics Deep Dive

Correlate Kubernetes logs, events, and metrics to debug pod failures, scheduling issues, and performance bottlenecks.

Advanced Troubleshooting: Kubernetes Logs, Events, and Metrics Deep Dive

Kubernetes troubleshooting gets easier when you separate three questions: what did the container say, what did the control plane do, and what do the metrics show? Logs, events, and metrics answer different parts of the same incident.

The examples below show how to use all three together when a pod crashes, an image will not pull, a workload cannot schedule, or a service looks healthy but responds slowly.

Kubernetes Logs: The Foundation of Debugging

Logs are the detailed records of what an application or system process is doing. In Kubernetes, logs are generated by the containers running within your pods. They are often the first place to look when an application isn't behaving as expected.

Accessing Container Logs

The kubectl logs command is your primary tool for retrieving logs from pods. It's versatile and offers several useful options.

  • Get logs from a single container in a pod:

    kubectl logs <pod-name>
    

    If a pod has only one container, this command works directly.

  • Get logs from a specific container in a multi-container pod:

    kubectl logs <pod-name> -c <container-name>
    
  • View logs from a previous instance of a crashed container: If a container has restarted due to an error, you can view its logs before the restart using the --previous flag:

    kubectl logs <pod-name> --previous
    
  • Follow logs in real-time: Similar to tail -f, the -f (or --follow) flag allows you to stream new log entries as they are generated, which is invaluable for debugging live issues.

    kubectl logs -f <pod-name> -c <container-name>
    
  • Filter logs by time: You can specify how many lines from the end to retrieve (--tail) or logs from a specific duration (--since).

    kubectl logs <pod-name> --tail=100 # Last 100 lines
    kubectl logs <pod-name> --since=1h # Logs from the last hour
    

Centralized Logging Solutions

While kubectl logs is excellent for immediate debugging, it's not practical for large-scale, long-term log management. For production environments, centralized logging solutions are essential. These solutions typically involve:

  • Log Agents: Running an agent (e.g., Fluentd, Fluent Bit, Filebeat) on each node to collect logs from all pods.
  • Log Storage & Indexing: Storing logs in a central repository (e.g., Elasticsearch, Loki, Splunk).
  • Log Visualization & Analysis: Providing an interface to search, filter, and visualize logs (e.g., Kibana, Grafana, Splunk UI).

Best Practices for Logging

  • Structured Logging: Emit logs in a structured format (e.g., JSON) to make them easily parsable and queryable by centralized logging systems.
  • Appropriate Log Levels: Use different log levels (DEBUG, INFO, WARN, ERROR, FATAL) to categorize messages and control verbosity.
  • Avoid Sensitive Information: Do not log sensitive data (passwords, PII) directly.

Kubernetes Events: The Cluster's Storyteller

Kubernetes events are records of state changes and operations occurring within the cluster. They provide crucial insights into what Kubernetes itself is doing (or failing to do) in response to your desired state. Events are invaluable for understanding why pods aren't scheduling, images aren't pulling, or volumes aren't mounting.

Accessing Kubernetes Events

  • Cluster-wide events:

    kubectl get events
    

    This command shows all recent events in the current namespace. You can add --all-namespaces to see events across the entire cluster.

    A typical event output looks like this:

    LAST SEEN   TYPE      REASON      OBJECT                         MESSAGE
    3m21s       Normal    Scheduled   pod/my-app-789c6f66-abcde      Successfully assigned default/my-app-789c6f66-abcde to node01
    3m20s       Normal    Pulling     pod/my-app-789c6f66-abcde      Pulling image "example/my-app:1.2.3"
    2m58s       Warning   BackOff     pod/my-app-789c6f66-abcde      Back-off restarting failed container app
    
  • Events for one object:

    kubectl describe pod <pod-name>
    

    The Events section at the bottom is often the fastest way to see scheduling, pulling, mounting, and restart problems for a single pod.

  • Sort events by creation time:

    kubectl get events --sort-by=.metadata.creationTimestamp
    

What Events Usually Tell You

Events are short-lived records, so they are best for recent failures. Look for these common reasons:

  • FailedScheduling: The scheduler could not place the pod. Check node selectors, taints, tolerations, resource requests, and available capacity.
  • ImagePullBackOff or ErrImagePull: Kubernetes could not pull the image. Check the image name, tag, registry access, and image pull secret.
  • FailedMount: A volume could not mount. Check PVC binding, storage class, node permissions, and CSI driver health.
  • BackOff: A container keeps failing. Pair the event with kubectl logs --previous.

Kubernetes Metrics: The Resource View

Metrics tell you whether the cluster has enough CPU, memory, and capacity for the workload. They also help you separate application bugs from resource pressure.

Quick Checks with metrics-server

If metrics-server is installed, use kubectl top:

kubectl top nodes
kubectl top pods
kubectl top pod <pod-name> --containers

High pod memory near the container limit often lines up with OOMKilled restarts. High node CPU can explain latency even when the pod logs look clean.

Deeper Metrics with Prometheus

In production, Prometheus and Grafana usually provide the historical view that kubectl top lacks. Useful signals include:

  • Container restarts over time.
  • CPU throttling for containers with low CPU limits.
  • Memory working set compared with limits.
  • Pending pods by namespace.
  • API server request latency and error rate.
  • Node disk pressure, memory pressure, and network saturation.

Correlating Logs, Events, and Metrics

Use a time window and move from symptom to cause:

  1. Check pod state:
    kubectl get pod <pod-name> -o wide
    kubectl describe pod <pod-name>
    
  2. Read current and previous logs:
    kubectl logs <pod-name> -c <container-name>
    kubectl logs <pod-name> -c <container-name> --previous
    
  3. Check recent namespace events:
    kubectl get events --sort-by=.metadata.creationTimestamp
    
  4. Compare resource usage:
    kubectl top pod <pod-name> --containers
    kubectl top node
    

For example, a pod with CrashLoopBackOff, previous logs ending in an out-of-memory error, and metrics showing memory near the limit points to a memory limit or application memory problem. A pod stuck in Pending with FailedScheduling events and low node capacity points to scheduling pressure, not a container bug.

Takeaway

Do not debug Kubernetes from one signal alone. Logs explain application behavior, events explain control-plane decisions, and metrics explain resource pressure. When you line them up by time and object name, root causes become much easier to separate from symptoms.