Kubernetes Pod Failure Troubleshooting: A Comprehensive Guide

Navigate the complexities of Kubernetes Pod failures with this comprehensive guide. Learn the structured process for diagnosing common issues like CrashLoopBackOff, ImagePullBackOff, and resource exhaustion. We detail how to leverage crucial tools like `kubectl describe` and `kubectl logs --previous` to pinpoint the root cause, interpret container exit states, and implement practical fixes to maintain reliable application uptime and stability.

Kubernetes Pod Failure Troubleshooting: A Comprehensive Guide

Kubernetes pod failure troubleshooting is less about memorizing every status and more about learning where Kubernetes leaves clues. A pod almost never fails "silently." The scheduler, kubelet, container runtime, image registry, volume plugin, and your application all leave traces in different places. The trick is checking them in the right order so you do not spend twenty minutes reading application logs for a pod that never pulled its image.

I usually start with one question: did the pod fail before the container started, while the container was starting, or after the application began running? That single split keeps the investigation grounded. Pending usually points to scheduling, storage, or image setup. ImagePullBackOff points to the registry path, tag, credentials, or node egress. CrashLoopBackOff usually means the process starts and then exits, though the reason might be configuration, a missing file, a bad command, a failed dependency, or memory pressure.


The Three Pillars of Pod Diagnosis

Troubleshooting begins by utilizing three primary kubectl commands to gather all available information about the failing Pod.

1. Initial Status Check (kubectl get pods)

The first step is always to determine the current state of the Pod and its containers. Pay close attention to the STATUS and READY columns.

kubectl get pods -n my-namespace

Interpreting Pod Status

Status Meaning Initial Action
Running At least one container is running; this does not always mean the app is serving traffic. Check READY, restarts, and readiness events.
Pending Pod has been accepted by Kubernetes but containers are not yet created. Check scheduling events or image pull status.
CrashLoopBackOff Container starts, crashes, and Kubelet attempts to restart it repeatedly. Check application logs (kubectl logs --previous).
ImagePullBackOff Kubelet cannot pull the required container image. Check image name, tag, and registry credentials.
Error Pod exited due to a runtime error or failed startup command. Check logs and describe events.
Terminating/Unknown Pod is shutting down or the Kubelet host is unresponsive. Check node health.

2. Deep Inspection (kubectl describe pod)

If the status is anything other than Running, the describe command provides crucial context, detailing scheduling decisions, resource allocation, and container states.

kubectl describe pod [POD_NAME] -n my-namespace

Focus on these sections in the output:

  • Containers/Init Containers: Check the State (especially Waiting or Terminated) and the Last State (where the failure reason is often recorded, e.g., Reason: OOMKilled).
  • Resource Limits: Verify that Limits and Requests are correctly set.
  • Events: This is the most critical section. Events show the lifecycle history, including scheduling failures, volume mounting issues, and image pull attempts.

Tip: If the Events section shows a message like "0/N nodes available," the Pod is likely failing to schedule due to insufficient resources (CPU, memory) or affinity rules not being met.

Read events from the bottom upward when you want the newest clue, but do not ignore older events. A pod can have more than one problem. For example, a deployment may start with FailedScheduling because the requested memory is too high, then later move to ImagePullBackOff after a node is added. If you only look at the final status, you may miss the change that made the problem move forward.

3. Reviewing Logs (kubectl logs)

For runtime application issues, logs provide the stack trace or error messages explaining why the process terminated.

# Check current container logs
kubectl logs [POD_NAME] -n my-namespace

# Check logs for a specific container within the Pod
kubectl logs [POD_NAME] -c [CONTAINER_NAME] -n my-namespace

# Crucial for CrashLoopBackOff: Check the logs from the *previous* failed run
kubectl logs [POD_NAME] --previous -n my-namespace

If the pod has sidecars, always include -c. Many frustrating investigations come from reading the logs of the healthy sidecar instead of the failing application container. For init container failures, use the init container name with -c as well:

kubectl logs [POD_NAME] -c [INIT_CONTAINER_NAME] -n my-namespace

Common Pod Failure Scenarios and Solutions

Most Pod failures fall into a few recognizable patterns, each requiring a targeted diagnostic approach.

Scenario 1: CrashLoopBackOff

This is the most frequent Pod failure. It signifies that the container is starting successfully, but the application within the container is immediately exiting (with a non-zero exit code).

Diagnosis:

  1. Use kubectl logs --previous to view the traceback or exit reason.
  2. Use kubectl describe to check the Exit Code in the Last State section.

Common Causes & Fixes:

  • Exit Code 1/2: General application error, missing configuration file, database connectivity failure, or application crash due to bad input.
    • Fix: Debug the application code or check ConfigMaps/Secrets being mounted.
  • Missing Dependencies: The entry point script requires files or environments that are not present.
    • Fix: Verify the Dockerfile and image build process.
  • Bad command or args: The container image is valid, but the command in the Pod spec overrides the image entrypoint incorrectly.
    • Fix: Compare the Deployment command and args with the image's expected startup command. Test the same image locally if possible.
  • Probe-induced restarts: A liveness probe may kill a slow-starting app before it finishes warming up.
    • Fix: Increase initialDelaySeconds, use a startupProbe, or point the probe at a cheaper health endpoint.

A practical pattern is to deploy one temporary copy with the same image but a harmless command, then inspect the filesystem and environment:

kubectl run debug-image \
  --image=registry.example.com/app:tag \
  --restart=Never \
  --command -- sleep 3600

kubectl exec -it debug-image -- /bin/sh

This does not replace fixing the Deployment, but it helps answer simple questions quickly: is the config file actually in the image, does the binary exist, does the container have the expected shell, and are environment variables present?

Scenario 2: ImagePullBackOff / ErrImagePull

This occurs when the Kubelet cannot fetch the container image specified in the Pod definition.

Diagnosis:

  1. Check the Events section of kubectl describe for the specific error message (e.g., 404 Not Found or authentication required).

Common Causes & Fixes:

  • Typo or Wrong Tag: The image name or tag is incorrect.
    • Fix: Correct the image name in the Deployment or Pod specification.
  • Private Registry Access: The cluster does not have credentials to pull from a private registry (like Docker Hub, GCR, ECR).
    • Fix: Ensure an appropriate imagePullSecret is referenced in the Pod spec and that the Secret exists in the namespace.
# Example Pod spec snippet for using a pull secret
spec:
  containers:
  ...
  imagePullSecrets:
  - name: my-registry-secret

Also check where the pull secret lives. Kubernetes secrets are namespaced. A secret named regcred in default will not help a pod in payments. If the same image works in one namespace but fails in another, compare service accounts and image pull secrets before assuming the registry is broken:

kubectl get serviceaccount default -n payments -o yaml
kubectl get secret regcred -n payments

Scenario 3: Pending Status (Stuck)

A Pod remains in Pending status, usually indicating that it cannot be scheduled onto a Node or it is waiting for resources (like a PersistentVolume).

Diagnosis:

  1. Run kubectl describe and look at the Events section.

Common Causes & Fixes:

  • Resource Exhaustion: The cluster lacks Nodes with enough available CPU or Memory to satisfy the Pod's requests.
    • Fix: Increase cluster size, or reduce Pod resource requests (if feasible).
    • Event Message Example: 0/4 nodes are available: 4 Insufficient cpu.
  • Volume Binding Issues: The Pod requires a PersistentVolumeClaim (PVC) that cannot be bound to an underlying PersistentVolume (PV).
    • Fix: Check the status of the PVC (kubectl get pvc) and ensure the StorageClass is functioning.
  • Selector or affinity mismatch: The pod asks for a node label that does not exist, or a required affinity rule excludes every node.
    • Fix: Compare nodeSelector, nodeAffinity, and node labels with kubectl get nodes --show-labels.
  • Taints not tolerated: Nodes are available, but they repel this pod because it lacks a matching toleration.
    • Fix: Add the intended toleration to the pod, or remove the taint if it no longer represents a real placement rule.

Scenario 4: OOMKilled (Out of Memory Killed)

While this usually results in a CrashLoopBackOff status, the underlying cause is specific: the container used more memory than the limit defined in its specification, causing the host operating system (via the Kubelet) to forcefully terminate it.

Diagnosis:

  1. Check kubectl describe -> Last State -> Reason: OOMKilled.

Fixes:

  1. Increase Limits: Increase the memory limit in the Pod spec, providing more headroom.
  2. Optimize Application: Profile the application to reduce its memory footprint.
  3. Set Requests: Ensure requests are set close to the actual steady-state usage to improve scheduling reliability.
resources:
  limits:
    memory: "512Mi" # Increase this value
  requests:
    memory: "256Mi"

Be careful with memory "fixes" that only raise the limit. If the application has a leak, a higher limit may only delay the next failure and make the node carry more risk. Look at memory over time in your metrics system. A sawtooth pattern that returns to baseline after garbage collection is different from a steady climb until OOM.

Scenario 5: CreateContainerConfigError and CreateContainerError

These statuses are easy to overlook because they do not sound like application failures. They usually mean the kubelet could not assemble the container configuration.

Common causes include:

  • A referenced ConfigMap or Secret does not exist in the namespace.
  • A key inside a ConfigMap or Secret is misspelled.
  • A volume mount path conflicts with another mount.
  • The container references an invalid security context.

The fastest check is still describe:

kubectl describe pod [POD_NAME] -n my-namespace

Look for event messages such as secret "app-config" not found or configmap "settings" not found. Then verify the object:

kubectl get secret app-config -n my-namespace
kubectl get configmap settings -n my-namespace -o yaml

This is a common deployment-pipeline mistake. The application manifest is applied, but the secret creation step failed or ran against the wrong namespace.

Scenario 6: Running but Not Ready

A pod can show Running while still being unusable. The READY column tells you how many containers are ready according to their readiness probes. A pod with 1/2 or 0/1 may be alive but removed from Service endpoints.

Check endpoints when traffic is failing but pods look alive:

kubectl get endpoints [SERVICE_NAME] -n my-namespace
kubectl describe pod [POD_NAME] -n my-namespace

If the endpoint list is empty, the problem may be a readiness probe, a Service selector mismatch, or the application listening on a different port than the Service expects. In real incidents, this is where people lose time: they keep restarting pods even though the pods are not the reason traffic is missing.


Preventing Future Failures: Best Practices

Robust applications require careful configuration to prevent common deployment pitfalls.

Use Liveness and Readiness Probes

Proper implementation of probes allows Kubernetes to intelligently manage container health:

  • Liveness Probes: Determine if the container is healthy enough to continue running. If the liveness probe fails, Kubelet will restart the container (resolving soft locks).
  • Readiness Probes: Determine if the container is ready to serve traffic. If the readiness probe fails, the Pod is removed from service endpoints, preventing failed requests while the container recovers.

Do not point liveness probes at deep dependency checks unless you really want Kubernetes to restart the container whenever that dependency has a short outage. A database being unavailable for thirty seconds is usually not proof that the process is dead. Readiness is the better place to say, "do not send this pod traffic right now."

Enforce Resource Limits and Requests

Always define resource requirements for containers. This prevents Pods from consuming excessive resources (leading to Node instability) and ensures the scheduler can place the Pod on a Node with sufficient capacity.

Utilize Init Containers for Setup

If a Pod requires a dependency check or data setup before the main application starts (e.g., waiting for a database migration to complete), use an Init Container. If the Init Container fails, the Pod will restart it repeatedly, cleanly isolating setup errors from application runtime errors.

A Practical Triage Flow

When you are on call, use a repeatable path:

  1. Check kubectl get pods -n <namespace> -o wide so you see status, restarts, age, and node placement.
  2. Run kubectl describe pod and read Events, State, Last State, mounts, and resource settings.
  3. Pull logs with kubectl logs, adding --previous for restarted containers and -c for multi-container pods.
  4. If the pod is Pending, inspect scheduling, taints, node labels, and PVCs before reading app logs.
  5. If the pod is Running but not receiving traffic, inspect readiness probes, Service selectors, and endpoints.
  6. If the pod was OOMKilled, compare limits with real memory graphs before simply increasing the number.

This order keeps you from jumping straight to the application when Kubernetes has not even started it yet.

Final Check

The most useful habit is to separate symptoms from causes. CrashLoopBackOff is a symptom. The cause might be a missing secret, a bad migration, a liveness probe, or a memory limit. Pending is a symptom. The cause might be CPU requests, a PVC, a taint, or a node selector. Let the pod status tell you where to look, then let events and logs tell you what changed.