Kubernetes Pod Failure Troubleshooting: A Comprehensive Guide

Kubernetes Pods are the smallest deployable units in the ecosystem, running the containers that power your application. When a Pod fails, it directly impacts the availability and reliability of your service. Diagnosing Pod failures quickly and accurately is a fundamental skill for any Kubernetes administrator or developer.

This guide provides a structured, step-by-step approach to diagnosing the most common Pod failure scenarios. We will cover the essential kubectl commands used for inspection, interpret various Pod statuses, and outline actionable solutions to restore your applications to a stable, running state.

The Three Pillars of Pod Diagnosis

Troubleshooting begins by utilizing three primary kubectl commands to gather all available information about the failing Pod.

1. Initial Status Check (`kubectl get pods`)

The first step is always to determine the current state of the Pod and its containers. Pay close attention to the STATUS and READY columns.

kubectl get pods -n my-namespace

Interpreting Pod Status

Status	Meaning	Initial Action
Running	Pod is healthy, all containers are running.	(Likely no issue, unless readiness probe is failing.)
Pending	Pod has been accepted by Kubernetes but containers are not yet created.	Check scheduling events or image pull status.
CrashLoopBackOff	Container starts, crashes, and Kubelet attempts to restart it repeatedly.	Check application logs (`kubectl logs --previous`).
ImagePullBackOff	Kubelet cannot pull the required container image.	Check image name, tag, and registry credentials.
Error	Pod exited due to a runtime error or failed startup command.	Check logs and `describe` events.
Terminating/Unknown	Pod is shutting down or the Kubelet host is unresponsive.	Check node health.

2. Deep Inspection (`kubectl describe pod`)

If the status is anything other than Running, the describe command provides crucial context, detailing scheduling decisions, resource allocation, and container states.

kubectl describe pod [POD_NAME] -n my-namespace

Focus on these sections in the output:

Containers/Init Containers: Check the State (especially Waiting or Terminated) and the Last State (where the failure reason is often recorded, e.g., Reason: OOMKilled).
Resource Limits: Verify that Limits and Requests are correctly set.
Events: This is the most critical section. Events show the lifecycle history, including scheduling failures, volume mounting issues, and image pull attempts.

Tip: If the Events section shows a message like "0/N nodes available," the Pod is likely failing to schedule due to insufficient resources (CPU, memory) or affinity rules not being met.

3. Reviewing Logs (`kubectl logs`)

For runtime application issues, logs provide the stack trace or error messages explaining why the process terminated.

# Check current container logs
kubectl logs [POD_NAME] -n my-namespace

# Check logs for a specific container within the Pod
kubectl logs [POD_NAME] -c [CONTAINER_NAME] -n my-namespace

# Crucial for CrashLoopBackOff: Check the logs from the *previous* failed run
kubectl logs [POD_NAME] --previous -n my-namespace

Common Pod Failure Scenarios and Solutions

Most Pod failures fall into a few recognizable patterns, each requiring a targeted diagnostic approach.

Scenario 1: CrashLoopBackOff

This is the most frequent Pod failure. It signifies that the container is starting successfully, but the application within the container is immediately exiting (with a non-zero exit code).

Diagnosis:
1. Use kubectl logs --previous to view the traceback or exit reason.
2. Use kubectl describe to check the Exit Code in the Last State section.

Common Causes & Fixes:

Exit Code 1/2: General application error, missing configuration file, database connectivity failure, or application crash due to bad input.
- Fix: Debug the application code or check ConfigMaps/Secrets being mounted.
Missing Dependencies: The entry point script requires files or environments that are not present.
- Fix: Verify the Dockerfile and image build process.

Scenario 2: ImagePullBackOff / ErrImagePull

This occurs when the Kubelet cannot fetch the container image specified in the Pod definition.

Diagnosis:
1. Check the Events section of kubectl describe for the specific error message (e.g., 404 Not Found or authentication required).

Common Causes & Fixes:

Typo or Wrong Tag: The image name or tag is incorrect.
- Fix: Correct the image name in the Deployment or Pod specification.
Private Registry Access: The cluster does not have credentials to pull from a private registry (like Docker Hub, GCR, ECR).
- Fix: Ensure an appropriate imagePullSecret is referenced in the Pod spec and that the Secret exists in the namespace.

# Example Pod spec snippet for using a pull secret
spec:
  containers:
  ...
  imagePullSecrets:
  - name: my-registry-secret

Scenario 3: Pending Status (Stuck)

A Pod remains in Pending status, usually indicating that it cannot be scheduled onto a Node or it is waiting for resources (like a PersistentVolume).

Diagnosis:
1. Run kubectl describe and look at the Events section.

Common Causes & Fixes:

Resource Exhaustion: The cluster lacks Nodes with enough available CPU or Memory to satisfy the Pod's requests.
- Fix: Increase cluster size, or reduce Pod resource requests (if feasible).
- Event Message Example: 0/4 nodes are available: 4 Insufficient cpu.
Volume Binding Issues: The Pod requires a PersistentVolumeClaim (PVC) that cannot be bound to an underlying PersistentVolume (PV).
- Fix: Check the status of the PVC (kubectl get pvc) and ensure the StorageClass is functioning.

Scenario 4: OOMKilled (Out of Memory Killed)

While this usually results in a CrashLoopBackOff status, the underlying cause is specific: the container used more memory than the limit defined in its specification, causing the host operating system (via the Kubelet) to forcefully terminate it.

Diagnosis:
1. Check kubectl describe -> Last State -> Reason: OOMKilled.

Fixes:

Increase Limits: Increase the memory limit in the Pod spec, providing more headroom.
Optimize Application: Profile the application to reduce its memory footprint.
Set Requests: Ensure requests are set close to the actual steady-state usage to improve scheduling reliability.

resources:
  limits:
    memory: "512Mi" # Increase this value
  requests:
    memory: "256Mi"

Preventing Future Failures: Best Practices

Robust applications require careful configuration to prevent common deployment pitfalls.

Use Liveness and Readiness Probes

Proper implementation of probes allows Kubernetes to intelligently manage container health:

Liveness Probes: Determine if the container is healthy enough to continue running. If the liveness probe fails, Kubelet will restart the container (resolving soft locks).
Readiness Probes: Determine if the container is ready to serve traffic. If the readiness probe fails, the Pod is removed from service endpoints, preventing failed requests while the container recovers.

Enforce Resource Limits and Requests

Always define resource requirements for containers. This prevents Pods from consuming excessive resources (leading to Node instability) and ensures the scheduler can place the Pod on a Node with sufficient capacity.

Utilize Init Containers for Setup

If a Pod requires a dependency check or data setup before the main application starts (e.g., waiting for a database migration to complete), use an Init Container. If the Init Container fails, the Pod will restart it repeatedly, cleanly isolating setup errors from application runtime errors.

Conclusion

Mastering Kubernetes Pod troubleshooting hinges on a methodical approach, relying heavily on the output of kubectl get, kubectl describe, and kubectl logs. By systematically analyzing the Pod status, reading the event history, and understanding common exit codes, you can quickly diagnose and resolve CrashLoopBackOff, ImagePullBackOff, and resource-related failures, ensuring consistent application uptime.