Kubernetes Pod Failure Troubleshooting: A Comprehensive Guide
Navigate the complexities of Kubernetes Pod failures with this comprehensive guide. Learn the structured process for diagnosing common issues like CrashLoopBackOff, ImagePullBackOff, and resource exhaustion. We detail how to leverage crucial tools like `kubectl describe` and `kubectl logs --previous` to pinpoint the root cause, interpret container exit states, and implement practical fixes to maintain reliable application uptime and stability.
Kubernetes Pod Failure Troubleshooting: A Comprehensive Guide
Kubernetes pod failure troubleshooting is less about memorizing every status and more about learning where Kubernetes leaves clues. A pod almost never fails "silently." The scheduler, kubelet, container runtime, image registry, volume plugin, and your application all leave traces in different places. The trick is checking them in the right order so you do not spend twenty minutes reading application logs for a pod that never pulled its image.
I usually start with one question: did the pod fail before the container started, while the container was starting, or after the application began running? That single split keeps the investigation grounded. Pending usually points to scheduling, storage, or image setup. ImagePullBackOff points to the registry path, tag, credentials, or node egress. CrashLoopBackOff usually means the process starts and then exits, though the reason might be configuration, a missing file, a bad command, a failed dependency, or memory pressure.
The Three Pillars of Pod Diagnosis
Troubleshooting begins by utilizing three primary kubectl commands to gather all available information about the failing Pod.
1. Initial Status Check (kubectl get pods)
The first step is always to determine the current state of the Pod and its containers. Pay close attention to the STATUS and READY columns.
kubectl get pods -n my-namespace
Interpreting Pod Status
| Status | Meaning | Initial Action |
|---|---|---|
| Running | At least one container is running; this does not always mean the app is serving traffic. | Check READY, restarts, and readiness events. |
| Pending | Pod has been accepted by Kubernetes but containers are not yet created. | Check scheduling events or image pull status. |
| CrashLoopBackOff | Container starts, crashes, and Kubelet attempts to restart it repeatedly. | Check application logs (kubectl logs --previous). |
| ImagePullBackOff | Kubelet cannot pull the required container image. | Check image name, tag, and registry credentials. |
| Error | Pod exited due to a runtime error or failed startup command. | Check logs and describe events. |
| Terminating/Unknown | Pod is shutting down or the Kubelet host is unresponsive. | Check node health. |
2. Deep Inspection (kubectl describe pod)
If the status is anything other than Running, the describe command provides crucial context, detailing scheduling decisions, resource allocation, and container states.
kubectl describe pod [POD_NAME] -n my-namespace
Focus on these sections in the output:
- Containers/Init Containers: Check the
State(especiallyWaitingorTerminated) and theLast State(where the failure reason is often recorded, e.g.,Reason: OOMKilled). - Resource Limits: Verify that
LimitsandRequestsare correctly set. - Events: This is the most critical section. Events show the lifecycle history, including scheduling failures, volume mounting issues, and image pull attempts.
Tip: If the
Eventssection shows a message like "0/N nodes available," the Pod is likely failing to schedule due to insufficient resources (CPU, memory) or affinity rules not being met.
Read events from the bottom upward when you want the newest clue, but do not ignore older events. A pod can have more than one problem. For example, a deployment may start with FailedScheduling because the requested memory is too high, then later move to ImagePullBackOff after a node is added. If you only look at the final status, you may miss the change that made the problem move forward.
3. Reviewing Logs (kubectl logs)
For runtime application issues, logs provide the stack trace or error messages explaining why the process terminated.
# Check current container logs
kubectl logs [POD_NAME] -n my-namespace
# Check logs for a specific container within the Pod
kubectl logs [POD_NAME] -c [CONTAINER_NAME] -n my-namespace
# Crucial for CrashLoopBackOff: Check the logs from the *previous* failed run
kubectl logs [POD_NAME] --previous -n my-namespace
If the pod has sidecars, always include -c. Many frustrating investigations come from reading the logs of the healthy sidecar instead of the failing application container. For init container failures, use the init container name with -c as well:
kubectl logs [POD_NAME] -c [INIT_CONTAINER_NAME] -n my-namespace
Common Pod Failure Scenarios and Solutions
Most Pod failures fall into a few recognizable patterns, each requiring a targeted diagnostic approach.
Scenario 1: CrashLoopBackOff
This is the most frequent Pod failure. It signifies that the container is starting successfully, but the application within the container is immediately exiting (with a non-zero exit code).
Diagnosis:
- Use
kubectl logs --previousto view the traceback or exit reason. - Use
kubectl describeto check theExit Codein theLast Statesection.
Common Causes & Fixes:
- Exit Code 1/2: General application error, missing configuration file, database connectivity failure, or application crash due to bad input.
- Fix: Debug the application code or check ConfigMaps/Secrets being mounted.
- Missing Dependencies: The entry point script requires files or environments that are not present.
- Fix: Verify the Dockerfile and image build process.
- Bad command or args: The container image is valid, but the command in the Pod spec overrides the image entrypoint incorrectly.
- Fix: Compare the Deployment
commandandargswith the image's expected startup command. Test the same image locally if possible.
- Fix: Compare the Deployment
- Probe-induced restarts: A liveness probe may kill a slow-starting app before it finishes warming up.
- Fix: Increase
initialDelaySeconds, use astartupProbe, or point the probe at a cheaper health endpoint.
- Fix: Increase
A practical pattern is to deploy one temporary copy with the same image but a harmless command, then inspect the filesystem and environment:
kubectl run debug-image \
--image=registry.example.com/app:tag \
--restart=Never \
--command -- sleep 3600
kubectl exec -it debug-image -- /bin/sh
This does not replace fixing the Deployment, but it helps answer simple questions quickly: is the config file actually in the image, does the binary exist, does the container have the expected shell, and are environment variables present?
Scenario 2: ImagePullBackOff / ErrImagePull
This occurs when the Kubelet cannot fetch the container image specified in the Pod definition.
Diagnosis:
- Check the
Eventssection ofkubectl describefor the specific error message (e.g.,404 Not Foundorauthentication required).
Common Causes & Fixes:
- Typo or Wrong Tag: The image name or tag is incorrect.
- Fix: Correct the image name in the Deployment or Pod specification.
- Private Registry Access: The cluster does not have credentials to pull from a private registry (like Docker Hub, GCR, ECR).
- Fix: Ensure an appropriate
imagePullSecretis referenced in the Pod spec and that the Secret exists in the namespace.
- Fix: Ensure an appropriate
# Example Pod spec snippet for using a pull secret
spec:
containers:
...
imagePullSecrets:
- name: my-registry-secret
Also check where the pull secret lives. Kubernetes secrets are namespaced. A secret named regcred in default will not help a pod in payments. If the same image works in one namespace but fails in another, compare service accounts and image pull secrets before assuming the registry is broken:
kubectl get serviceaccount default -n payments -o yaml
kubectl get secret regcred -n payments
Scenario 3: Pending Status (Stuck)
A Pod remains in Pending status, usually indicating that it cannot be scheduled onto a Node or it is waiting for resources (like a PersistentVolume).
Diagnosis:
- Run
kubectl describeand look at theEventssection.
Common Causes & Fixes:
- Resource Exhaustion: The cluster lacks Nodes with enough available CPU or Memory to satisfy the Pod's
requests.- Fix: Increase cluster size, or reduce Pod resource requests (if feasible).
- Event Message Example:
0/4 nodes are available: 4 Insufficient cpu.
- Volume Binding Issues: The Pod requires a
PersistentVolumeClaim(PVC) that cannot be bound to an underlyingPersistentVolume(PV).- Fix: Check the status of the PVC (
kubectl get pvc) and ensure the StorageClass is functioning.
- Fix: Check the status of the PVC (
- Selector or affinity mismatch: The pod asks for a node label that does not exist, or a required affinity rule excludes every node.
- Fix: Compare
nodeSelector,nodeAffinity, and node labels withkubectl get nodes --show-labels.
- Fix: Compare
- Taints not tolerated: Nodes are available, but they repel this pod because it lacks a matching toleration.
- Fix: Add the intended toleration to the pod, or remove the taint if it no longer represents a real placement rule.
Scenario 4: OOMKilled (Out of Memory Killed)
While this usually results in a CrashLoopBackOff status, the underlying cause is specific: the container used more memory than the limit defined in its specification, causing the host operating system (via the Kubelet) to forcefully terminate it.
Diagnosis:
- Check
kubectl describe->Last State->Reason: OOMKilled.
Fixes:
- Increase Limits: Increase the memory
limitin the Pod spec, providing more headroom. - Optimize Application: Profile the application to reduce its memory footprint.
- Set Requests: Ensure
requestsare set close to the actual steady-state usage to improve scheduling reliability.
resources:
limits:
memory: "512Mi" # Increase this value
requests:
memory: "256Mi"
Be careful with memory "fixes" that only raise the limit. If the application has a leak, a higher limit may only delay the next failure and make the node carry more risk. Look at memory over time in your metrics system. A sawtooth pattern that returns to baseline after garbage collection is different from a steady climb until OOM.
Scenario 5: CreateContainerConfigError and CreateContainerError
These statuses are easy to overlook because they do not sound like application failures. They usually mean the kubelet could not assemble the container configuration.
Common causes include:
- A referenced ConfigMap or Secret does not exist in the namespace.
- A key inside a ConfigMap or Secret is misspelled.
- A volume mount path conflicts with another mount.
- The container references an invalid security context.
The fastest check is still describe:
kubectl describe pod [POD_NAME] -n my-namespace
Look for event messages such as secret "app-config" not found or configmap "settings" not found. Then verify the object:
kubectl get secret app-config -n my-namespace
kubectl get configmap settings -n my-namespace -o yaml
This is a common deployment-pipeline mistake. The application manifest is applied, but the secret creation step failed or ran against the wrong namespace.
Scenario 6: Running but Not Ready
A pod can show Running while still being unusable. The READY column tells you how many containers are ready according to their readiness probes. A pod with 1/2 or 0/1 may be alive but removed from Service endpoints.
Check endpoints when traffic is failing but pods look alive:
kubectl get endpoints [SERVICE_NAME] -n my-namespace
kubectl describe pod [POD_NAME] -n my-namespace
If the endpoint list is empty, the problem may be a readiness probe, a Service selector mismatch, or the application listening on a different port than the Service expects. In real incidents, this is where people lose time: they keep restarting pods even though the pods are not the reason traffic is missing.
Preventing Future Failures: Best Practices
Robust applications require careful configuration to prevent common deployment pitfalls.
Use Liveness and Readiness Probes
Proper implementation of probes allows Kubernetes to intelligently manage container health:
- Liveness Probes: Determine if the container is healthy enough to continue running. If the liveness probe fails, Kubelet will restart the container (resolving soft locks).
- Readiness Probes: Determine if the container is ready to serve traffic. If the readiness probe fails, the Pod is removed from service endpoints, preventing failed requests while the container recovers.
Do not point liveness probes at deep dependency checks unless you really want Kubernetes to restart the container whenever that dependency has a short outage. A database being unavailable for thirty seconds is usually not proof that the process is dead. Readiness is the better place to say, "do not send this pod traffic right now."
Enforce Resource Limits and Requests
Always define resource requirements for containers. This prevents Pods from consuming excessive resources (leading to Node instability) and ensures the scheduler can place the Pod on a Node with sufficient capacity.
Utilize Init Containers for Setup
If a Pod requires a dependency check or data setup before the main application starts (e.g., waiting for a database migration to complete), use an Init Container. If the Init Container fails, the Pod will restart it repeatedly, cleanly isolating setup errors from application runtime errors.
A Practical Triage Flow
When you are on call, use a repeatable path:
- Check
kubectl get pods -n <namespace> -o wideso you see status, restarts, age, and node placement. - Run
kubectl describe podand read Events, State, Last State, mounts, and resource settings. - Pull logs with
kubectl logs, adding--previousfor restarted containers and-cfor multi-container pods. - If the pod is
Pending, inspect scheduling, taints, node labels, and PVCs before reading app logs. - If the pod is
Runningbut not receiving traffic, inspect readiness probes, Service selectors, and endpoints. - If the pod was OOMKilled, compare limits with real memory graphs before simply increasing the number.
This order keeps you from jumping straight to the application when Kubernetes has not even started it yet.
Final Check
The most useful habit is to separate symptoms from causes. CrashLoopBackOff is a symptom. The cause might be a missing secret, a bad migration, a liveness probe, or a memory limit. Pending is a symptom. The cause might be CPU requests, a PVC, a taint, or a node selector. Let the pod status tell you where to look, then let events and logs tell you what changed.