Troubleshooting: Why Is My Kubernetes Pod Stuck in Pending or CrashLoopBackOff?

Pending and CrashLoopBackOff look similar when you are waiting for a rollout, but they mean very different things. Pending usually means Kubernetes has not been able to place or prepare the pod. CrashLoopBackOff means the container did start, then exited, and Kubernetes is delaying the next restart.

That difference matters. A pending pod is often a scheduler, image, or storage problem. A crashing pod is usually an application, command, probe, permission, or memory problem. Start with that split and the troubleshooting path gets much shorter.

Understanding Pod States: Pending vs. CrashLoopBackOff

Before diving into troubleshooting, it's essential to understand what these two states signify.

Pod Status: Pending

A Pod in the Pending state means Kubernetes has accepted the Pod object, but it has not fully moved into a running container state. Sometimes it has not been scheduled onto a node. Sometimes it has a node assigned, but image pull, volume attach, or sandbox setup has not completed.

Pod Status: CrashLoopBackOff

A Pod in CrashLoopBackOff means that a container within the Pod is repeatedly starting, crashing, and then restarting. Kubernetes implements an exponential back-off delay between restarts to prevent overwhelming the node. This state almost always points to an issue with the application running inside the container itself or its immediate environment.

One subtle case: a container can exit with code 0 and still enter a restart loop if the workload is supposed to be a long-running server. That often happens when a Deployment runs a one-shot command by mistake, such as a migration script or a shell command that finishes immediately.

Troubleshooting Pods in Pending State

When a Pod is Pending, the first place to look is the scheduler and the node it's trying to get onto. Here are the common causes and diagnostic steps.

1. Insufficient Resources on Nodes

One of the most frequent reasons for a Pod being Pending is that no node in the cluster has enough available resources (CPU, memory) to satisfy the Pod's requests. The scheduler cannot find a suitable node.

Diagnostic Steps:

Describe the Pod: The kubectl describe pod command is your best friend here. It will often show events detailing why the Pod cannot be scheduled.
```
kubectl describe pod <pod-name> -n <namespace>
```
Look for events like "FailedScheduling" and messages such as "0/3 nodes are available: 3 Insufficient cpu" or "memory".
Check Node Resources: See the current resource usage and capacity of your nodes.
```
kubectl get nodes
kubectl top nodes # (requires metrics-server)
```

Solution:

Increase Cluster Capacity: Add more nodes to your Kubernetes cluster.
Adjust Pod Resource Requests: Reduce the requests for CPU and memory in your Pod's manifest if they are set too high.
```
resources:
  requests:
    memory: "128Mi"
    cpu: "250m"
```
Evict Other Pods: Manually evict lower-priority Pods from nodes to free up resources (use with caution).

2. Image Pull Errors

If Kubernetes can schedule the Pod to a node, but the node fails to pull the container image, the Pod will remain Pending.

Common Causes:

Incorrect Image Name/Tag: Typos in the image name or using a non-existent tag.
Private Registry Authentication: Missing or incorrect ImagePullSecrets for private registries.
Network Issues: Node unable to reach the image registry.

Diagnostic Steps:

Describe the Pod: Again, kubectl describe pod is key. Look for events like "Failed" or "ErrImagePull" or "ImagePullBackOff".
```
kubectl describe pod <pod-name> -n <namespace>
```
Example output event: Failed to pull image "my-private-registry/my-app:v1.0": rpc error: code = Unknown desc = Error response from daemon: pull access denied for my-private-registry/my-app, repository does not exist or may require 'docker login'
Check ImagePullSecrets: Verify that imagePullSecrets are correctly configured in your Pod or ServiceAccount.
```
kubectl get secret <your-image-pull-secret> -o yaml -n <namespace>
```

Solution:

Correct Image Name/Tag: Double-check the image name and tag in your deployment manifest.

Configure ImagePullSecrets: Ensure you have created a docker-registry secret and linked it to your Pod or ServiceAccount.

kubectl create secret docker-registry my-registry-secret \n      --docker-server=your-registry.com \n      --docker-username=your-username \n      --docker-password=your-password \n      --docker-email=your-email -n <namespace>

Then, add it to your Pod spec:

spec:
  imagePullSecrets:
  - name: my-registry-secret
  containers:
  ...

Network Connectivity: Verify network connectivity from the node to the image registry.

If you are using a private registry, check the ServiceAccount too. Many teams attach imagePullSecrets to the namespace's default ServiceAccount instead of every Deployment:

kubectl get serviceaccount default -n <namespace> -o yaml

If the secret exists but the pull still fails, confirm the registry hostname in the secret exactly matches the hostname in the image reference. registry.example.com/app:v1 and https://registry.example.com/app:v1 are not the same reference.

3. Volume-Related Issues

If your Pod requires a PersistentVolumeClaim (PVC) and the corresponding PersistentVolume (PV) cannot be provisioned or bound, the Pod will remain Pending.

Diagnostic Steps:

Describe the Pod: Look for events related to volumes.
```
kubectl describe pod <pod-name> -n <namespace>
```
Events might show FailedAttachVolume, FailedMount, or similar messages.
Check PVC and PV Status: Inspect the status of the PVC and PV.
```
kubectl get pvc <pvc-name> -n <namespace>
kubectl get pv
```
Look for PVCs stuck in Pending or PVs not bound.

Solution:

Ensure StorageClass: Make sure a StorageClass is defined and available, especially if using dynamic provisioning.
Check PV Availability: If using static provisioning, ensure the PV exists and matches the PVC criteria.
Verify Access Modes: Ensure the access modes (e.g., ReadWriteOnce, ReadWriteMany) are compatible.

Also check whether the pod is scheduled in a zone where the volume can attach. In cloud clusters, a disk created in one availability zone may not attach to a node in another. The event usually mentions volume node affinity or attach failure. In that case, the fix may be scheduling constraints, a different StorageClass, or recreating the volume in the right zone.

4. Taints, Tolerations, and Node Selectors

A pod can stay Pending even when the cluster has plenty of CPU and memory. The scheduler also has to respect placement rules.

Common examples:

The pod has a nodeSelector that matches no nodes.
The pod requires node affinity that is too strict.
The only matching nodes have taints, and the pod has no matching toleration.
The namespace has a quota that blocks the requested resources.

Check the scheduling events first:

kubectl describe pod <pod-name> -n <namespace>

Then compare the pod's placement rules with node labels:

kubectl get pod <pod-name> -n <namespace> -o yaml
kubectl get nodes --show-labels
kubectl describe node <node-name>

If the event says a taint was not tolerated, either schedule the pod somewhere else or add a toleration only if that workload really belongs on those nodes. Do not blindly tolerate every taint. Taints often protect special nodes, GPU nodes, infrastructure nodes, or nodes under pressure.

Troubleshooting Pods in CrashLoopBackOff State

A CrashLoopBackOff state indicates an application-level problem. The container successfully started but then exited with an error, prompting Kubernetes to restart it repeatedly.

1. Application Errors

The most common cause is the application itself failing to start or encountering a fatal error shortly after starting.

Common Causes:

Missing Dependencies/Configuration: The application can't find critical configuration files, environment variables, or external services it relies on.
Incorrect Command/Arguments: The command or args specified in the container spec are incorrect or lead to an immediate exit.
Application Logic Errors: Bugs in the application code that cause it to crash on startup.

Diagnostic Steps:

View Pod Logs: This is the most critical step. The logs will often show the exact error message that caused the application to crash.
```
kubectl logs <pod-name> -n <namespace>
```
If the Pod is repeatedly crashing, the logs might show the output from the most recent failed attempt. To see logs from a previous instance of a crashing container, use the -p (previous) flag:
```
kubectl logs <pod-name> -p -n <namespace>
```
Describe the Pod: Look for Restart Count in the Containers section, which indicates how many times the container has crashed. Also, check Last State for Exit Code.
```
kubectl describe pod <pod-name> -n <namespace>
```
An exit code of 0 usually means a graceful shutdown, but any non-zero exit code signifies an error. Common non-zero exit codes include 1 (general error), 137 (SIGKILL, often OOMKilled), 139 (SIGSEGV, segmentation fault).

Solution:

Review Application Logs: Based on the logs, debug your application code or configuration. Ensure all required environment variables, ConfigMaps, and Secrets are correctly mounted/injected.
Test Locally: Try running the container image locally with the same environment variables and commands to reproduce and debug the issue.

If the pod has multiple containers, always specify the container name:

kubectl logs <pod-name> -c <container-name> -n <namespace>
kubectl logs <pod-name> -c <container-name> -p -n <namespace>

Without -c, you may be reading the sidecar logs while the main app is the one crashing.

2. Liveness and Readiness Probes Failing

Kubernetes uses Liveness and Readiness probes to determine the health and availability of your application. If a liveness probe continuously fails, Kubernetes will restart the container, leading to CrashLoopBackOff.

Diagnostic Steps:

Describe the Pod: Check the Liveness and Readiness probe definitions and their Last State in the Containers section.
```
kubectl describe pod <pod-name> -n <namespace>
```
Look for messages indicating probe failures, such as "Liveness probe failed: HTTP probe failed with statuscode: 500".
Review Application Logs: Sometimes the application logs will provide context for why the probe endpoint is failing.

Solution:

Adjust Probe Configuration: Correct the probe's path, port, command, initialDelaySeconds, periodSeconds, or failureThreshold.
Ensure Probe Endpoint Health: Verify that the application endpoint targeted by the probe is actually healthy and responding as expected. The application might be taking too long to start, requiring a larger initialDelaySeconds.

For slow-starting applications, consider a startupProbe. It gives the application more time to initialize before the liveness probe starts judging it. This is cleaner than setting a huge liveness initialDelaySeconds for every restart.

3. Resource Limits Exceeded

If a container consistently attempts to use more memory than its memory.limit or is CPU throttled due to exceeding its cpu.limit, the kernel might terminate the process, often with an OOMKilled (Out Of Memory Killed) event.

Diagnostic Steps:

Describe the Pod: Look for OOMKilled in the Last State or Events section. An Exit Code: 137 often indicates an OOMKilled event.
```
kubectl describe pod <pod-name> -n <namespace>
```
Check kubectl top: If metrics-server is installed, use kubectl top pod to see the actual resource usage of your Pods.
```
kubectl top pod <pod-name> -n <namespace>
```

Solution:

Increase Resource Limits: If your application genuinely needs more resources, increase the memory and/or cpu limits in your Pod's manifest. This might require more capacity on your nodes.
```
resources:
  requests:
    memory: "256Mi"
    cpu: "500m"
  limits:
    memory: "512Mi" # Increase this
    cpu: "1000m"   # Increase this
```
Optimize Application: Profile your application to identify and reduce its resource consumption.

4. Permissions Issues

Containers might crash if they lack the necessary permissions to access files, directories, or network resources they require.

Diagnostic Steps:

Review Logs: The application logs might show permission denied errors (EACCES).
Describe Pod: Check the ServiceAccount being used and any mounted securityContext settings.

Solution:

Adjust securityContext: Set runAsUser, fsGroup, or allowPrivilegeEscalation as needed.
ServiceAccount Permissions: Ensure the ServiceAccount associated with the Pod has the necessary Roles and ClusterRoles bound via RoleBindings and ClusterRoleBindings.
Volume Permissions: Ensure mounted volumes (e.g., emptyDir, hostPath, ConfigMap, Secret) have correct permissions for the container's user.

A Fast Decision Tree

When someone says "the pod is broken," run these in order:

kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --all-containers=true --tail=100
kubectl logs <pod-name> -n <namespace> --all-containers=true -p --tail=100

Then branch:

If there is no node in kubectl get pod -o wide, focus on scheduling: requests, taints, affinity, quota, and node availability.
If there is a node but the event mentions image pull, focus on image name, tag, registry auth, and node-to-registry network access.
If the event mentions mount or attach, focus on PVCs, PVs, StorageClass, access modes, and zone placement.
If the pod starts then restarts, focus on logs, exit code, probes, command/args, config, secrets, and memory limits.

This order avoids a common mistake: reading application logs for a pod that never actually started an application container.

Reading Exit Codes Without Overreacting

Exit codes are clues, not complete explanations.

1 usually means the application returned a general error. The logs matter more than the number.
2 can point to command-line usage errors in many programs.
126 often means the command exists but cannot execute.
127 often means the command was not found.
137 commonly appears when the process receives SIGKILL; in Kubernetes that is often, but not always, connected to OOMKilled.
143 means the process received SIGTERM, which can happen during normal termination.

If the exit code is 137, check the pod's Last State and events before assuming a memory leak. A node drain, eviction, or manual kill can also terminate a container.

General Diagnostic Steps and Tools

Here's a quick checklist of commands to run when facing Pod issues:

Get a Quick Overview: Check the status of your Pods.

kubectl get pods -n <namespace>
kubectl get pods -n <namespace> -o wide

Detailed Pod Information: The most crucial command for understanding Pod events, states, and conditions.
```
kubectl describe pod <pod-name> -n <namespace>
```

Container Logs: See what your application is reporting.

kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -p -n <namespace> # Previous instance
kubectl logs <pod-name> -f -n <namespace> # Follow logs

Cluster-wide Events: Sometimes the issue isn't with a specific Pod but a cluster-wide event (e.g., node pressure).
```
kubectl get events -n <namespace>
```
Interactive Debugging: If your container starts but crashes quickly, you might be able to exec into it for a brief moment or into a separate debug container if configured.
```
kubectl exec -it <pod-name> -n <namespace> -- bash
```
(Note: This works only if the container stays alive long enough to attach.)

Best Practices to Avoid Pod Issues

Prevention is always better than cure. Following these best practices can significantly reduce Pending and CrashLoopBackOff incidents:

Set Realistic Resource Requests and Limits: Start with reasonable requests and limits, then fine-tune them based on application profiling and monitoring.
Use Specific Image Tags: Avoid latest tags in production. Use immutable tags (e.g., v1.2.3, commit-sha) for reproducibility.
Implement Robust Probes: Configure liveness and readiness probes that accurately reflect your application's health. Account for startup times with initialDelaySeconds.
Centralized Logging and Monitoring: Use tools like Prometheus, Grafana, ELK stack, or cloud-native logging services to collect and analyze Pod logs and metrics.
Version Control for Manifests: Store your Kubernetes manifests in a version control system (e.g., Git) to track changes and facilitate rollbacks.
Thorough Testing: Test your container images and Kubernetes deployments in development and staging environments before deploying to production.
Graceful Shutdowns: Ensure your applications handle SIGTERM signals for graceful shutdowns, allowing them to release resources before termination.

What Usually Fixes It Fastest

For Pending, kubectl describe pod is usually the fastest path because scheduler and kubelet events explain what Kubernetes could not do. For CrashLoopBackOff, previous logs are usually the fastest path because the current container may be too new to show the crash that caused the loop.

After you fix the immediate problem, look for the prevention step: right-sized requests, better image tags, a startup probe, a missing secret check in CI, or a clearer runbook. The best pod incident is the one that becomes easier to recognize next time.