Debugging Failed Deployments: Identifying Common YAML and Configuration Errors

Kubernetes deployment failures usually show up as pods stuck in Pending, ImagePullBackOff, CrashLoopBackOff, or zero available replicas. The cause is often visible in events, but you need to inspect the Deployment, ReplicaSet, and Pods in the right order.

This guide walks through the checks that usually find YAML mistakes, image pull problems, bad probes, resource constraints, and scheduling rules that keep your deployment from becoming healthy.

The First Steps: Checking Deployment Status and Events

When a deployment fails, your first diagnostic steps should always involve checking the primary resource itself and the events associated with its managed ReplicaSets and Pods. This provides the highest-level view of what Kubernetes is attempting to do and why it's failing.

1. Inspecting the Deployment Health

Use kubectl get deployments to see the overall status. Look specifically at the READY, UP-TO-DATE, and AVAILABLE columns. A discrepancy here indicates a problem with the underlying Pods.

kubectl get deployments <deployment-name>

If the deployment status shows few or zero ready replicas, proceed to check the ReplicaSet.

2. Reviewing ReplicaSet and Pod Events

ReplicaSets manage the desired number of Pods. If the deployment fails, the ReplicaSet is usually the source of the error cascade. Use the describe command on the ReplicaSet, which is usually named <deployment-name>-<hash>:

kubectl describe rs <replicaset-name>

Crucially, examine the Events section at the bottom of the output. This section details recent actions, including scheduling attempts, image pull failures, and volume mounting issues. These events are often the smoking gun.

Finally, check the Pods themselves, as they report the immediate failure:

kubectl get pods -l app=<your-app-label>
kubectl describe pod <pod-name>

Look for the State and Reason fields in the pod description. Common reasons include ImagePullBackOff, ErrImagePull, CrashLoopBackOff, and Pending.

Common Configuration Errors in YAML Manifests

Misconfigurations in your YAML files are the most frequent cause of deployment failures. These errors can range from simple indentation mistakes to complex structural issues.

1. YAML Syntax and Structure Errors

Kubernetes APIs are extremely sensitive to correct YAML syntax (indentation, spacing, and data types). If the YAML is invalid, kubectl apply will often fail immediately, stating it cannot parse the file.

Best Practice: Use a Linter Always validate your YAML syntax before applying. Tools like yamllint or integrated IDE support can catch basic syntax errors immediately.

Example of a common structural error: Incorrect mapping or indentation.

# INCORRECT: containers belongs under spec.template.spec, not directly under spec.template
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      labels:
        app: my-app
    containers:
      - name: my-app
        image: myregistry/myapp:v1

Correct placement:

spec:
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: myregistry/myapp:v1
          ports:
            - containerPort: 8080

2. Image Reference and Pull Errors

If Pods enter ImagePullBackOff or ErrImagePull, the issue is related to accessing the container image.

Typo in Image Name/Tag: Double-check the spelling of the image repository, name, and tag.
Private Registry Authentication: If pulling from a private registry, ensure you have created an image pull secret and referenced it as imagePullSecrets in the Pod spec.

# Snippet showing ImagePullSecret usage
spec:
  imagePullSecrets:
    - name: my-registry-secret
  containers:
    - name: my-app
      image: private.example.com/my-app:v1

3. Resource Requests and Limits Violations

If a Pod remains in Pending status and the events mention insufficient resources, the cluster nodes cannot satisfy the CPU or memory requirements defined in the YAML.

Check the limits specified in your deployment manifest:

resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "1"

Troubleshooting Steps:

Use kubectl describe nodes to see available capacity.
If you see events like 0/3 nodes are available: 3 Insufficient cpu, you must either lower the requests in your YAML or add more nodes to the cluster.

Advanced Configuration Issues

Beyond basic syntax and image pulls, complex configurations involving networking, storage, and scheduling can lead to difficult-to-diagnose failures.

1. Misconfigured Probes (Liveness and Readiness)

If a Pod transitions to CrashLoopBackOff, it often means the container starts but immediately fails a check, or it starts but fails its readiness probe repeatedly.

Liveness Probes: If this fails repeatedly, Kubernetes restarts the container. Check kubectl describe pod <pod-name> for probe failure events and kubectl logs <pod-name> --previous after a restart.
Readiness Probes: If this fails, the Pod keeps running but is kept out of the Service endpoints. Ensure the path, port, and expected response code match what your application is actually serving.

Example of a common Readiness Probe error: Targeting the wrong port or expecting HTTP when the app only exposes TCP.

2. Volume and PersistentVolumeClaim (PVC) Failures

If Pods are pending due to volume issues, inspect the PVC status:

kubectl get pvc <pvc-name>

If the PVC status is Pending, it means the cluster could not find a matching StorageClass or a suitable PersistentVolume to bind to. Check the PVC events for specific binding errors.

3. Affinity and Anti-Affinity Rules

Complex scheduling rules, such as nodeAffinity or podAntiAffinity, can unintentionally prevent a Pod from being scheduled if no node satisfies all criteria. If Pods remain Pending and events mention scheduling restrictions, review these rules.

For instance, if you require a Pod to run on a node with a specific label (nodeSelector) that no existing node possesses, the Pod will never schedule.

# Example: Restricting deployment to nodes labeled 'disktype: ssd'
spec:
  nodeSelector:
    disktype: ssd
  containers:
  # ...

Troubleshooting Tip: Temporarily comment out restrictive nodeSelector or affinity rules. If the Pod schedules successfully, you know the issue lies in the selection criteria.

Debugging Workflow

When faced with a failed deployment, follow this structured path:

Deployment Status: kubectl get deployments -> Are replicas reporting ready?
Pod Events: kubectl describe pod <failing-pod> -> Check the Events section for immediate errors (ImagePull, OOMKilled, Volume issues).
Container Logs: kubectl logs <failing-pod> -> If the container starts but crashes (CrashLoopBackOff), application logic or liveness probes are suspect.
Resource Check: If Pod is Pending, check for resource constraints (Insufficient cpu/memory) or failed volume bindings (PVC status).
Configuration Validation: Review the YAML for indentation, correct field names, and valid resource values (requests/limits).

Work from the cluster's evidence instead of guessing. Check status, events, logs, resources, and manifest structure in that order, then make the smallest YAML change that matches the error you found.