A Practical Guide to Debugging Failed Shell and Command Modules

The Ansible command and shell modules are the backbone of many advanced playbooks, allowing users to execute arbitrary binaries or scripts on remote hosts. While powerful, these modules often introduce the greatest complexity in debugging. When a script fails, Ansible only sees the exit status, not the context of the failure.

Mastering the debugging techniques for these modules—specifically checking return codes, capturing standard error, and employing the critical failed_when conditional—is essential for building reliable, production-grade Ansible playbooks. This guide provides actionable steps and practical examples for identifying, diagnosing, and controlling failures arising from external command execution.

Command vs. Shell: Understanding the Difference

Before diving into debugging, it is vital to understand the fundamental difference between the two modules, as their execution environment impacts failure modes.

`ansible.builtin.command`

This module executes the command directly, bypassing the standard shell environment. This makes it safer and more predictable, as it avoids shell features like variable interpolation, globbing, pipes (|), and redirection (>).

Best Practice: Use command whenever the task is simple and does not require shell features.

`ansible.builtin.shell`

This module executes the command via the remote host's standard shell (/bin/sh or equivalent). This is necessary for complex operations, environment variables, or when using standard shell syntax (e.g., cd /tmp && ls -l).

Warning: Since shell relies on the environment, it is more prone to unpredictable failures related to PATH configuration, hidden environment variables, or complex quoting.

The Anatomy of an Ansible Command Failure

By default, Ansible determines the success or failure of a command or shell module task based on the process's return code (RC).

Return Code (RC)	Interpretation
`rc = 0`	Success (Task continues)
`rc != 0`	Failure (Task immediately stops, host marked failed)

However, this simple check often doesn't capture the nuance of real-world scripts. A command might return an RC of 0 but still produce an unwanted result (a logical failure), or a command might return an expected non-zero RC (e.g., grep returns 1 if it finds no matches).

To handle these nuances, we must capture the output and conditionally control the failure state.

Step 1: Capturing Command Output with `register`

The first step in effective debugging is capturing all available output streams into an Ansible variable using the register keyword. This allows inspection of the return code, standard output, and standard error.

To prevent the playbook from halting immediately upon a non-zero return code during initial testing, it is often useful to temporarily use ignore_errors: yes.

- name: Execute a potentially unreliable command and capture results
  ansible.builtin.shell: | 
    /usr/local/bin/check_config.sh 2>&1 || exit 1
  register: cmd_output
  ignore_errors: yes  # Temporarily allow RC != 0 to proceed

Once registered, the cmd_output variable will contain several useful keys, most notably:

cmd_output.rc: The integer return code.
cmd_output.stdout: The standard output stream.
cmd_output.stderr: The standard error stream.
cmd_output.failed: A boolean indicating if Ansible currently considers the task failed.

Step 2: Inspecting Captured Data with `debug`

Use the debug module immediately after the failed task to inspect the contents of the registered variable. This helps distinguish between a true technical failure (e.g., command not found) and a logical failure (e.g., script ran but reported an internal error).

- name: Display full captured output for debugging
  ansible.builtin.debug:
    var: cmd_output
    # Use 'when' to only show this if the task failed, cleaning up output
  when: cmd_output.failed is defined and cmd_output.failed

- name: Highlight stderr contents
  ansible.builtin.debug:
    msg: "Captured STDERR: {{ cmd_output.stderr }}"
  when: cmd_output.stderr | length > 0

By inspecting the full output, you can pinpoint the specific error message or pattern that indicates a true failure.

Step 3: Overriding Default Failure Behavior with `failed_when`

The failed_when conditional is the most powerful tool for debugging and managing complex shell module results. It allows you to define custom logic, using Jinja2 expressions, to determine if a task should be marked as failed, regardless of the default return code.

Scenario A: Ignoring a Non-Zero Return Code

Often, a utility returns a non-zero code to indicate an expected state. For instance, if you are checking if a service exists using a command that returns RC=1 when the service is missing, you may only want to fail if the RC is greater than 1.

- name: Check service status, but ignore RC=1 (service not found)
  ansible.builtin.command: systemctl is-enabled my_optional_service
  register: service_status
  failed_when: service_status.rc > 1

Scenario B: Failing on Logical Errors (RC=0, but Bad Output)

If a script always returns RC=0 even when an internal error occurs, but prints a specific error string to stdout or stderr, use failed_when to catch that string.

- name: Validate database connectivity script
  ansible.builtin.shell: /opt/scripts/db_connect_test.sh
  register: db_result
  # Check both stdout and stderr for common error phrases
  failed_when: 
    - "'Connection refused' in db_result.stderr"
    - "'Authentication failure' in db_result.stdout"

Scenario C: Combining RC and Output Checks

For robust checks, combine the return code and content checks using logical operators (and, or, parentheses).

- name: Check deployment logs
  ansible.builtin.shell: tail -n 50 /var/log/deployment.log
  register: log_check
  # Fail if the RC is non-zero OR if the successful output contains the word 'FATAL'
  failed_when: log_check.rc != 0 or 'FATAL' in log_check.stdout

Tip: When using failed_when, you should generally remove ignore_errors: yes unless you explicitly want the failure to be logged but the play to continue.

Best Practices for Reliable Command Execution

To minimize the need for complex debugging, follow these standards when writing tasks that use command or shell:

1. Always Use Absolute Paths

Do not rely on the remote user's $PATH. Always specify the full path to the executable (e.g., /usr/bin/python, not just python). This avoids failures caused by inconsistent environments or subtle differences in the execution path.

2. Leverage Conditionals over Shell Logic

Instead of using complex shell logic like || or && inside the shell module, utilize Ansible's native conditionals (when:, failed_when:, changed_when:) and the register keyword. This keeps the playbook logic transparent and easier to debug.

3. Explicitly Control Change Detection (`changed_when`)

By default, command and shell mark a task as changed if the return code is 0. If your script runs but makes no changes to the system (e.g., a simple status check), you should manually define when the task results in a change using changed_when.

- name: Check disk space (should not result in 'changed')
  ansible.builtin.command: df -h /data
  changed_when: false

4. Use State Modules Where Possible

If you find yourself using shell to check for file existence, start/stop services, or install packages, stop and look for a dedicated Ansible module (e.g., ansible.builtin.stat, ansible.builtin.service, ansible.builtin.package). Dedicated modules handle idempotency and error checking internally, reducing debugging effort significantly.

Conclusion

Debugging failed shell and command modules moves beyond simply reading an error message; it requires analyzing the process output streams and controlling Ansible's perception of failure. By diligently using register to capture output, leveraging debug for inspection, and implementing precise failure conditions via failed_when, you gain robust control over external execution, ensuring your Ansible playbooks handle unreliable or complex commands predictably and reliably.