A Practical Guide to Debugging Failed Shell and Command Modules

Debug Ansible shell and command failures with register, stdout, stderr, rc, failed_when, and changed_when examples.

A Practical Guide to Debugging Failed Shell and Command Modules

The Ansible command and shell modules are useful when no purpose-built module exists, but they can be awkward to debug. A failed task may only show a return code unless you capture the command output yourself.

This guide shows you how to debug failed shell and command modules by checking rc, stdout, and stderr, then using failed_when and changed_when to make Ansible report the real result.


Command vs. Shell: Understanding the Difference

Before diving into debugging, it is vital to understand the fundamental difference between the two modules, as their execution environment impacts failure modes.

ansible.builtin.command

This module executes the command directly, bypassing the standard shell environment. This makes it safer and more predictable, as it avoids shell features like variable interpolation, globbing, pipes (|), and redirection (>).

Best Practice: Use command whenever the task is simple and does not require shell features.

ansible.builtin.shell

This module executes the command via the remote host's standard shell (/bin/sh or equivalent). This is necessary for complex operations, environment variables, or when using standard shell syntax (e.g., cd /tmp && ls -l).

Warning: Since shell relies on the environment, it is more prone to unpredictable failures related to PATH configuration, hidden environment variables, or complex quoting.

The Anatomy of an Ansible Command Failure

By default, Ansible determines the success or failure of a command or shell module task based on the process's return code (RC).

Return Code (RC) Interpretation
rc = 0 Success (Task continues)
rc != 0 Failure (Task immediately stops, host marked failed)

However, this simple check often doesn't capture the nuance of real-world scripts. A command might return an RC of 0 but still produce an unwanted result (a logical failure), or a command might return an expected non-zero RC (e.g., grep returns 1 if it finds no matches).

To handle these nuances, we must capture the output and conditionally control the failure state.

Step 1: Capturing Command Output with register

The first step in effective debugging is capturing all available output streams into an Ansible variable using the register keyword. This allows inspection of the return code, standard output, and standard error.

To prevent the playbook from halting immediately upon a non-zero return code during initial testing, it is often useful to temporarily use ignore_errors: yes.

- name: Execute a potentially unreliable command and capture results
  ansible.builtin.shell: | 
    /usr/local/bin/check_config.sh 2>&1 || exit 1
  register: cmd_output
  ignore_errors: yes  # Temporarily allow RC != 0 to proceed

Once registered, the cmd_output variable will contain several useful keys, most notably:

  • cmd_output.rc: The integer return code.
  • cmd_output.stdout: The standard output stream.
  • cmd_output.stderr: The standard error stream.
  • cmd_output.failed: A boolean indicating if Ansible currently considers the task failed.

Step 2: Inspecting Captured Data with debug

Use the debug module immediately after the failed task to inspect the contents of the registered variable. This helps distinguish between a true technical failure (e.g., command not found) and a logical failure (e.g., script ran but reported an internal error).

- name: Display full captured output for debugging
  ansible.builtin.debug:
    var: cmd_output
    # Use 'when' to only show this if the task failed, cleaning up output
  when: cmd_output.failed is defined and cmd_output.failed

- name: Highlight stderr contents
  ansible.builtin.debug:
    msg: "Captured STDERR: {{ cmd_output.stderr }}"
  when: cmd_output.stderr | length > 0

By inspecting the full output, you can pinpoint the specific error message or pattern that indicates a true failure.

Step 3: Overriding Default Failure Behavior with failed_when

The failed_when conditional is the most powerful tool for debugging and managing complex shell module results. It allows you to define custom logic, using Jinja2 expressions, to determine if a task should be marked as failed, regardless of the default return code.

Scenario A: Handling an Expected Non-Zero Return Code

Some utilities return a non-zero code for an expected result. For example, grep returns 1 when it finds no match and greater than 1 for actual errors.

- name: Check whether a setting exists, but do not fail when absent
  ansible.builtin.command: grep -q '^feature_enabled=true' /etc/myapp.conf
  register: grep_result
  failed_when: grep_result.rc > 1
  changed_when: false

Scenario B: Failing on Logical Errors (RC=0, but Bad Output)

If a script always returns RC=0 even when an internal error occurs, but prints a specific error string to stdout or stderr, use failed_when to catch that string.

- name: Validate database connectivity script
  ansible.builtin.shell: /opt/scripts/db_connect_test.sh
  register: db_result
  # Check both stdout and stderr for common error phrases
  failed_when: >
    ('Connection refused' in db_result.stderr) or
    ('Authentication failure' in db_result.stdout)

Scenario C: Combining RC and Output Checks

For robust checks, combine the return code and content checks using logical operators (and, or, parentheses).

- name: Check deployment logs
  ansible.builtin.shell: tail -n 50 /var/log/deployment.log
  register: log_check
  # Fail if the RC is non-zero OR if the successful output contains the word 'FATAL'
  failed_when: log_check.rc != 0 or 'FATAL' in log_check.stdout

Tip: When using failed_when, you should generally remove ignore_errors: yes unless you explicitly want the failure to be logged but the play to continue.

Best Practices for Reliable Command Execution

To minimize the need for complex debugging, follow these standards when writing tasks that use command or shell:

1. Always Use Absolute Paths

Do not rely on the remote user's $PATH. Always specify the full path to the executable (e.g., /usr/bin/python, not just python). This avoids failures caused by inconsistent environments or subtle differences in the execution path.

2. Leverage Conditionals over Shell Logic

Instead of using complex shell logic like || or && inside the shell module, utilize Ansible's native conditionals (when:, failed_when:, changed_when:) and the register keyword. This keeps the playbook logic transparent and easier to debug.

3. Explicitly Control Change Detection (changed_when)

By default, command and shell mark a task as changed if the return code is 0. If your script runs but makes no changes to the system (e.g., a simple status check), you should manually define when the task results in a change using changed_when.

- name: Check disk space (should not result in 'changed')
  ansible.builtin.command: df -h /data
  changed_when: false

4. Use State Modules Where Possible

If you find yourself using shell to check for file existence, start/stop services, or install packages, stop and look for a dedicated Ansible module (e.g., ansible.builtin.stat, ansible.builtin.service, ansible.builtin.package). Dedicated modules handle idempotency and error checking internally, reducing debugging effort significantly.

Final Takeaway

When a shell or command task fails, capture the result first, inspect rc, stdout, and stderr, then encode the real success condition in failed_when. Once the task is stable, add changed_when so status checks do not show false changes in every playbook run.