Maximizing Ansible Performance with ControlPersist and Pipelining

Slow Ansible runs usually feel mysterious until you watch what is happening on the wire. A playbook with twenty small tasks against a hundred hosts can spend a surprising amount of time opening SSH sessions, copying temporary module files, running Python, collecting output, and closing connections. The work on the remote host may take milliseconds, but the connection overhead repeats again and again.

Two settings often help: SSH connection reuse through ControlPersist, and Ansible pipelining. They are not magic switches, and they will not fix slow package mirrors, overloaded databases, or tasks that do heavy work. They do reduce avoidable communication overhead, which is exactly where many small-task playbooks waste time.

First, measure the current pain

Before changing configuration, run the playbook once with timing enabled:

ANSIBLE_CALLBACKS_ENABLED=ansible.posix.profile_tasks ansible-playbook site.yml

If that callback is not installed, use the simpler built-in timing output from your CI system or wrap the command with time. The goal is not a perfect benchmark. You want a baseline and a sense of whether the slow tasks are actual work or tiny tasks repeated across many hosts.

A useful smoke test is an ad-hoc ping across a representative inventory:

time ansible all -m ping

Run it twice. If the second run is much faster after connection reuse is configured, you have confirmed that SSH setup cost was part of the problem.

What ControlPersist changes

ControlPersist is an OpenSSH feature. It keeps a master SSH connection open for a period of time so later SSH commands to the same host, user, and port can reuse it. Ansible commonly uses SSH for each task, so connection multiplexing removes repeated handshakes.

A practical project-level ansible.cfg looks like this:

[defaults]
forks = 20

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=600s -o ControlPath=~/.ansible/cp/%%h-%%p-%%r

Create the socket directory with private permissions:

mkdir -p ~/.ansible/cp
chmod 700 ~/.ansible ~/.ansible/cp

ControlMaster=auto tells SSH to use an existing master connection when one is available and create one when it is not. ControlPersist=600s keeps the master around for ten minutes after the last session exits. That value is not sacred. For short CI jobs, a few minutes may be enough. For an operator repeatedly running playbooks during a maintenance window, ten or fifteen minutes can be comfortable.

The ControlPath matters more than people expect. Unix socket paths have length limits on many systems. A long inventory hostname inside a deep workspace path can break multiplexing with a confusing error. Keeping the path short, such as under ~/.ansible/cp, avoids that class of failure.

Recent Ansible versions already enable SSH multiplexing by default in many normal configurations, but explicit settings in a project ansible.cfg make behavior easier to audit. Check the active configuration with:

ansible-config dump --only-changed

What pipelining changes

Without pipelining, Ansible often copies a module to a temporary directory on the remote host, executes it, then cleans up. With pipelining enabled, Ansible can pass module code through the SSH connection instead of writing as many temporary files. That saves round trips and remote filesystem work.

Enable it in the same file:

[ssh_connection]
pipelining = True

Or test it for one run:

ANSIBLE_PIPELINING=True ansible-playbook site.yml

The setting is most noticeable when the playbook has many small modules: file, lineinfile, template, user, service, and short command tasks. It matters less when a task spends most of its time installing packages, building software, transferring large artifacts, or waiting for an external service.

The sudo requiretty trap

The classic pipelining problem is requiretty in sudoers. Some older enterprise Linux configurations required a TTY for sudo. Pipelining does not work well with that requirement because Ansible is trying to stream work through SSH non-interactively.

Check sudoers carefully. Do not edit /etc/sudoers with a normal editor; use visudo:

sudo visudo

If you see a global line like this:

Defaults requiretty

You may need to remove it or override it for the Ansible automation user:

Defaults:ansible !requiretty

Only make that change if it matches your organization’s security policy. On many modern distributions, requiretty is not enabled by default.

A safer combined configuration

For many teams, this is a reasonable starting point:

[defaults]
forks = 20
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .ansible_facts
fact_caching_timeout = 86400

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=600s -o ControlPath=~/.ansible/cp/%%h-%%p-%%r
pipelining = True

forks controls parallelism. It is related to performance, but it is not the same optimization. Raising forks from 5 to 20 may help if your control node and network can handle it. Raising it too far can overload bastion hosts, package repositories, or the managed nodes themselves.

Fact caching helps a different kind of slowness. If your playbooks gather facts on every run and the facts do not need to be fresh every minute, caching can remove repeated setup work. Use it deliberately; stale facts can be confusing if you rely on current memory, disk, or interface data.

How to test without fooling yourself

Test on a subset that looks like production. Five idle test VMs on the same subnet will not tell you much about a hundred mixed hosts behind a bastion.

Run the same playbook before and after the change. Clear existing control sockets between test runs if you need a cold comparison:

rm -f ~/.ansible/cp/*
time ansible-playbook site.yml --limit web

Then run a warm comparison:

time ansible-playbook site.yml --limit web

Look for fewer seconds spent in small repeated tasks. Also watch failure modes. If tasks that use become start failing, investigate sudo configuration before blaming pipelining itself.

When these settings will not help much

ControlPersist and pipelining do not make slow remote work fast. If apt update waits on a mirror, connection reuse will not save you. If a service restart waits thirty seconds because the app drains connections, pipelining is irrelevant. If your playbook copies a 2 GB artifact to every host, focus on artifact distribution, caching, or local package repositories.

They also do not replace idempotent playbook design. A playbook that runs shell commands unconditionally will still waste time. Use modules that can detect current state, add creates or removes to command tasks when appropriate, and avoid gathering facts when the play does not use them.

The practical approach is simple: enable ControlPersist with a short, private control path; test pipelining with your sudo policy; tune forks gradually; and measure with the same inventory and workload each time. In many real Ansible environments, these changes turn a sluggish run into a tolerable one because they remove repeated overhead rather than hiding it.

Bastion hosts and jump hosts

Many inventories do not connect directly to managed nodes. They go through a bastion. ControlPersist still helps, but you need to think about both connections: control node to bastion, and bastion to target.

A common inventory variable looks like this:

[private]
app01 ansible_host=10.0.10.11
app02 ansible_host=10.0.10.12

[private:vars]
ansible_user=ansible
ansible_ssh_common_args='-o ProxyJump=bastion.example.com'

If every task repeatedly builds a jump connection, the bastion becomes part of the overhead. Keep the control path short and private, and watch the bastion’s SSH connection limits. A playbook that runs fine against ten hosts can overload a small bastion when forks is raised to fifty.

For older SSH clients, you may see ProxyCommand instead of ProxyJump:

ansible_ssh_common_args='-o ProxyCommand="ssh -W %h:%p bastion.example.com"'

Test a plain SSH connection before blaming Ansible:

ssh -o ProxyJump=bastion.example.com [email protected] hostname

If that is slow, Ansible will be slow too.

Pipelining and become in real playbooks

The old advice that pipelining does not work with privilege escalation is too broad. Pipelining can work with become; the common blocker is sudo requiring a TTY or a policy that interferes with non-interactive execution. The only reliable answer is to test with the same become settings your playbooks use.

A small test play is enough:

- hosts: all
  become: true
  gather_facts: false
  tasks:
    - name: Confirm privileged command works
      ansible.builtin.command: id
      changed_when: false

Run it with pipelining enabled. If it fails, check sudoers, authentication prompts, and whether the automation user can run the required commands without a password prompt. Password prompts in large automation runs are brittle; they also make timing measurements noisy.

Tune forks with respect for dependencies

Increasing forks is tempting because it produces an immediate visible change. It can also create a new bottleneck. If one play updates package caches on every host at once, a high fork count can punish the package mirror. If every host restarts and reconnects to the same database at once, the database sees the blast.

A measured approach is better:

[defaults]
forks = 10

Run the play. Try 20. Then 30. Watch the control node CPU, SSH process count, bastion load, network saturation, package repository, and the service being changed. The fastest setting is not always the one with the highest parallelism. For rolling application deployments, you may also want serial in the playbook to protect availability:

- hosts: web
  serial: 10

forks controls how much Ansible can do at once. serial controls how many hosts the play should process at a time. They solve different problems.

Clean up stale control sockets

Control sockets usually clean themselves up, but laptops sleep, CI jobs get killed, and network paths change. If SSH starts reporting multiplexing errors, remove stale sockets:

rm -f ~/.ansible/cp/*

That is safe when no Ansible run is active for those sockets. In shared automation runners, avoid placing control sockets in a shared writable directory. Each automation user should have its own private path.

Inventory design can erase performance gains

Connection tuning helps, but inventory design can still slow everything down. Dynamic inventory scripts that call cloud APIs slowly, group vars that run expensive lookups, and playbooks that gather facts for hosts they never touch can add delay before SSH even starts.

If a run pauses before the first task, profile inventory loading. Cache dynamic inventory where the plugin supports it. Avoid expensive variable lookups at parse time. Keep host patterns tight:

ansible-playbook site.yml --limit web:&prod

That command targets hosts in both web and prod. Running a broad play against all and then skipping most tasks with when conditions wastes time and makes output harder to read.

Prefer fewer, clearer tasks when state is related

Ansible readability matters, but excessive tiny tasks can make connection overhead more visible. If you set ten related lines in one config file with ten separate lineinfile tasks, you pay task overhead ten times and make rollback reasoning harder. A template may be clearer and faster:

- name: Render application config
  ansible.builtin.template:
    src: app.conf.j2
    dest: /etc/app/app.conf
    mode: '0644'
  notify: restart app

This is not an argument for giant shell scripts inside Ansible. It is a reminder to model the desired state at the right level. One template for one config file is often better than many micro-edits.

Avoid premature global changes

Put performance settings in the project ansible.cfg when possible. Changing /etc/ansible/ansible.cfg on a shared control node can surprise other teams. A project-level file makes the behavior travel with the repository and keeps CI, laptops, and automation runners closer to the same configuration.

Confirm which config file Ansible is using:

ansible --version

The output includes the active config path. This catches a common mistake: editing one ansible.cfg while Ansible is reading another.

Know when to stop tuning Ansible

If connection overhead is no longer a major part of runtime, stop tuning SSH and look at the work. Package cache updates, container pulls, database migrations, service health checks, and cloud API calls often dominate mature playbooks. At that point, the better optimization may be a local package mirror, artifact caching, smaller deployment batches, or moving one-time provisioning out of every deploy run.