Mastering Performance: A Practical Guide to Using the Sysstat Toolset
Unlock the full potential of Linux performance monitoring with this practical guide to the Sysstat toolset. Learn how to install and configure Sysstat for historical logging and master the use of the powerful `sar` utility. This article provides actionable command examples for analyzing CPU utilization, memory pressure, disk I/O saturation, and network activity, enabling administrators to establish performance baselines and quickly diagnose and resolve system bottlenecks in production environments.
Mastering Performance: A Practical Guide to Using the Sysstat Toolset
Performance work gets messy when you only have the current moment. A server is slow now, but was it slow ten minutes ago? Did the disk start backing up before the CPU climbed? Did the problem begin after the cron job, the deploy, or the backup window? The sysstat toolset is useful because it gives you both live readings and a historical record you can compare against.
The main tool is sar, the System Activity Reporter. I reach for it when top is too brief, when an incident already passed, or when I need to show that a problem was storage, memory pressure, network traffic, or CPU saturation instead of guessing from symptoms. The rest of the suite, especially iostat and mpstat, fills in details when sar points you toward a likely bottleneck.
This is not a replacement for full observability. You still want application metrics, logs, tracing, and external checks. But on a Linux host, sysstat is one of the fastest ways to answer the first practical question: what was the machine actually doing?
1. Installation and Initial Configuration of Sysstat
The sysstat package is typically available in the standard repositories of all major Linux distributions.
1.1 Installation Commands
Use the appropriate package manager command for your system:
Debian/Ubuntu:
sudo apt update
sudo apt install sysstat
RHEL/CentOS/Fedora:
sudo yum install sysstat
# or use dnf for newer systems
sudo dnf install sysstat
1.2 Enabling Historical Data Collection
For sar to be truly useful, it must collect data historically. By default, installation often sets up a cron job or systemd timer, but verification is crucial.
On modern systems, ensure the sysstat service is active:
sudo systemctl enable --now sysstat
Configuration File
The frequency of data collection is controlled by configuration files, typically located at /etc/default/sysstat (Debian/Ubuntu) or /etc/sysconfig/sysstat (RHEL/CentOS). Look for the ENABLED or HISTORY setting. Setting ENABLED="true" ensures daily data collection.
Tip: By default,
sysstatdata files are stored in/var/log/sa/with filenames likesaXXwhereXXis the day of the month. Some Debian-based systems also expose reports under/var/log/sysstat/. Check your package defaults before assuming the path.
After enabling collection, wait for at least one interval and confirm files are appearing:
ls -lh /var/log/sa/
If the directory is empty, check the systemd timers:
systemctl list-timers | grep sysstat
systemctl status sysstat-collect.timer
On older systems, collection may still be driven by cron. The exact packaging varies by distribution, so verify instead of relying on memory from another server.
2. The Core Utility: System Activity Reporter (sar)
sar is the primary interface for viewing statistics. It can display real-time data or analyze previously collected historical data.
2.1 Basic Syntax for Real-Time Monitoring
The basic syntax is designed to report specific metrics at a specified interval for a defined count.
sar [options] [interval] [count]
Example: To report general CPU statistics every 3 seconds, 10 times:
sar -u 3 10
That command is good during an incident because it gives you a short moving sample instead of one snapshot. A single line can catch a quiet second and mislead you. Ten samples over thirty seconds show whether the pattern is steady, spiky, or already gone.
| Option | Description |
|---|---|
-u |
CPU utilization (default) |
-r |
Memory and paging statistics |
-d |
Block device activity (disk I/O) |
-n |
Network statistics (e.g., -n DEV for interface stats) |
-q |
Run queue and load average |
-W |
Swapping activity (paging) |
-A |
All metrics (useful for comprehensive snapshots) |
For historical files, the shape is the same. You add -f to choose the data file and often -s and -e to limit the time range. That matters because reading a whole day of output during an outage is slow and noisy.
3. Key Performance Metrics and Practical sar Examples
Understanding the output of sar requires knowledge of what metrics indicate performance health or stress.
3.1 CPU Utilization (sar -u)
CPU utilization is often the first place to look for bottlenecks. High utilization across specific categories indicates the nature of the workload.
sar -u 5 3
| Metric | Description | Bottleneck Indicator |
|---|---|---|
%user |
CPU time spent running user-level processes. | High indicates application/service saturation. |
%system |
CPU time spent running kernel/system tasks. | High suggests intensive system calls or driver issues. |
%iowait |
CPU time idle waiting for I/O operations (disk/network). | High indicates an I/O bottleneck, not CPU shortage. |
%idle |
CPU time spent waiting for nothing (available). | Low (e.g., < 5%) suggests CPU saturation. |
Be careful with %iowait. It is commonly misread as "the CPU is busy with disk." It actually means the CPU was idle while at least one I/O request was outstanding. A high value can point toward storage latency, but it needs confirmation with disk metrics. On a database server, for example, high %iowait plus high disk await is a much stronger signal than %iowait by itself.
Another useful CPU view is the run queue:
sar -q 5 5
runq-sz shows how many tasks are waiting to run. If load average is high but runq-sz is modest and %iowait is high, you may be looking at blocked I/O rather than pure CPU pressure. If runq-sz stays high and %idle is near zero, the machine probably needs fewer runnable processes, faster code, or more CPU capacity.
3.2 Memory and Paging (sar -r and sar -W)
Memory statistics reveal both consumption and whether the system is resorting to swapping or paging.
Memory Utilization (sar -r):
sar -r 1 5
Focus on kbavail (available memory). If kbmemfree is low, but kbcached and kbbuffers are high, the memory is being used efficiently by the kernel's caching mechanism.
Swapping Activity (sar -W):
sar -W 1 5
Look at pswpin/s (pages swapped in) and pswpout/s (pages swapped out). Any significant non-zero values here indicate the system is aggressively swapping, signaling memory pressure (a strong bottleneck).
Linux memory output can look alarming until you remember that cache is not wasted memory. A server with very little kbmemfree may still be healthy if kbavail is comfortable and swap activity is quiet. The dangerous pattern is different: available memory falls, swap-in and swap-out activity appears, and application latency climbs. That tells you processes are touching memory that no longer fits in RAM.
For a web server, that might happen after a deploy that accidentally doubles worker counts. For a batch host, it might happen when two large jobs overlap. sar will not tell you which process caused it, but it gives you the timeline. Pair it with ps, top, service logs, or cgroup metrics to identify the owner.
3.3 Disk I/O Activity (sar -d)
Monitoring disk activity is crucial for database servers or heavily utilized storage systems.
sar -d 3 5
This output requires identifying the specific devices (e.g., sda, vda). Key metrics include:
tps: Transfers per second (a high value indicates high I/O requests).rd_sec/s&wr_sec/s: Amount of data read/written per second.%util: Percentage of time the device was busy servicing requests. If%utilstays near 100% on a traditional block device, storage may be saturated.
On modern SSDs and virtual disks, %util deserves context. Some devices handle parallel I/O well, and cloud volumes may be limited by provisioned IOPS, throughput, or burst credits. Treat %util as a prompt to look closer, not as a complete diagnosis. Confirm with iostat -xd, application latency, and platform-level storage metrics if you are on AWS, Azure, GCP, or another virtualized environment.
One practical workflow is:
sar -d -f /var/log/sa/sa24 -s 09:00:00 -e 10:00:00
iostat -xd 2 5
Use sar to find the bad hour, then use iostat during a live recurrence to inspect device-level latency.
3.4 Network Statistics (sar -n)
sar can report activity across various network layers. The most common check is interface activity (DEV).
sar -n DEV 5 1
This command shows metrics like rxpk/s (received packets per second) and txkB/s (transmitted kilobytes per second) for each network interface. Use this to identify interfaces experiencing heavy load or potential errors.
For error counters, add EDEV:
sar -n EDEV 5 3
This can show receive errors, transmit errors, drops, and collisions where supported by the driver. Drops are especially useful when a service complains about intermittent timeouts but CPU and disk look normal. If drops rise during traffic spikes, you may need to inspect NIC queues, kernel network settings, container networking, or the load balancer path.
For TCP-level behavior, try:
sar -n TCP,ETCP 5 3
Retransmits, resets, and failed connection attempts can turn a vague "the site is slow" report into a more specific network or upstream problem.
4. Historical Analysis and Baseline Creation
The true power of sysstat lies in its ability to analyze system activity over extended periods, which is essential for establishing performance baselines (what is normal for your system).
4.1 Analyzing Previous Days
To view data collected on a previous day, use the -f flag to specify the path to the daily saXX file.
Example: To view CPU statistics from the 10th day of the current month:
sar -u -f /var/log/sa/sa10
To review statistics across a specific time window on that day, add the -s (start time) and -e (end time) flags (using 24-hour format).
# View network stats from 14:00 to 16:30 on the 10th
sar -n DEV -f /var/log/sa/sa10 -s 14:00:00 -e 16:30:00
4.2 Establishing Baselines
- Collect Data: Run
sysstatthrough normal high-load and low-load periods. - Identify Norms: Analyze historical data (
sar -f) to determine average CPU utilization (%user,%system), peak I/O latency (%util), and average memory usage. - Define Thresholds: Treat sustained deviations from your own baseline as investigation triggers. A busy database host and a quiet jump box should not share the same thresholds.
Baselines are more useful when they are tied to real business rhythms. A Monday morning batch import, a nightly backup, and a product launch all create different "normal" shapes. Keep notes when you investigate: "backup started at 01:00," "new release at 14:30," "marketing email at 09:05." Those notes make historical sar output much easier to interpret later.
5. Supporting Sysstat Tools
While sar is the umbrella tool, the sysstat suite includes specialized utilities that offer focused, high-detail reports.
5.1 iostat (Input/Output Statistics)
iostat provides detailed metrics specifically focused on device utilization, particularly useful when diagnosing storage bottlenecks.
# Report disk stats every 2 seconds, 4 times, including extended stats (x)
iostat -xd 2 4
Key iostat metrics:
%util: The percentage of CPU time during which I/O requests were issued to the device (crucial indicator of saturation).await: The average wait time (in milliseconds) for I/O requests issued to the device. Highawaitindicates slow storage responsiveness.
If await jumps but throughput is not high, look for small random I/O, filesystem issues, noisy neighbors on virtual infrastructure, or an application doing sync-heavy writes. If throughput is high and latency rises with it, the device may simply be at its practical limit.
5.2 mpstat (Multi-Processor Statistics)
If you suspect CPU scheduling issues or uneven workload distribution across cores, mpstat provides per-processor usage statistics, something sar -u aggregates.
# Show usage for all CPUs (A) every 2 seconds
mpstat -P ALL 2 1
This is invaluable for identifying single-threaded applications that are saturating a single core while others remain idle, or for diagnosing hyperthreading efficiency.
5.3 sadf (Exporting Sysstat Data)
sadf reads the same collected data as sar but can print it in formats that are easier for scripts and dashboards to consume.
sadf -d /var/log/sa/sa24 -- -u
sadf -j /var/log/sa/sa24 -- -r
The -d output is useful for delimited text processing. The -j output is useful when you want JSON. This is handy when you need to attach evidence to an incident review or compare two hosts without manually copying terminal output.
6. A Practical Incident Walkthrough
Imagine an API server that started timing out at 10:15. The application logs show requests piling up, but they do not explain why. Start with the historical CPU view:
sar -u -f /var/log/sa/sa24 -s 10:00:00 -e 10:30:00
If %user is high and %idle is low, the app may be CPU-bound. Check per-core usage:
sar -P ALL -f /var/log/sa/sa24 -s 10:00:00 -e 10:30:00
If one core is pinned while others are quiet, suspect a single-threaded worker, hot lock, or uneven process distribution. If all cores are busy, look at request rate, recent deploys, and expensive code paths.
If CPU looks mostly idle but %iowait rises, switch to disk:
sar -d -f /var/log/sa/sa24 -s 10:00:00 -e 10:30:00
High device utilization or rising queue depth around the same time points toward storage. On a database-backed service, the next stop is database logs and slow query data. On a file-serving host, check whether a backup, compression job, or log rotation ran at the same time.
If CPU and disk look fine, inspect memory and network:
sar -r -f /var/log/sa/sa24 -s 10:00:00 -e 10:30:00
sar -W -f /var/log/sa/sa24 -s 10:00:00 -e 10:30:00
sar -n DEV,EDEV,TCP,ETCP -f /var/log/sa/sa24 -s 10:00:00 -e 10:30:00
The point is not to run every command every time. The point is to follow the evidence. sar gives you a timeline across resource classes, which is usually what you need to stop chasing the loudest symptom.
A Simple Operating Habit
The best way to learn sysstat is to use it before something breaks. Check a healthy server during normal traffic. Check it during backups. Check it after a deploy. Save a few command patterns that match your environment.
When an incident happens, you will already know what normal looks like. That is the real value of the toolset. sar, iostat, mpstat, and sadf do not magically diagnose the application for you, but they keep the conversation grounded in evidence: when the problem started, which resource changed, and whether the host was actually under pressure.