Troubleshooting Linux Resource Exhaustion: CPU, Memory, and Disk Space

Troubleshoot Linux CPU, memory, and disk exhaustion with practical commands, safer cleanup steps, and root-cause checks.

Troubleshooting Linux Resource Exhaustion: CPU, Memory, and Disk Space

When a Linux server runs out of CPU, memory, or disk space, the first symptom is usually vague: the site is slow, SSH hangs after login, deployments fail, or a service keeps restarting. The fastest way through the incident is to identify which resource is exhausted, then find the process or filesystem behind it.

Do the least risky checks first. Read-only commands such as top, free, df, du, vmstat, and journalctl give you a picture without changing the machine. Killing processes and deleting files can be necessary, but they are not diagnosis.

Identifying the Culprit: Monitoring System Resources

Before you can fix a resource exhaustion problem, you need to pinpoint which resource is being overutilized and which process is responsible. Linux provides a rich set of command-line tools for this purpose.

CPU Usage Monitoring

High CPU usage can make your system feel slow and unresponsive. It's often caused by a runaway process, a demanding application, or an inefficient script.

  • top: This is an indispensable real-time system monitor. It displays a dynamic list of processes, sorted by CPU usage by default. You can see the overall CPU utilization, memory usage, and individual process details.

    top
    

    Within top, press 1 to see individual CPU core usage. Press P to sort by CPU usage. Look for processes consistently consuming a high percentage of CPU.

  • htop: An enhanced, interactive version of top. It's often preferred for its user-friendliness, colorized output, and easier navigation.

    htop
    

    Similar to top, htop allows sorting by CPU usage and provides detailed process information.

  • mpstat: Part of the sysstat package, mpstat provides detailed CPU statistics, including per-processor usage, interrupt counts, and context switches.

    mpstat -P ALL 1
    

    This command will display CPU statistics for all cores every second.

Also check load average against CPU count:

uptime
nproc

A load average of 8 means something very different on a 2-core VM than on a 32-core host. Load also includes tasks waiting on uninterruptible I/O, so a high load average with low CPU use may actually point to disk or network storage.

Memory Usage Monitoring

When a system runs out of available RAM and swap space, it starts using disk space as virtual memory, which is significantly slower, leading to severe performance degradation.

  • free -h: Displays the total amount of free and used physical and swap memory in the system, along with the buffers and caches used by the kernel. The -h flag makes the output human-readable (e.g., MB, GB).

    free -h
    

    Pay attention to the available memory and the used swap space. High swap usage indicates insufficient RAM.

  • top / htop: Both top and htop show memory usage per process. Look for processes with a high %MEM value.

  • vmstat: Reports virtual memory statistics. It can show information about processes, memory, paging, block IO, traps, and CPU activity.

    vmstat 5
    

    This command will report statistics every 5 seconds. Look at the si (swap-in) and so (swap-out) columns; high values indicate significant memory swapping.

For possible OOM kills, check the kernel log:

dmesg -T | grep -i 'killed process'
journalctl -k --since "1 hour ago" | grep -i oom

An OOM kill changes the incident. The immediate question becomes which process was killed, why it exceeded available memory, and whether systemd or an orchestrator restarted it.

Disk Space Monitoring

A full disk partition can prevent applications from writing data, cause errors, and even prevent the system from booting.

  • df -h: Reports file system disk space usage. The -h flag makes the output human-readable.

    df -h
    

    This command will list all mounted file systems and show their total size, used space, available space, and mount point. Look for partitions at or near 100% usage.

  • du -sh <directory>: Estimates file space usage for a given directory. The -s flag summarizes, and -h makes it human-readable.

    du -sh /var/log/*
    

    Use this to find which subdirectories are consuming the most disk space.

Check inode usage too:

df -ih

A filesystem can have free gigabytes and still be unable to create files if it has run out of inodes. This happens with millions of tiny files: cache entries, mail queues, session files, build artifacts, or badly rotated logs.

Resolving Resource Exhaustion Issues

Once you've identified the problematic resource and the offending process, you can take steps to resolve the issue.

Addressing High CPU Usage

  1. Identify the Process: Use top or htop to find the process ID (PID) consuming high CPU.
  2. Investigate the Process: Determine what the process is. Is it a user application, a system service, or something unexpected?
    • Legitimate High Usage: If a legitimate application is using a lot of CPU (e.g., compiling software, video encoding), you might need to wait for it to finish, schedule it for off-peak hours, or upgrade your hardware.
    • Runaway Process: If a process is stuck in a loop or consuming excessive CPU unintentionally, you can try to restart it. If that doesn't work, you may need to terminate it.
  3. Terminate the Process (Use with Caution!): You can use the kill command to send signals to processes. The most common signals are:
    • SIGTERM (15): Gracefully asks the process to terminate.
    • SIGKILL (9): Forcefully terminates the process immediately. This should be a last resort as it doesn't allow the process to clean up.
    # Gracefully terminate process with PID 1234
    kill 1234
    
    # Forcefully terminate process with PID 1234
    kill -9 1234
    
  4. Check Logs: Examine system logs (e.g., /var/log/syslog, /var/log/messages, application-specific logs) for errors related to the problematic process.
  5. Optimize Applications/Scripts: If the high CPU usage is due to an inefficient application or script, consider optimizing the code or configuration.

High CPU is not always bad. A batch job using all cores for a short time may be fine. A single-threaded process stuck at 100% of one core while requests queue behind it is different. Look at duration, user impact, and whether the process is expected to be busy.

If you need more context before restarting a service, capture a snapshot:

ps -fp <pid>
sudo lsof -p <pid> | head
sudo strace -p <pid> -tt -T -f

Use strace carefully on production systems. It can add overhead, but a short sample often tells you whether the process is looping, waiting on files, failing network calls, or repeatedly opening the same resource.

Resolving Memory Leaks and Exhaustion

A memory leak occurs when a program fails to release memory it no longer needs, gradually consuming all available RAM. This can lead to excessive swapping and system unresponsiveness.

  1. Identify the Process: Use top or htop to find processes with high memory (%MEM) or resident set size (RSS) values that are steadily increasing over time.
  2. Investigate the Process: Determine the nature of the application. Is it a known application with potential memory issues, or something custom?
  3. Restart the Application/Service: Often, simply restarting the application or service can temporarily resolve a memory leak by freeing up the accumulated memory.
    # Example: Restarting Apache web server
    sudo systemctl restart apache2
    
  4. Check Application-Specific Monitoring: Many applications (e.g., web servers, databases) have their own monitoring tools or logs that can help diagnose memory issues.
  5. Analyze Core Dumps: For critical applications, you might need to enable core dumps and use debugging tools (like gdb) to analyze the memory state when the leak occurs. This is an advanced troubleshooting step.
  6. Increase Swap Space (Temporary Measure): If you cannot immediately resolve the leak, you can increase swap space to provide more virtual memory. However, this is a workaround, not a solution.
  7. Hardware Upgrade: If your system consistently runs out of memory for its workload, you may need to add more physical RAM.

A better memory investigation watches change over time. One top screenshot only says who is large now. A leak is a trend.

while true; do
  date
  ps -eo pid,comm,rss,%mem --sort=-rss | head -15
  sleep 60
done

If the same process climbs steadily across samples without dropping after traffic falls, you have a stronger leak signal. If many processes grow together during peak traffic, the workload may simply exceed capacity or concurrency limits.

For systemd services, check whether memory limits already exist:

systemctl show <service> -p MemoryCurrent -p MemoryMax

For containers, host-level free -h may look fine while a container hits its own limit. Check docker stats, kubectl top pod, or the orchestrator events for OOM kills.

Managing Full Disk Partitions

When a disk partition fills up, it can cause various system failures. Immediate action is usually required.

  1. Identify the Full Partition: Use df -h to locate the partition(s) at 100% capacity.
  2. Find Large Files/Directories: Use du -sh or du -h --max-depth=1 <directory> to navigate down the directory tree and find what's consuming the space.
    # Find the largest directories in the root partition
    sudo du -h --max-depth=1 / | sort -rh
    
    Common culprits include log files (/var/log), temporary files (/tmp), package caches, and user data.
  3. Clean Up Log Files: Log files can grow very large. You can often safely delete old logs, or configure log rotation (logrotate) to manage their size automatically.
    • Deleting Old Logs: Be cautious and ensure you're not deleting currently active logs. You can use find to delete files older than a certain number of days.
      # Delete .log files older than 30 days in /var/log/myapp
      sudo find /var/log/myapp -name "*.log" -type f -mtime +30 -delete
      
    • Log Rotation: Ensure logrotate is configured correctly for your services. It typically runs daily and handles archiving and deleting old logs.
  4. Clear Package Manager Cache: Package managers often keep downloaded package files. Clearing these can free up significant space.
    • Debian/Ubuntu (apt):
      sudo apt autoremove
      sudo apt clean
      
    • CentOS/RHEL/Fedora (yum/dnf):
      sudo yum autoremove  # or dnf autoremove
      sudo yum clean all   # or dnf clean all
      
  5. Remove Unused Packages: Uninstall software you no longer need.
    • Debian/Ubuntu: sudo apt remove <package_name>
    • CentOS/RHEL/Fedora: sudo yum remove <package_name> or sudo dnf remove <package_name>
  6. Check Temporary Directories: Files in /tmp are often safe to delete, especially after a reboot, but be careful if applications are actively using them.
  7. Empty Trash: If you are using a desktop environment, check user trash bins.
  8. Consider Resizing Partitions: If space is consistently an issue and cleanup isn't sufficient, you may need to resize partitions or add more storage. This is a more advanced operation that might require unmounting partitions or booting from a live environment.

Be careful with deleted files that are still open. df may show a full filesystem even after you removed a large log file, because a running process still has the file handle open.

sudo lsof +L1

If a deleted file is still held open, restarting or reloading the owning service releases the space. Do that intentionally; do not restart a database or critical service in the middle of an incident without understanding the impact.

For journal logs, prefer journalctl cleanup over deleting files manually:

journalctl --disk-usage
sudo journalctl --vacuum-time=14d

For Docker hosts, check container logs and unused images:

docker system df
docker ps --size

Do not run broad prune commands blindly on a production host. They can remove images, build cache, stopped containers, and networks that someone expected to keep.

A Triage Order That Works Under Pressure

When everything is slow, use a fixed order so you do not jump between theories.

  1. Confirm the host is reachable and not read-only:

    uptime
    date
    mount | grep ' ro,'
    
  2. Check CPU and load:

    top
    uptime
    
  3. Check memory and swap:

    free -h
    vmstat 1 5
    
  4. Check disk space and inodes:

    df -h
    df -ih
    
  5. Check recent kernel and service errors:

    journalctl -p warning..alert --since "30 minutes ago"
    journalctl -k --since "30 minutes ago"
    

This order catches the common failures quickly: CPU saturation, swap storms, full filesystems, inode exhaustion, OOM kills, and storage errors.

Choosing the Least Bad Immediate Fix

During an outage, you may need a short-term fix before the permanent fix is ready.

For CPU exhaustion, a graceful service restart may be safer than kill -9, especially for software that writes state. If one background job is starving user traffic, lower its priority:

sudo renice +10 -p <pid>
sudo ionice -c2 -n7 -p <pid>

For memory exhaustion, reducing concurrency is often safer than adding swap and hoping. Lower web worker counts, pause batch jobs, or temporarily disable expensive features. Swap can buy time, but heavy swap usually turns a clear failure into a slow failure.

For disk exhaustion, delete or rotate files you understand. Good candidates are old compressed logs, package caches, obsolete build artifacts, and temporary files from stopped jobs. Bad candidates are database files, active logs, unknown files under application data directories, and anything you cannot explain.

Root Cause Notes to Capture

After the system is stable, write down what changed. Useful notes are concrete:

  • The exact filesystem or resource that was exhausted.
  • The process, user, service, container, or cron job involved.
  • The command output that proved it.
  • The immediate action taken.
  • The permanent fix needed.

This is not paperwork for its own sake. The next incident is much easier when you know that /var filled because debug logs grew after a deploy, or that memory pressure began when worker count doubled.

Best Practices for Prevention

  • Regular Monitoring: Implement regular monitoring of CPU, memory, and disk space using tools like top, htop, free, df, and dedicated monitoring solutions (e.g., Nagios, Zabbix, Prometheus).
  • Automate Log Rotation: Ensure logrotate is properly configured for all services generating logs.
  • Tune Application Configurations: Optimize application settings to be more resource-efficient. For example, tune web server worker processes, database connection pools, etc.
  • Set Up Alerts: Configure alerts for sustained high usage, fast growth, OOM kills, filesystem fullness, inode exhaustion, and service restarts. Alert on trends, not only hard limits.
  • System Updates: Keep your system and applications updated, as performance improvements and bug fixes are often included in newer versions.
  • Resource Limits: For multi-user systems or containerized environments, consider setting resource limits (e.g., using ulimit or cgroups) to prevent a single process from starving others.

Resource exhaustion troubleshooting is mostly disciplined narrowing. Find the constrained resource, identify the owner, make the smallest stabilizing change, then fix the reason it happened. The basic tools are enough for most incidents if you use them in that order and resist the urge to delete or kill before you understand what you are touching.