Troubleshooting Failed Systemd Services: A Practical Guide for Sysadmins
Systemd services are the backbone of modern Linux systems, but they can fail. This practical guide empowers sysadmins to systematically troubleshoot and resolve common systemd service failures. Learn to effectively use `journalctl` for log analysis, diagnose dependency issues, interpret exit codes, and apply specific fixes for web servers, databases, and more to quickly restore service functionality.
Troubleshooting Failed Systemd Services: A Practical Guide for Sysadmins
Failed systemd services are usually less mysterious than they first look. The useful evidence is already on the machine: the unit definition, the exact command systemd tried to run, the exit status, and the journal lines around the failure. The trick is to read them in the right order instead of restarting the service ten times and hoping the message changes.
I usually start with three questions: did systemd find the unit, did the process start, and did the application itself reject its configuration or environment? The commands below keep that investigation grounded.
Understanding Systemd Service Failures
When a systemd service fails to start or crashes unexpectedly, it's often due to a variety of reasons. These can range from simple configuration errors, missing dependencies, resource limitations, to bugs within the service itself. Systemd provides robust mechanisms to help you pinpoint the exact cause of these failures.
Common Causes of Service Failures:
- Configuration Errors: Incorrect settings in the service's
.serviceunit file or related configuration files. - Missing Dependencies: The service relies on other system resources (like network, other services, specific filesystems) that are not available or have not started yet.
- Resource Exhaustion: The service requires more memory, CPU, or disk I/O than the system can provide.
- Permissions Issues: The service process lacks the necessary permissions to access required files, directories, or network ports.
- Bugs in the Service: The application itself has a bug that causes it to crash during startup or operation.
- Corrupted Data: Essential data files used by the service are corrupted.
- Network Issues: Problems with network interfaces, DNS, or firewall rules preventing the service from binding to ports or communicating.
Step 1: Inspecting Service Status
The first step in troubleshooting any failed service is to check its current status. Systemd's systemctl command is your primary tool for this.
Using systemctl status
The systemctl status <service_name>.service command provides a concise overview of the service's current state, recent log entries, and process information.
sudo systemctl status nginx.service
Example Output (Failed Service):
● nginx.service - A high performance web server and reverse proxy
Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
Active: failed (result=exit-code) since Tue 2023-10-27 10:30:00 UTC; 1min ago
Docs: man:nginx(8)
Process: 1234 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=1/FAILURE)
Main PID: 1234 (code=exited, status=1/FAILURE)
Oct 27 10:30:00 your-server systemd[1]: Starting A high performance web server and reverse proxy...
Oct 27 10:30:00 your-server nginx[1234]: nginx: [emerg] bind() to port 80 failed (98: Address already in use)
Oct 27 10:30:00 your-server systemd[1]: nginx.service: Main process exited, code=exited, status=1/FAILURE
Oct 27 10:30:00 your-server systemd[1]: Failed to start A high performance web server and reverse proxy.
Key information to look for in systemctl status output:
Active:: This line indicates the current state.failedis the state we are interested in. It might also showfailed (result=exit-code)orfailed (result=oom-kill). Theresultoften provides a clue.Process:: Details about the process that systemd tried to run. If it showscode=exited, status=..., this is critical.- Log Entries: The most recent log lines often contain the direct error message from the service.
Step 2: Analyzing Logs with journalctl
The journalctl command is systemd's powerful tool for querying and displaying logs from the systemd journal. It's essential for getting detailed insights into why a service failed.
Basic journalctl Usage for Services
To view logs for a specific service, use the -u flag:
sudo journalctl -u <service_name>.service
To follow logs in real-time:
sudo journalctl -f -u <service_name>.service
To view logs from the last boot (useful for services that failed during startup):
sudo journalctl -b -u <service_name>.service
To see logs since a specific time:
sudo journalctl --since "2023-10-27 10:00:00" -u <service_name>.service
Interpreting journalctl Output
Look for error messages, stack traces, or specific error codes reported by the application or systemd itself. The example output from systemctl status already showed a key error: bind() to port 80 failed (98: Address already in use). This clearly indicates another process is already using port 80, preventing Nginx from starting.
Tip: If the service is very verbose, you can limit the output:
sudo journalctl -n 50 -u <service_name>.service # Show last 50 lines
Step 3: Checking Service Dependencies and Requirements
Systemd services often depend on other services or system resources being available. If a dependency isn't met, the service won't start.
Viewing Dependencies
You can inspect the dependencies of a service using systemctl cat and looking at directives like Requires=, Wants=, After=, Before=, and PartOf=.
systemctl cat <service_name>.service
For example, a service that binds to a specific address may need ordering after the network is configured. After=network-online.target only controls order; it does not, by itself, pull that target into the transaction. If the service truly needs it, you often see both:
Wants=network-online.target
After=network-online.target
Be conservative with Requires=. It creates a stronger relationship and can stop your service when the required unit stops. Many application services only need Wants= plus After=.
Checking for Missing Dependencies
While systemctl status often indicates dependency issues, explicitly checking if required services are active can be helpful.
systemctl is-active <dependency_service_name>.service
If a required service is masked or stopped, it can prevent your target service from starting.
systemctl list-dependencies <service_name>.service
This command shows the full dependency tree.
Step 4: Understanding Exit Codes
When a service fails, it exits with a specific exit code. This code provides valuable information about the nature of the failure.
- Exit Code 0: Success.
- Exit Code 1: Generic failure for many programs. The specific meaning depends on the application.
- Exit Code 127: Command not found (often due to incorrect
ExecStartpath or missing executable). - Exit Code 137: Killed by
SIGKILL. This is often, but not always, related to memory pressure. - Exit Code 139: Killed by
SIGSEGV(Segmentation fault).
From the systemctl status output, we saw status=1/FAILURE. This is a generic failure, and the preceding log messages are essential to understand why it failed with status 1.
Identifying OOM Kills
If systemctl status shows failed (result=oom-kill), it means the Linux Out-Of-Memory (OOM) killer terminated the service's process because the system was running critically low on memory.
To confirm this, you can often find related messages in journalctl or dmesg:
dmesg | grep -i oom
Troubleshooting OOM Errors
- Increase system RAM: If possible.
- Reduce memory usage: Optimize the service or other running processes.
- Configure Swap: Ensure adequate swap space is available.
- Check service memory limits: A
MemoryMax=setting can cause a service-specific OOM even when the host still has free memory. - Review recent deploys: Memory failures often follow a configuration change, traffic change, or version change.
Step 5: Check the Unit File Systemd Is Actually Using
Do not assume the file in your editor is the complete unit. Packages, drop-ins, and overrides can combine into the final definition:
systemctl cat <service_name>.service
systemctl show <service_name>.service -p FragmentPath -p DropInPaths
This catches a common problem: someone edited /usr/lib/systemd/system/app.service, while an override in /etc/systemd/system/app.service.d/override.conf still changes Environment= or ExecStart=.
After editing unit files or drop-ins, reload systemd:
sudo systemctl daemon-reload
If you forget this step, systemctl restart may keep using the old unit definition.
Step 6: Common Service-Specific Issues and Fixes
While the above steps are general, specific services have common failure modes.
Web Servers (Nginx, Apache)
- Port already in use: As seen in the example, another process might be listening on port 80 or 443. Use
sudo ss -tulnp | grep :80to find the offending process. - Configuration syntax errors: Run the web server's configuration test (e.g.,
sudo nginx -torsudo apachectl configtest). - Missing SSL certificates: Ensure certificate files are present and readable.
Databases (MySQL, PostgreSQL)
- Data directory permissions: Ensure the database user has correct read/write access to its data directory.
- Corrupted data files: May require restoring from backup or using database-specific recovery tools.
- Disk space full: Databases can consume significant disk space.
Networking Services
- Incorrect IP addresses or hostnames: Verify network configuration.
- Firewall rules: Ensure necessary ports are open.
- DNS resolution issues: Check
/etc/resolv.confand network connectivity.
Step 7: Advanced Troubleshooting Techniques
Re-enabling and Restarting the Service
After making changes, reload units if needed, then restart the service. You do not need to run enable every time unless you are changing boot behavior.
sudo systemctl daemon-reload # Reload systemd manager configuration
sudo systemctl restart <service_name>.service
Using systemctl --failed
This command lists all units that are currently in a failed state.
systemctl --failed
Checking Resource Limits (ulimit)
Some services may fail if they hit operating system-level resource limits. Check limits with ulimit -a as the user the service runs as, or check systemd's own resource control directives in the unit file.
For systemd-managed services, unit properties are often more relevant than an interactive shell's ulimit:
systemctl show <service_name>.service -p LimitNOFILE -p User -p Group -p MemoryMax -p TasksMax
If an application says too many open files, compare LimitNOFILE with the application's connection count and file usage. If a service cannot create threads or child processes, look at TasksMax.
Debugging Flags
Many applications have debug modes or verbose logging that can be enabled via command-line arguments in the ExecStart line of the .service file. Consult the application's documentation.
A Quick Example: Service Works Manually, Fails at Boot
This is one of the most common systemd complaints. A developer runs the command by hand and it works. The same command fails as a service. The usual difference is environment.
Check the service user and working directory:
systemctl show myapp.service -p User -p Group -p WorkingDirectory
systemctl cat myapp.service
Then look for assumptions in the app: relative paths, files in a home directory, environment variables from .bashrc, or credentials loaded by an interactive shell. systemd does not read your shell startup files for a service. If the app needs APP_ENV=production or DATABASE_URL=..., put that configuration in the unit with Environment=, an EnvironmentFile=, or your normal secret-management path.
Boot-only failures can also be ordering problems. A service may start before DNS, the network address, or a mounted filesystem is ready. Do not fix that with a blind sleep in the application. Express the dependency in the unit:
Wants=network-online.target
After=network-online.target
RequiresMountsFor=/srv/myapp
RequiresMountsFor= is useful when the service needs a specific path, especially if that path comes from a separate disk or network mount. It is clearer than hoping a broad target happens to finish first.
Resetting Failed State
After a service fails, systemd remembers the failed state until it is reset or the unit succeeds. That is helpful for visibility, but it can confuse status checks after you have already fixed the issue:
sudo systemctl reset-failed myapp.service
sudo systemctl restart myapp.service
systemctl status myapp.service
Use reset-failed after you have captured the evidence you need. During an incident, the failed state and journal timestamps are useful breadcrumbs.
One more small habit helps after noisy failures: check whether the unit is restart-looping before you edit anything.
systemctl show myapp.service -p NRestarts -p RestartUSec
If the restart count is climbing quickly, stop the unit while you investigate. That protects dependencies from repeated bad connections and keeps the journal readable.
The reliable pattern is: read status, read the journal, inspect the effective unit with systemctl cat, verify dependencies and paths, then restart only after you know what changed. That keeps systemd troubleshooting boring, which is exactly what you want during an outage.