Troubleshooting Common Systemd Service Failures Effectively
Diagnose common systemd service failures with systemctl, journalctl, exit codes, unit checks, and practical repair steps.
Troubleshooting Common Systemd Service Failures Effectively
Most systemd service failures are not mysterious once you separate three questions: did systemd read the unit file, did it manage to execute the command, and did the application stay healthy after it started? Those are different failure points, and they leave different clues.
The mistake I see most often is jumping straight into editing the unit file. First read the status and the logs. A failed service usually tells you whether it hit a missing executable, a bad user, a permission problem, a dependency ordering issue, or an application crash. The exact wording matters.
The Essential Diagnostic Toolkit
Effective troubleshooting relies on two primary systemd tools that provide immediate feedback on service state and operational logs.
1. Checking the Service Status
The systemctl status command provides an immediate snapshot of the unit's condition, including its current state, recent logs, and critical metadata like the process ID (PID) and exit code.
$ systemctl status myapp.service
Key information to look for:
Load:Confirms the unit file was read correctly.loadedis good. If it showsnot found, your service file is in the wrong location or misspelled.Active:This is the core status. If it readsfailed, the service attempted to start and exited unexpectedly.Exit Code:This numerical code, often displayed alongsideActive: failed, is vital. It indicates why the process terminated (e.g., 0 for clean exit, 1 or 2 for general application errors, 203 for execution path errors).- Recent Logs: Systemd often includes the last few lines of log output from the service, which may instantly reveal the error.
2. Deep Dive into Logs with Journalctl
While systemctl status gives a summary, journalctl provides the full context of the service's execution history, including standard output and standard error streams.
Use the following command to view the journal specifically for your failing service, using the -x flag for explanation and the -e flag to jump to the end (most recent entries):
$ journalctl -xeu myapp.service
Tip: If the failure happened hours or days ago, use the time filtering options, such as
journalctl -u myapp.service --since "2 hours ago".
Step-by-Step Diagnosis of Common Failures
Systemd failures typically fall into a few predictable categories. By examining the status and the logs, you can quickly categorize the issue and apply the appropriate solution.
Failure Type 1: Execution Errors (Exit Code 203)
An exit code of 203/EXEC means systemd could not execute the file specified in the ExecStart directive. This is one of the most common configuration mistakes.
Causes and Solutions:
Incorrect Path: The path to the executable is wrong or not absolute.
- Solution: Always use the full, absolute path in
ExecStart. Ensure the executable exists at that exact location.
# INCORRECT ExecStart=myapp # CORRECT ExecStart=/usr/local/bin/myapp- Solution: Always use the full, absolute path in
Missing Permissions: The file lacks execute permission for the user running the service.
- Solution: Check and apply execute permissions:
chmod +x /path/to/executable.
- Solution: Check and apply execute permissions:
Missing Interpreter (Shebang): If
ExecStartpoints to a script (e.g., Python or Bash), the shebang line (#!/usr/bin/env python) might be missing or incorrect, preventing execution.- Solution: Verify the script has a valid shebang line.
Failure Type 2: Application Crashes (Exit Code 1 or 2)
If the service is starting successfully (systemd finds the executable) but then immediately enters the failed state with a generic application error code (usually 1 or 2), the problem lies within the application logic or environment.
Causes and Solutions:
Configuration File Errors: The application could not read its required configuration file, or the file contains invalid syntax.
- Solution: Review the
journalctloutput carefully. The application usually prints a specific error message about the configuration file path or syntax. Use theWorkingDirectory=directive if configuration files are relative.
- Solution: Review the
Resource Contention/Access Denied: The application failed to open a necessary port, access a database, or write to a log file due to permission restrictions.
- Solution: Verify the
User=directive in the service file and ensure that user has R/W access to all necessary resources and directories.
- Solution: Verify the
Failure Type 3: Dependency Failures
The service might fail because it starts before a required dependency is ready, such as a database, network interface, or mounted filesystem.
Causes and Solutions:
Network Not Ready: Services that require network connectivity (e.g., web servers, proxies) often fail if they start before the network stack is initialized.
- Solution: If the service needs an address or route during startup, add the
network-online.targetordering and make sure your distribution's wait-online service is enabled for your network manager:
[Unit] Description=My Web Service After=network-online.target Wants=network-online.target- Solution: If the service needs an address or route during startup, add the
Filesystem Not Mounted: The service attempts to access files on a volume that hasn't been mounted yet (especially critical for secondary storage or network mounts).
- Solution: Use
RequiresMountsFor=to explicitly tell systemd which path must be available before starting.
[Unit] RequiresMountsFor=/mnt/data/storage- Solution: Use
Failure Type 4: User and Environment Issues (Exit Code 217)
Exit code 217/USER often indicates a failure related to user or group directives, or environment variables being unavailable.
Causes and Solutions:
Invalid User/Group: The user specified in the
User=orGroup=directive does not exist on the system.- Solution: Verify the username exists via
id <username>.
- Solution: Verify the username exists via
Missing Environment Variables: Systemd services run in a clean environment, meaning shell variables (like
PATHor custom API keys) are not inherited.- Solution: Define necessary variables directly in the service file or via an environment file.
[Service] # Direct definition Environment="API_KEY=ABCDEFG" # Using an external file (e.g., /etc/sysconfig/myapp) EnvironmentFile=/etc/sysconfig/myapp
Troubleshooting Workflow and Best Practices
When modifying a service file, always follow this three-step cycle to ensure your changes are picked up and tested correctly.
1. Validate Configuration Syntax
Use systemd-analyze verify to check the service unit file before attempting to start it. This catches simple syntax errors.
$ systemd-analyze verify /etc/systemd/system/myapp.service
2. Reload the Daemon
Systemd caches configuration files. After any change to a unit file, you must tell systemd to reload its configuration.
$ systemctl daemon-reload
3. Restart and Check Status
Attempt to restart the service and immediately check its status and logs.
$ systemctl restart myapp.service
$ systemctl status myapp.service
Handling Immediate Restarts and Timeouts
If your service enters a restarting loop or immediately fails without an obvious log message, consider adjusting these directives in the [Service] section:
| Directive | Purpose | Best Practice |
|---|---|---|
Type= |
How systemd manages the process (e.g., simple, forking). |
Use simple unless the application explicitly daemonizes. |
TimeoutStartSec= |
How long systemd waits for the main process to signal success. | Increase this value if the application has a lengthy startup (e.g., large database initialization). |
Restart= |
Defines when the service should be automatically restarted (e.g., always, on-failure). |
Use on-failure for production applications to prevent endless restart loops on repeated configuration errors. |
Reading Failure States More Carefully
failed is not the only bad state. A unit can be inactive (dead) after a clean exit, which is normal for Type=oneshot jobs but suspicious for a daemon you expected to keep running. A unit can be activating until TimeoutStartSec= expires. A unit can be active (exited) when the command finished and systemd believes that is acceptable. Before changing restart policy, make sure the service type matches the program.
For a normal foreground process, start with:
[Service]
Type=simple
ExecStart=/usr/local/bin/myapp
For a script that runs once and exits:
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/rotate-reports
For older daemons that fork themselves into the background, Type=forking may be needed, but do not use it by habit. Many modern applications already stay in the foreground when run under systemd. If you tell systemd to expect forking and the process does not fork the way systemd expects, you can get misleading startup failures.
A Triage Checklist That Works Under Pressure
When a service is down and people are waiting, use a fixed sequence:
systemctl status myapp.service --no-pager
journalctl -u myapp.service -b --no-pager
systemctl cat myapp.service
systemctl show myapp.service -p FragmentPath -p User -p Group -p WorkingDirectory -p ExecStart
Look for the first real error, not the last line. The final journal entry may only say that systemd marked the unit failed. The useful line is often above it: Permission denied, No such file or directory, Address already in use, Failed at step USER, or an application-specific exception.
If the service was recently edited, check for syntax and reload state:
sudo systemd-analyze verify /etc/systemd/system/myapp.service
sudo systemctl daemon-reload
If systemctl status says the unit file changed on disk, systemd is warning you that the manager has not reloaded the new definition. Restarting the service before daemon-reload may keep using stale settings.
Permission Problems That Do Not Look Like Permission Problems
A service can run perfectly from your shell and fail under systemd because it is not running as you. Check the User=, Group=, WorkingDirectory=, and any hardening options such as ProtectSystem=, ReadWritePaths=, PrivateTmp=, or NoNewPrivileges=.
For example:
[Service]
User=webapp
WorkingDirectory=/srv/webapp
ExecStart=/srv/webapp/bin/server
ReadWritePaths=/srv/webapp/var
ProtectSystem=strict
With ProtectSystem=strict, most of the filesystem is read-only to the service. That is a good hardening setting, but it means the application must write only to paths you explicitly allow. If the journal says the app cannot create a PID file, cache file, SQLite database, or upload directory, the unit's sandboxing may be the reason.
Also check parent directory permissions. The executable may be mode 755, but if /srv/webapp is not searchable by the service user, systemd will still fail to execute it. Use:
namei -l /srv/webapp/bin/server
sudo -u webapp /srv/webapp/bin/server --check-config
Running a safe config check as the service user catches a lot of issues without starting the full daemon.
Restart Loops and Rate Limits
Restart=on-failure is useful, but it can hide the original error in a flood of repeated starts. Systemd also applies start rate limiting. When a service fails too many times in a short window, you may see start-limit-hit.
Useful commands:
systemctl status myapp.service
systemctl reset-failed myapp.service
sudo systemctl start myapp.service
reset-failed does not fix the cause. It only clears systemd's failed state and rate-limit memory so you can test again after making a change. If you keep needing it, slow down and fix the first failure in the journal.
Debugging Persistent Issues
If standard logs don't reveal the issue, the application might be redirecting its output.
- Review
StandardOutputandStandardError: By default, these are directed to the journal. If they are set to/dev/nullor a file, you must check those locations directly for error messages. - Temporary Verbosity: If possible, temporarily configure the application (or its command line arguments in
ExecStart) to run with maximum verbosity (e.g.,--debugor-v) to generate more detailed log output when failing.
A Sensible Stopping Point
Once the service starts, check one more thing: whether it does real work. systemctl status can only tell you the process state from systemd's point of view. A web service can be active while returning 500s. A worker can be active while failing every job. After fixing the unit-level problem, run the application's own health check, look at its application logs, and confirm the dependency it talks to is reachable.
For most incidents, the useful path is short: systemctl status, then journalctl -u, then inspect the unit with systemctl cat, then test the command as the configured service user. That keeps you close to evidence and away from random unit-file changes.
Write down the final cause in the service's runbook or deployment notes while it is still fresh. "Fixed systemd" is not useful later. "Service failed with 203/EXEC because the deploy created /opt/app/current/bin/server without execute permission" is useful. The next incident will usually rhyme with the last one.