Mastering OOM Policy: Tuning Systemd's Response to Out-of-Memory Events

Out-of-memory failures rarely happen at a convenient time. A batch import gets a larger file than usual, a service leaks memory overnight, a backup overlaps with a traffic spike, or a deploy doubles the number of worker processes. When Linux cannot free enough memory for an allocation, the kernel may invoke the OOM killer and terminate a process so the machine can keep running.

The uncomfortable part is that the default victim may not be the service you would have chosen. On a shared host, you might prefer a retryable queue worker to die before the main API. On a database server, you may want SSH and monitoring to stay alive so you can recover the machine. Systemd gives you two knobs for that kind of decision: OOMScoreAdjust= and OOMPolicy=.

OOMScoreAdjust= influences which process is selected. OOMPolicy= controls what systemd does after a process in the service has been killed. They solve different problems, and mixing them up leads to bad runbooks.

What the Kernel Is Scoring

Every Linux process has an OOM score, visible at /proc/<pid>/oom_score. A higher score means the process is a more likely OOM victim. The kernel derives that score from memory use and other context, then applies the adjustment value from /proc/<pid>/oom_score_adj.

Systemd's OOMScoreAdjust= writes that adjustment for processes it starts. The range is -1000 to 1000.

-1000 gives the strongest protection and effectively disables OOM killing for that process.
Negative values make the process less likely to be killed.
Positive values make the process more likely to be killed.
0 leaves the adjustment neutral.

The safest approach is usually not "protect everything important." If every service is protected, the kernel has fewer useful choices when the host is already short on memory. Protect a small number of services and make disposable work easier to kill.

For a primary API service, a moderate adjustment is often enough:

[Service]
OOMScoreAdjust=-300

For a queue worker that can retry jobs:

[Service]
OOMScoreAdjust=500

That worker may die first during memory pressure, but that is the point. A failed job can go back to the queue. A dead database or unreachable host is a larger incident.

What `OOMPolicy` Actually Does

OOMPolicy= does not mark a unit as "critical," and it does not choose the first process to kill. The supported values are continue, stop, and kill.

continue: systemd logs the OOM event and leaves the unit running if any processes remain.
stop: systemd logs the event and stops the unit cleanly.
kill: if one process in the unit is OOM-killed, the remaining processes in that unit are killed as a group.

Use this setting to avoid half-alive services. If a multi-process web service loses a worker and keeps accepting traffic in a broken state, continue can hide the failure. OOMPolicy=kill makes the failure obvious and lets Restart=on-failure bring the service back in a clean state.

[Service]
OOMPolicy=kill
Restart=on-failure
RestartSec=5s

For a batch job with helper processes, stop may be less abrupt for the remaining processes:

[Service]
OOMPolicy=stop

The process chosen by the kernel is already gone. stop only affects what systemd does to the rest of the service, so do not rely on it as a graceful save point. Long-running jobs should checkpoint their own work.

A Practical Tuning Pattern

Start by sorting services into three groups.

First, identify services that keep the host recoverable: SSH, networking, monitoring, and the primary workload. Give only the most important ones modest negative adjustments.

Second, identify services that can be retried: workers, importers, report generators, image processors, cache warmers, and development helpers. Give those positive adjustments.

Third, decide whether each service can safely keep running after one process is killed. If not, use OOMPolicy=kill and a restart policy.

A realistic worker override might look like this:

# /etc/systemd/system/image-worker.service.d/oom.conf
[Service]
OOMScoreAdjust=500
OOMPolicy=kill
Restart=on-failure
RestartSec=10s

A primary application service might look like this:

# /etc/systemd/system/api.service.d/oom.conf
[Service]
OOMScoreAdjust=-300
OOMPolicy=kill
Restart=on-failure
RestartSec=5s

I would avoid OOMScoreAdjust=-1000 unless you have tested the failure mode. If that protected service is the one leaking memory, the machine still needs a way to recover.

Applying and Verifying the Change

Use drop-ins instead of editing packaged unit files:

sudo systemctl edit api.service

After saving the override, reload systemd and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart api.service

Check the merged unit and the values systemd sees:

systemctl cat api.service
systemctl show api.service -p OOMPolicy -p OOMScoreAdjust

Then inspect the running process:

PID=$(systemctl show api.service -p MainPID --value)
cat /proc/$PID/oom_score_adj
cat /proc/$PID/oom_score

oom_score_adj should match your configured adjustment. oom_score can change as the process uses more or less memory.

After an incident, check both the unit logs and the kernel log:

journalctl -u api.service --since "1 hour ago"
journalctl -k --since "1 hour ago" | grep -i oom

On systems that use systemd-oomd, also check:

systemctl status systemd-oomd
oomctl

OOM Policy Is Not Capacity Planning

OOM tuning is a last line of defense. You still need memory limits, alerts, and enough headroom for normal spikes. For services with predictable boundaries, consider cgroup memory controls:

[Service]
MemoryHigh=1500M
MemoryMax=2G

MemoryHigh= applies pressure before the hard limit. MemoryMax= is a ceiling. Exact behavior depends on the systemd version and cgroup setup, but the operational idea is simple: contain one service before it consumes the host.

Swap deserves the same kind of thought. No swap can make short spikes turn into abrupt OOM kills. Too much slow swap can keep the host alive while latency becomes useless. Review OOM policy together with swap, memory limits, restart behavior, and alerts.

Example: One Host, Three Services

Suppose a small production host runs an API, a Redis cache, and a background report worker. The report worker is useful, but it can retry work. Redis improves latency, but the application can still serve some requests by going to the database. The API is the customer-facing service.

A reasonable first pass might be:

# api.service
[Service]
OOMScoreAdjust=-300
OOMPolicy=kill
Restart=on-failure

# redis.service drop-in, if this Redis instance is cache-only
[Service]
OOMScoreAdjust=0
OOMPolicy=kill

# report-worker.service
[Service]
OOMScoreAdjust=600
OOMPolicy=kill
Restart=on-failure

That does not guarantee the worker dies first in every possible case, but it makes your intent clear. If the report worker grows too large, it is an easier target. If the API loses one of its processes, systemd kills the rest and restarts it cleanly. If Redis is only a cache, you may choose not to protect it heavily; if Redis is your primary data store, you would make a different decision.

This is why OOM policy should be tied to service role, not product name. "Redis" is not automatically critical or disposable. "The cache we can rebuild" and "the only copy of session state" are different operational objects.

Testing Without Creating a Disaster

You do not need to crash a production server to learn whether the settings are applied. Start with inspection:

systemctl show report-worker.service -p OOMScoreAdjust -p OOMPolicy
systemctl status report-worker.service

Then check the running process:

PID=$(systemctl show report-worker.service -p MainPID --value)
cat /proc/$PID/oom_score_adj

For deeper testing, use a staging host or a disposable virtual machine with the same systemd version and cgroup mode. Run a controlled memory-pressure tool there, not on a shared production box. The goal is to confirm broad behavior: the worker is easier to kill, the main service does not remain half alive, and restart behavior is visible in the journal.

If you use containers, test in the same shape you deploy. A service running directly under systemd does not behave exactly like a process inside a container with its own memory limit. The kernel may enforce the container limit before the host is globally out of memory. In that case, your container runtime, Kubernetes, or cgroup settings may be the first layer that decides what dies.

Reading the Incident Afterward

After an OOM event, avoid jumping straight to "we need more RAM." Sometimes you do. Sometimes a cache forgot TTLs. Sometimes a deploy changed worker concurrency. Sometimes persistence or backup activity caused copy-on-write memory to spike.

Look for three things:

journalctl -k --since "2026-05-24 01:00" | grep -i oom
journalctl -u api.service --since "2026-05-24 01:00"
systemctl show api.service -p Result -p NRestarts

The kernel log usually tells you which process was killed. The unit log tells you how systemd reacted. Restart counters tell you whether the service recovered cleanly or flapped.

Then compare the killed process with your intended priority. If a protected service died before a disposable worker, check whether the worker was actually running under the unit you tuned, whether the override was loaded, and whether another memory limit fired first. If the chosen victim matches the policy but the incident still hurt users, your service classification may need to change.

Document the Reason, Not Just the Value

OOM settings are easy to forget because they sit quietly in unit drop-ins until a bad day. Leave a short comment in the override or in your infrastructure repository explaining the reason for the adjustment.

[Service]
# Retryable queue worker. Prefer killing this before api.service during host pressure.
OOMScoreAdjust=600
OOMPolicy=kill

That comment saves time during an incident review. Without it, someone may see a positive OOM score and "fix" it back to zero without realizing it was an intentional priority decision.

Also record when you last reviewed the setting. A service can change roles over time. A worker that once handled disposable thumbnails might later process payments, exports, or customer-visible jobs. The OOM policy should follow the current risk, not the service's original purpose.

Common Bad Configurations

One bad configuration is protecting the database, the API, the worker, the cache, the log shipper, and the monitoring agent all at once. That feels careful, but it gives the kernel fewer options. Pick priorities.

Another bad configuration is setting OOMPolicy=continue on a service that cannot tolerate missing child processes. A process manager, web server, or custom daemon may keep the unit active even after part of the workload is gone. If your load balancer only checks whether the port is open, traffic can continue flowing to a degraded service.

A third bad configuration is positive adjustment without retry behavior. If you make a service easy to kill, make sure killing it is acceptable. For a queue worker, that means jobs are acknowledged only after successful processing. For a batch job, that means checkpoints. For a cache warmer, that means the cache can be rebuilt later.

Finally, avoid hiding OOM events with automatic restarts alone. Restarting a leaking service may buy time, but it can also create a loop where memory climbs, the service dies, and users see periodic failures. Add alerts on restart count and memory growth, not just process state.

A Short Runbook

When you tune a real server, use a repeatable checklist:

List services required for recovery and user traffic.
List retryable services that can be killed first.
Add positive OOMScoreAdjust values to disposable work.
Add moderate negative values only to the few services that deserve protection.
Use OOMPolicy=kill for services that should not run partially.
Verify the applied values through systemctl show and /proc.
Alert on memory pressure before OOM events happen.

The goal is not to make OOM events harmless. The goal is to make them understandable. OOMScoreAdjust= helps choose the victim. OOMPolicy= helps define what happens to the rest of the unit. Together, they give you a more predictable failure order when memory is already exhausted.