I kept finding avahi-daemon pegging the CPU in some of my LXC containers, and I wanted a service policy that behaves like a human would: limit it to 10%, restart immediately if pegged, and restart if it won’t calm down above 5%.
Well, turns out systemd already gives us 90% of this, but the documentation for that is squirrely, and after poking around a bit I found that the remaining 10% is just a tiny watchdog script and a timer.
Setup
First, contain the daemon with CPUQuota:
sudo systemctl edit avahi-daemon
[Service]
CPUAccounting=yes
CPUQuota=10%
Restart=on-failure
RestartSec=10s
KillSignal=SIGTERM
TimeoutStopSec=30s
Then create a generic watchdog script at /usr/local/sbin/cpu-watch.sh:
#!/bin/bash
set -euo pipefail
UNIT="$1"
INTERVAL=30
# Policy thresholds
PEGGED_NS=$((INTERVAL * 1000000000 * 9 / 10)) # ~90% of quota window
SUSTAINED_NS=$((INTERVAL * 1000000000 * 5 / 100)) # 5% CPU
STATE="/run/cpu-watch-${UNIT}.state"
current=$(systemctl show "$UNIT" -p CPUUsageNSec --value)
previous=0
[[ -f "$STATE" ]] && previous=$(cat "$STATE")
echo "$current" > "$STATE"
delta=$((current - previous))
# Restart if pegged (hitting CPUQuota)
if (( delta >= PEGGED_NS )); then
logger -t cpu-watch "CPU pegged for $UNIT (${delta}ns), restarting"
systemctl restart "$UNIT"
exit 0
fi
# Restart if consistently above 5%
if (( delta >= SUSTAINED_NS )); then
logger -t cpu-watch "Sustained CPU abuse for $UNIT (${delta}ns), restarting"
systemctl restart "$UNIT"
fi
It’s not ideal to have hard-coded thresholds or to hit storage frequently, but in most modern systems /run is a tmpfs or similar, so for a simple watchdog this is acceptable.
The next step is to make it executable and figure out how to use it via systemd templates:
sudo chmod +x /usr/local/sbin/cpu-watch.sh
# cat /etc/systemd/system/[email protected]
[Unit]
Description=CPU watchdog for %i
After=%i
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/cpu-watch.sh %i
# cat /etc/systemd/system/[email protected]
[Unit]
Description=Periodic CPU watchdog for %i
[Timer]
OnBootSec=2min
OnUnitActiveSec=30s
AccuracySec=5s
[Install]
WantedBy=timers.target
The trick I learned today was how to enable it with the target service name:
sudo systemctl daemon-reload
sudo systemctl enable --now [email protected]
You can check it’s working with:
sudo systemctl list-timers | grep cpu-watch
# this should show the script restart messages, if any:
sudo journalctl -t cpu-watch -f
Why This Works
The magic, according to Internet lore and a bit of LLM spelunking, is in using CPUUsageNSec deltas over a timer interval, which has a few nice properties:
- Short CPU spikes are ignored, since the timer provides natural hysteresis
- Sustained abuse (>5%) triggers restart
- Pegged at quota (90% of 10%) triggers immediate restart
- Runaway loops are contained by
CPUQuota - Everything is
systemd-native and auditable viajournalctl
It’s not perfect, but at least I got a reusable pattern/template out of this experiment, and I can adapt this to other services as needed.