
Running local LLMs 24/7 means processes crash — a lot. Here's how I replaced a dumb restart loop with an intelligent health-check workflow that recovered my Ollama service in 10 seconds instead of 3 minutes.
The Problem: Crashes Are a Fact of Life
If you're running Draw Things, Ollama, or any memory-hungry local LLM process on a Mac Mini cluster, you already know this: the process will die. Out-of-memory on a long render, a failed model load, a weird signal from an overnight automation — the reasons pile up fast.
My M2 Pro Mac Mini cluster ran Ollama as the backbone for several n8n automation pipelines. When Ollama went down, everything downstream stalled silently. I'd notice three minutes later (if I was awake) or not at all (if I wasn't). That 3-minute window was the real problem.
The naive fix is a shell loop. I tried it. Here's why it's the wrong call.
Section 1: The Infinite Loop Trap
The first thing I wrote was this:
#!/bin/bash
while true
do
/Applications/Ollama.app/Contents/MacOS/Ollama serve
echo "Ollama process crashed. Restarting in 5 seconds..."
sleep 5
done
It "works" in the same way a band-aid works on a broken leg. The real problems start immediately:
- No exit-code awareness. If Ollama exits with code
0(clean shutdown for a config change), this loop just restarts it anyway. - No health validation. The process could be running but the HTTP server on port
11434could be deadlocked. The loop has no idea. - Zombie accumulation. On macOS, rapid restart cycles without proper cleanup leave orphan processes that silently eat memory.
- No observability. If this loop is the only thing watching the process, you have zero audit trail. You wake up to a system in an unknown state.
The deeper gotcha: running ≠ healthy. A process can be alive and completely unresponsive. Any restart logic that doesn't verify the actual service endpoint is lying to you.
Section 2: Designing an Immune System with n8n
The core insight is this — you don't want to restart a process. You want to verify a service endpoint, and only act when that check fails. That's the difference between a watchdog and an immune system.
n8n is perfect for this because it gives you:
- A visual, auditable flow
- Native HTTP request nodes with failure-path routing
- SSH nodes to execute remote commands
- Built-in logging and optional alerting
Here's the full workflow structure:
Step 1: Schedule Trigger
Set it to run every 30 seconds. n8n's built-in Schedule node handles this cleanly — no cron syntax needed, but you can use it if you prefer.
Step 2: HTTP Health Check Node
{
"method": "GET",
"url": "http://localhost:11434",
"options": {
"continueOnFail": true,
"timeout": 5000
}
}
The critical setting here is continueOnFail: true. Without it, a failed request terminates the entire workflow. With it, the workflow branches — success path goes one way, failure path goes another. This is where the intelligence lives.
Step 3: SSH Restart Node (Failure Path Only)
Connect an SSH Command node on the failure branch of the HTTP node:
# SSH Command Node — restart Ollama
pkill -f "Ollama" || true
sleep 2
open -a Ollama
On macOS, open -a Ollama starts the full app bundle and brings up the menu bar icon. If you're running the CLI binary directly instead:
pkill -f "ollama serve" || true
sleep 2
nohup /usr/local/bin/ollama serve > /tmp/ollama.log 2>&1 &
Step 4: Verification + Logging
After the restart command, add a 5-second Wait node, then run the HTTP check again. If it passes, log "recovered" with a timestamp. If it fails again, fire a Slack or email alert — at that point, human intervention is probably warranted.
// Code Node — build log entry
{
"timestamp": "{{ $now.toISO() }}",
"event": "ollama_restart",
"trigger": "health_check_fail",
"node": "mac-mini-m2-pro"
}
Write this to an n8n Set node and pipe it into a Google Sheet, Notion database, or a simple HTTP POST to your logging endpoint. The audit trail pays off the first time you want to know whether a crash at 3 AM was a one-off or a pattern.
Section 3: Variations and Gotchas
Multiple Services on the Same Cluster
If you're also running Draw Things or a custom inference server alongside Ollama, duplicate the workflow per service with different ports and restart commands. Don't try to cram everything into one mega-workflow — separate flows are easier to debug and disable independently.
| Service | Port | Health Check URL | Restart Command |
|---|---|---|---|
| Ollama | 11434 | http://localhost:11434 |
pkill -f ollama && open -a Ollama |
| Draw Things API | 7860 | http://localhost:7860/ |
open -a "Draw Things" |
| Custom LLM server | 8080 | http://localhost:8080/health |
launchctl kickstart -k gui/$(id -u)/com.custom.llm |
macOS vs. Linux vs. Docker
On macOS, use open -a AppName for GUI apps and launchctl for system services. The pkill approach works fine for processes but won't relaunch an .app bundle on its own.
On Linux, swap open -a for systemctl restart servicename if you've set up a systemd unit (which you should — it handles crash restarts natively and you'd use n8n purely for alerting in that case).
On Docker, replace the SSH node with an HTTP request to the Docker socket or use the n8n Docker node to call docker restart container_name. This is actually cleaner than SSH.
The Back-off Trap
Don't restart on every 30-second tick indefinitely. Add a counter or a timestamp comparison — if Ollama has been restarted more than 3 times in 10 minutes, stop trying and page someone. A restart loop that fires 120 times in an hour is worse than the original crash.
// Code Node — guard against restart storm
const lastRestart = $node["Globals"].json["lastOllamaRestart"];
const now = Date.now();
const tenMinutes = 10 * 60 * 1000;
if (lastRestart && (now - lastRestart) < tenMinutes) {
// too soon — skip restart, go straight to alert
return [{ json: { action: "alert", reason: "restart_storm" } }];
}
return [{ json: { action: "restart" } }];
What about launchd on macOS?
launchd with a KeepAlive plist is the "right" macOS-native answer for process supervision. If you just need Ollama to stay alive and nothing else, it's worth setting up. But for anything that needs observability, cross-service coordination, or downstream alerting integrated with the rest of your automation stack— n8n wins on flexibility.
Closing
Measured on my M2 Pro cluster: average Ollama downtime dropped from ~3 minutes (manual detection + intervention) to under 10 seconds (automated detect + restart + verify). That's a 99% reduction in recovery time — and more importantly, the pipelines that depend on Ollama now self-heal without waking me up.
The real shift isn't the time savings. It's moving from "I restart things when they break" to "the system diagnoses itself and applies a fix." Once that pattern is in place for Ollama, it takes 20 minutes to add the same layer to every other service in the cluster.
Next up: combining this health-check data with Grafana to visualize crash frequency per service over time — useful for catching gradual memory leaks before they become outages. 🔧
🐦 Faster updates on X: @baegseungh7061
📚 More in this series: Code Advanced
💌 Subscribe: Follow on X or grab the RSS
댓글
댓글 쓰기