Coordinated OS Patching (Phase 8 pass-1)¶
Operational guide for push-mode Linux patching from infra-services, including observability and notifications. Architecture detail: patching architecture.
Status at a glance¶
| Area | In repo (pass-1) | Live on homelab (your deploy) |
|---|---|---|
| C&C timer + SSH key | Yes | After ansible-pull on infra-services |
| Wave patching (saltierpoop → prox → infra-services) | Yes | After first manual/systemd run |
| Discord weekly summary | Yes | After notify.sops.yaml + SOPS |
| ntfy critical (failure, pre-reboot) | Yes | After SOPS + phone subscribe |
| Prometheus patch alerts | Yes | After monitoring stack reload |
| Grafana patching dashboard | Yes | Auto-provisioned with stack |
ansible_pull.prom fix |
Yes | After ansible-pull on all hosts |
| Weekly schedule (Sun 04:00 PT) | Yes | Timer enabled by patch-controller role |
Pass-1 is code-complete in git. Production patching is not live until you complete Owner checklist below.
Assigned ntfy topic¶
Use this topic for patch-critical alerts only (separate from Whrrr capacity
homelab-whrrr-capacity-7f2a):
Server: https://ntfy.sh (public ntfy.sh — same as capacity alerts).
Subscribe on your phone now (takes 30 seconds):
- Install ntfy app.
- Add subscription → topic
homelab-patch-critical-b4e9. - Test (basic):
curl -d "homelab patch ntfy test" -H "Title: Patch test" \
"https://ntfy.sh/homelab-patch-critical-b4e9"
- Test enriched notifications (orchestrator, pre-reboot, Alertmanager template):
# From repo root on infra-services (notify env is mode 600 root-only):
sudo bash -c 'set -a; source /etc/homelab/patching-notify.env; set +a; bash /var/lib/ansible-pull/homelab/scripts/test-patch-ntfy.sh --live'
CI runs scripts/test-patch-ntfy.sh --dry-run on every push. Live mode sends
three TEST messages; expect skull/patch tags, ARA/Grafana action buttons on
failure-style, and a formatted Alertmanager alert.
Enriched headers used in production:
| Path | Tags | Click | Actions |
|---|---|---|---|
| Orchestrator failure | skull,patch |
ARA | Open ARA, Grafana |
| Pre-reboot | rotating_light,computer |
Grafana patching dashboard | — |
| Alertmanager critical patching | (template) | (template) | — |
Alertmanager posts to ?template=alertmanager so webhook JSON becomes a readable
title + message instead of raw JSON on your phone.
Discord uses rich embeds (color, fields, timestamps) and link buttons via
/usr/local/lib/homelab/discord-patch-notify.py:
sudo bash -c 'set -a; source /etc/homelab/patching-notify.env; set +a; bash /var/lib/ansible-pull/homelab/scripts/test-patch-discord.sh --live'
Expect four TEST embeds: validation (blue), success (green), failure (red), reboot (yellow) — each with ARA / Grafana / Runbook buttons.
What pass-1 built (reference)¶
Patching mechanism¶
| Component | Path / unit |
|---|---|
| Patch controller role | infra/ansible/roles/patch-controller/ |
| Patching role | infra/ansible/roles/patching/ |
| Playbook | infra/ansible/playbooks/patch.yml |
| Systemd timer | homelab-patch-orchestrate.timer |
| Systemd service | homelab-patch-orchestrate.service |
| Wrapper script | /usr/local/bin/homelab-patch-orchestrate |
| SSH key | /etc/homelab/patch-controller/id_ed25519 |
| Notify env | /etc/homelab/patching-notify.env |
Wave order: saltierpoop (0) → prox (1) → infra-services (2). C&C
patches itself last.
Policy defaults (group_vars/patching_targets.yml):
patching_upgrade_type: security— apt safe upgradepatching_reboot: auto_if_required— reboot if/var/run/reboot-requiredunattended-upgradesremoved on targets
Observability¶
| Signal | Source |
|---|---|
patch.prom |
Each patching target — last success time, reboot flag |
patch_orchestrate.prom |
infra-services only — orchestrator exit code + run time |
ansible_pull.prom |
Each ansible-pull host — last successful apply |
| ARA | All orchestrator / playbook runs |
| Grafana | Dashboard Coordinated OS Patching (patching.json) |
Notifications¶
| Event | Discord | ntfy |
|---|---|---|
| Weekly patch success | Rich embed + per-host fields + link buttons | — |
| Orchestrator / playbook failure | Rich embed + @here + ARA/Grafana/Prometheus buttons |
Rich (tags, click, actions) |
| Pre-reboot on a host | Rich embed + Grafana button | Rich (tags, click) |
| Prometheus patch alerts | Enhanced markdown + runbook links | Alertmanager template (critical only) |
| Converge validation | Rich embed probe on ansible-pull | Enriched curl probe |
Alerts (monitoring/prometheus/alerts/patching.yml):
| Alert | Severity | Condition |
|---|---|---|
OSPatchStale |
warning | patch.prom not updated in 8 days |
OSPatchOrchestrateFailed |
critical | orchestrator exit code ≠ 0 |
PatchRebootPending |
warning | reboot-required flag set 30m |
Owner checklist¶
Complete in order. Check off as you go.
- [ ] Subscribe to ntfy topic
homelab-patch-critical-b4e9(see above) - [ ] Push pass-1 to
mainif not already on origin - [ ] SOPS — Ansible notify (Discord webhook + ntfy topic):
cd infra/ansible/inventory/group_vars/patch_controller
cp notify.sops.yaml.example notify.sops.yaml
# Edit: patching_notify_discord_webhook_url = same URL as monitoring Discord webhook
# patching_notify_ntfy_topic = homelab-patch-critical-b4e9 (already in example)
SOPS_AGE_KEY_FILE=/etc/homelab/age-key.txt sops -e -i notify.sops.yaml
git add notify.sops.yaml && git commit && git push
- [ ] SOPS — Alertmanager (add to existing monitoring secrets):
Re-decrypt .env on infra-services and reload alertmanager (below).
- [ ] Converge all managed hosts — wait for ansible-pull or trigger on each:
- infra-services, saltierpoop (ansible-pull); prox optional full bootstrap
- [ ] Reload monitoring on infra-services
- [ ] Publish patch-controller.pub and verify SSH from infra-services (sudo — key is root-only):
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes someone@192.168.6.243 true
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes root@192.168.6.71 true
- [ ] Dry-run patch (see Testing)
- [ ] One live patch via systemd unit (not raw playbook)
- [ ] Notification test matrix (see Testing)
- [ ] Mark README Owner TODO rows for Phase 8 done
Prerequisites¶
- ansible-pull converging on infra-services (deploys
patch-controllerrole) - Patch-controller SSH from infra-services to saltierpoop (
someone) and prox (root) — verified with the patch private key (see below). infra-services has no general outbound SSH to other hosts; one-time bootstrap uses an operator workstation (scripts/bootstrap-patch-ssh-from-operator.sh). - Repo checkout at
/var/lib/ansible-pull/homelab(or/opt/homelab) - SOPS age key on infra-services at
/etc/homelab/age-key.txt
After bootstrap, targets also trust the committed
infra/ansible/inventory/files/patch-controller.pub via each host's ansible-pull
(from="192.168.6.17"). Optional backup-fetch push to saltierpoop only runs when
/etc/homelab/backup-fetch/id_ed25519 parses cleanly (ssh-keygen -y).
Deploy steps (infra-services)¶
1. Verify patch-controller landed¶
ssh someone@192.168.6.17
sudo systemctl list-timers homelab-patch-orchestrate.timer
sudo ls -la /etc/homelab/patch-controller/
sudo ls -la /etc/homelab/patching-notify.env # after notify.sops.yaml committed + pull
1b. Patch SSH bootstrap (one-time, if not already done)¶
From a workstation with SSH to all hosts (not from infra-services as someone):
Or verify manually on infra-services:
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes someone@192.168.6.243 true
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes root@192.168.6.71 true
patch-controller.pub in git keeps new targets converged via ansible-pull.
Expected:
- Timer:
Sun …weekly, timezoneAmerica/Los_Angeles - Private key:
/etc/homelab/patch-controller/id_ed25519(mode 600) - Notify env: mode 600, contains
NTFY_PATCH_TOPIC=homelab-patch-critical-b4e9
2. Reload monitoring (after SOPS + push)¶
cd /opt/homelab/services/monitoring
SOPS_AGE_KEY_FILE=/etc/homelab/age-key.txt sops -d .env.sops.yaml \
| sed 's/: /=/' > .env
docker compose up -d alertmanager prometheus grafana
3. Confirm metrics paths (after converge)¶
On each managed host:
ls -la /var/lib/node_exporter/textfile_collector/patch.prom
ls -la /var/lib/node_exporter/textfile_collector/ansible_pull.prom
On infra-services only:
Prometheus: https://prometheus.infra.realemail.app — query
homelab_patch_last_success_unixtime.
Testing¶
Bootstrap patch SSH (one-time)¶
Diagnose backup-fetch first (error in libcrypto = unreadable key file, not
“wrong host key”):
sudo ls -la /etc/homelab/backup-fetch/id_ed25519
sudo ssh-keygen -y -f /etc/homelab/backup-fetch/id_ed25519
# Must match infra/ansible/inventory/files/backup-fetch.pub
If ssh-keygen -y fails, fix or regenerate backup-fetch before relying on
automated saltierpoop push. After ansible-pull redeploys the key, re-run
ssh-keygen -y. See scripts/update-backup-sops.sh if SOPS and disk diverged.
saltierpoop (recommended — uses your normal someone SSH, not backup-fetch):
PUB=$(sudo cat /etc/homelab/patch-controller/id_ed25519.pub)
ssh someone@192.168.6.243 "mkdir -p ~/.ssh && chmod 700 ~/.ssh && (grep -qxF '$PUB' ~/.ssh/authorized_keys 2>/dev/null || echo '$PUB' >> ~/.ssh/authorized_keys) && chmod 600 ~/.ssh/authorized_keys"
prox — one-time password auth (patch private key is root-only):
sudo ssh-copy-id -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes root@192.168.6.71
Verify patch key (sudo required):
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes someone@192.168.6.243 true
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes root@192.168.6.71 true
Dry run (check mode)¶
Does not send Discord summary, reboot, or update all metrics reliably. Use to preview apt changes only.
cd /var/lib/ansible-pull/homelab/infra/ansible
sudo bash -c '
export ANSIBLE_PRIVATE_KEY_FILE=/etc/homelab/patch-controller/id_ed25519
export ANSIBLE_CONFIG=$PWD/ansible.cfg
ansible-playbook -i inventory/generated.yml playbooks/patch.yml --check --diff
'
Pick a low-traffic window for the first live run (saltierpoop is wave 0).
Live patch (production path)¶
Always use the systemd unit so the wrapper records exit metrics and failure notifications:
sudo systemctl start homelab-patch-orchestrate.service
sudo journalctl -u homelab-patch-orchestrate.service -n 100 --no-pager
Runs appear in ARA: https://ara.infra.realemail.app or http://192.168.6.17:8000.
Notification test matrix (pass-1 acceptance)¶
| # | Test | How | Expected |
|---|---|---|---|
| 0 | ntfy enriched (automated) | bash scripts/test-patch-ntfy.sh --dry-run (CI) or --live on infra-services |
Dry-run passes; live sends 3 TEST pushes with tags/click/actions |
| 0b | Discord rich (automated) | bash scripts/test-patch-discord.sh --dry-run (CI) or --live on infra-services |
Four TEST embeds with colors + link buttons |
| 1 | ntfy subscribe | curl test above or --live script |
Phone notification |
| 2 | Successful live patch | systemctl start homelab-patch-orchestrate.service |
Discord summary with lines like saltierpoop: patched; no ntfy |
| 3 | Metrics fresh | Prometheus / Grafana | patch.prom, patch_orchestrate.prom updated; exit code 0 |
| 4 | No false stale | Prometheus alerts | OSPatchStale not firing |
| 5 | MAINTENANCE skip | sudo touch /etc/homelab/MAINTENANCE on saltierpoop; run patch; remove file |
Summary shows saltierpoop: skipped_maintenance; others patch |
| 6 | Orchestrator failure | Temporarily break SSH to prox or edit wrapper to use bad inventory; run unit; revert | Discord + ntfy critical (skull/patch tags, ARA click); homelab_patch_orchestrate_last_exit_code != 0 |
| 7 | Pre-reboot ntfy | Only if a kernel update leaves /var/run/reboot-required |
ntfy before reboot (rotating_light tag, Grafana click) |
Discord summary example¶
Success posts a green embed with inline fields per host (not plain text):
✅ Weekly OS Patch Complete
Coordinated patching finished across all waves. · 2 patched · 1 rebooted
saltierpoop prox infra-services
✅ Patched 🔄 Rebooted ✅ Patched
Buttons: 📋 ARA · 📊 Grafana · 📖 Runbook
With MAINTENANCE on one host:
Operations¶
Pause patching on one host¶
Also pauses ansible-pull on that host.
Pause all patching (C&C)¶
On infra-services only:
Orchestrator will not start (ExecCondition). Manual playbook still possible.
Change schedule¶
Edit patch_orchestrate_calendar / patch_orchestrate_timezone in
roles/patch-controller/defaults/main.yml, commit, wait for ansible-pull.
Default: Sunday 04:00 Pacific, RandomizedDelaySec=15min.
Change upgrade policy¶
| Goal | Variable | Value |
|---|---|---|
| Security updates only (default) | patching_upgrade_type |
security |
| Full dist-upgrade | patching_upgrade_type |
dist |
| Never auto-reboot | patching_reboot |
never |
| Reboot if required | patching_reboot |
auto_if_required |
Set in group_vars/patching_targets.yml or per-host host_vars/.
Onboard a new Linux host¶
ansible.managed: truein inventory; add topatching_targets+patching_waveNuv run python inventory/generators/render-ansible.pyand commit- Bootstrap with
site.yml(ansible-pull) - Converge infra-services (patch-controller key authorize play)
- Add wave play in
playbooks/patch.ymlif you created a new wave group
Redistribute patch-controller SSH key¶
If infra-services was rebuilt:
cd /var/lib/ansible-pull/homelab/infra/ansible
export ANSIBLE_CONFIG=$PWD/ansible.cfg
SOPS_AGE_KEY_FILE=/etc/homelab/age-key.txt \
sudo -E ansible-playbook -i inventory/generated.yml playbooks/site.yml \
--tags patch-controller \
--limit infra-services \
-e ansible_connection=local
prox bootstrap (optional)¶
prox is patchable as root without full ansible-pull:
ansible-playbook -i infra/ansible/inventory/generated.yml \
infra/ansible/playbooks/site.yml \
--tags common,ansible-pull --limit prox -e ansible_user=root
Troubleshooting¶
| Symptom | Likely cause | Fix |
|---|---|---|
Permission denied (publickey) to prox |
patch-controller pubkey not on root | Commit patch-controller.pub + ansible-pull on prox, or sudo ssh-copy-id -i /etc/homelab/patch-controller/id_ed25519 root@192.168.6.71 |
Load key ... error in libcrypto (backup-fetch) |
Corrupt private key on disk or bad SOPS deploy | sudo ssh-keygen -y -f /etc/homelab/backup-fetch/id_ed25519; fix SOPS + ansible-pull, or bootstrap patch SSH via normal someone SSH |
Identity file ... not accessible: Permission denied |
Manual test as someone |
Patch key is root-only; prefix commands with sudo |
| Patch playbook UNREACHABLE on infra-services | C&C play used SSH with patch key | Pull latest main; C&C plays use ansible_connection: local |
| Patch skipped on host | MAINTENANCE flag | test -f /etc/homelab/MAINTENANCE |
| No Discord summary | Missing / invalid notify.sops.yaml or webhook |
Check /etc/homelab/patching-notify.env; verify SOPS committed |
| No ntfy on failure | Topic not in env or not subscribed | grep NTFY_PATCH_TOPIC in notify env + monitoring .env; test curl |
OSPatchStale firing |
Patch hasn't run in 8+ days | Check timer; run manual patch |
OSPatchOrchestrateFailed |
Last wrapper exit ≠ 0 | journalctl -u homelab-patch-orchestrate; fix playbook error; re-run |
PatchRebootPending |
Reboot required but not done | SSH to host; sudo reboot or investigate hung reboot |
AnsiblePullStale |
ansible-pull not applying | systemctl status ansible-pull-apply.timer; check git pull / SSH URL |
No patch.prom in Prometheus |
node_exporter textfile not enabled | ansible-pull converge common role on host |
| saltierpoop media blip | Wave 0 runs first | Schedule is off-peak Sun 04:00; manual runs: pick quiet time |
| prox reboot concern | Hypervisor restart | Safe upgrade rarely requires reboot; ntfy warns if it does |
| Raw playbook vs unit | Ran ansible-playbook directly |
Use systemctl start homelab-patch-orchestrate.service for metrics + failure notify |
Useful commands¶
# Orchestrator logs
sudo journalctl -u homelab-patch-orchestrate.service -n 200 --no-pager
# Timer next run
systemctl list-timers homelab-patch-orchestrate.timer
# Last orchestrator metrics
cat /var/lib/node_exporter/textfile_collector/patch_orchestrate.prom
# Test Discord (four rich TEST embeds; notify env is root-only)
sudo bash -c 'set -a; source /etc/homelab/patching-notify.env; set +a; bash /var/lib/ansible-pull/homelab/scripts/test-patch-discord.sh --live'
# Test ntfy (basic)
curl -d "test" -H "Title: Patch" \
"https://ntfy.sh/homelab-patch-critical-b4e9"
# Test enriched ntfy (three TEST notifications; notify env is root-only)
sudo bash -c 'set -a; source /etc/homelab/patching-notify.env; set +a; bash /var/lib/ansible-pull/homelab/scripts/test-patch-ntfy.sh --live'
Out of scope (pass-1)¶
- Saltbox container image updates (Saltbox / Komodo)
- Loki/journald shipping for orchestrator logs (Phase 9 syslog cohort)
- Per-wave Discord messages (summary-only design)
- Appliances (UDM, DSM, HAOS) — vendor patching
Related¶
- Patching architecture
- Synology capacity ntfy — same ntfy.sh pattern, different topic
- PLAN.md Phase 8
- secrets/patching/README.md
- Grafana: Coordinated OS Patching dashboard