Skip to content

Coordinated OS Patching (Phase 8 pass-1)

Operational guide for push-mode Linux patching from infra-services, including observability and notifications. Architecture detail: patching architecture.


Status at a glance

Area In repo (pass-1) Live on homelab (your deploy)
C&C timer + SSH key Yes After ansible-pull on infra-services
Wave patching (saltierpoop → prox → infra-services) Yes After first manual/systemd run
Discord weekly summary Yes After notify.sops.yaml + SOPS
ntfy critical (failure, pre-reboot) Yes After SOPS + phone subscribe
Prometheus patch alerts Yes After monitoring stack reload
Grafana patching dashboard Yes Auto-provisioned with stack
ansible_pull.prom fix Yes After ansible-pull on all hosts
Weekly schedule (Sun 04:00 PT) Yes Timer enabled by patch-controller role

Pass-1 is code-complete in git. Production patching is not live until you complete Owner checklist below.


Assigned ntfy topic

Use this topic for patch-critical alerts only (separate from Whrrr capacity homelab-whrrr-capacity-7f2a):

homelab-patch-critical-b4e9

Server: https://ntfy.sh (public ntfy.sh — same as capacity alerts).

Subscribe on your phone now (takes 30 seconds):

  1. Install ntfy app.
  2. Add subscription → topic homelab-patch-critical-b4e9.
  3. Test (basic):
curl -d "homelab patch ntfy test" -H "Title: Patch test" \
  "https://ntfy.sh/homelab-patch-critical-b4e9"
  1. Test enriched notifications (orchestrator, pre-reboot, Alertmanager template):
# From repo root on infra-services (notify env is mode 600 root-only):
sudo bash -c 'set -a; source /etc/homelab/patching-notify.env; set +a; bash /var/lib/ansible-pull/homelab/scripts/test-patch-ntfy.sh --live'

CI runs scripts/test-patch-ntfy.sh --dry-run on every push. Live mode sends three TEST messages; expect skull/patch tags, ARA/Grafana action buttons on failure-style, and a formatted Alertmanager alert.

Enriched headers used in production:

Path Tags Click Actions
Orchestrator failure skull,patch ARA Open ARA, Grafana
Pre-reboot rotating_light,computer Grafana patching dashboard
Alertmanager critical patching (template) (template)

Alertmanager posts to ?template=alertmanager so webhook JSON becomes a readable title + message instead of raw JSON on your phone.

Discord uses rich embeds (color, fields, timestamps) and link buttons via /usr/local/lib/homelab/discord-patch-notify.py:

sudo bash -c 'set -a; source /etc/homelab/patching-notify.env; set +a; bash /var/lib/ansible-pull/homelab/scripts/test-patch-discord.sh --live'

Expect four TEST embeds: validation (blue), success (green), failure (red), reboot (yellow) — each with ARA / Grafana / Runbook buttons.


What pass-1 built (reference)

Patching mechanism

Component Path / unit
Patch controller role infra/ansible/roles/patch-controller/
Patching role infra/ansible/roles/patching/
Playbook infra/ansible/playbooks/patch.yml
Systemd timer homelab-patch-orchestrate.timer
Systemd service homelab-patch-orchestrate.service
Wrapper script /usr/local/bin/homelab-patch-orchestrate
SSH key /etc/homelab/patch-controller/id_ed25519
Notify env /etc/homelab/patching-notify.env

Wave order: saltierpoop (0) → prox (1) → infra-services (2). C&C patches itself last.

Policy defaults (group_vars/patching_targets.yml):

  • patching_upgrade_type: security — apt safe upgrade
  • patching_reboot: auto_if_required — reboot if /var/run/reboot-required
  • unattended-upgrades removed on targets

Observability

Signal Source
patch.prom Each patching target — last success time, reboot flag
patch_orchestrate.prom infra-services only — orchestrator exit code + run time
ansible_pull.prom Each ansible-pull host — last successful apply
ARA All orchestrator / playbook runs
Grafana Dashboard Coordinated OS Patching (patching.json)

Notifications

Event Discord ntfy
Weekly patch success Rich embed + per-host fields + link buttons
Orchestrator / playbook failure Rich embed + @here + ARA/Grafana/Prometheus buttons Rich (tags, click, actions)
Pre-reboot on a host Rich embed + Grafana button Rich (tags, click)
Prometheus patch alerts Enhanced markdown + runbook links Alertmanager template (critical only)
Converge validation Rich embed probe on ansible-pull Enriched curl probe

Alerts (monitoring/prometheus/alerts/patching.yml):

Alert Severity Condition
OSPatchStale warning patch.prom not updated in 8 days
OSPatchOrchestrateFailed critical orchestrator exit code ≠ 0
PatchRebootPending warning reboot-required flag set 30m

Owner checklist

Complete in order. Check off as you go.

  • [ ] Subscribe to ntfy topic homelab-patch-critical-b4e9 (see above)
  • [ ] Push pass-1 to main if not already on origin
  • [ ] SOPS — Ansible notify (Discord webhook + ntfy topic):
cd infra/ansible/inventory/group_vars/patch_controller
cp notify.sops.yaml.example notify.sops.yaml
# Edit: patching_notify_discord_webhook_url = same URL as monitoring Discord webhook
#       patching_notify_ntfy_topic = homelab-patch-critical-b4e9  (already in example)
SOPS_AGE_KEY_FILE=/etc/homelab/age-key.txt sops -e -i notify.sops.yaml
git add notify.sops.yaml && git commit && git push
  • [ ] SOPS — Alertmanager (add to existing monitoring secrets):
NTFY_SERVER: "https://ntfy.sh"
NTFY_PATCH_TOPIC: "homelab-patch-critical-b4e9"

Re-decrypt .env on infra-services and reload alertmanager (below). - [ ] Converge all managed hosts — wait for ansible-pull or trigger on each: - infra-services, saltierpoop (ansible-pull); prox optional full bootstrap - [ ] Reload monitoring on infra-services - [ ] Publish patch-controller.pub and verify SSH from infra-services (sudo — key is root-only):

sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes someone@192.168.6.243 true
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes root@192.168.6.71 true
  • [ ] Dry-run patch (see Testing)
  • [ ] One live patch via systemd unit (not raw playbook)
  • [ ] Notification test matrix (see Testing)
  • [ ] Mark README Owner TODO rows for Phase 8 done

Prerequisites

  1. ansible-pull converging on infra-services (deploys patch-controller role)
  2. Patch-controller SSH from infra-services to saltierpoop (someone) and prox (root) — verified with the patch private key (see below). infra-services has no general outbound SSH to other hosts; one-time bootstrap uses an operator workstation (scripts/bootstrap-patch-ssh-from-operator.sh).
  3. Repo checkout at /var/lib/ansible-pull/homelab (or /opt/homelab)
  4. SOPS age key on infra-services at /etc/homelab/age-key.txt

After bootstrap, targets also trust the committed infra/ansible/inventory/files/patch-controller.pub via each host's ansible-pull (from="192.168.6.17"). Optional backup-fetch push to saltierpoop only runs when /etc/homelab/backup-fetch/id_ed25519 parses cleanly (ssh-keygen -y).


Deploy steps (infra-services)

1. Verify patch-controller landed

ssh someone@192.168.6.17
sudo systemctl list-timers homelab-patch-orchestrate.timer
sudo ls -la /etc/homelab/patch-controller/
sudo ls -la /etc/homelab/patching-notify.env   # after notify.sops.yaml committed + pull

1b. Patch SSH bootstrap (one-time, if not already done)

From a workstation with SSH to all hosts (not from infra-services as someone):

./scripts/bootstrap-patch-ssh-from-operator.sh

Or verify manually on infra-services:

sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes someone@192.168.6.243 true
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes root@192.168.6.71 true

patch-controller.pub in git keeps new targets converged via ansible-pull.

Expected:

  • Timer: Sun … weekly, timezone America/Los_Angeles
  • Private key: /etc/homelab/patch-controller/id_ed25519 (mode 600)
  • Notify env: mode 600, contains NTFY_PATCH_TOPIC=homelab-patch-critical-b4e9

2. Reload monitoring (after SOPS + push)

cd /opt/homelab/services/monitoring
SOPS_AGE_KEY_FILE=/etc/homelab/age-key.txt sops -d .env.sops.yaml \
  | sed 's/: /=/' > .env
docker compose up -d alertmanager prometheus grafana

3. Confirm metrics paths (after converge)

On each managed host:

ls -la /var/lib/node_exporter/textfile_collector/patch.prom
ls -la /var/lib/node_exporter/textfile_collector/ansible_pull.prom

On infra-services only:

ls -la /var/lib/node_exporter/textfile_collector/patch_orchestrate.prom

Prometheus: https://prometheus.infra.realemail.app — query homelab_patch_last_success_unixtime.


Testing

Bootstrap patch SSH (one-time)

Diagnose backup-fetch first (error in libcrypto = unreadable key file, not “wrong host key”):

sudo ls -la /etc/homelab/backup-fetch/id_ed25519
sudo ssh-keygen -y -f /etc/homelab/backup-fetch/id_ed25519
# Must match infra/ansible/inventory/files/backup-fetch.pub

If ssh-keygen -y fails, fix or regenerate backup-fetch before relying on automated saltierpoop push. After ansible-pull redeploys the key, re-run ssh-keygen -y. See scripts/update-backup-sops.sh if SOPS and disk diverged.

saltierpoop (recommended — uses your normal someone SSH, not backup-fetch):

PUB=$(sudo cat /etc/homelab/patch-controller/id_ed25519.pub)
ssh someone@192.168.6.243 "mkdir -p ~/.ssh && chmod 700 ~/.ssh && (grep -qxF '$PUB' ~/.ssh/authorized_keys 2>/dev/null || echo '$PUB' >> ~/.ssh/authorized_keys) && chmod 600 ~/.ssh/authorized_keys"

prox — one-time password auth (patch private key is root-only):

sudo ssh-copy-id -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes root@192.168.6.71

Verify patch key (sudo required):

sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes someone@192.168.6.243 true
sudo ssh -i /etc/homelab/patch-controller/id_ed25519 -o IdentitiesOnly=yes root@192.168.6.71 true

Dry run (check mode)

Does not send Discord summary, reboot, or update all metrics reliably. Use to preview apt changes only.

cd /var/lib/ansible-pull/homelab/infra/ansible
sudo bash -c '
  export ANSIBLE_PRIVATE_KEY_FILE=/etc/homelab/patch-controller/id_ed25519
  export ANSIBLE_CONFIG=$PWD/ansible.cfg
  ansible-playbook -i inventory/generated.yml playbooks/patch.yml --check --diff
'

Pick a low-traffic window for the first live run (saltierpoop is wave 0).

Live patch (production path)

Always use the systemd unit so the wrapper records exit metrics and failure notifications:

sudo systemctl start homelab-patch-orchestrate.service
sudo journalctl -u homelab-patch-orchestrate.service -n 100 --no-pager

Runs appear in ARA: https://ara.infra.realemail.app or http://192.168.6.17:8000.

Notification test matrix (pass-1 acceptance)

# Test How Expected
0 ntfy enriched (automated) bash scripts/test-patch-ntfy.sh --dry-run (CI) or --live on infra-services Dry-run passes; live sends 3 TEST pushes with tags/click/actions
0b Discord rich (automated) bash scripts/test-patch-discord.sh --dry-run (CI) or --live on infra-services Four TEST embeds with colors + link buttons
1 ntfy subscribe curl test above or --live script Phone notification
2 Successful live patch systemctl start homelab-patch-orchestrate.service Discord summary with lines like saltierpoop: patched; no ntfy
3 Metrics fresh Prometheus / Grafana patch.prom, patch_orchestrate.prom updated; exit code 0
4 No false stale Prometheus alerts OSPatchStale not firing
5 MAINTENANCE skip sudo touch /etc/homelab/MAINTENANCE on saltierpoop; run patch; remove file Summary shows saltierpoop: skipped_maintenance; others patch
6 Orchestrator failure Temporarily break SSH to prox or edit wrapper to use bad inventory; run unit; revert Discord + ntfy critical (skull/patch tags, ARA click); homelab_patch_orchestrate_last_exit_code != 0
7 Pre-reboot ntfy Only if a kernel update leaves /var/run/reboot-required ntfy before reboot (rotating_light tag, Grafana click)

Discord summary example

Success posts a green embed with inline fields per host (not plain text):

✅ Weekly OS Patch Complete
Coordinated patching finished across all waves. · 2 patched · 1 rebooted

saltierpoop          prox                 infra-services
✅ Patched            🔄 Rebooted          ✅ Patched

Buttons: 📋 ARA · 📊 Grafana · 📖 Runbook

With MAINTENANCE on one host:

saltierpoop: skipped_maintenance
prox: patched
infra-services: patched

Operations

Pause patching on one host

sudo touch /etc/homelab/MAINTENANCE   # on target
sudo rm /etc/homelab/MAINTENANCE     # when done

Also pauses ansible-pull on that host.

Pause all patching (C&C)

On infra-services only:

sudo touch /etc/homelab/MAINTENANCE

Orchestrator will not start (ExecCondition). Manual playbook still possible.

Change schedule

Edit patch_orchestrate_calendar / patch_orchestrate_timezone in roles/patch-controller/defaults/main.yml, commit, wait for ansible-pull.

Default: Sunday 04:00 Pacific, RandomizedDelaySec=15min.

Change upgrade policy

Goal Variable Value
Security updates only (default) patching_upgrade_type security
Full dist-upgrade patching_upgrade_type dist
Never auto-reboot patching_reboot never
Reboot if required patching_reboot auto_if_required

Set in group_vars/patching_targets.yml or per-host host_vars/.

Onboard a new Linux host

  1. ansible.managed: true in inventory; add to patching_targets + patching_waveN
  2. uv run python inventory/generators/render-ansible.py and commit
  3. Bootstrap with site.yml (ansible-pull)
  4. Converge infra-services (patch-controller key authorize play)
  5. Add wave play in playbooks/patch.yml if you created a new wave group

Redistribute patch-controller SSH key

If infra-services was rebuilt:

cd /var/lib/ansible-pull/homelab/infra/ansible
export ANSIBLE_CONFIG=$PWD/ansible.cfg
SOPS_AGE_KEY_FILE=/etc/homelab/age-key.txt \
sudo -E ansible-playbook -i inventory/generated.yml playbooks/site.yml \
  --tags patch-controller \
  --limit infra-services \
  -e ansible_connection=local

prox bootstrap (optional)

prox is patchable as root without full ansible-pull:

ansible-playbook -i infra/ansible/inventory/generated.yml \
  infra/ansible/playbooks/site.yml \
  --tags common,ansible-pull --limit prox -e ansible_user=root

Troubleshooting

Symptom Likely cause Fix
Permission denied (publickey) to prox patch-controller pubkey not on root Commit patch-controller.pub + ansible-pull on prox, or sudo ssh-copy-id -i /etc/homelab/patch-controller/id_ed25519 root@192.168.6.71
Load key ... error in libcrypto (backup-fetch) Corrupt private key on disk or bad SOPS deploy sudo ssh-keygen -y -f /etc/homelab/backup-fetch/id_ed25519; fix SOPS + ansible-pull, or bootstrap patch SSH via normal someone SSH
Identity file ... not accessible: Permission denied Manual test as someone Patch key is root-only; prefix commands with sudo
Patch playbook UNREACHABLE on infra-services C&C play used SSH with patch key Pull latest main; C&C plays use ansible_connection: local
Patch skipped on host MAINTENANCE flag test -f /etc/homelab/MAINTENANCE
No Discord summary Missing / invalid notify.sops.yaml or webhook Check /etc/homelab/patching-notify.env; verify SOPS committed
No ntfy on failure Topic not in env or not subscribed grep NTFY_PATCH_TOPIC in notify env + monitoring .env; test curl
OSPatchStale firing Patch hasn't run in 8+ days Check timer; run manual patch
OSPatchOrchestrateFailed Last wrapper exit ≠ 0 journalctl -u homelab-patch-orchestrate; fix playbook error; re-run
PatchRebootPending Reboot required but not done SSH to host; sudo reboot or investigate hung reboot
AnsiblePullStale ansible-pull not applying systemctl status ansible-pull-apply.timer; check git pull / SSH URL
No patch.prom in Prometheus node_exporter textfile not enabled ansible-pull converge common role on host
saltierpoop media blip Wave 0 runs first Schedule is off-peak Sun 04:00; manual runs: pick quiet time
prox reboot concern Hypervisor restart Safe upgrade rarely requires reboot; ntfy warns if it does
Raw playbook vs unit Ran ansible-playbook directly Use systemctl start homelab-patch-orchestrate.service for metrics + failure notify

Useful commands

# Orchestrator logs
sudo journalctl -u homelab-patch-orchestrate.service -n 200 --no-pager

# Timer next run
systemctl list-timers homelab-patch-orchestrate.timer

# Last orchestrator metrics
cat /var/lib/node_exporter/textfile_collector/patch_orchestrate.prom

# Test Discord (four rich TEST embeds; notify env is root-only)
sudo bash -c 'set -a; source /etc/homelab/patching-notify.env; set +a; bash /var/lib/ansible-pull/homelab/scripts/test-patch-discord.sh --live'

# Test ntfy (basic)
curl -d "test" -H "Title: Patch" \
  "https://ntfy.sh/homelab-patch-critical-b4e9"

# Test enriched ntfy (three TEST notifications; notify env is root-only)
sudo bash -c 'set -a; source /etc/homelab/patching-notify.env; set +a; bash /var/lib/ansible-pull/homelab/scripts/test-patch-ntfy.sh --live'

Out of scope (pass-1)

  • Saltbox container image updates (Saltbox / Komodo)
  • Loki/journald shipping for orchestrator logs (Phase 9 syslog cohort)
  • Per-wave Discord messages (summary-only design)
  • Appliances (UDM, DSM, HAOS) — vendor patching