Coordinated OS Patching¶

Phase 8 delivers the original homelab goal: one command-and-control host (infra-services) applies Linux package updates to all Ansible-managed hosts on a defined schedule.

Why separate from ansible-pull?¶

Mechanism	Purpose	Cadence
`ansible-pull`	Git config convergence (roles, secrets, timers)	Every 30 min
`homelab-patch-orchestrate`	OS package upgrades (`apt`)	Weekly (Sun 04:00 PT)

Mixing apt upgrade into every pull would be noisy, hard to schedule around media workloads, and risky during incident response. Patching is deliberate and wave-ordered.

Architecture¶

flowchart LR
  IS[infra-services C&C]
  SP[saltierpoop wave0]
  PX[prox wave1]
  IS2[infra-services wave2]

  IS -->|push SSH| SP
  IS -->|push SSH root| PX
  IS -->|local| IS2

Components¶

Piece	Location
`roles/patch-controller/`	infra-services only — SSH key, systemd timer
`roles/patching/`	All `patching_targets` — apt, reboot, metrics
`playbooks/patch.yml`	Invoked by timer; three wave plays
`homelab-patch-orchestrate.timer`	Weekly schedule on infra-services
`patch.prom`	node_exporter textfile on each target
`patch_orchestrate.prom`	infra-services — orchestrator exit code + last run time
`ansible_pull.prom`	each host — last successful ansible-pull apply
`/etc/homelab/patching-notify.env`	Discord + ntfy vars for wrapper and playbook

Inventory groups¶

Group	Members (today)
`patch_controller`	`infra-services`
`patching_targets`	`infra-services`, `prox`, `saltierpoop`
`patching_wave0`	`saltierpoop`
`patching_wave1`	`prox`
`patching_wave2`	`infra-services` (C&C patches itself last)

New Ansible-managed Linux hosts join patching_targets and a wave group in inventory; re-run render-ansible.py.

Policy defaults¶

Defined in infra/ansible/inventory/group_vars/patching_targets.yml:

patching_upgrade_type: security — apt upgrade safe (installed packages only)
patching_reboot: auto_if_required — reboot when /var/run/reboot-required exists
unattended-upgrades package removed on targets

Override per host in host_vars/ (e.g. patching_upgrade_type: dist before an LTS bump).

SSH access¶

Target	User	Notes
saltierpoop	`someone`	NOPASSWD sudo via `common` role
prox	`root`	`host_vars/prox.yml`
infra-services (wave 2)	local	`ansible_connection: local`

The patch-controller generates /etc/homelab/patch-controller/id_ed25519 on infra-services. site.yml distributes the public key to other targets when ansible-pull converges on infra-services.

Maintenance hold¶

/etc/homelab/MAINTENANCE on a host:

Skips ansible-pull (existing behavior)
Skips homelab-patch-orchestrate service (ExecCondition)
Skips that host inside roles/patching/ (meta: end_host)

Observability and notifications¶

flowchart TB
  subgraph orchestrator [infra-services]
    Timer[homelab-patch-orchestrate.timer]
    Wrapper[homelab-patch-orchestrate.sh]
    PB[patch.yml]
    Timer --> Wrapper --> PB
  end
  subgraph targets [patching_targets]
    SP[saltierpoop]
    PX[prox]
    IS[infra-services wave2]
  end
  PB --> SP
  PB --> PX
  PB --> IS
  Wrapper -->|failure| NtfyF[ntfy critical]
  Wrapper -->|failure| DiscordF[Discord]
  PB -->|success summary| DiscordS[Discord]
  SP -->|pre-reboot| NtfyR[ntfy critical]
  PX -->|pre-reboot| NtfyR
  IS -->|pre-reboot| NtfyR
  SP --> Prom[patch.prom]
  PX --> Prom
  IS --> Prom
  Wrapper --> OrchProm[patch_orchestrate.prom]
  Prom --> Alertmanager
  OrchProm --> Alertmanager
  Alertmanager --> DiscordA[Discord all severities]
  Alertmanager -->|critical patching| NtfyA[ntfy]

Signal	Where
ARA	Every orchestrator and playbook run
Prometheus	`homelab_patch_last_success_unixtime`, `homelab_patch_reboot_required`, `homelab_patch_orchestrate_last_exit_code`
Grafana	`monitoring/grafana/dashboards/patching.json`
Discord	Rich embeds + link buttons (success, failure, pre-reboot, validation); Alertmanager markdown with runbook links
ntfy	Orchestrator failure + pre-reboot (Ansible/wrapper); critical patch alerts (Alertmanager)

Alerts (monitoring/prometheus/alerts/patching.yml):

Alert	Severity	When
`OSPatchStale`	warning	`patch.prom` older than 8 days
`OSPatchOrchestrateFailed`	critical	orchestrator exit code ≠ 0
`PatchRebootPending`	warning	`homelab_patch_reboot_required == 1` for 30m

Secrets:

Ansible: notify.sops.yaml.example
Alertmanager: .env.sops.yaml.example (NTFY_PATCH_TOPIC)

Topic: homelab-patch-critical-b4e9

Out of scope¶

Saltbox containers — OS layer only on saltierpoop; use Saltbox/Komodo for image updates
Proxmox guests not in inventory — patch when onboarded to Ansible
Appliances (UDM, Synology DSM, HAOS) — vendor UI / firmware
customer-app class — own lifecycle