Saltbox monitoring → infra-services migration policy¶
Owner policy (2026-06): Migration is acceptable only if all data moves. After cutover, Saltbox must read/write infra-services with inventory overrides wherever Saltbox’s default assumes local containers — no compromise that breaks Saltbox as the management plane for saltierpoop.
This is not “retire Saltbox monitoring because it doesn’t belong on saltierpoop.”
Those containers are com.github.saltbox.saltbox_managed=true roles. The question
is whether one observability backend (infra-services) replaces two, while
Saltbox keeps working.
Non-negotiables¶
| # | Requirement |
|---|---|
| 1 | Full data migration — TSDB, Grafana DB/dashboards/datasources, Loki chunks, Uptime Kuma state. No silent loss; archive + verify before sb remove. |
| 2 | Saltbox remains authoritative on saltierpoop — changes via sb edit inventory / sb install, not homelab services/ on that host. |
| 3 | Saltbox R/W after cutover — scrape targets, log push, and any role config that today hits local prometheus / grafana / loki must aim at infra-services (192.168.6.17 or *.infra.realemail.app) via inventory overrides. User-facing monitoring URLs are infra only (see below). |
| 4 | Homelab owns infra-services stack — services/monitoring/ is canonical for ops C&C metrics/logs/alerts (*.infra.realemail.app). |
Current data (saltierpoop, 2026-06-24)¶
| Role | Path | Size (approx) | Notes |
|---|---|---|---|
| Prometheus | /opt/prometheus/data |
~559 MB | 15d retention |
| Grafana | /opt/grafana |
~91 MB | Includes provisioning + DB |
| Loki | /opt/loki/loki |
~3.6 GB | Largest; plan downtime or parallel ingest |
| Loki Alloy | /opt/loki/alloy |
small | Config at /opt/loki/config/alloy-config.river |
| Uptime Kuma | /opt/uptime |
~197 MB | SQLite under /app/data |
Config overlays already in Saltbox inventory (localhost.yml): e.g. extra Grafana
dashboard bind-mounts under grafana_role_docker_volumes_custom.
Target architecture after migration¶
saltierpoop (Saltbox VM)
Traefik *.realemail.app ── media / auth / *arr only (no monitoring UI proxy)
Alloy / exporters / scrapes ──LAN──► 192.168.6.17:9090 / :3100
(no local prom/grafana/loki/uptime containers after cutover)
infra-services
services/monitoring/ ◄── canonical TSDB + Grafana + Loki + Alertmanager
*.infra.realemail.app ◄── sole user-facing monitoring URLs (owner decision)
URL policy (owner decision)¶
Standardize on *.infra.realemail.app only — do not keep Saltbox Traefik routes
for prometheus.realemail.app, grafana.realemail.app, netdata.realemail.app, etc.
| Before (Saltbox) | After (canonical) |
|---|---|
https://grafana.realemail.app |
https://grafana.infra.realemail.app |
https://prometheus.realemail.app |
https://prometheus.infra.realemail.app |
https://uptime.realemail.app |
https://uptime.infra.realemail.app (once deployed) |
After cutover: sb remove monitoring roles, drop Cloudflare/DNS entries for the old
*.realemail.app monitoring hostnames if nothing else needs them. Bookmarks move to
infra hostnames.
Authentik (required): Migrated UIs at *.infra.realemail.app must be
Authentik-protected via infra outpost + Traefik forward auth before cutover sign-off.
See ADR-002 and
authentik-cross-host-sso.md. Admin-password-only
Grafana is not acceptable as end state.
Internal Docker DNS: Saltbox docs assume http://prometheus:9090 on the saltbox
network. After removing local containers, inventory must repoint Alloy and any remaining
consumers to http://192.168.6.17:9090 / http://192.168.6.17:3100 (LAN) or
HTTPS infra URLs — verified before destroy.
Per-role migration gates¶
Prometheus¶
| Step | Action |
|---|---|
| Export | Copy TSDB (/opt/prometheus/data) + merge /opt/prometheus/prometheus.yml scrape jobs into monitoring/prometheus/ (Saltbox cAdvisor/node targets must remain scraped from infra-services). |
| Import | Stop infra-services Prometheus; restore TSDB into prometheus-data volume; reconcile config; start; verify series count. |
| Saltbox | sb remove prometheus; disable DNS for prometheus.realemail.app if present; inventory overrides for any role still scraping/querying locally → http://192.168.6.17:9090. |
| Gate | Query parity on critical metrics at prometheus.infra.realemail.app; 15d history visible in Grafana. |
Grafana¶
| Step | Action |
|---|---|
| Export | /opt/grafana (DB + provisioning) or grafana-cli export; merge with homelab file provisioning in monitoring/grafana/. |
| Import | Restore into infra-services grafana-data volume; reconcile datasources to infra Prometheus/Loki. |
| Saltbox | sb remove grafana; users/bookmarks → grafana.infra.realemail.app; migrate Saltbox-origin dashboard provisioning into homelab monitoring/grafana/ before remove. |
| Gate | All migrated dashboards render at infra URL; Authentik SSO per ADR-002. |
Loki + Alloy¶
| Step | Action |
|---|---|
| Export | /opt/loki/loki (~3.6 GB) — plan maintenance window; Loki version/compatibility check (Saltbox 2.9.x vs infra 3.5.x may require export/import tooling, not raw copy). |
| Import | Extend infra-services Loki retention/volume; migrate or rehydrate from archive. |
| Saltbox | Alloy must keep shipping — inventory override on loki-alloy (or successor) to push to http://192.168.6.17:3100 (or authenticated URL). Do not leave saltierpoop logs orphaned. |
| Gate | Historical log queries return pre-cutover data; new container logs appear in infra Loki within 5m. |
Uptime Kuma¶
| Step | Action |
|---|---|
| Export | /opt/uptime SQLite + settings. |
| Import | Deploy services/uptime-kuma/ on infra-services (not yet in repo); restore DB. |
| Saltbox | Remove local role; update any bookmarks/monitors; status page URL decision. |
| Gate | All monitors green; notification channels intact. |
Netdata / Glances / Jaeger¶
Lower priority. Netdata/Glances overlap infra node_exporter scrape. Jaeger likely
unused. Migrate only if you actively use them; otherwise sb remove after confirming
no dependency.
Saltbox inventory (where overrides live)¶
- Path:
/srv/git/saltbox/inventories/host_vars/localhost.ymlon saltierpoop - Edit:
sb edit inventory(not homelab git — homelab only shipsaccounts.yml/settings.ymlvia SOPS) - Patterns: role-scoped vars with
_role_infix;_custommerges, not_defaultreplaces (inventory docs) - Examples to research per role before cutover:
- Alloy / Loki push URL →
http://192.168.6.17:3100 - Any scrape or API URL still pointing at docker service names → LAN infra endpoints
*_role_dns_enabled: falsebefore remove if Saltbox manages CF records for old hostnamessb remove <role>only after gates pass — not homelab compose
Document every override in this file or a saltierpoop runbook appendix when executed.
Network / firewall¶
infra-services (192.168.6.17) must accept from saltierpoop (192.168.6.243):
- Prometheus
9090(scrape + query from Alloy/exporters) - Loki
3100(push/query from Alloy)
Grafana 3000 is browser-facing on infra Traefik only — saltierpoop does not proxy it.
Confirm UDM ZBF Servers→Servers before cutover.
Execution order (when approved)¶
- Parallel run — infra-services stack receives same scrapes + log streams; compare dashboards.
- Data migration — per table above; archive saltierpoop paths to
infra-backupsbefore delete. - Inventory overrides — point Saltbox at infra-services;
sb installaffected roles. - Verify Saltbox R/W — Alloy push, scrape parity, *arr health (no dependency on local prom/grafana).
sb removelocal prometheus, grafana, loki, uptime (only after owner sign-off).- Update homelab docs — inventory notes, decom-old-monitoring.md, journal entry.
- Authentik — infra outpost live; Grafana/Prometheus/Uptime routers on forward-auth.
Out of scope¶
- Moving Plex, *arr, Traefik, Authentik off saltierpoop
- Managing Saltbox
localhost.ymlfrom homelab git (unless owner adds that workflow later) - Partial migration (“infra gets metrics but Loki history stays on saltierpoop”) — violates owner policy #1