Saltbox monitoring → infra-services migration policy¶

Owner policy (2026-06): Migration is acceptable only if all data moves. After cutover, Saltbox must read/write infra-services with inventory overrides wherever Saltbox’s default assumes local containers — no compromise that breaks Saltbox as the management plane for saltierpoop.

This is not “retire Saltbox monitoring because it doesn’t belong on saltierpoop.” Those containers are com.github.saltbox.saltbox_managed=true roles. The question is whether one observability backend (infra-services) replaces two, while Saltbox keeps working.

Non-negotiables¶

#	Requirement
1	Full data migration — TSDB, Grafana DB/dashboards/datasources, Loki chunks, Uptime Kuma state. No silent loss; archive + verify before `sb remove`.
2	Saltbox remains authoritative on saltierpoop — changes via `sb edit inventory` / `sb install`, not homelab `services/` on that host.
3	Saltbox R/W after cutover — scrape targets, log push, and any role config that today hits local `prometheus` / `grafana` / `loki` must aim at infra-services (`192.168.6.17` or `.infra.realemail.app`) via inventory overrides. User-facing monitoring URLs are infra only* (see below).
4	Homelab owns infra-services stack — `services/monitoring/` is canonical for ops C&C metrics/logs/alerts (`*.infra.realemail.app`).

Current data (saltierpoop, 2026-06-24)¶

Role	Path	Size (approx)	Notes
Prometheus	`/opt/prometheus/data`	~559 MB	15d retention
Grafana	`/opt/grafana`	~91 MB	Includes provisioning + DB
Loki	`/opt/loki/loki`	~3.6 GB	Largest; plan downtime or parallel ingest
Loki Alloy	`/opt/loki/alloy`	small	Config at `/opt/loki/config/alloy-config.river`
Uptime Kuma	`/opt/uptime`	~197 MB	SQLite under `/app/data`

Config overlays already in Saltbox inventory (localhost.yml): e.g. extra Grafana dashboard bind-mounts under grafana_role_docker_volumes_custom.

Target architecture after migration¶

saltierpoop (Saltbox VM)
  Traefik *.realemail.app     ── media / auth / *arr only (no monitoring UI proxy)
  Alloy / exporters / scrapes ──LAN──► 192.168.6.17:9090 / :3100
  (no local prom/grafana/loki/uptime containers after cutover)

infra-services
  services/monitoring/        ◄── canonical TSDB + Grafana + Loki + Alertmanager
  *.infra.realemail.app       ◄── sole user-facing monitoring URLs (owner decision)

URL policy (owner decision)¶

Standardize on *.infra.realemail.app only — do not keep Saltbox Traefik routes for prometheus.realemail.app, grafana.realemail.app, netdata.realemail.app, etc.

Before (Saltbox)	After (canonical)
`https://grafana.realemail.app`	`https://grafana.infra.realemail.app`
`https://prometheus.realemail.app`	`https://prometheus.infra.realemail.app`
`https://uptime.realemail.app`	`https://uptime.infra.realemail.app` (once deployed)

After cutover: sb remove monitoring roles, drop Cloudflare/DNS entries for the old *.realemail.app monitoring hostnames if nothing else needs them. Bookmarks move to infra hostnames.

Authentik (required): Migrated UIs at *.infra.realemail.app must be Authentik-protected via infra outpost + Traefik forward auth before cutover sign-off. See ADR-002 and authentik-cross-host-sso.md. Admin-password-only Grafana is not acceptable as end state.

Internal Docker DNS: Saltbox docs assume http://prometheus:9090 on the saltbox network. After removing local containers, inventory must repoint Alloy and any remaining consumers to http://192.168.6.17:9090 / http://192.168.6.17:3100 (LAN) or HTTPS infra URLs — verified before destroy.

Per-role migration gates¶

Prometheus¶

Step	Action
Export	Copy TSDB (`/opt/prometheus/data`) + merge `/opt/prometheus/prometheus.yml` scrape jobs into `monitoring/prometheus/` (Saltbox cAdvisor/node targets must remain scraped from infra-services).
Import	Stop infra-services Prometheus; restore TSDB into `prometheus-data` volume; reconcile config; start; verify series count.
Saltbox	`sb remove prometheus`; disable DNS for `prometheus.realemail.app` if present; inventory overrides for any role still scraping/querying locally → `http://192.168.6.17:9090`.
Gate	Query parity on critical metrics at `prometheus.infra.realemail.app`; 15d history visible in Grafana.

Grafana¶

Step	Action
Export	`/opt/grafana` (DB + provisioning) or `grafana-cli` export; merge with homelab file provisioning in `monitoring/grafana/`.
Import	Restore into infra-services `grafana-data` volume; reconcile datasources to infra Prometheus/Loki.
Saltbox	`sb remove grafana`; users/bookmarks → `grafana.infra.realemail.app`; migrate Saltbox-origin dashboard provisioning into homelab `monitoring/grafana/` before remove.
Gate	All migrated dashboards render at infra URL; Authentik SSO per ADR-002.

Loki + Alloy¶

Step	Action
Export	`/opt/loki/loki` (~3.6 GB) — plan maintenance window; Loki version/compatibility check (Saltbox 2.9.x vs infra 3.5.x may require export/import tooling, not raw copy).
Import	Extend infra-services Loki retention/volume; migrate or rehydrate from archive.
Saltbox	Alloy must keep shipping — inventory override on `loki-alloy` (or successor) to push to `http://192.168.6.17:3100` (or authenticated URL). Do not leave saltierpoop logs orphaned.
Gate	Historical log queries return pre-cutover data; new container logs appear in infra Loki within 5m.

Uptime Kuma¶

Step	Action
Export	`/opt/uptime` SQLite + settings.
Import	Deploy `services/uptime-kuma/` on infra-services (not yet in repo); restore DB.
Saltbox	Remove local role; update any bookmarks/monitors; status page URL decision.
Gate	All monitors green; notification channels intact.

Netdata / Glances / Jaeger¶

Lower priority. Netdata/Glances overlap infra node_exporter scrape. Jaeger likely unused. Migrate only if you actively use them; otherwise sb remove after confirming no dependency.

Saltbox inventory (where overrides live)¶

Path: /srv/git/saltbox/inventories/host_vars/localhost.yml on saltierpoop
Edit: sb edit inventory (not homelab git — homelab only ships accounts.yml / settings.yml via SOPS)
Patterns: role-scoped vars with _role_ infix; _custom merges, not _default replaces (inventory docs)
Examples to research per role before cutover:
Alloy / Loki push URL → http://192.168.6.17:3100
Any scrape or API URL still pointing at docker service names → LAN infra endpoints
*_role_dns_enabled: false before remove if Saltbox manages CF records for old hostnames
sb remove <role> only after gates pass — not homelab compose

Document every override in this file or a saltierpoop runbook appendix when executed.

Network / firewall¶

infra-services (192.168.6.17) must accept from saltierpoop (192.168.6.243):

Prometheus 9090 (scrape + query from Alloy/exporters)
Loki 3100 (push/query from Alloy)

Grafana 3000 is browser-facing on infra Traefik only — saltierpoop does not proxy it.

Confirm UDM ZBF Servers→Servers before cutover.

Execution order (when approved)¶

Parallel run — infra-services stack receives same scrapes + log streams; compare dashboards.
Data migration — per table above; archive saltierpoop paths to infra-backups before delete.
Inventory overrides — point Saltbox at infra-services; sb install affected roles.
Verify Saltbox R/W — Alloy push, scrape parity, *arr health (no dependency on local prom/grafana).
sb remove local prometheus, grafana, loki, uptime (only after owner sign-off).
Update homelab docs — inventory notes, decom-old-monitoring.md, journal entry.
Authentik — infra outpost live; Grafana/Prometheus/Uptime routers on forward-auth.

Out of scope¶

Moving Plex, *arr, Traefik, Authentik off saltierpoop
Managing Saltbox localhost.yml from homelab git (unless owner adds that workflow later)
Partial migration (“infra gets metrics but Loki history stays on saltierpoop”) — violates owner policy #1