Skip to content

Saltbox monitoring → infra-services migration policy

Owner policy (2026-06): Migration is acceptable only if all data moves. After cutover, Saltbox must read/write infra-services with inventory overrides wherever Saltbox’s default assumes local containers — no compromise that breaks Saltbox as the management plane for saltierpoop.

This is not “retire Saltbox monitoring because it doesn’t belong on saltierpoop.” Those containers are com.github.saltbox.saltbox_managed=true roles. The question is whether one observability backend (infra-services) replaces two, while Saltbox keeps working.


Non-negotiables

# Requirement
1 Full data migration — TSDB, Grafana DB/dashboards/datasources, Loki chunks, Uptime Kuma state. No silent loss; archive + verify before sb remove.
2 Saltbox remains authoritative on saltierpoop — changes via sb edit inventory / sb install, not homelab services/ on that host.
3 Saltbox R/W after cutover — scrape targets, log push, and any role config that today hits local prometheus / grafana / loki must aim at infra-services (192.168.6.17 or *.infra.realemail.app) via inventory overrides. User-facing monitoring URLs are infra only (see below).
4 Homelab owns infra-services stackservices/monitoring/ is canonical for ops C&C metrics/logs/alerts (*.infra.realemail.app).

Current data (saltierpoop, 2026-06-24)

Role Path Size (approx) Notes
Prometheus /opt/prometheus/data ~559 MB 15d retention
Grafana /opt/grafana ~91 MB Includes provisioning + DB
Loki /opt/loki/loki ~3.6 GB Largest; plan downtime or parallel ingest
Loki Alloy /opt/loki/alloy small Config at /opt/loki/config/alloy-config.river
Uptime Kuma /opt/uptime ~197 MB SQLite under /app/data

Config overlays already in Saltbox inventory (localhost.yml): e.g. extra Grafana dashboard bind-mounts under grafana_role_docker_volumes_custom.


Target architecture after migration

saltierpoop (Saltbox VM)
  Traefik *.realemail.app     ── media / auth / *arr only (no monitoring UI proxy)
  Alloy / exporters / scrapes ──LAN──► 192.168.6.17:9090 / :3100
  (no local prom/grafana/loki/uptime containers after cutover)

infra-services
  services/monitoring/        ◄── canonical TSDB + Grafana + Loki + Alertmanager
  *.infra.realemail.app       ◄── sole user-facing monitoring URLs (owner decision)

URL policy (owner decision)

Standardize on *.infra.realemail.app only — do not keep Saltbox Traefik routes for prometheus.realemail.app, grafana.realemail.app, netdata.realemail.app, etc.

Before (Saltbox) After (canonical)
https://grafana.realemail.app https://grafana.infra.realemail.app
https://prometheus.realemail.app https://prometheus.infra.realemail.app
https://uptime.realemail.app https://uptime.infra.realemail.app (once deployed)

After cutover: sb remove monitoring roles, drop Cloudflare/DNS entries for the old *.realemail.app monitoring hostnames if nothing else needs them. Bookmarks move to infra hostnames.

Authentik (required): Migrated UIs at *.infra.realemail.app must be Authentik-protected via infra outpost + Traefik forward auth before cutover sign-off. See ADR-002 and authentik-cross-host-sso.md. Admin-password-only Grafana is not acceptable as end state.

Internal Docker DNS: Saltbox docs assume http://prometheus:9090 on the saltbox network. After removing local containers, inventory must repoint Alloy and any remaining consumers to http://192.168.6.17:9090 / http://192.168.6.17:3100 (LAN) or HTTPS infra URLs — verified before destroy.


Per-role migration gates

Prometheus

Step Action
Export Copy TSDB (/opt/prometheus/data) + merge /opt/prometheus/prometheus.yml scrape jobs into monitoring/prometheus/ (Saltbox cAdvisor/node targets must remain scraped from infra-services).
Import Stop infra-services Prometheus; restore TSDB into prometheus-data volume; reconcile config; start; verify series count.
Saltbox sb remove prometheus; disable DNS for prometheus.realemail.app if present; inventory overrides for any role still scraping/querying locally → http://192.168.6.17:9090.
Gate Query parity on critical metrics at prometheus.infra.realemail.app; 15d history visible in Grafana.

Grafana

Step Action
Export /opt/grafana (DB + provisioning) or grafana-cli export; merge with homelab file provisioning in monitoring/grafana/.
Import Restore into infra-services grafana-data volume; reconcile datasources to infra Prometheus/Loki.
Saltbox sb remove grafana; users/bookmarks → grafana.infra.realemail.app; migrate Saltbox-origin dashboard provisioning into homelab monitoring/grafana/ before remove.
Gate All migrated dashboards render at infra URL; Authentik SSO per ADR-002.

Loki + Alloy

Step Action
Export /opt/loki/loki (~3.6 GB) — plan maintenance window; Loki version/compatibility check (Saltbox 2.9.x vs infra 3.5.x may require export/import tooling, not raw copy).
Import Extend infra-services Loki retention/volume; migrate or rehydrate from archive.
Saltbox Alloy must keep shipping — inventory override on loki-alloy (or successor) to push to http://192.168.6.17:3100 (or authenticated URL). Do not leave saltierpoop logs orphaned.
Gate Historical log queries return pre-cutover data; new container logs appear in infra Loki within 5m.

Uptime Kuma

Step Action
Export /opt/uptime SQLite + settings.
Import Deploy services/uptime-kuma/ on infra-services (not yet in repo); restore DB.
Saltbox Remove local role; update any bookmarks/monitors; status page URL decision.
Gate All monitors green; notification channels intact.

Netdata / Glances / Jaeger

Lower priority. Netdata/Glances overlap infra node_exporter scrape. Jaeger likely unused. Migrate only if you actively use them; otherwise sb remove after confirming no dependency.


Saltbox inventory (where overrides live)

  • Path: /srv/git/saltbox/inventories/host_vars/localhost.yml on saltierpoop
  • Edit: sb edit inventory (not homelab git — homelab only ships accounts.yml / settings.yml via SOPS)
  • Patterns: role-scoped vars with _role_ infix; _custom merges, not _default replaces (inventory docs)
  • Examples to research per role before cutover:
  • Alloy / Loki push URL → http://192.168.6.17:3100
  • Any scrape or API URL still pointing at docker service names → LAN infra endpoints
  • *_role_dns_enabled: false before remove if Saltbox manages CF records for old hostnames
  • sb remove <role> only after gates pass — not homelab compose

Document every override in this file or a saltierpoop runbook appendix when executed.


Network / firewall

infra-services (192.168.6.17) must accept from saltierpoop (192.168.6.243):

  • Prometheus 9090 (scrape + query from Alloy/exporters)
  • Loki 3100 (push/query from Alloy)

Grafana 3000 is browser-facing on infra Traefik only — saltierpoop does not proxy it.

Confirm UDM ZBF Servers→Servers before cutover.


Execution order (when approved)

  1. Parallel run — infra-services stack receives same scrapes + log streams; compare dashboards.
  2. Data migration — per table above; archive saltierpoop paths to infra-backups before delete.
  3. Inventory overrides — point Saltbox at infra-services; sb install affected roles.
  4. Verify Saltbox R/W — Alloy push, scrape parity, *arr health (no dependency on local prom/grafana).
  5. sb remove local prometheus, grafana, loki, uptime (only after owner sign-off).
  6. Update homelab docs — inventory notes, decom-old-monitoring.md, journal entry.
  7. Authentik — infra outpost live; Grafana/Prometheus/Uptime routers on forward-auth.

Out of scope

  • Moving Plex, *arr, Traefik, Authentik off saltierpoop
  • Managing Saltbox localhost.yml from homelab git (unless owner adds that workflow later)
  • Partial migration (“infra gets metrics but Loki history stays on saltierpoop”) — violates owner policy #1