Decommission Old Monitoring LXCs¶
Runbook for migrating data from and decommissioning LXC 101, 108, and 112 on Proxmox, which are superseded by the Phase 5 monitoring stack on infra-services.
Targets¶
| VMID | Name | IP | RAM | Disk | Status | Superseded By |
|---|---|---|---|---|---|---|
| 101 | alpine-prometheus |
— | 256MB | 1GB | stopped | Phase 5 Prometheus |
| 108 | prometheus |
192.168.6.237 | 2GB | 8GB | running | Phase 5 Prometheus (infra-services) |
| 112 | grafana |
192.168.6.249 | 512MB | 2GB | running | Phase 5 Grafana (infra-services) |
There are also prometheus and grafana containers on saltierpoop (Saltbox-managed).
Those are out of scope for this runbook — they stay as Saltbox-internal
monitoring. The new canonical stack is on infra-services at 192.168.6.17.
Pre-flight Audit¶
Before destroying anything, collect configuration from the running LXCs so we can identify what was sending data to them and re-point if needed.
Step 1 — Audit LXC 108 (Prometheus)¶
SSH into LXC 108 and extract its scrape configuration:
ssh root@192.168.6.237
# Find the Prometheus config
cat /etc/prometheus/prometheus.yml # standard location
# or: find / -name prometheus.yml 2>/dev/null
# Save the scrape_configs section — this lists every target that was
# being scraped. Each target is a "data producer" that may need re-pointing.
Record the output. For each scrape target, determine:
- Already covered? Check if the target IP appears in the new
monitoring/targets/nodes.yml(generated from inventory). If yes, no action. - Missing? Add the host/appliance to
inventory/hosts/orinventory/appliances/withansible.managed: truesorender-prometheus.pygenerates a target for it. - Irrelevant? If the target was a stopped LXC or dead service, skip it.
Step 2 — Audit LXC 112 (Grafana)¶
SSH into LXC 112 and extract its datasource config and dashboards:
ssh root@192.168.6.249
# Grafana datasources (shows what Prometheus/InfluxDB it reads from)
cat /etc/grafana/provisioning/datasources/*.yml 2>/dev/null
# or check the Grafana API:
curl -s http://localhost:3000/api/datasources | jq .
# Export dashboards worth keeping
curl -s http://localhost:3000/api/search | jq '.[].uri'
# For each dashboard you want to preserve:
# curl -s http://localhost:3000/api/dashboards/uid/<UID> | jq .dashboard > dashboard-name.json
If any datasource points at LXC 108 (192.168.6.237:9090), that confirms
the dependency. The new Prometheus is at 192.168.6.17:9090.
Import any valuable dashboard JSON files into the new Grafana at
monitoring/grafana/dashboards/ and they will be auto-provisioned.
Step 3 — Audit saltierpoop Prometheus (informational only)¶
This is Saltbox-managed and out of our scope, but useful to document:
Record which targets it scrapes. These are likely Saltbox-internal services only (sonarr, radarr, etc.) and do not need re-pointing.
Step 4 — Verify new stack coverage¶
On your workstation, confirm the new Prometheus is scraping everything important:
# Check current targets
curl -s http://prometheus.infra.realemail.app/api/v1/targets | jq '.data.activeTargets[].labels.instance'
Compare this list against the targets found in LXC 108's config. Any gaps should be added to inventory before proceeding.
Audit Results (completed 2026-05-14)¶
LXC 108 (Prometheus) findings¶
Scrape config at /etc/prometheus/prometheus.yml:
| Job | Target | Status |
|---|---|---|
prometheus |
localhost:9090 |
Self-scrape — irrelevant after decom |
tiktok-monitor2 |
192.168.6.98:80 |
Needs assessment — customer-app metrics. Decide if this should be added to the new stack |
No alerting rules were configured. No remote-write destinations.
LXC 112 (Grafana) findings¶
Datasources:
| Name | Type | URL | Notes |
|---|---|---|---|
influxdb-nfs-monitoring |
InfluxDB (Flux) | http://192.168.6.132:8086 |
Reads nfs-monitoring bucket |
proxbox-influxdb |
InfluxDB (Flux) | http://192.168.6.132:8086 |
Reads thermal-data bucket |
proxbox-prometheus-tiktok |
Prometheus | http://192.168.6.237:9090 |
Points at LXC 108 — confirmed dependency |
Key finding (2026-05 audit): Both dashboards depended on InfluxDB at
192.168.6.132:8086, not Prometheus. Superseded 2026-06-24 (Phase 3): dashboards
rewritten for Prometheus; LXC 111 archived and destroyed; Influx datasources removed from
infra-services Grafana.
Dashboards exported:
| Dashboard | File | Datasource | Panels |
|---|---|---|---|
| NFS File Monitoring System | nfs-monitoring.json |
InfluxDB (nfs-monitoring bucket) |
Documents Indexed, ES Docs Over Time, Container CPU, NFS Storage gauge |
| Proxbox Thermals | proxbox_thermals.json |
InfluxDB (thermal-data bucket) |
CPU package/core temps, NVMe temps (Proxmox host) |
Both JSON files were saved to monitoring/grafana/dashboards/ for archival.
Proxbox Thermals was rewritten for Prometheus (Phase 3); NFS Monitoring remains a
historical export (LXC 114 retired — no live NFS probe metrics).
Coverage gap summary¶
All gaps identified during audit have been resolved:
| Gap | Action | Status |
|---|---|---|
192.168.6.98:80 (tiktok-monitor2) |
Added to prometheus.yml as static scrape target |
Done |
InfluxDB datasource (192.168.6.132:8086) |
Interim bridge (May 2026); removed Phase 3 — Prometheus-only datasources | Superseded |
| Node exporter targets | nodes.yml committed to git; file_sd picks it up after git pull |
Done |
| Traefik metrics | Added metrics entrypoint on :8080 and metrics.prometheus to traefik.yml |
Done |
INFLUXDB_TOKEN |
Interim bridge (May 2026); removed Phase 3 from compose and SOPS | Superseded |
Data Migration¶
Prometheus TSDB (LXC 108)¶
Prometheus data is retention-bounded (typically 15-30d) and regenerated by ongoing scrapes. Migration is not recommended — the new Prometheus will build its own history. If historical data is needed for a specific investigation, it can be queried from the old instance before shutdown.
Grafana dashboards (LXC 112)¶
Exported as JSON (see audit results above) and placed in
monitoring/grafana/dashboards/. The provisioning config auto-imports them,
but the InfluxDB datasource must be added first.
Decommission Steps¶
After the audit is complete and any re-pointing is done:
1. Stop the LXC services¶
2. Wait and verify¶
Leave the LXCs stopped for 48 hours. Monitor for:
- Missing scrape targets in Prometheus (
up == 0alerts) - Broken Grafana dashboards on the new stack
- Any service complaining about a missing
192.168.6.237or192.168.6.249
3. Destroy the LXCs¶
Once confident nothing depends on them:
--purge removes the container and its associated volumes.
4. Reclaim resources¶
Expected recovery: ~2.75GB RAM allocation, ~11GB disk on local-lvm.
5. Update documentation¶
- Remove LXC 101/108/112 from
docs/architecture/lab-audit.mdsection 3.2 (mark as decommissioned or remove rows). - Remove the Phase 6 migration TODO row from
README.md. - Add a note to
docs/postmortems/index.mdrecording the decommission date.