Decommission Old Monitoring LXCs¶
Runbook for migrating data from and decommissioning LXC 101, 108, and 112 on Proxmox, which are superseded by the Phase 5 monitoring stack on infra-services.
Targets¶
| VMID | Name | IP | RAM | Disk | Status | Superseded By |
|---|---|---|---|---|---|---|
| 101 | alpine-prometheus |
— | 256MB | 1GB | stopped | Phase 5 Prometheus |
| 108 | prometheus |
192.168.6.237 | 2GB | 8GB | running | Phase 5 Prometheus (infra-services) |
| 112 | grafana |
192.168.6.249 | 512MB | 2GB | running | Phase 5 Grafana (infra-services) |
There are also prometheus and grafana containers on saltierpoop (Saltbox-managed).
Those are out of scope for this runbook — they stay as Saltbox-internal
monitoring. The new canonical stack is on infra-services at 192.168.6.17.
Pre-flight Audit¶
Before destroying anything, collect configuration from the running LXCs so we can identify what was sending data to them and re-point if needed.
Step 1 — Audit LXC 108 (Prometheus)¶
SSH into LXC 108 and extract its scrape configuration:
ssh root@192.168.6.237
# Find the Prometheus config
cat /etc/prometheus/prometheus.yml # standard location
# or: find / -name prometheus.yml 2>/dev/null
# Save the scrape_configs section — this lists every target that was
# being scraped. Each target is a "data producer" that may need re-pointing.
Record the output. For each scrape target, determine:
- Already covered? Check if the target IP appears in the new
monitoring/targets/nodes.yml(generated from inventory). If yes, no action. - Missing? Add the host/appliance to
inventory/hosts/orinventory/appliances/withansible.managed: truesorender-prometheus.pygenerates a target for it. - Irrelevant? If the target was a stopped LXC or dead service, skip it.
Step 2 — Audit LXC 112 (Grafana)¶
SSH into LXC 112 and extract its datasource config and dashboards:
ssh root@192.168.6.249
# Grafana datasources (shows what Prometheus/InfluxDB it reads from)
cat /etc/grafana/provisioning/datasources/*.yml 2>/dev/null
# or check the Grafana API:
curl -s http://localhost:3000/api/datasources | jq .
# Export dashboards worth keeping
curl -s http://localhost:3000/api/search | jq '.[].uri'
# For each dashboard you want to preserve:
# curl -s http://localhost:3000/api/dashboards/uid/<UID> | jq .dashboard > dashboard-name.json
If any datasource points at LXC 108 (192.168.6.237:9090), that confirms
the dependency. The new Prometheus is at 192.168.6.17:9090.
Import any valuable dashboard JSON files into the new Grafana at
monitoring/grafana/dashboards/ and they will be auto-provisioned.
Step 3 — Audit saltierpoop Prometheus (informational only)¶
This is Saltbox-managed and out of our scope, but useful to document:
Record which targets it scrapes. These are likely Saltbox-internal services only (sonarr, radarr, etc.) and do not need re-pointing.
Step 4 — Verify new stack coverage¶
On your workstation, confirm the new Prometheus is scraping everything important:
# Check current targets
curl -s http://prometheus.infra.realemail.app/api/v1/targets | jq '.data.activeTargets[].labels.instance'
Compare this list against the targets found in LXC 108's config. Any gaps should be added to inventory before proceeding.
Audit Results (completed 2026-05-14)¶
LXC 108 (Prometheus) findings¶
Scrape config at /etc/prometheus/prometheus.yml:
| Job | Target | Status |
|---|---|---|
prometheus |
localhost:9090 |
Self-scrape — irrelevant after decom |
tiktok-monitor2 |
192.168.6.98:80 |
Needs assessment — customer-app metrics. Decide if this should be added to the new stack |
No alerting rules were configured. No remote-write destinations.
LXC 112 (Grafana) findings¶
Datasources:
| Name | Type | URL | Notes |
|---|---|---|---|
influxdb-nfs-monitoring |
InfluxDB (Flux) | http://192.168.6.132:8086 |
Reads nfs-monitoring bucket |
proxbox-influxdb |
InfluxDB (Flux) | http://192.168.6.132:8086 |
Reads thermal-data bucket |
proxbox-prometheus-tiktok |
Prometheus | http://192.168.6.237:9090 |
Points at LXC 108 — confirmed dependency |
Key finding: Both dashboards depend on InfluxDB at 192.168.6.132:8086,
not Prometheus. These dashboards will not function in the new Grafana unless an
InfluxDB datasource is added. The InfluxDB instance itself (wherever
192.168.6.132 lives) is not being decommissioned here.
Dashboards exported:
| Dashboard | File | Datasource | Panels |
|---|---|---|---|
| NFS File Monitoring System | nfs-monitoring.json |
InfluxDB (nfs-monitoring bucket) |
Documents Indexed, ES Docs Over Time, Container CPU, NFS Storage gauge |
| Proxbox Thermals | proxbox_thermals.json |
InfluxDB (thermal-data bucket) |
CPU package/core temps, NVMe temps (Proxmox host) |
Both JSON files are saved to monitoring/grafana/dashboards/ for archival.
They will not render until an InfluxDB datasource is provisioned in the new
Grafana — see follow-up TODO below.
Coverage gap summary¶
| Gap | Action |
|---|---|
192.168.6.98:80 (tiktok-monitor2) |
Added to new Prometheus as static scrape target |
InfluxDB datasource (192.168.6.132:8086) |
Add to monitoring/grafana/provisioning/datasources/ if dashboards are wanted |
| Node exporter targets | git pull on infra-services so nodes.yml is picked up by file_sd |
| Traefik metrics | Enable metrics.prometheus in Traefik static config (see below) |
Data Migration¶
Prometheus TSDB (LXC 108)¶
Prometheus data is retention-bounded (typically 15-30d) and regenerated by ongoing scrapes. Migration is not recommended — the new Prometheus will build its own history. If historical data is needed for a specific investigation, it can be queried from the old instance before shutdown.
Grafana dashboards (LXC 112)¶
Exported as JSON (see audit results above) and placed in
monitoring/grafana/dashboards/. The provisioning config auto-imports them,
but the InfluxDB datasource must be added first.
Decommission Steps¶
After the audit is complete and any re-pointing is done:
1. Stop the LXC services¶
2. Wait and verify¶
Leave the LXCs stopped for 48 hours. Monitor for:
- Missing scrape targets in Prometheus (
up == 0alerts) - Broken Grafana dashboards on the new stack
- Any service complaining about a missing
192.168.6.237or192.168.6.249
3. Destroy the LXCs¶
Once confident nothing depends on them:
--purge removes the container and its associated volumes.
4. Reclaim resources¶
Expected recovery: ~2.75GB RAM allocation, ~11GB disk on local-lvm.
5. Update documentation¶
- Remove LXC 101/108/112 from
docs/architecture/lab-audit.mdsection 3.2 (mark as decommissioned or remove rows). - Remove the Phase 6 migration TODO row from
README.md. - Add a note to
docs/postmortems/index.mdrecording the decommission date.