Decommission Old Monitoring LXCs¶

Runbook for migrating data from and decommissioning LXC 101, 108, and 112 on Proxmox, which are superseded by the Phase 5 monitoring stack on infra-services.

Targets¶

VMID	Name	IP	RAM	Disk	Status	Superseded By
101	`alpine-prometheus`	—	256MB	1GB	stopped	Phase 5 Prometheus
108	`prometheus`	192.168.6.237	2GB	8GB	running	Phase 5 Prometheus (infra-services)
112	`grafana`	192.168.6.249	512MB	2GB	running	Phase 5 Grafana (infra-services)

There are also prometheus and grafana containers on saltierpoop (Saltbox-managed). Those are out of scope for this runbook — they stay as Saltbox-internal monitoring. The new canonical stack is on infra-services at 192.168.6.17.

Pre-flight Audit¶

Before destroying anything, collect configuration from the running LXCs so we can identify what was sending data to them and re-point if needed.

Step 1 — Audit LXC 108 (Prometheus)¶

SSH into LXC 108 and extract its scrape configuration:

ssh root@192.168.6.237

# Find the Prometheus config
cat /etc/prometheus/prometheus.yml   # standard location
# or: find / -name prometheus.yml 2>/dev/null

# Save the scrape_configs section — this lists every target that was
# being scraped. Each target is a "data producer" that may need re-pointing.

Record the output. For each scrape target, determine:

Already covered? Check if the target IP appears in the new monitoring/targets/nodes.yml (generated from inventory). If yes, no action.
Missing? Add the host/appliance to inventory/hosts/ or inventory/appliances/ with ansible.managed: true so render-prometheus.py generates a target for it.
Irrelevant? If the target was a stopped LXC or dead service, skip it.

Step 2 — Audit LXC 112 (Grafana)¶

SSH into LXC 112 and extract its datasource config and dashboards:

ssh root@192.168.6.249

# Grafana datasources (shows what Prometheus/InfluxDB it reads from)
cat /etc/grafana/provisioning/datasources/*.yml 2>/dev/null
# or check the Grafana API:
curl -s http://localhost:3000/api/datasources | jq .

# Export dashboards worth keeping
curl -s http://localhost:3000/api/search | jq '.[].uri'
# For each dashboard you want to preserve:
# curl -s http://localhost:3000/api/dashboards/uid/<UID> | jq .dashboard > dashboard-name.json

If any datasource points at LXC 108 (192.168.6.237:9090), that confirms the dependency. The new Prometheus is at 192.168.6.17:9090.

Import any valuable dashboard JSON files into the new Grafana at monitoring/grafana/dashboards/ and they will be auto-provisioned.

Step 3 — Audit saltierpoop Prometheus (informational only)¶

This is Saltbox-managed and out of our scope, but useful to document:

ssh someone@192.168.6.243
docker exec prometheus cat /etc/prometheus/prometheus.yml

Record which targets it scrapes. These are likely Saltbox-internal services only (sonarr, radarr, etc.) and do not need re-pointing.

Step 4 — Verify new stack coverage¶

On your workstation, confirm the new Prometheus is scraping everything important:

# Check current targets
curl -s http://prometheus.infra.realemail.app/api/v1/targets | jq '.data.activeTargets[].labels.instance'

Compare this list against the targets found in LXC 108's config. Any gaps should be added to inventory before proceeding.

Audit Results (completed 2026-05-14)¶

LXC 108 (Prometheus) findings¶

Scrape config at /etc/prometheus/prometheus.yml:

Job	Target	Status
`prometheus`	`localhost:9090`	Self-scrape — irrelevant after decom
`tiktok-monitor2`	`192.168.6.98:80`	Needs assessment — customer-app metrics. Decide if this should be added to the new stack

No alerting rules were configured. No remote-write destinations.

LXC 112 (Grafana) findings¶

Datasources:

Name	Type	URL	Notes
`influxdb-nfs-monitoring`	InfluxDB (Flux)	`http://192.168.6.132:8086`	Reads `nfs-monitoring` bucket
`proxbox-influxdb`	InfluxDB (Flux)	`http://192.168.6.132:8086`	Reads `thermal-data` bucket
`proxbox-prometheus-tiktok`	Prometheus	`http://192.168.6.237:9090`	Points at LXC 108 — confirmed dependency

Key finding: Both dashboards depend on InfluxDB at 192.168.6.132:8086, not Prometheus. These dashboards will not function in the new Grafana unless an InfluxDB datasource is added. The InfluxDB instance itself (wherever 192.168.6.132 lives) is not being decommissioned here.

Dashboards exported:

Dashboard	File	Datasource	Panels
NFS File Monitoring System	`nfs-monitoring.json`	InfluxDB (`nfs-monitoring` bucket)	Documents Indexed, ES Docs Over Time, Container CPU, NFS Storage gauge
Proxbox Thermals	`proxbox_thermals.json`	InfluxDB (`thermal-data` bucket)	CPU package/core temps, NVMe temps (Proxmox host)

Both JSON files are saved to monitoring/grafana/dashboards/ for archival. They will not render until an InfluxDB datasource is provisioned in the new Grafana — see follow-up TODO below.

Coverage gap summary¶

Gap	Action
`192.168.6.98:80` (tiktok-monitor2)	Added to new Prometheus as static scrape target
InfluxDB datasource (`192.168.6.132:8086`)	Add to `monitoring/grafana/provisioning/datasources/` if dashboards are wanted
Node exporter targets	`git pull` on infra-services so `nodes.yml` is picked up by file_sd
Traefik metrics	Enable `metrics.prometheus` in Traefik static config (see below)

Data Migration¶

Prometheus TSDB (LXC 108)¶

Prometheus data is retention-bounded (typically 15-30d) and regenerated by ongoing scrapes. Migration is not recommended — the new Prometheus will build its own history. If historical data is needed for a specific investigation, it can be queried from the old instance before shutdown.

Grafana dashboards (LXC 112)¶

Exported as JSON (see audit results above) and placed in monitoring/grafana/dashboards/. The provisioning config auto-imports them, but the InfluxDB datasource must be added first.

Decommission Steps¶

After the audit is complete and any re-pointing is done:

1. Stop the LXC services¶

# On Proxmox host (192.168.6.71)
pct stop 108
pct stop 112
# 101 is already stopped

2. Wait and verify¶

Leave the LXCs stopped for 48 hours. Monitor for:

Missing scrape targets in Prometheus (up == 0 alerts)
Broken Grafana dashboards on the new stack
Any service complaining about a missing 192.168.6.237 or 192.168.6.249

3. Destroy the LXCs¶

Once confident nothing depends on them:

pct destroy 101 --purge
pct destroy 108 --purge
pct destroy 112 --purge

--purge removes the container and its associated volumes.

4. Reclaim resources¶

Expected recovery: ~2.75GB RAM allocation, ~11GB disk on local-lvm.

5. Update documentation¶

Remove LXC 101/108/112 from docs/architecture/lab-audit.md section 3.2 (mark as decommissioned or remove rows).
Remove the Phase 6 migration TODO row from README.md.
Add a note to docs/postmortems/index.md recording the decommission date.