Skip to content

Decommission Old Monitoring LXCs

Runbook for migrating data from and decommissioning LXC 101, 108, and 112 on Proxmox, which are superseded by the Phase 5 monitoring stack on infra-services.

Targets

VMID Name IP RAM Disk Status Superseded By
101 alpine-prometheus 256MB 1GB stopped Phase 5 Prometheus
108 prometheus 192.168.6.237 2GB 8GB running Phase 5 Prometheus (infra-services)
112 grafana 192.168.6.249 512MB 2GB running Phase 5 Grafana (infra-services)

There are also prometheus and grafana containers on saltierpoop (Saltbox-managed). Those are out of scope for this runbook — they stay as Saltbox-internal monitoring. The new canonical stack is on infra-services at 192.168.6.17.

Pre-flight Audit

Before destroying anything, collect configuration from the running LXCs so we can identify what was sending data to them and re-point if needed.

Step 1 — Audit LXC 108 (Prometheus)

SSH into LXC 108 and extract its scrape configuration:

ssh root@192.168.6.237

# Find the Prometheus config
cat /etc/prometheus/prometheus.yml   # standard location
# or: find / -name prometheus.yml 2>/dev/null

# Save the scrape_configs section — this lists every target that was
# being scraped. Each target is a "data producer" that may need re-pointing.

Record the output. For each scrape target, determine:

  • Already covered? Check if the target IP appears in the new monitoring/targets/nodes.yml (generated from inventory). If yes, no action.
  • Missing? Add the host/appliance to inventory/hosts/ or inventory/appliances/ with ansible.managed: true so render-prometheus.py generates a target for it.
  • Irrelevant? If the target was a stopped LXC or dead service, skip it.

Step 2 — Audit LXC 112 (Grafana)

SSH into LXC 112 and extract its datasource config and dashboards:

ssh root@192.168.6.249

# Grafana datasources (shows what Prometheus/InfluxDB it reads from)
cat /etc/grafana/provisioning/datasources/*.yml 2>/dev/null
# or check the Grafana API:
curl -s http://localhost:3000/api/datasources | jq .

# Export dashboards worth keeping
curl -s http://localhost:3000/api/search | jq '.[].uri'
# For each dashboard you want to preserve:
# curl -s http://localhost:3000/api/dashboards/uid/<UID> | jq .dashboard > dashboard-name.json

If any datasource points at LXC 108 (192.168.6.237:9090), that confirms the dependency. The new Prometheus is at 192.168.6.17:9090.

Import any valuable dashboard JSON files into the new Grafana at monitoring/grafana/dashboards/ and they will be auto-provisioned.

Step 3 — Audit saltierpoop Prometheus (informational only)

This is Saltbox-managed and out of our scope, but useful to document:

ssh someone@192.168.6.243
docker exec prometheus cat /etc/prometheus/prometheus.yml

Record which targets it scrapes. These are likely Saltbox-internal services only (sonarr, radarr, etc.) and do not need re-pointing.

Step 4 — Verify new stack coverage

On your workstation, confirm the new Prometheus is scraping everything important:

# Check current targets
curl -s http://prometheus.infra.realemail.app/api/v1/targets | jq '.data.activeTargets[].labels.instance'

Compare this list against the targets found in LXC 108's config. Any gaps should be added to inventory before proceeding.

Audit Results (completed 2026-05-14)

LXC 108 (Prometheus) findings

Scrape config at /etc/prometheus/prometheus.yml:

Job Target Status
prometheus localhost:9090 Self-scrape — irrelevant after decom
tiktok-monitor2 192.168.6.98:80 Needs assessment — customer-app metrics. Decide if this should be added to the new stack

No alerting rules were configured. No remote-write destinations.

LXC 112 (Grafana) findings

Datasources:

Name Type URL Notes
influxdb-nfs-monitoring InfluxDB (Flux) http://192.168.6.132:8086 Reads nfs-monitoring bucket
proxbox-influxdb InfluxDB (Flux) http://192.168.6.132:8086 Reads thermal-data bucket
proxbox-prometheus-tiktok Prometheus http://192.168.6.237:9090 Points at LXC 108 — confirmed dependency

Key finding: Both dashboards depend on InfluxDB at 192.168.6.132:8086, not Prometheus. These dashboards will not function in the new Grafana unless an InfluxDB datasource is added. The InfluxDB instance itself (wherever 192.168.6.132 lives) is not being decommissioned here.

Dashboards exported:

Dashboard File Datasource Panels
NFS File Monitoring System nfs-monitoring.json InfluxDB (nfs-monitoring bucket) Documents Indexed, ES Docs Over Time, Container CPU, NFS Storage gauge
Proxbox Thermals proxbox_thermals.json InfluxDB (thermal-data bucket) CPU package/core temps, NVMe temps (Proxmox host)

Both JSON files are saved to monitoring/grafana/dashboards/ for archival. They will not render until an InfluxDB datasource is provisioned in the new Grafana — see follow-up TODO below.

Coverage gap summary

Gap Action
192.168.6.98:80 (tiktok-monitor2) Added to new Prometheus as static scrape target
InfluxDB datasource (192.168.6.132:8086) Add to monitoring/grafana/provisioning/datasources/ if dashboards are wanted
Node exporter targets git pull on infra-services so nodes.yml is picked up by file_sd
Traefik metrics Enable metrics.prometheus in Traefik static config (see below)

Data Migration

Prometheus TSDB (LXC 108)

Prometheus data is retention-bounded (typically 15-30d) and regenerated by ongoing scrapes. Migration is not recommended — the new Prometheus will build its own history. If historical data is needed for a specific investigation, it can be queried from the old instance before shutdown.

Grafana dashboards (LXC 112)

Exported as JSON (see audit results above) and placed in monitoring/grafana/dashboards/. The provisioning config auto-imports them, but the InfluxDB datasource must be added first.

Decommission Steps

After the audit is complete and any re-pointing is done:

1. Stop the LXC services

# On Proxmox host (192.168.6.71)
pct stop 108
pct stop 112
# 101 is already stopped

2. Wait and verify

Leave the LXCs stopped for 48 hours. Monitor for:

  • Missing scrape targets in Prometheus (up == 0 alerts)
  • Broken Grafana dashboards on the new stack
  • Any service complaining about a missing 192.168.6.237 or 192.168.6.249

3. Destroy the LXCs

Once confident nothing depends on them:

pct destroy 101 --purge
pct destroy 108 --purge
pct destroy 112 --purge

--purge removes the container and its associated volumes.

4. Reclaim resources

Expected recovery: ~2.75GB RAM allocation, ~11GB disk on local-lvm.

5. Update documentation

  • Remove LXC 101/108/112 from docs/architecture/lab-audit.md section 3.2 (mark as decommissioned or remove rows).
  • Remove the Phase 6 migration TODO row from README.md.
  • Add a note to docs/postmortems/index.md recording the decommission date.