Skip to content

Decommission Old Monitoring LXCs

Runbook for migrating data from and decommissioning LXC 101, 108, and 112 on Proxmox, which are superseded by the Phase 5 monitoring stack on infra-services.

Targets

VMID Name IP RAM Disk Status Superseded By
101 alpine-prometheus 256MB 1GB stopped Phase 5 Prometheus
108 prometheus 192.168.6.237 2GB 8GB running Phase 5 Prometheus (infra-services)
112 grafana 192.168.6.249 512MB 2GB running Phase 5 Grafana (infra-services)

There are also prometheus and grafana containers on saltierpoop (Saltbox-managed). Those are out of scope for this runbook — they stay as Saltbox-internal monitoring. The new canonical stack is on infra-services at 192.168.6.17.

Pre-flight Audit

Before destroying anything, collect configuration from the running LXCs so we can identify what was sending data to them and re-point if needed.

Step 1 — Audit LXC 108 (Prometheus)

SSH into LXC 108 and extract its scrape configuration:

ssh root@192.168.6.237

# Find the Prometheus config
cat /etc/prometheus/prometheus.yml   # standard location
# or: find / -name prometheus.yml 2>/dev/null

# Save the scrape_configs section — this lists every target that was
# being scraped. Each target is a "data producer" that may need re-pointing.

Record the output. For each scrape target, determine:

  • Already covered? Check if the target IP appears in the new monitoring/targets/nodes.yml (generated from inventory). If yes, no action.
  • Missing? Add the host/appliance to inventory/hosts/ or inventory/appliances/ with ansible.managed: true so render-prometheus.py generates a target for it.
  • Irrelevant? If the target was a stopped LXC or dead service, skip it.

Step 2 — Audit LXC 112 (Grafana)

SSH into LXC 112 and extract its datasource config and dashboards:

ssh root@192.168.6.249

# Grafana datasources (shows what Prometheus/InfluxDB it reads from)
cat /etc/grafana/provisioning/datasources/*.yml 2>/dev/null
# or check the Grafana API:
curl -s http://localhost:3000/api/datasources | jq .

# Export dashboards worth keeping
curl -s http://localhost:3000/api/search | jq '.[].uri'
# For each dashboard you want to preserve:
# curl -s http://localhost:3000/api/dashboards/uid/<UID> | jq .dashboard > dashboard-name.json

If any datasource points at LXC 108 (192.168.6.237:9090), that confirms the dependency. The new Prometheus is at 192.168.6.17:9090.

Import any valuable dashboard JSON files into the new Grafana at monitoring/grafana/dashboards/ and they will be auto-provisioned.

Step 3 — Audit saltierpoop Prometheus (informational only)

This is Saltbox-managed and out of our scope, but useful to document:

ssh someone@192.168.6.243
docker exec prometheus cat /etc/prometheus/prometheus.yml

Record which targets it scrapes. These are likely Saltbox-internal services only (sonarr, radarr, etc.) and do not need re-pointing.

Step 4 — Verify new stack coverage

On your workstation, confirm the new Prometheus is scraping everything important:

# Check current targets
curl -s http://prometheus.infra.realemail.app/api/v1/targets | jq '.data.activeTargets[].labels.instance'

Compare this list against the targets found in LXC 108's config. Any gaps should be added to inventory before proceeding.

Audit Results (completed 2026-05-14)

LXC 108 (Prometheus) findings

Scrape config at /etc/prometheus/prometheus.yml:

Job Target Status
prometheus localhost:9090 Self-scrape — irrelevant after decom
tiktok-monitor2 192.168.6.98:80 Needs assessment — customer-app metrics. Decide if this should be added to the new stack

No alerting rules were configured. No remote-write destinations.

LXC 112 (Grafana) findings

Datasources:

Name Type URL Notes
influxdb-nfs-monitoring InfluxDB (Flux) http://192.168.6.132:8086 Reads nfs-monitoring bucket
proxbox-influxdb InfluxDB (Flux) http://192.168.6.132:8086 Reads thermal-data bucket
proxbox-prometheus-tiktok Prometheus http://192.168.6.237:9090 Points at LXC 108 — confirmed dependency

Key finding (2026-05 audit): Both dashboards depended on InfluxDB at 192.168.6.132:8086, not Prometheus. Superseded 2026-06-24 (Phase 3): dashboards rewritten for Prometheus; LXC 111 archived and destroyed; Influx datasources removed from infra-services Grafana.

Dashboards exported:

Dashboard File Datasource Panels
NFS File Monitoring System nfs-monitoring.json InfluxDB (nfs-monitoring bucket) Documents Indexed, ES Docs Over Time, Container CPU, NFS Storage gauge
Proxbox Thermals proxbox_thermals.json InfluxDB (thermal-data bucket) CPU package/core temps, NVMe temps (Proxmox host)

Both JSON files were saved to monitoring/grafana/dashboards/ for archival. Proxbox Thermals was rewritten for Prometheus (Phase 3); NFS Monitoring remains a historical export (LXC 114 retired — no live NFS probe metrics).

Coverage gap summary

All gaps identified during audit have been resolved:

Gap Action Status
192.168.6.98:80 (tiktok-monitor2) Added to prometheus.yml as static scrape target Done
InfluxDB datasource (192.168.6.132:8086) Interim bridge (May 2026); removed Phase 3 — Prometheus-only datasources Superseded
Node exporter targets nodes.yml committed to git; file_sd picks it up after git pull Done
Traefik metrics Added metrics entrypoint on :8080 and metrics.prometheus to traefik.yml Done
INFLUXDB_TOKEN Interim bridge (May 2026); removed Phase 3 from compose and SOPS Superseded

Data Migration

Prometheus TSDB (LXC 108)

Prometheus data is retention-bounded (typically 15-30d) and regenerated by ongoing scrapes. Migration is not recommended — the new Prometheus will build its own history. If historical data is needed for a specific investigation, it can be queried from the old instance before shutdown.

Grafana dashboards (LXC 112)

Exported as JSON (see audit results above) and placed in monitoring/grafana/dashboards/. The provisioning config auto-imports them, but the InfluxDB datasource must be added first.

Decommission Steps

After the audit is complete and any re-pointing is done:

1. Stop the LXC services

# On Proxmox host (192.168.6.71)
pct stop 108
pct stop 112
# 101 is already stopped

2. Wait and verify

Leave the LXCs stopped for 48 hours. Monitor for:

  • Missing scrape targets in Prometheus (up == 0 alerts)
  • Broken Grafana dashboards on the new stack
  • Any service complaining about a missing 192.168.6.237 or 192.168.6.249

3. Destroy the LXCs

Once confident nothing depends on them:

pct destroy 101 --purge
pct destroy 108 --purge
pct destroy 112 --purge

--purge removes the container and its associated volumes.

4. Reclaim resources

Expected recovery: ~2.75GB RAM allocation, ~11GB disk on local-lvm.

5. Update documentation

  • Remove LXC 101/108/112 from docs/architecture/lab-audit.md section 3.2 (mark as decommissioned or remove rows).
  • Remove the Phase 6 migration TODO row from README.md.
  • Add a note to docs/postmortems/index.md recording the decommission date.