Proxmox storage remediation — proposal¶
Status: Partially executed (owner decisions 2026-06-24)
Date: 2026-06-24
Host: prox (192.168.6.71) — ASUS PN64, 1 TB NVMe
Context: Post–Wave A/B decommission; backup target migrated to Whrrr volume6
This document consolidates the live storage audit, resolved items, open risks, and recommended decision paths. It is the action proposal companion to the frozen point-in-time audit in prox-storage-2026-06-24.md.
Related:
- Compute disposition review — keep / consolidate / retire matrix
- Proxmox consolidation — end-state guest target
- Synology capacity ntfy — Whrrr volume alerts
- Compute decommission queue — vzdump + destroy order
Executive summary¶
Prox is not in hypervisor-level crisis after recent backup work, but two production-adjacent workloads sit on different failure modes:
| Workload | Primary risk | Root cause |
|---|---|---|
| nfs-monitoring (114) | Imminent — guest disk ~full | 98.7% thin LV, 91% in-guest, 53 GB Docker (ES stack) |
| saltierpoop (100) | Operational — media I/O | Local disk 81% (OK for now); Prawns/Movems NFS at 100% |
Three storage layers must be managed independently:
pve-root— OS, ISOs, local scratch (improved; ISO prune still available)local-lvmthin pool — guest disks (~270 GB thin free; unevenly consumed)- Whrrr NFS — Prawns/synorpn nearly full; infra-backups on volume6 healthy
Recommendation in one line: ~~Gut or shrink 114~~ 114 retired 2026-06-24; manage guest disks only (Whrrr NFS pool layout is fixed for now); keep infra-backups as vzdump target; saltierpoop alert relief via periodic log hygiene.
Owner decisions (2026-06-24)¶
| Topic | Decision |
|---|---|
| Whrrr NFS / Prawns cleanup | Out of scope — pool layout prevents changes; only VM/LXC actions |
| nfs-monitoring (114) | Backup + destroy — executed 2026-06-24 |
| saltierpoop (100) | Keep on prox — marginal in-guest cleanup for disk alert relief |
Executed (2026-06-24)¶
| Action | Result |
|---|---|
| LXC 114 manual backup | infra-backups/dump/manual-lxc-114-rootfs-2026_06_24.tar.zst (~53 GB) + lxc-114-pct.conf |
| LXC 114 destroy | VMID 114 removed; ~110 GB thin pool freed |
| Prometheus | Removed nfs-monitoring-* scrape jobs from prometheus.yml |
| saltierpoop hygiene | journalctl --vacuum-size=500M (~3.4 GB), /tmp clear, syslog.1 truncate → 78% used / ~22% free on / |
saltierpoop ongoing hygiene (recommended)¶
Disk alerts fire when root filesystem free space drops below 20%
(monitoring/prometheus/alerts/node.yml). Saltierpoop flapped at ~19% free before cleanup.
Safe periodic maintenance (no Docker prune without review):
# On saltierpoop (guest agent or SSH)
sudo journalctl --vacuum-size=500M
sudo find /tmp -mindepth 1 -mtime +7 -delete
sudo logrotate -f /etc/logrotate.d/rsyslog # if syslog grows again
Consider DSM/Ansible cron for journal vacuum monthly if alerts return.
vzdump + NFS note¶
Large LXC vzdump to infra-backups failed: PVE lxc-usernsexec cannot create .tmp
dirs on NFS with current Synology squash (Map root to admin only). Workarounds:
- DSM NFS rule → Map all users to admin (fixes native
vzdump), or - Manual mount +
tar | zstdto NFS (used for 114), or tmpdiron storage pointing to local scratch (only if archive fits ~27 GB free on root)
Resolved (2026-06-24)¶
No owner decision required — already done.
| Item | Outcome |
|---|---|
Broken CIFS vm-backups on Prawns |
Removed from PVE |
| New backup target | NFS infra-backups → 192.168.6.215:/volume6/infra-backups (2 TB quota) |
| Retention | keep-last=3 on infra-backups |
Legacy Wave A vzdump on prox local |
Relocated to infra-backups/dump/ (~3.1 GB total) |
| Validation | Test vzdump 120 succeeded |
Future vzdump:
Raise the 2 TB DSM quota on infra-backups before scheduling full-VM dumps (e.g. saltierpoop).
Current state (live, 2026-06-24)¶
Hypervisor layers¶
| Layer | Size | Used | Free | Assessment |
|---|---|---|---|---|
pve-root (local) |
94 GB | 63 GB (71%) | 27 GB | Improved after dump relocation; ~11 GB ISO prune available |
| local-lvm thin pool | 794 GB | 537 GB (67.5%) | ~270 GB | Adequate if 114 stops growing |
| infra-backups NFS | 2 TB quota | ~3 GB | ~2 TB | Healthy |
| synorpn NFS (Prawns) | 30 TB | 99.6% | ~112 GB | Critical upstream cap |
Guest thin-pool consumers (top)¶
| VMID | Guest | Prov. | Thin data % | ~Actual | In-guest / |
Verdict |
|---|---|---|---|---|---|---|
| 100 | saltierpoop | 260 GB | 79.7% | ~207 GB | 206/258 GB (81%) | Largest anchor; local OK, NFS exposed |
| 114 | nfs-monitoring | 112 GB | 98.7% | ~110 GB | 96/111 GB (91%) | Urgent |
| 119 | harbor-registry | 80 GB | 87.8% | ~70 GB | 45/79 GB (61%) | Monitor; registry GC |
| 200 | haos | 32 GB | 95.3% | ~30 GB | — | HA retention review |
| 123 | infra-services | 30 GB | 65.0% | ~19 GB | — | Healthy |
| 111 | influxdb | 8 GB | 86.7% | ~7 GB | — | Retire after Grafana cutover |
| 116 | pulse | 4 GB | 99.5% | ~4 GB | 59% in-guest | Small but maxed LV |
12 guests remain on prox after Wave A/B (3 VMs, 9 LXCs).
flowchart TB
subgraph prox["prox NVMe"]
root["pve-root 27G free"]
thin["thin pool 270G free"]
s100["VM 100 · 207G actual"]
s114["LXC 114 · 110G actual · 98.7%"]
end
subgraph whrrr["Whrrr 192.168.6.215"]
v9["volume9 Prawns · 99.6%"]
v6["volume6 infra-backups · 0.15%"]
end
v9 --> s114
v9 --> s100
v6 --> vzdump["vzdump archives"]
thin --> s100
thin --> s114
Risk register¶
| ID | Risk | Likelihood | Impact | Mitigation owner |
|---|---|---|---|---|
| R1 | 114 rootfs fills → ES/Docker crash | High | Loss of legacy dashboards; possible corrupt indices | Owner: keep vs gut 114 |
| R2 | Prawns NFS full → saltierpoop *arr/download failures | Medium–High | Media pipeline stalls | Owner: Whrrr archive policy |
| R3 | Large vzdump to local during misconfigured job |
Low (post-fix) | Repeat metrimon failure | Ops: always --storage infra-backups |
| R4 | saltierpoop thin growth without cleanup | Medium | Erodes pool headroom over months | Saltbox log/docker hygiene |
| R5 | pulse / haos / harbor thin max | Medium | Niche service outage | Per-guest cleanup or resize |
| R6 | 2 TB backup quota exceeded by full VM dump | Low until scheduled | Backup job failure | Raise quota when needed |
nfs-monitoring (LXC 114) — detailed proposal¶
What it is today¶
- Role (original): NFS bridge + observability stack for Prawns/synorpn.
- Mounts:
synorpn(Prawns NFS, 100%), legacymp1prawns bind. - Running containers: elasticsearch, kibana, logstash, file_crawler, cadvisor, elasticsearch_exporter.
- Migration status: NFS dashboard metrics moved to Prometheus on infra-services (2026-06-24). Influx on LXC 111 still pending Grafana datasource removal.
Why it is urgent¶
- Thin LV at 98.7% — prox cannot grow the guest much without expansion or cleanup.
- 53 GB in
/var/lib/dockeron an 111 GB rootfs with 11 GB free. - Upstream Prawns at 100% — file_crawler and any NFS writes are on a full volume.
Options¶
| Option | Actions | Pros | Cons | Recommended when |
|---|---|---|---|---|
| A — Gut legacy stack (default) | Stop ES/Kibana/Logstash/file_crawler; keep cadvisor + ES exporter if still scraped; prune Docker volumes | Frees 40–50+ GB quickly; reduces ops burden | Lose Kibana/Logstash if still used | No one opens Kibana weekly |
| B — Keep stack, shrink data | Curate ES indices; Docker prune; reduce retention | Preserves UI for debugging | Labor-intensive; Prawns still 100% | Short-term bridge only |
| C — Keep stack, expand disk | Resize LXC to 150 GB+; set ES ILM | Headroom for growth | Consumes thin pool; doesn't fix Prawns | Confirmed ongoing ES need |
| D — Full retire | Stop LXC; remove from Prometheus scrape; update inventory | Maximum prox relief | Lose NFS-side tooling entirely | Bridge no longer needed |
Proposal: Option A now, re-evaluate D after confirming no Grafana/ops dependency on Kibana. Keep cadvisor and elasticsearch_exporter only if Prometheus jobs on 114 remain valuable; otherwise migrate exporters or drop scrapes.
Owner decisions (114)¶
- [ ] D1: Confirm whether Kibana / Logstash / ES on 114 are still used (Y/N).
- [ ] D2: If No → approve Option A cleanup (agent executes with pre-stop snapshot optional).
- [ ] D3: If Yes → choose B (retention) or C (resize to ___ GB).
- [ ] D4: Pause file_crawler until Prawns has headroom? (Y/N)
saltierpoop (VM 100) — detailed proposal¶
What it is today¶
- Role: Production Saltbox media stack (~40 containers).
- Prox cost: 30 GB RAM, 260 GB thin disk (~207 GB actual) — ~37% of all thin usage.
- In-guest disk: 206 / 258 GB (81%) — ~51 GB free on local root.
- Docker loop: 34 / 54 GB — not the immediate problem.
- NFS mounts:
| Export | Volume | Use % | Free |
|---|---|---|---|
| Prawns | volume9 | 100% | ~112 GB |
| Movems | volume2 | 100% | ~87 GB |
| SerializedWatchables | volume1 | 98% | ~938 GB |
| OrderedWords | volume6 | 43% | ~8.3 TB |
Why it feels risky (but differently from 114)¶
- Prox local disk: Moderate — not imminently full.
- Media pipeline: High — new writes to Prawns/Movems can fail or stall while exports show 100%.
- Backup: Full
vzdumpfits the 2 TB quota; compressed size may be 80–150+ GB. Raise quota before making full-VM dumps routine.
Options¶
| Option | Actions | Pros | Cons | Horizon |
|---|---|---|---|---|
| A — Hygiene only | Saltbox log rotation, Docker prune, review / growth |
Low risk; no architecture change | Doesn't fix Prawns | Now |
| B — Whrrr media policy | Prawns/Movems cleanup or cold-tier moves | Fixes upstream for 100 + 114 | Owner media/archive decisions | Soon |
| C — Migrate to Whrrr VMM | Move VM off prox per PLAN baseline | Frees ~207 GB thin + 30 GB RAM | VMM validation, cutover window | Phase 9+ |
| D — Resize / split disks | Expand LV or add data disk | More local room | Doesn't fix NFS; uses thin pool | Only if local fills |
Proposal: A + B in parallel — prox hygiene is maintenance; Prawns/Movems policy is the real saltierpoop decision. C remains the strategic end-game from PLAN.md § infra-services baseline; do not block on it for 114 urgency.
Owner decisions (100)¶
- [ ] D5: Accept that Prawns/Movems at 100% is the primary saltierpoop risk (Y/N — acknowledge).
- [ ] D6: Schedule Whrrr Prawns/Movems cleanup or tiering (owner timeline: ___).
- [ ] D7: Approve saltierpoop in-guest hygiene pass (logs, unused images) — agent or manual?
- [ ] D8: Whrrr VMM migration — defer / plan / prioritize (pick one).
Secondary guests — maintenance backlog¶
| VMID | Guest | Issue | Proposed action | Owner gate |
|---|---|---|---|---|
| 111 | influxdb | Superseded by Prometheus for thermals/NFS | Retire LXC after Grafana Influx datasources removed | D9: confirm datasources migrated |
| 109 | graylog | Stopped; Pattern E central syslog | Revive per central-syslog-graylog | Separate syslog decision |
| 116 | pulse | 99.5% thin on 4 GB LV | Expand to 8 GB or prune in-guest | Low urgency |
| 119 | harbor-registry | 87.8% thin | Registry GC, old tag cleanup | Low urgency |
| 200 | haos | 95.3% thin | Review HA snapshot/history retention | Low urgency |
| — | prox ISOs | ~11 GB redundant Ubuntu ISOs | Delete desktop + duplicate server ISOs; keep one 24.04 server | D10: pick keeper ISO |
Whrrr upstream (shared dependency)¶
Owner 2026-06-24: Whrrr NFS / pool cleanup is out of scope — pool layout prevents changes for now. This proposal covers prox guests only.
114 and 100 both historically depended on volume9 Prawns near capacity.
Operational backups use volume6 (infra-backups) — do not route vzdump to Prawns.
| Volume | Pool | Role | Use % | Proposal |
|---|---|---|---|---|
| volume9 (Prawns) | vg3 | Media + synorpn + saltierpoop paths | ~100% | Owner archive policy; not backup target |
| volume2 (Movems) | vg1 | Movies NFS | ~100% | Same |
| volume1 (SerializedWatchables) | vg1 | TV | ~98% | Monitor |
| volume6 (infra-backups) | vg2 | Proxmox vzdump | ~0% | Keep; raise quota on demand |
| volume6 (OrderedWords) | vg2 | Books | ~43% | Healthy |
Capacity alerts: synology-capacity-ntfy (deployed on Whrrr).
Phased execution plan¶
Phase 0 — Done¶
- [x]
infra-backupsNFS storage on prox - [x] Retire
vm-backupsCIFS - [x] Relocate legacy vzdump off
pve-root - [x] Document in prox-storage-2026-06-24.md
Phase 1 — Urgent (this week)¶
Depends on D1–D4.
- Owner confirms 114 stack usage.
- If unused → stop ES/Kibana/Logstash/file_crawler; prune Docker; verify disk drop.
- Optional:
vzdump 114 --storage infra-backupsbefore destructive cleanup. - Pause file_crawler if Prawns unchanged.
Success criteria: LXC 114 in-guest / below 70%; thin LV below 85%.
Phase 2 — Near-term (this month)¶
Depends on D5–D10 and consolidation queue.
- Remove Grafana Influx datasources → retire LXC 111.
- ISO prune on prox (~10 GB).
- saltierpoop hygiene pass (logs/Docker).
- Owner Prawns/Movems capacity actions on Whrrr.
- harbor / haos / pulse disk passes as time allows.
Success criteria: pve-root above 35 GB free; Prawns below 98% or write policy documented; influxdb retired.
Phase 3 — Strategic (Phase 9+)¶
Depends on D8.
- Evaluate saltierpoop migration to Whrrr VMM (PLAN baseline).
- Revisit nfs-monitoring — full retire vs minimal NFS bridge.
- Graylog 109 revive for central syslog (orthogonal to storage, but on prox).
Owner decision checklist (summary)¶
| ID | Question | Options | Proposal |
|---|---|---|---|
| D1 | Still using Kibana/Logstash/ES on 114? | Y / N | Assume N until confirmed |
| D2 | Approve 114 legacy stack removal? | A / B / C / D | A if D1=N |
| D3 | Resize 114 if keeping stack? | GB: ___ | Only if D1=Y |
| D4 | Pause file_crawler until Prawns relief? | Y / N | Y |
| D5 | Acknowledge Prawns as saltierpoop blocker? | Y | Y |
| D6 | Prawns/Movems cleanup timeline? | date / defer | Owner sets |
| D7 | saltierpoop hygiene pass? | agent / manual / skip | agent with approval |
| D8 | Whrrr VMM migration for 100? | defer / plan / prioritize | defer |
| D9 | Influx datasources removed → retire 111? | Y / N | Y after verify |
| D10 | ISO prune — keep ubuntu-24.04.1-live-server only? |
Y / N | Y |
Reply with decisions (e.g. D1=N, D2=A, D4=Y, D7=agent, D10=Y) to unblock Phase 1 execution.
What we are not proposing¶
- Deleting backup artifacts without explicit owner approval (see homelab agent rules).
- Moving vzdump back to Prawns or
localas default. - Destroying saltierpoop or nfs-monitoring without disposition review + vzdump gate.
- Raising
infra-backupsquota preemptively (owner: on occasion requiring it).
Changelog¶
| Date | Change |
|---|---|
| 2026-06-24 | Initial proposal from live prox audit + owner storage session |