Skip to content

Proxmox storage remediation — proposal

Status: Partially executed (owner decisions 2026-06-24) Date: 2026-06-24 Host: prox (192.168.6.71) — ASUS PN64, 1 TB NVMe Context: Post–Wave A/B decommission; backup target migrated to Whrrr volume6

This document consolidates the live storage audit, resolved items, open risks, and recommended decision paths. It is the action proposal companion to the frozen point-in-time audit in prox-storage-2026-06-24.md.

Related:


Executive summary

Prox is not in hypervisor-level crisis after recent backup work, but two production-adjacent workloads sit on different failure modes:

Workload Primary risk Root cause
nfs-monitoring (114) Imminent — guest disk ~full 98.7% thin LV, 91% in-guest, 53 GB Docker (ES stack)
saltierpoop (100) Operational — media I/O Local disk 81% (OK for now); Prawns/Movems NFS at 100%

Three storage layers must be managed independently:

  1. pve-root — OS, ISOs, local scratch (improved; ISO prune still available)
  2. local-lvm thin pool — guest disks (~270 GB thin free; unevenly consumed)
  3. Whrrr NFS — Prawns/synorpn nearly full; infra-backups on volume6 healthy

Recommendation in one line: ~~Gut or shrink 114~~ 114 retired 2026-06-24; manage guest disks only (Whrrr NFS pool layout is fixed for now); keep infra-backups as vzdump target; saltierpoop alert relief via periodic log hygiene.


Owner decisions (2026-06-24)

Topic Decision
Whrrr NFS / Prawns cleanup Out of scope — pool layout prevents changes; only VM/LXC actions
nfs-monitoring (114) Backup + destroy — executed 2026-06-24
saltierpoop (100) Keep on prox — marginal in-guest cleanup for disk alert relief

Executed (2026-06-24)

Action Result
LXC 114 manual backup infra-backups/dump/manual-lxc-114-rootfs-2026_06_24.tar.zst (~53 GB) + lxc-114-pct.conf
LXC 114 destroy VMID 114 removed; ~110 GB thin pool freed
Prometheus Removed nfs-monitoring-* scrape jobs from prometheus.yml
saltierpoop hygiene journalctl --vacuum-size=500M (~3.4 GB), /tmp clear, syslog.1 truncate → 78% used / ~22% free on /

Disk alerts fire when root filesystem free space drops below 20% (monitoring/prometheus/alerts/node.yml). Saltierpoop flapped at ~19% free before cleanup.

Safe periodic maintenance (no Docker prune without review):

# On saltierpoop (guest agent or SSH)
sudo journalctl --vacuum-size=500M
sudo find /tmp -mindepth 1 -mtime +7 -delete
sudo logrotate -f /etc/logrotate.d/rsyslog   # if syslog grows again

Consider DSM/Ansible cron for journal vacuum monthly if alerts return.

vzdump + NFS note

Large LXC vzdump to infra-backups failed: PVE lxc-usernsexec cannot create .tmp dirs on NFS with current Synology squash (Map root to admin only). Workarounds:

  1. DSM NFS rule → Map all users to admin (fixes native vzdump), or
  2. Manual mount + tar | zstd to NFS (used for 114), or
  3. tmpdir on storage pointing to local scratch (only if archive fits ~27 GB free on root)

Resolved (2026-06-24)

No owner decision required — already done.

Item Outcome
Broken CIFS vm-backups on Prawns Removed from PVE
New backup target NFS infra-backups192.168.6.215:/volume6/infra-backups (2 TB quota)
Retention keep-last=3 on infra-backups
Legacy Wave A vzdump on prox local Relocated to infra-backups/dump/ (~3.1 GB total)
Validation Test vzdump 120 succeeded

Future vzdump:

vzdump <vmid> --storage infra-backups --compress zstd

Raise the 2 TB DSM quota on infra-backups before scheduling full-VM dumps (e.g. saltierpoop).


Current state (live, 2026-06-24)

Hypervisor layers

Layer Size Used Free Assessment
pve-root (local) 94 GB 63 GB (71%) 27 GB Improved after dump relocation; ~11 GB ISO prune available
local-lvm thin pool 794 GB 537 GB (67.5%) ~270 GB Adequate if 114 stops growing
infra-backups NFS 2 TB quota ~3 GB ~2 TB Healthy
synorpn NFS (Prawns) 30 TB 99.6% ~112 GB Critical upstream cap

Guest thin-pool consumers (top)

VMID Guest Prov. Thin data % ~Actual In-guest / Verdict
100 saltierpoop 260 GB 79.7% ~207 GB 206/258 GB (81%) Largest anchor; local OK, NFS exposed
114 nfs-monitoring 112 GB 98.7% ~110 GB 96/111 GB (91%) Urgent
119 harbor-registry 80 GB 87.8% ~70 GB 45/79 GB (61%) Monitor; registry GC
200 haos 32 GB 95.3% ~30 GB HA retention review
123 infra-services 30 GB 65.0% ~19 GB Healthy
111 influxdb 8 GB 86.7% ~7 GB Retire after Grafana cutover
116 pulse 4 GB 99.5% ~4 GB 59% in-guest Small but maxed LV

12 guests remain on prox after Wave A/B (3 VMs, 9 LXCs).

flowchart TB
  subgraph prox["prox NVMe"]
    root["pve-root 27G free"]
    thin["thin pool 270G free"]
    s100["VM 100 · 207G actual"]
    s114["LXC 114 · 110G actual · 98.7%"]
  end
  subgraph whrrr["Whrrr 192.168.6.215"]
    v9["volume9 Prawns · 99.6%"]
    v6["volume6 infra-backups · 0.15%"]
  end
  v9 --> s114
  v9 --> s100
  v6 --> vzdump["vzdump archives"]
  thin --> s100
  thin --> s114

Risk register

ID Risk Likelihood Impact Mitigation owner
R1 114 rootfs fills → ES/Docker crash High Loss of legacy dashboards; possible corrupt indices Owner: keep vs gut 114
R2 Prawns NFS full → saltierpoop *arr/download failures Medium–High Media pipeline stalls Owner: Whrrr archive policy
R3 Large vzdump to local during misconfigured job Low (post-fix) Repeat metrimon failure Ops: always --storage infra-backups
R4 saltierpoop thin growth without cleanup Medium Erodes pool headroom over months Saltbox log/docker hygiene
R5 pulse / haos / harbor thin max Medium Niche service outage Per-guest cleanup or resize
R6 2 TB backup quota exceeded by full VM dump Low until scheduled Backup job failure Raise quota when needed

nfs-monitoring (LXC 114) — detailed proposal

What it is today

  • Role (original): NFS bridge + observability stack for Prawns/synorpn.
  • Mounts: synorpn (Prawns NFS, 100%), legacy mp1 prawns bind.
  • Running containers: elasticsearch, kibana, logstash, file_crawler, cadvisor, elasticsearch_exporter.
  • Migration status: NFS dashboard metrics moved to Prometheus on infra-services (2026-06-24). Influx on LXC 111 still pending Grafana datasource removal.

Why it is urgent

  • Thin LV at 98.7% — prox cannot grow the guest much without expansion or cleanup.
  • 53 GB in /var/lib/docker on an 111 GB rootfs with 11 GB free.
  • Upstream Prawns at 100% — file_crawler and any NFS writes are on a full volume.

Options

Option Actions Pros Cons Recommended when
A — Gut legacy stack (default) Stop ES/Kibana/Logstash/file_crawler; keep cadvisor + ES exporter if still scraped; prune Docker volumes Frees 40–50+ GB quickly; reduces ops burden Lose Kibana/Logstash if still used No one opens Kibana weekly
B — Keep stack, shrink data Curate ES indices; Docker prune; reduce retention Preserves UI for debugging Labor-intensive; Prawns still 100% Short-term bridge only
C — Keep stack, expand disk Resize LXC to 150 GB+; set ES ILM Headroom for growth Consumes thin pool; doesn't fix Prawns Confirmed ongoing ES need
D — Full retire Stop LXC; remove from Prometheus scrape; update inventory Maximum prox relief Lose NFS-side tooling entirely Bridge no longer needed

Proposal: Option A now, re-evaluate D after confirming no Grafana/ops dependency on Kibana. Keep cadvisor and elasticsearch_exporter only if Prometheus jobs on 114 remain valuable; otherwise migrate exporters or drop scrapes.

Owner decisions (114)

  • [ ] D1: Confirm whether Kibana / Logstash / ES on 114 are still used (Y/N).
  • [ ] D2: If No → approve Option A cleanup (agent executes with pre-stop snapshot optional).
  • [ ] D3: If Yes → choose B (retention) or C (resize to ___ GB).
  • [ ] D4: Pause file_crawler until Prawns has headroom? (Y/N)

saltierpoop (VM 100) — detailed proposal

What it is today

  • Role: Production Saltbox media stack (~40 containers).
  • Prox cost: 30 GB RAM, 260 GB thin disk (~207 GB actual) — ~37% of all thin usage.
  • In-guest disk: 206 / 258 GB (81%) — ~51 GB free on local root.
  • Docker loop: 34 / 54 GB — not the immediate problem.
  • NFS mounts:
Export Volume Use % Free
Prawns volume9 100% ~112 GB
Movems volume2 100% ~87 GB
SerializedWatchables volume1 98% ~938 GB
OrderedWords volume6 43% ~8.3 TB

Why it feels risky (but differently from 114)

  • Prox local disk: Moderate — not imminently full.
  • Media pipeline: High — new writes to Prawns/Movems can fail or stall while exports show 100%.
  • Backup: Full vzdump fits the 2 TB quota; compressed size may be 80–150+ GB. Raise quota before making full-VM dumps routine.

Options

Option Actions Pros Cons Horizon
A — Hygiene only Saltbox log rotation, Docker prune, review / growth Low risk; no architecture change Doesn't fix Prawns Now
B — Whrrr media policy Prawns/Movems cleanup or cold-tier moves Fixes upstream for 100 + 114 Owner media/archive decisions Soon
C — Migrate to Whrrr VMM Move VM off prox per PLAN baseline Frees ~207 GB thin + 30 GB RAM VMM validation, cutover window Phase 9+
D — Resize / split disks Expand LV or add data disk More local room Doesn't fix NFS; uses thin pool Only if local fills

Proposal: A + B in parallel — prox hygiene is maintenance; Prawns/Movems policy is the real saltierpoop decision. C remains the strategic end-game from PLAN.md § infra-services baseline; do not block on it for 114 urgency.

Owner decisions (100)

  • [ ] D5: Accept that Prawns/Movems at 100% is the primary saltierpoop risk (Y/N — acknowledge).
  • [ ] D6: Schedule Whrrr Prawns/Movems cleanup or tiering (owner timeline: ___).
  • [ ] D7: Approve saltierpoop in-guest hygiene pass (logs, unused images) — agent or manual?
  • [ ] D8: Whrrr VMM migration — defer / plan / prioritize (pick one).

Secondary guests — maintenance backlog

VMID Guest Issue Proposed action Owner gate
111 influxdb Superseded by Prometheus for thermals/NFS Retire LXC after Grafana Influx datasources removed D9: confirm datasources migrated
109 graylog Stopped; Pattern E central syslog Revive per central-syslog-graylog Separate syslog decision
116 pulse 99.5% thin on 4 GB LV Expand to 8 GB or prune in-guest Low urgency
119 harbor-registry 87.8% thin Registry GC, old tag cleanup Low urgency
200 haos 95.3% thin Review HA snapshot/history retention Low urgency
prox ISOs ~11 GB redundant Ubuntu ISOs Delete desktop + duplicate server ISOs; keep one 24.04 server D10: pick keeper ISO

Whrrr upstream (shared dependency)

Owner 2026-06-24: Whrrr NFS / pool cleanup is out of scope — pool layout prevents changes for now. This proposal covers prox guests only.

114 and 100 both historically depended on volume9 Prawns near capacity. Operational backups use volume6 (infra-backups) — do not route vzdump to Prawns.

Volume Pool Role Use % Proposal
volume9 (Prawns) vg3 Media + synorpn + saltierpoop paths ~100% Owner archive policy; not backup target
volume2 (Movems) vg1 Movies NFS ~100% Same
volume1 (SerializedWatchables) vg1 TV ~98% Monitor
volume6 (infra-backups) vg2 Proxmox vzdump ~0% Keep; raise quota on demand
volume6 (OrderedWords) vg2 Books ~43% Healthy

Capacity alerts: synology-capacity-ntfy (deployed on Whrrr).


Phased execution plan

Phase 0 — Done

  • [x] infra-backups NFS storage on prox
  • [x] Retire vm-backups CIFS
  • [x] Relocate legacy vzdump off pve-root
  • [x] Document in prox-storage-2026-06-24.md

Phase 1 — Urgent (this week)

Depends on D1–D4.

  1. Owner confirms 114 stack usage.
  2. If unused → stop ES/Kibana/Logstash/file_crawler; prune Docker; verify disk drop.
  3. Optional: vzdump 114 --storage infra-backups before destructive cleanup.
  4. Pause file_crawler if Prawns unchanged.

Success criteria: LXC 114 in-guest / below 70%; thin LV below 85%.

Phase 2 — Near-term (this month)

Depends on D5–D10 and consolidation queue.

  1. Remove Grafana Influx datasources → retire LXC 111.
  2. ISO prune on prox (~10 GB).
  3. saltierpoop hygiene pass (logs/Docker).
  4. Owner Prawns/Movems capacity actions on Whrrr.
  5. harbor / haos / pulse disk passes as time allows.

Success criteria: pve-root above 35 GB free; Prawns below 98% or write policy documented; influxdb retired.

Phase 3 — Strategic (Phase 9+)

Depends on D8.

  1. Evaluate saltierpoop migration to Whrrr VMM (PLAN baseline).
  2. Revisit nfs-monitoring — full retire vs minimal NFS bridge.
  3. Graylog 109 revive for central syslog (orthogonal to storage, but on prox).

Owner decision checklist (summary)

ID Question Options Proposal
D1 Still using Kibana/Logstash/ES on 114? Y / N Assume N until confirmed
D2 Approve 114 legacy stack removal? A / B / C / D A if D1=N
D3 Resize 114 if keeping stack? GB: ___ Only if D1=Y
D4 Pause file_crawler until Prawns relief? Y / N Y
D5 Acknowledge Prawns as saltierpoop blocker? Y Y
D6 Prawns/Movems cleanup timeline? date / defer Owner sets
D7 saltierpoop hygiene pass? agent / manual / skip agent with approval
D8 Whrrr VMM migration for 100? defer / plan / prioritize defer
D9 Influx datasources removed → retire 111? Y / N Y after verify
D10 ISO prune — keep ubuntu-24.04.1-live-server only? Y / N Y

Reply with decisions (e.g. D1=N, D2=A, D4=Y, D7=agent, D10=Y) to unblock Phase 1 execution.


What we are not proposing

  • Deleting backup artifacts without explicit owner approval (see homelab agent rules).
  • Moving vzdump back to Prawns or local as default.
  • Destroying saltierpoop or nfs-monitoring without disposition review + vzdump gate.
  • Raising infra-backups quota preemptively (owner: on occasion requiring it).

Changelog

Date Change
2026-06-24 Initial proposal from live prox audit + owner storage session