Proxmox storage remediation — proposal¶

Status: Partially executed (owner decisions 2026-06-24) Date: 2026-06-24 Host: prox (192.168.6.71) — ASUS PN64, 1 TB NVMe Context: Post–Wave A/B decommission; backup target migrated to Whrrr volume6

This document consolidates the live storage audit, resolved items, open risks, and recommended decision paths. It is the action proposal companion to the frozen point-in-time audit in prox-storage-2026-06-24.md.

Related:

Compute disposition review — keep / consolidate / retire matrix
Proxmox consolidation — end-state guest target
Synology capacity ntfy — Whrrr volume alerts
Compute decommission queue — vzdump + destroy order

Executive summary¶

Prox is not in hypervisor-level crisis after recent backup work, but two production-adjacent workloads sit on different failure modes:

Workload	Primary risk	Root cause
nfs-monitoring (114)	Imminent — guest disk ~full	98.7% thin LV, 91% in-guest, 53 GB Docker (ES stack)
saltierpoop (100)	Operational — media I/O	Local disk 81% (OK for now); Prawns/Movems NFS at 100%

Three storage layers must be managed independently:

pve-root — OS, ISOs, local scratch (improved; ISO prune still available)
local-lvm thin pool — guest disks (~270 GB thin free; unevenly consumed)
Whrrr NFS — Prawns/synorpn nearly full; infra-backups on volume6 healthy

Recommendation in one line: ~~Gut or shrink 114~~ 114 retired 2026-06-24; manage guest disks only (Whrrr NFS pool layout is fixed for now); keep infra-backups as vzdump target; saltierpoop alert relief via periodic log hygiene.

Owner decisions (2026-06-24)¶

Topic	Decision
Whrrr NFS / Prawns cleanup	Out of scope — pool layout prevents changes; only VM/LXC actions
nfs-monitoring (114)	Backup + destroy — executed 2026-06-24
saltierpoop (100)	Keep on prox — marginal in-guest cleanup for disk alert relief

Executed (2026-06-24)¶

Action	Result
LXC 114 manual backup	`infra-backups/dump/manual-lxc-114-rootfs-2026_06_24.tar.zst` (~53 GB) + `lxc-114-pct.conf`
LXC 114 destroy	VMID 114 removed; ~110 GB thin pool freed
Prometheus	Removed `nfs-monitoring-*` scrape jobs from `prometheus.yml`
saltierpoop hygiene	`journalctl --vacuum-size=500M` (~3.4 GB), `/tmp` clear, `syslog.1` truncate → 78% used / ~22% free on `/`

saltierpoop ongoing hygiene (recommended)¶

Disk alerts fire when root filesystem free space drops below 20% (monitoring/prometheus/alerts/node.yml). Saltierpoop flapped at ~19% free before cleanup.

Safe periodic maintenance (no Docker prune without review):

# On saltierpoop (guest agent or SSH)
sudo journalctl --vacuum-size=500M
sudo find /tmp -mindepth 1 -mtime +7 -delete
sudo logrotate -f /etc/logrotate.d/rsyslog   # if syslog grows again

Consider DSM/Ansible cron for journal vacuum monthly if alerts return.

vzdump + NFS note¶

Large LXC vzdump to infra-backups failed: PVE lxc-usernsexec cannot create .tmp dirs on NFS with current Synology squash (Map root to admin only). Workarounds:

DSM NFS rule → Map all users to admin (fixes native vzdump), or
Manual mount + tar | zstd to NFS (used for 114), or
tmpdir on storage pointing to local scratch (only if archive fits ~27 GB free on root)

Resolved (2026-06-24)¶

No owner decision required — already done.

Item	Outcome
Broken CIFS `vm-backups` on Prawns	Removed from PVE
New backup target	NFS `infra-backups` → `192.168.6.215:/volume6/infra-backups` (2 TB quota)
Retention	`keep-last=3` on `infra-backups`
Legacy Wave A vzdump on prox `local`	Relocated to `infra-backups/dump/` (~3.1 GB total)
Validation	Test `vzdump 120` succeeded

Future vzdump:

vzdump <vmid> --storage infra-backups --compress zstd

Raise the 2 TB DSM quota on infra-backups before scheduling full-VM dumps (e.g. saltierpoop).

Current state (live, 2026-06-24)¶

Hypervisor layers¶

Layer	Size	Used	Free	Assessment
pve-root (`local`)	94 GB	63 GB (71%)	27 GB	Improved after dump relocation; ~11 GB ISO prune available
local-lvm thin pool	794 GB	537 GB (67.5%)	~270 GB	Adequate if 114 stops growing
infra-backups NFS	2 TB quota	~3 GB	~2 TB	Healthy
synorpn NFS (Prawns)	30 TB	99.6%	~112 GB	Critical upstream cap

Guest thin-pool consumers (top)¶

VMID	Guest	Prov.	Thin data %	~Actual	In-guest `/`	Verdict
100	saltierpoop	260 GB	79.7%	~207 GB	206/258 GB (81%)	Largest anchor; local OK, NFS exposed
114	nfs-monitoring	112 GB	98.7%	~110 GB	96/111 GB (91%)	Urgent
119	harbor-registry	80 GB	87.8%	~70 GB	45/79 GB (61%)	Monitor; registry GC
200	haos	32 GB	95.3%	~30 GB	—	HA retention review
123	infra-services	30 GB	65.0%	~19 GB	—	Healthy
111	influxdb	8 GB	86.7%	~7 GB	—	Retire after Grafana cutover
116	pulse	4 GB	99.5%	~4 GB	59% in-guest	Small but maxed LV

12 guests remain on prox after Wave A/B (3 VMs, 9 LXCs).

flowchart TB
  subgraph prox["prox NVMe"]
    root["pve-root 27G free"]
    thin["thin pool 270G free"]
    s100["VM 100 · 207G actual"]
    s114["LXC 114 · 110G actual · 98.7%"]
  end
  subgraph whrrr["Whrrr 192.168.6.215"]
    v9["volume9 Prawns · 99.6%"]
    v6["volume6 infra-backups · 0.15%"]
  end
  v9 --> s114
  v9 --> s100
  v6 --> vzdump["vzdump archives"]
  thin --> s100
  thin --> s114

Risk register¶

ID	Risk	Likelihood	Impact	Mitigation owner
R1	114 rootfs fills → ES/Docker crash	High	Loss of legacy dashboards; possible corrupt indices	Owner: keep vs gut 114
R2	Prawns NFS full → saltierpoop *arr/download failures	Medium–High	Media pipeline stalls	Owner: Whrrr archive policy
R3	Large vzdump to `local` during misconfigured job	Low (post-fix)	Repeat metrimon failure	Ops: always `--storage infra-backups`
R4	saltierpoop thin growth without cleanup	Medium	Erodes pool headroom over months	Saltbox log/docker hygiene
R5	pulse / haos / harbor thin max	Medium	Niche service outage	Per-guest cleanup or resize
R6	2 TB backup quota exceeded by full VM dump	Low until scheduled	Backup job failure	Raise quota when needed

nfs-monitoring (LXC 114) — detailed proposal¶

What it is today¶

Role (original): NFS bridge + observability stack for Prawns/synorpn.
Mounts: synorpn (Prawns NFS, 100%), legacy mp1 prawns bind.
Running containers: elasticsearch, kibana, logstash, file_crawler, cadvisor, elasticsearch_exporter.
Migration status: NFS dashboard metrics moved to Prometheus on infra-services (2026-06-24). Influx on LXC 111 still pending Grafana datasource removal.

Why it is urgent¶

Thin LV at 98.7% — prox cannot grow the guest much without expansion or cleanup.
53 GB in /var/lib/docker on an 111 GB rootfs with 11 GB free.
Upstream Prawns at 100% — file_crawler and any NFS writes are on a full volume.

Options¶

Option	Actions	Pros	Cons	Recommended when
A — Gut legacy stack (default)	Stop ES/Kibana/Logstash/file_crawler; keep cadvisor + ES exporter if still scraped; prune Docker volumes	Frees 40–50+ GB quickly; reduces ops burden	Lose Kibana/Logstash if still used	No one opens Kibana weekly
B — Keep stack, shrink data	Curate ES indices; Docker prune; reduce retention	Preserves UI for debugging	Labor-intensive; Prawns still 100%	Short-term bridge only
C — Keep stack, expand disk	Resize LXC to 150 GB+; set ES ILM	Headroom for growth	Consumes thin pool; doesn't fix Prawns	Confirmed ongoing ES need
D — Full retire	Stop LXC; remove from Prometheus scrape; update inventory	Maximum prox relief	Lose NFS-side tooling entirely	Bridge no longer needed

Proposal: Option A now, re-evaluate D after confirming no Grafana/ops dependency on Kibana. Keep cadvisor and elasticsearch_exporter only if Prometheus jobs on 114 remain valuable; otherwise migrate exporters or drop scrapes.

Owner decisions (114)¶

[ ] D1: Confirm whether Kibana / Logstash / ES on 114 are still used (Y/N).
[ ] D2: If No → approve Option A cleanup (agent executes with pre-stop snapshot optional).
[ ] D3: If Yes → choose B (retention) or C (resize to ___ GB).
[ ] D4: Pause file_crawler until Prawns has headroom? (Y/N)

saltierpoop (VM 100) — detailed proposal¶

What it is today¶

Role: Production Saltbox media stack (~40 containers).
Prox cost: 30 GB RAM, 260 GB thin disk (~207 GB actual) — ~37% of all thin usage.
In-guest disk: 206 / 258 GB (81%) — ~51 GB free on local root.
Docker loop: 34 / 54 GB — not the immediate problem.
NFS mounts:

Export	Volume	Use %	Free
Prawns	volume9	100%	~112 GB
Movems	volume2	100%	~87 GB
SerializedWatchables	volume1	98%	~938 GB
OrderedWords	volume6	43%	~8.3 TB

Why it feels risky (but differently from 114)¶

Prox local disk: Moderate — not imminently full.
Media pipeline: High — new writes to Prawns/Movems can fail or stall while exports show 100%.
Backup: Full vzdump fits the 2 TB quota; compressed size may be 80–150+ GB. Raise quota before making full-VM dumps routine.

Options¶

Option	Actions	Pros	Cons	Horizon
A — Hygiene only	Saltbox log rotation, Docker prune, review `/` growth	Low risk; no architecture change	Doesn't fix Prawns	Now
B — Whrrr media policy	Prawns/Movems cleanup or cold-tier moves	Fixes upstream for 100 + 114	Owner media/archive decisions	Soon
C — Migrate to Whrrr VMM	Move VM off prox per PLAN baseline	Frees ~207 GB thin + 30 GB RAM	VMM validation, cutover window	Phase 9+
D — Resize / split disks	Expand LV or add data disk	More local room	Doesn't fix NFS; uses thin pool	Only if local fills

Proposal: A + B in parallel — prox hygiene is maintenance; Prawns/Movems policy is the real saltierpoop decision. C remains the strategic end-game from PLAN.md § infra-services baseline; do not block on it for 114 urgency.

Owner decisions (100)¶

[ ] D5: Accept that Prawns/Movems at 100% is the primary saltierpoop risk (Y/N — acknowledge).
[ ] D6: Schedule Whrrr Prawns/Movems cleanup or tiering (owner timeline: ___).
[ ] D7: Approve saltierpoop in-guest hygiene pass (logs, unused images) — agent or manual?
[ ] D8: Whrrr VMM migration — defer / plan / prioritize (pick one).

Secondary guests — maintenance backlog¶

VMID	Guest	Issue	Proposed action	Owner gate
111	influxdb	Superseded by Prometheus for thermals/NFS	Retire LXC after Grafana Influx datasources removed	D9: confirm datasources migrated
109	graylog	Stopped; Pattern E central syslog	Revive per central-syslog-graylog	Separate syslog decision
116	pulse	99.5% thin on 4 GB LV	Expand to 8 GB or prune in-guest	Low urgency
119	harbor-registry	87.8% thin	Registry GC, old tag cleanup	Low urgency
200	haos	95.3% thin	Review HA snapshot/history retention	Low urgency
—	prox ISOs	~11 GB redundant Ubuntu ISOs	Delete desktop + duplicate server ISOs; keep one 24.04 server	D10: pick keeper ISO

Whrrr upstream (shared dependency)¶

Owner 2026-06-24: Whrrr NFS / pool cleanup is out of scope — pool layout prevents changes for now. This proposal covers prox guests only.

114 and 100 both historically depended on volume9 Prawns near capacity. Operational backups use volume6 (infra-backups) — do not route vzdump to Prawns.

Volume	Pool	Role	Use %	Proposal
volume9 (Prawns)	vg3	Media + synorpn + saltierpoop paths	~100%	Owner archive policy; not backup target
volume2 (Movems)	vg1	Movies NFS	~100%	Same
volume1 (SerializedWatchables)	vg1	TV	~98%	Monitor
volume6 (infra-backups)	vg2	Proxmox vzdump	~0%	Keep; raise quota on demand
volume6 (OrderedWords)	vg2	Books	~43%	Healthy

Capacity alerts: synology-capacity-ntfy (deployed on Whrrr).

Phased execution plan¶

Phase 0 — Done¶

[x] infra-backups NFS storage on prox
[x] Retire vm-backups CIFS
[x] Relocate legacy vzdump off pve-root
[x] Document in prox-storage-2026-06-24.md

Phase 1 — Urgent (this week)¶

Depends on D1–D4.

Owner confirms 114 stack usage.
If unused → stop ES/Kibana/Logstash/file_crawler; prune Docker; verify disk drop.
Optional: vzdump 114 --storage infra-backups before destructive cleanup.
Pause file_crawler if Prawns unchanged.

Success criteria: LXC 114 in-guest / below 70%; thin LV below 85%.

Phase 2 — Near-term (this month)¶

Depends on D5–D10 and consolidation queue.

Remove Grafana Influx datasources → retire LXC 111.
ISO prune on prox (~10 GB).
saltierpoop hygiene pass (logs/Docker).
Owner Prawns/Movems capacity actions on Whrrr.
harbor / haos / pulse disk passes as time allows.

Success criteria: pve-root above 35 GB free; Prawns below 98% or write policy documented; influxdb retired.

Phase 3 — Strategic (Phase 9+)¶

Depends on D8.

Evaluate saltierpoop migration to Whrrr VMM (PLAN baseline).
Revisit nfs-monitoring — full retire vs minimal NFS bridge.
Graylog 109 revive for central syslog (orthogonal to storage, but on prox).

Owner decision checklist (summary)¶

ID	Question	Options	Proposal
D1	Still using Kibana/Logstash/ES on 114?	Y / N	Assume N until confirmed
D2	Approve 114 legacy stack removal?	A / B / C / D	A if D1=N
D3	Resize 114 if keeping stack?	GB: ___	Only if D1=Y
D4	Pause file_crawler until Prawns relief?	Y / N	Y
D5	Acknowledge Prawns as saltierpoop blocker?	Y	Y
D6	Prawns/Movems cleanup timeline?	date / defer	Owner sets
D7	saltierpoop hygiene pass?	agent / manual / skip	agent with approval
D8	Whrrr VMM migration for 100?	defer / plan / prioritize	defer
D9	Influx datasources removed → retire 111?	Y / N	Y after verify
D10	ISO prune — keep `ubuntu-24.04.1-live-server` only?	Y / N	Y

Reply with decisions (e.g. D1=N, D2=A, D4=Y, D7=agent, D10=Y) to unblock Phase 1 execution.

What we are not proposing¶

Deleting backup artifacts without explicit owner approval (see homelab agent rules).
Moving vzdump back to Prawns or local as default.
Destroying saltierpoop or nfs-monitoring without disposition review + vzdump gate.
Raising infra-backups quota preemptively (owner: on occasion requiring it).

Changelog¶

Date	Change
2026-06-24	Initial proposal from live prox audit + owner storage session