infra-services capacity and resize¶
Capacity targets, operating guidelines, and Proxmox resize procedure for
infra-services (Proxmox VM 123, 192.168.6.17).
Inventory today (inventory/hosts/infra-services.yaml): 4 vCPU / 16 GiB RAM /
100 GiB disk (Proxmox), 75 GiB ext4 in guest, 4 GiB swap. Resized 2026-06-24
after Wazuh 4.14.5 exposed the original 30 GiB disk as too small.
Prometheus rules scoped to this host live in
monitoring/prometheus/alerts/infra-services.yml.
Target sizing¶
| Resource | Minimum (strict ops) | Recommended | Notes |
|---|---|---|---|
| RAM | 12 GiB | 16 GiB | Wazuh RSS ~2.1 GiB today; room for agents, pulls, patch restarts |
| Disk | 64 GiB | 80 GiB | Images + upgrade peak + indexer growth; 64 GiB needs aggressive prune + ILM |
| Swap | 4 GiB | 4 GiB swapfile | Safety net only — not a substitute for RAM |
| vCPU | 4 | 4 (keep) | Memory-bound at homelab scale |
Sweet spot: 16 GiB RAM / 80 GiB disk / 4 vCPU / 4 GiB swap.
Update inventory after the Proxmox resize (not before):
Then run generators (render-ansible.py, render-doc-stubs.py, etc.) and commit.
Measured baseline (2026-06, post–Wazuh 4.14.5)¶
| Bucket | Size |
|---|---|
| Docker images | ~13 GiB (Wazuh ~7 GiB) |
| Container layers + volumes | ~6 GiB |
OS + /usr |
~3 GiB |
Root / total |
~28 GiB on 29 GiB VM |
| Container RSS (steady) | ~3.2–3.8 GiB |
| Host overhead | ~0.8–1.2 GiB |
Largest data volumes at snapshot: Prometheus ~1.5 GiB, AdGuard ~558 MiB, Mongo ~507 MiB, Loki ~320 MiB. Wazuh indexer data was still tiny; indexer indices are the long-term disk driver (plan 15–25 GiB in year one with FIM/vuln).
Safe operating guidelines¶
Disk¶
- Never run below 15% free on
/— treat 85% used as warning, 90% as incident (pulls and indexer merges fail badly when full). - After Wazuh (or any large stack) upgrade: within 24h run
docker image prune -fonce the new stack is healthy. Budget ~7 GiB headroom during pull if old images are still present. - Tight disk upgrade pattern: prune old images first, or pull one service at a time and prune between.
- Plan Wazuh indexer ILM/retention (e.g. 90–180d on alert indices) before disk becomes a surprise.
- Monitoring is bounded (Prometheus + Loki 30d retention) — budget ~6 GiB combined at steady state.
Memory¶
- Keep Wazuh indexer heap at 1 GiB until the VM has ≥16 GiB; only then consider 2 GiB (never >50% of VM RAM for heap alone).
- Do not add heavy stacks on infra-services without revisiting this budget.
- Enable the 4 GiB swapfile (Ansible) before or with the RAM resize — cheap insurance during pulls and patch windows.
Wazuh-specific¶
- All three Wazuh images must stay on the same tag.
- Run
apply-security-config.shafter upgrades that touch indexer security YAML. - Three agents (prox, infra-services, saltierpoop): budget +100–200 MiB manager RSS.
Emergency disk relief (before resize)¶
When / is critical and you need headroom immediately:
# On infra-services (SSH as someone)
docker image prune -f
docker builder prune -f # if build cache exists
sudo journalctl --vacuum-size=200M
df -h /
If pruning Wazuh old tags: only after the 4.14.5 stack is verified healthy (you need one full image set per component).
Resize procedure (Proxmox VM 123)¶
Schedule a maintenance window. Order: disk first (you may be at 100% now), then RAM. Swap can land via ansible-pull before or after RAM.
1. Pre-checks¶
# From your workstation
ssh infra-services-cursor 'df -h /; free -h; docker ps --format "{{.Names}}" | wc -l'
On prox (192.168.6.71): confirm local-lvm (or the VM’s datastore) has
enough thin free space for +50 GiB. See
prox storage remediation.
Optional: snapshot VM 123 in Proxmox before resize.
2. Extend disk (Proxmox)¶
Use GiB, not TiB
Proxmox Resize disk asks for the new total size. Enter 80
and confirm the unit is GiB. A typo of 34000 GiB (or selecting TiB)
produced a ~34 TiB virtio disk in the 2026-06-17 incident — the guest
partition grew to match while ext4 stayed at 29 GiB.
- Proxmox UI → VM 123 → Hardware → Hard Disk (scsi0) → Resize disk.
- Set total size to 80 GiB (not 80 TiB, not 34000).
- Apply — online grow works for virtio-scsi on Ubuntu 24.04 in most cases.
If / is full, free space before growpart (journal vacuum). Then grow the
filesystem with an explicit size so a mis-sized virtio disk cannot expand ext4
to terabytes:
ssh infra-services-cursor 'sudo journalctl --vacuum-size=200M' # if / is full
ssh infra-services-cursor 'lsblk /dev/sda'
ssh infra-services-cursor 'sudo growpart /dev/sda 1' # if partition < disk
ssh infra-services-cursor 'sudo resize2fs /dev/sda1 75G' # explicit target, not bare resize2fs
ssh infra-services-cursor 'df -h /'
Do not run bare resize2fs /dev/sda1 when lsblk shows a multi-terabyte
partition — always pass 75G (or your target minus ~5 GiB for boot slices).
Adjust partition/device if lsblk shows something other than sda1 (unlikely on
this VM).
3. Increase RAM (Proxmox)¶
- VM 123 → Hardware → Memory → set 16384 MiB (16 GiB).
- Apply — takes effect on next boot or immediately if hotplug is enabled (Proxmox usually requires a reboot for RAM on QEMU VMs).
4. Enable swap (Ansible)¶
Repo prep (already in tree when merged):
common_swapfile_enabled: trueininfra/ansible/inventory/host_vars/infra-services.ymlcommonrole taskswapfile.yml
After git pull on the host, ansible-pull applies swap on next converge, or run
manually:
ssh infra-services-cursor 'sudo systemctl start ansible-pull-apply.service'
ssh infra-services-cursor 'swapon --show; free -h'
One-time manual fallback (if Ansible has not run yet):
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
5. Reload Prometheus rules¶
After the alert file is on the host (git pull under /var/lib/ansible-pull/homelab):
Or restart the prometheus container from /opt/homelab/services/monitoring.
6. Post-resize verification¶
ssh infra-services-cursor 'free -h && df -h / && docker system df'
ssh infra-services-cursor 'docker stats --no-stream --format "{{.Name}} {{.MemUsage}}" | sort'
Pass criteria:
- ≥10 GiB free on
/at idle - ≥4 GiB MemAvailable under normal load (on 16 GiB VM)
- Swap active (
swapon --show) - Prometheus: no firing
InfraServices*alerts after 15 minutes
7. Update inventory and docs¶
- Edit
inventory/hosts/infra-services.yaml(ram_gb: 16,storage.size_gb: 80). - Run generators and commit.
- Optional:
python scripts/proxmox-scan.pyto refresh compute-live JSON.
Related¶
- Wazuh SIEM — stack on this host
- Adding a service — disk budget before new stacks
- Coordinated OS patching — patch window load
- Synology capacity ntfy — separate NAS alerts
Research updates (2026-06-17)¶
Live checks from infra-services-cursor plus frozen Proxmox storage audit
prox-storage-2026-06-24.md.
Use this section when scheduling the maintenance window — numbers drift daily while
Wazuh and Docker layers grow.
Live guest state (2026-06-17)¶
| Signal | Value | Implication |
|---|---|---|
/ usage |
29 GiB / 29 GiB (~100%, ~33 MiB free) | Blocker for docker pull, swapfile, apt |
| Block device | sda 30 GiB → sda1 29 GiB ext4 on / |
Standard Ubuntu 24.04 cloud image layout |
| Boot partitions | sda15 EFI 106M, sda16 /boot 913M |
Leave untouched during grow |
| RAM | 7.7 GiB total, ~4.1 GiB MemAvailable, no swap | Memory OK today; OOM risk on spikes |
| Containers | 17 running (was 13 pre-Wazuh in guest scan) | +Wazuh×3, +authentik-outpost |
cloud-guest-utils |
installed (growpart available) |
Guest grow after Proxmox disk resize is one command |
| QEMU guest agent | running | Clean shutdown from Proxmox UI works |
docker image prune -f |
0 B reclaimed | All 17 images referenced by running containers |
Docker disk (docker system df):
| Layer | Size | Notes |
|---|---|---|
| Images | 13.32 GiB (17 active) | Wazuh trio ~7 GiB (indexer 2.49 + manager 2.44 + dashboard 2.08 GiB) |
| Container writable layers | 3.00 GiB | wazuh-manager alone ~3 GiB — unusually large; watch on upgrades |
| Named volumes | 3.07 GiB | Prometheus 1.5 GiB, AdGuard 559 MiB, Mongo 507 MiB, Loki 318 MiB, Wazuh queue 270 MiB, indexer data 3 MiB |
| Build cache | 0 | — |
Top RSS (steady state today):
| Container | RSS |
|---|---|
| wazuh-indexer | 1.39 GiB |
| wazuh-manager | 515 MiB |
| ara | 287 MiB |
| wazuh-dashboard | 199 MiB |
Non-Docker disk hog: systemd journal /var/log/journal ~2.8–2.9 GiB — largest
single reclaimable chunk without stopping stacks.
Proxmox side (VM 123, audit 2026-06-24)¶
| Item | Value |
|---|---|
| VMID / name | 123 / infra-services |
| Datastore | local-lvm (thin LVM on prox NVMe) |
| Provisioned disk | 30 GiB (vm-123-disk-1; ~65% LVM data at audit → guest now ~100%) |
| Provisioned RAM | 8192 MiB |
| Thin pool headroom | ~271 GiB free on 794 GiB pool (67.5% used at audit) |
| Hypervisor RAM | prox 60 GiB total — +8 GiB for VM 123 is trivial |
Verdict: Proxmox storage is not the blocker; the guest filesystem is full.
Growing scsi0 by +50 GiB (30 → 80 GiB) consumes ~6% of remaining thin free.
Patch-controller SSH from infra-services → prox returned permission denied during this research pass — perform Proxmox changes from the Proxmox UI or a workstation with prox admin access, not via the patch-orchestrator key.
Proxmox resize — exact targets¶
Disk (UI): VM 123 → Hardware → Hard Disk (scsi0) on local-lvm → Resize
→ 80 GiB total (or +50G increment from 30).
Disk (CLI on prox):
Guest grow (after Proxmox resize, online OK for virtio-scsi):
ssh infra-services-cursor 'lsblk /dev/sda'
ssh infra-services-cursor 'sudo growpart /dev/sda 1 && sudo resize2fs /dev/sda1 && df -h /'
Expected: sda shows 80G, sda1 grows to ~79G, / reports ~50+ GiB free
at current utilization.
RAM (UI): VM 123 → Hardware → Memory → 16384 MiB.
RAM (CLI on prox):
Immediate relief before Proxmox (no stack stop)¶
If you need headroom today without a hypervisor change:
# ~2.5 GiB back from journal (safe on homelab; logs also in Loki)
ssh infra-services-cursor 'sudo journalctl --vacuum-size=200M && df -h /'
That may free enough space for a 4 GiB swapfile until the proper disk resize. Do not rely on journal vacuum long term — fix root size.
Still cannot docker image prune meaningfully while all 17 images are in use.
Monitoring¶
| Alert | Threshold | Firing today? |
|---|---|---|
InfraServicesDiskCritical |
<10% free on / |
Yes (deployed via hotfix + in repo) |
InfraServicesDiskPressure |
<15% free | Yes |
InfraServicesMemoryLow |
MemAvailable <1.5 GiB | No (~4.1 GiB available) |
Generic DiskSpaceLow (all hosts) |
<20% free | Also fires for infra-services |
After resize + git pull, ansible-pull enables swap; reload Prometheus if alert file
was only hotfixed:
Recommended maintenance order (condensed)¶
- Optional: journal vacuum if you need swap before Proxmox.
- Proxmox: disk 80 GiB → guest
growpart+resize2fs. - Proxmox: RAM 16 GiB → reboot VM.
git pullon host → ansible-pull (swapfile) → verifyswapon --show.- Update
inventory/hosts/infra-services.yaml+ generators + commit. - Post-check:
df -h /,free -h, confirmInfraServices*alerts clear.
Open follow-ups (not blocking resize)¶
- Wazuh indexer ILM/retention — indexer data still tiny; plan before it becomes the next disk surprise.
- Wazuh agents on prox, infra-services, saltierpoop — budget +100–200 MiB manager RSS.
- Refresh prox live scan after resize:
python scripts/proxmox-scan.py(updatescompute-live/JSON).
Troubleshooting: resize failures (2026-06-17 incident)¶
Symptoms¶
mkdir: cannot create directory '/tmp/growpart.…': No space left on device
resize2fs: Block bitmap checksum does not match bitmap
lsblk shows sda / sda1 as 34T
df shows / still 29G
Root causes¶
/at 100% —growpartneeds/tmp; vacuum journal first.- Proxmox disk set to ~34 TiB instead of 80 GiB — partition auto-span
matched the bogus virtio size; bare
resize2fsthen tried to grow toward TiB and failed on a full filesystem. resize2fsabort — read-onlye2fsck -nreported clean after journal vacuum; no offline fsck required in this case.
Recovery (worked live)¶
sudo journalctl --vacuum-size=200M # freed ~2.5 GiB
sudo e2fsck -n /dev/sda1 # verify clean before retry
sudo resize2fs /dev/sda1 75G # explicit size — NOT bare resize2fs
df -h / # expect ~73G size, ~48G free
Still required on Proxmox (shrinking)¶
qm disk resize cannot shrink LVM-thin volumes (shrinking disks is not
supported). Guest partition must be smaller than the target before
hypervisor shrink. Cloud-init growpart will undo a manual partition shrink
on boot unless disabled (see below).
Completed fix (2026-06-25, VM 123 → 100 GiB):
- Snapshot:
qm snapshot 123 pre-disk-shrink --description "…" - Guest (online): disable autogrow → shrink partition → shutdown
- Proxmox (VM stopped):
lvresize -L 100G -f /dev/pve/vm-123-disk-1 - Fix GPT backup header:
sgdisk -e /dev/pve/vm-123-disk-1(required afterlvresizeor the VM will not get network) - Reattach disk if needed:
qm set 123 --scsi0 local-lvm:vm-123-disk-1,…,size=100G - Boot guest:
growpart /dev/sda 1 && resize2fs /dev/sda1 - Delete snapshot:
qm delsnapshot 123 pre-disk-shrink
Disable cloud-init autogrow (Ansible: common_cloud_init_disable_autogrow: true):
Guest shrink (before hypervisor):
yes Yes | sudo parted ---pretend-input-tty /dev/sda unit GiB resizepart 1 85GiB
sudo partprobe /dev/sda
lsblk /dev/sda # sda1 ~84G while sda may still show 34T until step 3–4
sudo shutdown -h now
Proxmox shrink (from prox shell, VM stopped):
lvresize -L 100G -f /dev/pve/vm-123-disk-1
sgdisk -e /dev/pve/vm-123-disk-1
qm set 123 --scsi0 local-lvm:vm-123-disk-1,discard=on,ssd=1,size=100G
qm start 123
Until steps 3–4 complete, lsblk may still show a bogus 34T sda while
sda1 is already sane — fix the hypervisor LV, not just the partition.
Close-out (2026-06-25)¶
34 TiB typo disk fully corrected on VM 123:
| Check | Result |
|---|---|
Proxmox scsi0 |
100 GiB (vm-123-disk-1) |
Guest lsblk |
sda 100G, sda1 99G |
ext4 / |
96 GiB, ~62 GiB free |
| RAM | 16 GiB |
| Swap | 4 GiB /swapfile |
| Stacks | 17/17 containers healthy |
| Wazuh agents | prox, infra-services, saltierpoop Active |
Snapshot pre-disk-shrink |
removed (reclaimed thin-pool headroom) |
| cloud-init autogrow | disabled via Ansible |
Inventory storage.size_gb: 100 matches reality.