infra-services capacity and resize¶

Capacity targets, operating guidelines, and Proxmox resize procedure for infra-services (Proxmox VM 123, 192.168.6.17).

Inventory today (inventory/hosts/infra-services.yaml): 4 vCPU / 16 GiB RAM / 100 GiB disk (Proxmox), 75 GiB ext4 in guest, 4 GiB swap. Resized 2026-06-24 after Wazuh 4.14.5 exposed the original 30 GiB disk as too small.

Prometheus rules scoped to this host live in monitoring/prometheus/alerts/infra-services.yml.

Target sizing¶

Resource	Minimum (strict ops)	Recommended	Notes
RAM	12 GiB	16 GiB	Wazuh RSS ~2.1 GiB today; room for agents, pulls, patch restarts
Disk	64 GiB	80 GiB	Images + upgrade peak + indexer growth; 64 GiB needs aggressive prune + ILM
Swap	4 GiB	4 GiB swapfile	Safety net only — not a substitute for RAM
vCPU	4	4 (keep)	Memory-bound at homelab scale

Sweet spot: 16 GiB RAM / 80 GiB disk / 4 vCPU / 4 GiB swap.

Update inventory after the Proxmox resize (not before):

hardware:
  vcpu: 4
  ram_gb: 16
  storage:
    - device: scsi0
      size_gb: 80
      kind: virtio-scsi
      role: boot+data

Then run generators (render-ansible.py, render-doc-stubs.py, etc.) and commit.

Measured baseline (2026-06, post–Wazuh 4.14.5)¶

Bucket	Size
Docker images	~13 GiB (Wazuh ~7 GiB)
Container layers + volumes	~6 GiB
OS + `/usr`	~3 GiB
Root `/` total	~28 GiB on 29 GiB VM
Container RSS (steady)	~3.2–3.8 GiB
Host overhead	~0.8–1.2 GiB

Largest data volumes at snapshot: Prometheus ~1.5 GiB, AdGuard ~558 MiB, Mongo ~507 MiB, Loki ~320 MiB. Wazuh indexer data was still tiny; indexer indices are the long-term disk driver (plan 15–25 GiB in year one with FIM/vuln).

Safe operating guidelines¶

Disk¶

Never run below 15% free on / — treat 85% used as warning, 90% as incident (pulls and indexer merges fail badly when full).
After Wazuh (or any large stack) upgrade: within 24h run docker image prune -f once the new stack is healthy. Budget ~7 GiB headroom during pull if old images are still present.
Tight disk upgrade pattern: prune old images first, or pull one service at a time and prune between.
Plan Wazuh indexer ILM/retention (e.g. 90–180d on alert indices) before disk becomes a surprise.
Monitoring is bounded (Prometheus + Loki 30d retention) — budget ~6 GiB combined at steady state.

Memory¶

Keep Wazuh indexer heap at 1 GiB until the VM has ≥16 GiB; only then consider 2 GiB (never >50% of VM RAM for heap alone).
Do not add heavy stacks on infra-services without revisiting this budget.
Enable the 4 GiB swapfile (Ansible) before or with the RAM resize — cheap insurance during pulls and patch windows.

Wazuh-specific¶

All three Wazuh images must stay on the same tag.
Run apply-security-config.sh after upgrades that touch indexer security YAML.
Three agents (prox, infra-services, saltierpoop): budget +100–200 MiB manager RSS.

Emergency disk relief (before resize)¶

When / is critical and you need headroom immediately:

# On infra-services (SSH as someone)
docker image prune -f
docker builder prune -f   # if build cache exists
sudo journalctl --vacuum-size=200M
df -h /

If pruning Wazuh old tags: only after the 4.14.5 stack is verified healthy (you need one full image set per component).

Resize procedure (Proxmox VM 123)¶

Schedule a maintenance window. Order: disk first (you may be at 100% now), then RAM. Swap can land via ansible-pull before or after RAM.

1. Pre-checks¶

# From your workstation
ssh infra-services-cursor 'df -h /; free -h; docker ps --format "{{.Names}}" | wc -l'

On prox (192.168.6.71): confirm local-lvm (or the VM’s datastore) has enough thin free space for +50 GiB. See prox storage remediation.

Optional: snapshot VM 123 in Proxmox before resize.

2. Extend disk (Proxmox)¶

Use GiB, not TiB

Proxmox Resize disk asks for the new total size. Enter 80 and confirm the unit is GiB. A typo of 34000 GiB (or selecting TiB) produced a ~34 TiB virtio disk in the 2026-06-17 incident — the guest partition grew to match while ext4 stayed at 29 GiB.

Proxmox UI → VM 123 → Hardware → Hard Disk (scsi0) → Resize disk.
Set total size to 80 GiB (not 80 TiB, not 34000).
Apply — online grow works for virtio-scsi on Ubuntu 24.04 in most cases.

If / is full, free space before growpart (journal vacuum). Then grow the filesystem with an explicit size so a mis-sized virtio disk cannot expand ext4 to terabytes:

ssh infra-services-cursor 'sudo journalctl --vacuum-size=200M'   # if / is full
ssh infra-services-cursor 'lsblk /dev/sda'
ssh infra-services-cursor 'sudo growpart /dev/sda 1'             # if partition < disk
ssh infra-services-cursor 'sudo resize2fs /dev/sda1 75G'         # explicit target, not bare resize2fs
ssh infra-services-cursor 'df -h /'

Do not run bare resize2fs /dev/sda1 when lsblk shows a multi-terabyte partition — always pass 75G (or your target minus ~5 GiB for boot slices).

Adjust partition/device if lsblk shows something other than sda1 (unlikely on this VM).

3. Increase RAM (Proxmox)¶

VM 123 → Hardware → Memory → set 16384 MiB (16 GiB).
Apply — takes effect on next boot or immediately if hotplug is enabled (Proxmox usually requires a reboot for RAM on QEMU VMs).

ssh infra-services-cursor 'sudo reboot'
# wait, then:
ssh infra-services-cursor 'free -h'

4. Enable swap (Ansible)¶

Repo prep (already in tree when merged):

common_swapfile_enabled: true in infra/ansible/inventory/host_vars/infra-services.yml
common role task swapfile.yml

After git pull on the host, ansible-pull applies swap on next converge, or run manually:

ssh infra-services-cursor 'sudo systemctl start ansible-pull-apply.service'
ssh infra-services-cursor 'swapon --show; free -h'

One-time manual fallback (if Ansible has not run yet):

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

5. Reload Prometheus rules¶

After the alert file is on the host (git pull under /var/lib/ansible-pull/homelab):

ssh infra-services-cursor 'curl -sf -X POST http://127.0.0.1:9090/-/reload'

Or restart the prometheus container from /opt/homelab/services/monitoring.

6. Post-resize verification¶

ssh infra-services-cursor 'free -h && df -h / && docker system df'
ssh infra-services-cursor 'docker stats --no-stream --format "{{.Name}} {{.MemUsage}}" | sort'

Pass criteria:

≥10 GiB free on / at idle
≥4 GiB MemAvailable under normal load (on 16 GiB VM)
Swap active (swapon --show)
Prometheus: no firing InfraServices* alerts after 15 minutes

7. Update inventory and docs¶

Edit inventory/hosts/infra-services.yaml (ram_gb: 16, storage.size_gb: 80).
Run generators and commit.
Optional: python scripts/proxmox-scan.py to refresh compute-live JSON.

Wazuh SIEM — stack on this host
Adding a service — disk budget before new stacks
Coordinated OS patching — patch window load
Synology capacity ntfy — separate NAS alerts

Research updates (2026-06-17)¶

Live checks from infra-services-cursor plus frozen Proxmox storage audit prox-storage-2026-06-24.md. Use this section when scheduling the maintenance window — numbers drift daily while Wazuh and Docker layers grow.

Live guest state (2026-06-17)¶

Signal	Value	Implication
`/` usage	29 GiB / 29 GiB (~100%, ~33 MiB free)	Blocker for `docker pull`, swapfile, apt
Block device	`sda` 30 GiB → `sda1` 29 GiB ext4 on `/`	Standard Ubuntu 24.04 cloud image layout
Boot partitions	`sda15` EFI 106M, `sda16` `/boot` 913M	Leave untouched during grow
RAM	7.7 GiB total, ~4.1 GiB MemAvailable, no swap	Memory OK today; OOM risk on spikes
Containers	17 running (was 13 pre-Wazuh in guest scan)	+Wazuh×3, +authentik-outpost
`cloud-guest-utils`	installed (`growpart` available)	Guest grow after Proxmox disk resize is one command
QEMU guest agent	running	Clean shutdown from Proxmox UI works
`docker image prune -f`	0 B reclaimed	All 17 images referenced by running containers

Docker disk (docker system df):

Layer	Size	Notes
Images	13.32 GiB (17 active)	Wazuh trio ~7 GiB (indexer 2.49 + manager 2.44 + dashboard 2.08 GiB)
Container writable layers	3.00 GiB	`wazuh-manager` alone ~3 GiB — unusually large; watch on upgrades
Named volumes	3.07 GiB	Prometheus 1.5 GiB, AdGuard 559 MiB, Mongo 507 MiB, Loki 318 MiB, Wazuh queue 270 MiB, indexer data 3 MiB
Build cache	0	—

Top RSS (steady state today):

Container	RSS
wazuh-indexer	1.39 GiB
wazuh-manager	515 MiB
ara	287 MiB
wazuh-dashboard	199 MiB

Non-Docker disk hog: systemd journal /var/log/journal ~2.8–2.9 GiB — largest single reclaimable chunk without stopping stacks.

Proxmox side (VM 123, audit 2026-06-24)¶

Item	Value
VMID / name	123 / `infra-services`
Datastore	`local-lvm` (thin LVM on prox NVMe)
Provisioned disk	30 GiB (`vm-123-disk-1`; ~65% LVM data at audit → guest now ~100%)
Provisioned RAM	8192 MiB
Thin pool headroom	~271 GiB free on 794 GiB pool (67.5% used at audit)
Hypervisor RAM	prox 60 GiB total — +8 GiB for VM 123 is trivial

Verdict: Proxmox storage is not the blocker; the guest filesystem is full. Growing scsi0 by +50 GiB (30 → 80 GiB) consumes ~6% of remaining thin free.

Patch-controller SSH from infra-services → prox returned permission denied during this research pass — perform Proxmox changes from the Proxmox UI or a workstation with prox admin access, not via the patch-orchestrator key.

Proxmox resize — exact targets¶

Disk (UI): VM 123 → Hardware → Hard Disk (scsi0) on local-lvm → Resize → 80 GiB total (or +50G increment from 30).

Disk (CLI on prox):

qm resize 123 scsi0 +50G    # 30 → 80 GiB
# verify
qm config 123 | grep scsi

Guest grow (after Proxmox resize, online OK for virtio-scsi):

ssh infra-services-cursor 'lsblk /dev/sda'
ssh infra-services-cursor 'sudo growpart /dev/sda 1 && sudo resize2fs /dev/sda1 && df -h /'

Expected: sda shows 80G, sda1 grows to ~79G, / reports ~50+ GiB free at current utilization.

RAM (UI): VM 123 → Hardware → Memory → 16384 MiB.

RAM (CLI on prox):

qm set 123 --memory 16384
qm reboot 123    # RAM change requires reboot on this VM type

Immediate relief before Proxmox (no stack stop)¶

If you need headroom today without a hypervisor change:

# ~2.5 GiB back from journal (safe on homelab; logs also in Loki)
ssh infra-services-cursor 'sudo journalctl --vacuum-size=200M && df -h /'

That may free enough space for a 4 GiB swapfile until the proper disk resize. Do not rely on journal vacuum long term — fix root size.

Still cannot docker image prune meaningfully while all 17 images are in use.

Monitoring¶

Alert	Threshold	Firing today?
`InfraServicesDiskCritical`	<10% free on `/`	Yes (deployed via hotfix + in repo)
`InfraServicesDiskPressure`	<15% free	Yes
`InfraServicesMemoryLow`	MemAvailable <1.5 GiB	No (~4.1 GiB available)
Generic `DiskSpaceLow` (all hosts)	<20% free	Also fires for infra-services

After resize + git pull, ansible-pull enables swap; reload Prometheus if alert file was only hotfixed:

ssh infra-services-cursor 'curl -sf -X POST http://127.0.0.1:9090/-/reload'

Recommended maintenance order (condensed)¶

Optional: journal vacuum if you need swap before Proxmox.
Proxmox: disk 80 GiB → guest growpart + resize2fs.
Proxmox: RAM 16 GiB → reboot VM.
git pull on host → ansible-pull (swapfile) → verify swapon --show.
Update inventory/hosts/infra-services.yaml + generators + commit.
Post-check: df -h /, free -h, confirm InfraServices* alerts clear.

Open follow-ups (not blocking resize)¶

Wazuh indexer ILM/retention — indexer data still tiny; plan before it becomes the next disk surprise.
Wazuh agents on prox, infra-services, saltierpoop — budget +100–200 MiB manager RSS.
Refresh prox live scan after resize: python scripts/proxmox-scan.py (updates compute-live/ JSON).

Troubleshooting: resize failures (2026-06-17 incident)¶

Symptoms¶

mkdir: cannot create directory '/tmp/growpart.…': No space left on device
resize2fs: Block bitmap checksum does not match bitmap
lsblk shows sda / sda1 as 34T
df shows / still 29G

Root causes¶

/ at 100% — growpart needs /tmp; vacuum journal first.
Proxmox disk set to ~34 TiB instead of 80 GiB — partition auto-span matched the bogus virtio size; bare resize2fs then tried to grow toward TiB and failed on a full filesystem.
resize2fs abort — read-only e2fsck -n reported clean after journal vacuum; no offline fsck required in this case.

Recovery (worked live)¶

sudo journalctl --vacuum-size=200M          # freed ~2.5 GiB
sudo e2fsck -n /dev/sda1                    # verify clean before retry
sudo resize2fs /dev/sda1 75G              # explicit size — NOT bare resize2fs
df -h /                                     # expect ~73G size, ~48G free

Still required on Proxmox (shrinking)¶

qm disk resize cannot shrink LVM-thin volumes (shrinking disks is not supported). Guest partition must be smaller than the target before hypervisor shrink. Cloud-init growpart will undo a manual partition shrink on boot unless disabled (see below).

Completed fix (2026-06-25, VM 123 → 100 GiB):

Snapshot: qm snapshot 123 pre-disk-shrink --description "…"
Guest (online): disable autogrow → shrink partition → shutdown
Proxmox (VM stopped): lvresize -L 100G -f /dev/pve/vm-123-disk-1
Fix GPT backup header: sgdisk -e /dev/pve/vm-123-disk-1 (required after lvresize or the VM will not get network)
Reattach disk if needed: qm set 123 --scsi0 local-lvm:vm-123-disk-1,…,size=100G
Boot guest: growpart /dev/sda 1 && resize2fs /dev/sda1
Delete snapshot: qm delsnapshot 123 pre-disk-shrink

Disable cloud-init autogrow (Ansible: common_cloud_init_disable_autogrow: true):

# /etc/cloud/cloud.cfg.d/99-disable-autogrow.cfg
growpart:
  mode: off
resize_rootfs: false

Guest shrink (before hypervisor):

yes Yes | sudo parted ---pretend-input-tty /dev/sda unit GiB resizepart 1 85GiB
sudo partprobe /dev/sda
lsblk /dev/sda    # sda1 ~84G while sda may still show 34T until step 3–4
sudo shutdown -h now

Proxmox shrink (from prox shell, VM stopped):

lvresize -L 100G -f /dev/pve/vm-123-disk-1
sgdisk -e /dev/pve/vm-123-disk-1
qm set 123 --scsi0 local-lvm:vm-123-disk-1,discard=on,ssd=1,size=100G
qm start 123

Until steps 3–4 complete, lsblk may still show a bogus 34T sda while sda1 is already sane — fix the hypervisor LV, not just the partition.

Close-out (2026-06-25)¶

34 TiB typo disk fully corrected on VM 123:

Check	Result
Proxmox `scsi0`	100 GiB (`vm-123-disk-1`)
Guest `lsblk`	`sda` 100G, `sda1` 99G
ext4 `/`	96 GiB, ~62 GiB free
RAM	16 GiB
Swap	4 GiB `/swapfile`
Stacks	17/17 containers healthy
Wazuh agents	prox, infra-services, saltierpoop Active
Snapshot `pre-disk-shrink`	removed (reclaimed thin-pool headroom)
cloud-init autogrow	disabled via Ansible

Inventory storage.size_gb: 100 matches reality.