Skip to content

infra-services capacity and resize

Capacity targets, operating guidelines, and Proxmox resize procedure for infra-services (Proxmox VM 123, 192.168.6.17).

Inventory today (inventory/hosts/infra-services.yaml): 4 vCPU / 16 GiB RAM / 100 GiB disk (Proxmox), 75 GiB ext4 in guest, 4 GiB swap. Resized 2026-06-24 after Wazuh 4.14.5 exposed the original 30 GiB disk as too small.

Prometheus rules scoped to this host live in monitoring/prometheus/alerts/infra-services.yml.

Target sizing

Resource Minimum (strict ops) Recommended Notes
RAM 12 GiB 16 GiB Wazuh RSS ~2.1 GiB today; room for agents, pulls, patch restarts
Disk 64 GiB 80 GiB Images + upgrade peak + indexer growth; 64 GiB needs aggressive prune + ILM
Swap 4 GiB 4 GiB swapfile Safety net only — not a substitute for RAM
vCPU 4 4 (keep) Memory-bound at homelab scale

Sweet spot: 16 GiB RAM / 80 GiB disk / 4 vCPU / 4 GiB swap.

Update inventory after the Proxmox resize (not before):

hardware:
  vcpu: 4
  ram_gb: 16
  storage:
    - device: scsi0
      size_gb: 80
      kind: virtio-scsi
      role: boot+data

Then run generators (render-ansible.py, render-doc-stubs.py, etc.) and commit.

Measured baseline (2026-06, post–Wazuh 4.14.5)

Bucket Size
Docker images ~13 GiB (Wazuh ~7 GiB)
Container layers + volumes ~6 GiB
OS + /usr ~3 GiB
Root / total ~28 GiB on 29 GiB VM
Container RSS (steady) ~3.2–3.8 GiB
Host overhead ~0.8–1.2 GiB

Largest data volumes at snapshot: Prometheus ~1.5 GiB, AdGuard ~558 MiB, Mongo ~507 MiB, Loki ~320 MiB. Wazuh indexer data was still tiny; indexer indices are the long-term disk driver (plan 15–25 GiB in year one with FIM/vuln).

Safe operating guidelines

Disk

  1. Never run below 15% free on / — treat 85% used as warning, 90% as incident (pulls and indexer merges fail badly when full).
  2. After Wazuh (or any large stack) upgrade: within 24h run docker image prune -f once the new stack is healthy. Budget ~7 GiB headroom during pull if old images are still present.
  3. Tight disk upgrade pattern: prune old images first, or pull one service at a time and prune between.
  4. Plan Wazuh indexer ILM/retention (e.g. 90–180d on alert indices) before disk becomes a surprise.
  5. Monitoring is bounded (Prometheus + Loki 30d retention) — budget ~6 GiB combined at steady state.

Memory

  1. Keep Wazuh indexer heap at 1 GiB until the VM has ≥16 GiB; only then consider 2 GiB (never >50% of VM RAM for heap alone).
  2. Do not add heavy stacks on infra-services without revisiting this budget.
  3. Enable the 4 GiB swapfile (Ansible) before or with the RAM resize — cheap insurance during pulls and patch windows.

Wazuh-specific

  1. All three Wazuh images must stay on the same tag.
  2. Run apply-security-config.sh after upgrades that touch indexer security YAML.
  3. Three agents (prox, infra-services, saltierpoop): budget +100–200 MiB manager RSS.

Emergency disk relief (before resize)

When / is critical and you need headroom immediately:

# On infra-services (SSH as someone)
docker image prune -f
docker builder prune -f   # if build cache exists
sudo journalctl --vacuum-size=200M
df -h /

If pruning Wazuh old tags: only after the 4.14.5 stack is verified healthy (you need one full image set per component).

Resize procedure (Proxmox VM 123)

Schedule a maintenance window. Order: disk first (you may be at 100% now), then RAM. Swap can land via ansible-pull before or after RAM.

1. Pre-checks

# From your workstation
ssh infra-services-cursor 'df -h /; free -h; docker ps --format "{{.Names}}" | wc -l'

On prox (192.168.6.71): confirm local-lvm (or the VM’s datastore) has enough thin free space for +50 GiB. See prox storage remediation.

Optional: snapshot VM 123 in Proxmox before resize.

2. Extend disk (Proxmox)

Use GiB, not TiB

Proxmox Resize disk asks for the new total size. Enter 80 and confirm the unit is GiB. A typo of 34000 GiB (or selecting TiB) produced a ~34 TiB virtio disk in the 2026-06-17 incident — the guest partition grew to match while ext4 stayed at 29 GiB.

  1. Proxmox UI → VM 123HardwareHard Disk (scsi0)Resize disk.
  2. Set total size to 80 GiB (not 80 TiB, not 34000).
  3. Apply — online grow works for virtio-scsi on Ubuntu 24.04 in most cases.

If / is full, free space before growpart (journal vacuum). Then grow the filesystem with an explicit size so a mis-sized virtio disk cannot expand ext4 to terabytes:

ssh infra-services-cursor 'sudo journalctl --vacuum-size=200M'   # if / is full
ssh infra-services-cursor 'lsblk /dev/sda'
ssh infra-services-cursor 'sudo growpart /dev/sda 1'             # if partition < disk
ssh infra-services-cursor 'sudo resize2fs /dev/sda1 75G'         # explicit target, not bare resize2fs
ssh infra-services-cursor 'df -h /'

Do not run bare resize2fs /dev/sda1 when lsblk shows a multi-terabyte partition — always pass 75G (or your target minus ~5 GiB for boot slices).

Adjust partition/device if lsblk shows something other than sda1 (unlikely on this VM).

3. Increase RAM (Proxmox)

  1. VM 123HardwareMemory → set 16384 MiB (16 GiB).
  2. Apply — takes effect on next boot or immediately if hotplug is enabled (Proxmox usually requires a reboot for RAM on QEMU VMs).
ssh infra-services-cursor 'sudo reboot'
# wait, then:
ssh infra-services-cursor 'free -h'

4. Enable swap (Ansible)

Repo prep (already in tree when merged):

  • common_swapfile_enabled: true in infra/ansible/inventory/host_vars/infra-services.yml
  • common role task swapfile.yml

After git pull on the host, ansible-pull applies swap on next converge, or run manually:

ssh infra-services-cursor 'sudo systemctl start ansible-pull-apply.service'
ssh infra-services-cursor 'swapon --show; free -h'

One-time manual fallback (if Ansible has not run yet):

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

5. Reload Prometheus rules

After the alert file is on the host (git pull under /var/lib/ansible-pull/homelab):

ssh infra-services-cursor 'curl -sf -X POST http://127.0.0.1:9090/-/reload'

Or restart the prometheus container from /opt/homelab/services/monitoring.

6. Post-resize verification

ssh infra-services-cursor 'free -h && df -h / && docker system df'
ssh infra-services-cursor 'docker stats --no-stream --format "{{.Name}} {{.MemUsage}}" | sort'

Pass criteria:

  • ≥10 GiB free on / at idle
  • ≥4 GiB MemAvailable under normal load (on 16 GiB VM)
  • Swap active (swapon --show)
  • Prometheus: no firing InfraServices* alerts after 15 minutes

7. Update inventory and docs

  1. Edit inventory/hosts/infra-services.yaml (ram_gb: 16, storage.size_gb: 80).
  2. Run generators and commit.
  3. Optional: python scripts/proxmox-scan.py to refresh compute-live JSON.

Research updates (2026-06-17)

Live checks from infra-services-cursor plus frozen Proxmox storage audit prox-storage-2026-06-24.md. Use this section when scheduling the maintenance window — numbers drift daily while Wazuh and Docker layers grow.

Live guest state (2026-06-17)

Signal Value Implication
/ usage 29 GiB / 29 GiB (~100%, ~33 MiB free) Blocker for docker pull, swapfile, apt
Block device sda 30 GiBsda1 29 GiB ext4 on / Standard Ubuntu 24.04 cloud image layout
Boot partitions sda15 EFI 106M, sda16 /boot 913M Leave untouched during grow
RAM 7.7 GiB total, ~4.1 GiB MemAvailable, no swap Memory OK today; OOM risk on spikes
Containers 17 running (was 13 pre-Wazuh in guest scan) +Wazuh×3, +authentik-outpost
cloud-guest-utils installed (growpart available) Guest grow after Proxmox disk resize is one command
QEMU guest agent running Clean shutdown from Proxmox UI works
docker image prune -f 0 B reclaimed All 17 images referenced by running containers

Docker disk (docker system df):

Layer Size Notes
Images 13.32 GiB (17 active) Wazuh trio ~7 GiB (indexer 2.49 + manager 2.44 + dashboard 2.08 GiB)
Container writable layers 3.00 GiB wazuh-manager alone ~3 GiB — unusually large; watch on upgrades
Named volumes 3.07 GiB Prometheus 1.5 GiB, AdGuard 559 MiB, Mongo 507 MiB, Loki 318 MiB, Wazuh queue 270 MiB, indexer data 3 MiB
Build cache 0

Top RSS (steady state today):

Container RSS
wazuh-indexer 1.39 GiB
wazuh-manager 515 MiB
ara 287 MiB
wazuh-dashboard 199 MiB

Non-Docker disk hog: systemd journal /var/log/journal ~2.8–2.9 GiB — largest single reclaimable chunk without stopping stacks.

Proxmox side (VM 123, audit 2026-06-24)

Item Value
VMID / name 123 / infra-services
Datastore local-lvm (thin LVM on prox NVMe)
Provisioned disk 30 GiB (vm-123-disk-1; ~65% LVM data at audit → guest now ~100%)
Provisioned RAM 8192 MiB
Thin pool headroom ~271 GiB free on 794 GiB pool (67.5% used at audit)
Hypervisor RAM prox 60 GiB total — +8 GiB for VM 123 is trivial

Verdict: Proxmox storage is not the blocker; the guest filesystem is full. Growing scsi0 by +50 GiB (30 → 80 GiB) consumes ~6% of remaining thin free.

Patch-controller SSH from infra-services → prox returned permission denied during this research pass — perform Proxmox changes from the Proxmox UI or a workstation with prox admin access, not via the patch-orchestrator key.

Proxmox resize — exact targets

Disk (UI): VM 123 → Hardware → Hard Disk (scsi0) on local-lvm → Resize → 80 GiB total (or +50G increment from 30).

Disk (CLI on prox):

qm resize 123 scsi0 +50G    # 30 → 80 GiB
# verify
qm config 123 | grep scsi

Guest grow (after Proxmox resize, online OK for virtio-scsi):

ssh infra-services-cursor 'lsblk /dev/sda'
ssh infra-services-cursor 'sudo growpart /dev/sda 1 && sudo resize2fs /dev/sda1 && df -h /'

Expected: sda shows 80G, sda1 grows to ~79G, / reports ~50+ GiB free at current utilization.

RAM (UI): VM 123 → Hardware → Memory → 16384 MiB.

RAM (CLI on prox):

qm set 123 --memory 16384
qm reboot 123    # RAM change requires reboot on this VM type

Immediate relief before Proxmox (no stack stop)

If you need headroom today without a hypervisor change:

# ~2.5 GiB back from journal (safe on homelab; logs also in Loki)
ssh infra-services-cursor 'sudo journalctl --vacuum-size=200M && df -h /'

That may free enough space for a 4 GiB swapfile until the proper disk resize. Do not rely on journal vacuum long term — fix root size.

Still cannot docker image prune meaningfully while all 17 images are in use.

Monitoring

Alert Threshold Firing today?
InfraServicesDiskCritical <10% free on / Yes (deployed via hotfix + in repo)
InfraServicesDiskPressure <15% free Yes
InfraServicesMemoryLow MemAvailable <1.5 GiB No (~4.1 GiB available)
Generic DiskSpaceLow (all hosts) <20% free Also fires for infra-services

After resize + git pull, ansible-pull enables swap; reload Prometheus if alert file was only hotfixed:

ssh infra-services-cursor 'curl -sf -X POST http://127.0.0.1:9090/-/reload'
  1. Optional: journal vacuum if you need swap before Proxmox.
  2. Proxmox: disk 80 GiB → guest growpart + resize2fs.
  3. Proxmox: RAM 16 GiB → reboot VM.
  4. git pull on host → ansible-pull (swapfile) → verify swapon --show.
  5. Update inventory/hosts/infra-services.yaml + generators + commit.
  6. Post-check: df -h /, free -h, confirm InfraServices* alerts clear.

Open follow-ups (not blocking resize)

  • Wazuh indexer ILM/retention — indexer data still tiny; plan before it becomes the next disk surprise.
  • Wazuh agents on prox, infra-services, saltierpoop — budget +100–200 MiB manager RSS.
  • Refresh prox live scan after resize: python scripts/proxmox-scan.py (updates compute-live/ JSON).

Troubleshooting: resize failures (2026-06-17 incident)

Symptoms

mkdir: cannot create directory '/tmp/growpart.…': No space left on device
resize2fs: Block bitmap checksum does not match bitmap
lsblk shows sda / sda1 as 34T
df shows / still 29G

Root causes

  1. / at 100%growpart needs /tmp; vacuum journal first.
  2. Proxmox disk set to ~34 TiB instead of 80 GiB — partition auto-span matched the bogus virtio size; bare resize2fs then tried to grow toward TiB and failed on a full filesystem.
  3. resize2fs abort — read-only e2fsck -n reported clean after journal vacuum; no offline fsck required in this case.

Recovery (worked live)

sudo journalctl --vacuum-size=200M          # freed ~2.5 GiB
sudo e2fsck -n /dev/sda1                    # verify clean before retry
sudo resize2fs /dev/sda1 75G              # explicit size — NOT bare resize2fs
df -h /                                     # expect ~73G size, ~48G free

Still required on Proxmox (shrinking)

qm disk resize cannot shrink LVM-thin volumes (shrinking disks is not supported). Guest partition must be smaller than the target before hypervisor shrink. Cloud-init growpart will undo a manual partition shrink on boot unless disabled (see below).

Completed fix (2026-06-25, VM 123 → 100 GiB):

  1. Snapshot: qm snapshot 123 pre-disk-shrink --description "…"
  2. Guest (online): disable autogrow → shrink partition → shutdown
  3. Proxmox (VM stopped): lvresize -L 100G -f /dev/pve/vm-123-disk-1
  4. Fix GPT backup header: sgdisk -e /dev/pve/vm-123-disk-1 (required after lvresize or the VM will not get network)
  5. Reattach disk if needed: qm set 123 --scsi0 local-lvm:vm-123-disk-1,…,size=100G
  6. Boot guest: growpart /dev/sda 1 && resize2fs /dev/sda1
  7. Delete snapshot: qm delsnapshot 123 pre-disk-shrink

Disable cloud-init autogrow (Ansible: common_cloud_init_disable_autogrow: true):

# /etc/cloud/cloud.cfg.d/99-disable-autogrow.cfg
growpart:
  mode: off
resize_rootfs: false

Guest shrink (before hypervisor):

yes Yes | sudo parted ---pretend-input-tty /dev/sda unit GiB resizepart 1 85GiB
sudo partprobe /dev/sda
lsblk /dev/sda    # sda1 ~84G while sda may still show 34T until step 3–4
sudo shutdown -h now

Proxmox shrink (from prox shell, VM stopped):

lvresize -L 100G -f /dev/pve/vm-123-disk-1
sgdisk -e /dev/pve/vm-123-disk-1
qm set 123 --scsi0 local-lvm:vm-123-disk-1,discard=on,ssd=1,size=100G
qm start 123

Until steps 3–4 complete, lsblk may still show a bogus 34T sda while sda1 is already sane — fix the hypervisor LV, not just the partition.


Close-out (2026-06-25)

34 TiB typo disk fully corrected on VM 123:

Check Result
Proxmox scsi0 100 GiB (vm-123-disk-1)
Guest lsblk sda 100G, sda1 99G
ext4 / 96 GiB, ~62 GiB free
RAM 16 GiB
Swap 4 GiB /swapfile
Stacks 17/17 containers healthy
Wazuh agents prox, infra-services, saltierpoop Active
Snapshot pre-disk-shrink removed (reclaimed thin-pool headroom)
cloud-init autogrow disabled via Ansible

Inventory storage.size_gb: 100 matches reality.