Phase 7R Network & Infrastructure Audit — Owner Questionnaire¶

Audit date: 2026-06-18 (infra-services SSH verified same day) Auditor context: Takeover-style review; documentation may be stale. Live data pulled where API access exists; gaps are called out explicitly. Live scan: UDM SE UniFi Network 10.5.43 (gateway uptime ~31 hours at capture). Branch: docs/phase-7r-network-audit (working notes; raw JSON in gitignored .scratch/audit/).

Answer inline ([ ] / free text) or reply in chat keyed by question ID (e.g. Q4.2). Sections are ordered: access gaps → network/ZBF → Home Assistant → compute → DNS → security → housekeeping.

0. What we could and could not see¶

Source	Status	Notes
UniFi API (`labctl unifi dump`)	OK	46 clients, 6 devices, 119 ZBF policies, 8 zones
Proxmox API (`labctl proxmox`)	OK	6 QEMU + 16 LXC; configs for key guests
Synology DSM API (`labctl synology`)	OK	System info, network, packages, shares
Inventory + `network-scan.py`	OK	31 unknown clients; 10 inventory IPs not in `stat/sta`
SSH `infra-services` (WSL)	OK	Batch SSH via `infra-services` alias; 10 containers observed
SSH `proxbox` (WSL, 1Password)	OK	Host alias `proxbox` → `192.168.6.71`; shell + `pvesh` verified 2026-06-18
SSH `saltierpoop`	Not attempted	Same key gap expected
Home Assistant API	OK	Long-lived token in `.env`; pulled 2026-06-18 (see §0.3)
Tailscale from this workstation	Not attempted	No tailnet session in Cursor shell

graph TB
    subgraph observed["Observed live"]
        UDM["UDM SE · ZBF"]
        PVE["Proxmox prox · 8 running guests"]
        DSM["Synology whrrr · 3 LAN IPs"]
        INFRA["infra-services · 6 compose stacks"]
        PROX["proxbox · 8 running guests"]
        HA["Home Assistant · 54 integrations"]
        INV["Inventory YAML + docs"]
    end
    subgraph blind["Could not observe"]
        SALT["saltierpoop containers"]
    end
    observed --> blind

0.1 `infra-services` — compose stacks (observed 2026-06-18)¶

SSH: wsl → ssh -o BatchMode=yes infra-services (owner fixed WSL access).

Running containers (10)¶

Stack	Container	Image	Notes
traefik	traefik	traefik:v3.6	`:80`, `:443`, `:8080` on host
homepage	homepage	gethomepage/homepage	`:3000` on host (healthy)
ara	ara	recordsansible/ara-api	`:8000` on host
komodo	komodo-core, komodo-periphery, komodo-mongo	moghtech/komodo-* + mongo:7.0	internal ports only
monitoring	prometheus, grafana, loki, promtail, alertmanager	prom/grafana stack	prometheus `:9090`, loki `:3100`

Present on disk but not running¶

Stack	Path	Notes
adguard	`/opt/homelab/services/adguard/`	`compose.yml`, `dns-rewrites.yaml`, `unbound.conf` synced; no `compose.env` / `.env`; `docker compose ps` empty
backup-client	—	No `/opt/homelab/services/backup-client/` directory on host

Host DNS on `:53`¶

Only systemd-resolved on 127.0.0.53 / 127.0.0.54 — not AdGuard. PiHole at 192.168.6.80 remains the network DNS path via UDM WAN settings.

Ansible pull¶

ansible-pull-apply.timer and ansible-pull-check.timer are active (not the generic ansible-pull.timer). Host checkout on main at 2e073fd with local untracked/modified files (backups/restore-test.sh, services/monitoring/.env.sops.yaml, homepage config drift).

Docker networks¶

traefik (external), monitoring_default, komodo_default, plus default bridge/host.

0.2 `proxbox` — Proxmox guests (observed 2026-06-18)¶

SSH: wsl → ssh proxbox (owner SSH config alias; 1Password agent for auth).

Hostname on host: prox. Matches Proxmox API inventory.

Running (8)¶

ID	Type	Name	RAM (max)
100	qemu	saltierpoop	30720 MB
123	qemu	infra-services	8192 MB
200	qemu	haos	6144 MB
104	lxc	blocktopus	2048 MB
105	lxc	k6-loadtest	6144 MB
111	lxc	influxdb	2048 MB
116	lxc	pulse	1024 MB
119	lxc	harbor-registry	4096 MB

Stopped (14)¶

aiProject, unmanic, metrimon, dnsproject, graylog, sqlserver2022, mysql, nfs-monitoring, ollama, caddy, reactive-resume, octoprint, penpot, netboot.xyz

Note: 1Password SSH works interactively; unattended agents may need a dedicated key later (same pattern as infra-services-cursor).

0.3 Home Assistant — integration health (2026-06-18)¶

Source: REST API via .env token (HA_URL / HA_TOKEN). Raw summary in .scratch/ha-audit.json (gitignored).

Metric	Value
HA version	2026.6.3
Config entries	54
Entities (live states)	~800+
Unavailable entities	394

Integrations not healthy (20 of 54)¶

Integration	State	Error / target	Likely cause (ZBF / VLAN)
smlight SLZB-06M	`setup_retry`	Connection failed	Servers → IoT blocked (coordinator at `192.168.7.132`) → Q3.1
homekit_controller Aqara-Hub-M2-7E74	`setup_in_progress`	—	Hub on GenPop `192.168.1.82`, not IoT → Q2.1
homekit_controller Doorbell Repeater	`setup_retry`	timeout `192.168.7.107:43507`	Servers → IoT blocked → Q3.1
mqtt	`loaded`	zigbee2mqtt entities exist	Bridge likely dead while SLZB unreachable
tplink ×3 (EP10 plugs)	`setup_retry`	timeout `192.168.1.248/107/39:9999`	Plugs on GenPop; Servers → GenPop blocked → new rule or move plugs to IoT/Appliances
octoprint	`setup_retry`	`192.168.6.222:5000` connect failed	LXC 120 stopped → Q5.1
otbr OpenThread Border Router	`setup_retry`	Unable to connect	Thread/Matter; may need IoT reachability
unifiprotect UDM SE	`setup_error`	Authentication failed	Re-auth in HA UI (not ZBF)
synology_dsm ×2	`not_loaded`	stale `192.168.1.88` / `.105`	Remove or fix IPs (NAS is `192.168.6.215`)
ipp / syncthru Samsung printer	`setup_retry`	IPP timeout	Printer on GenPop `.167` → VLAN + reachability
tuya_local feeder	`migration_error`	—	Integration migration (not ZBF)
tuya cloud	`not_loaded`	—	Disabled / unused?
upnp UDM	`not_loaded`	—	Optional discovery
apple_tv Bedroom	`setup_in_progress`	—	May need Personal → IoT → Q3.2
srp_energy	`setup_retry`	—	Utility API (unrelated)

flowchart LR
    HA["HA 192.168.6.227<br/>Servers VLAN"]
    SLZB["SLZB .132<br/>IoT"]
    Aqara["Aqara hub .82<br/>GenPop"]
    TPL["TP-Link .1.x<br/>GenPop"]

    HA x--x|blocked| SLZB
    HA x--x|blocked| Aqara
    HA x--x|blocked| TPL

Takeaway: A large share of HA pain maps directly to missing east-west firewall allows (Servers→IoT, Servers→GenPop) and wrong VLAN placement (Aqara, printer, TP-Link). Fixing Q3.1 + Q2.1 should move the needle before chasing individual integrations.

1. Access — resolved¶

All audit SSH/API paths are working except saltierpoop (not attempted).

Q1.1 — SSH to `infra-services` ✅ Resolved¶

2026-06-18: Owner fixed WSL SSH to infra-services. Agent verified batch SSH and enumerated compose stacks — see §0.1 above.

Q1.2 — SSH to Proxmox (`proxbox`) ✅ Resolved¶

2026-06-18: Owner confirmed SSH config alias proxbox (1Password auth). Agent verified ssh proxbox → hostname prox, full guest list in §0.2.

Q1.3 — Home Assistant long-lived access token ✅ Resolved¶

2026-06-18: Token stored in local .env. API pull complete — see §0.3.

Token creation steps (reference)

1. Open `http://192.168.6.227:8123` → profile (username, bottom-left sidebar). 2. **Security → Long-Lived Access Tokens → Create Token** (name e.g. `homelab-audit-cursor`). 3. Add `HA_URL` + `HA_TOKEN` to repo-root `.env` (see `.env.example`).

2. Network topology & VLAN placement¶

Live client counts (UniFi `stat/sta`)¶

VLAN / network	Clients	Notes
Servers (4)	14	prox, infra-services, saltierpoop, HA, PiHole, NAS NICs, harbor, influx, pulse, k6, VMs on NAS
IoT (5)	14	SLZB coordinator now has IP `.132` (was missing in June scan)
Personal (2)	7	Includes Fiio R7 `.44`
Appliances (3)	5	Fellow Aiden on Appliances not IoT
Security (6)	3	Rack cam `.11` live (inventory says `.10`)
GenPop (1)	3	See below — design says guests-only

Q2.1 — GenPop has non-guest devices¶

Answers (2026-06-18):

IP	Device	Decision
192.168.1.167	Samsung printer	Stay GenPop — print from Personal; GenPop → Servers already allowed
192.168.1.82	Aqara Hub M2	Move to IoT WiFi
192.168.1.218	OnePlus 8 Pro	Personal VLAN 2 — on IsThisTheKrustyKrab (owner confirmed)

Printer note: CaptainKangapoo at 192.168.3.17 is Personal VLAN 2 (192.168.3.0/24), not GenPop. Policy 3 targets that network.

2026-06-18: Allow rule exists (Action=Allow, 527k+ hits) but sits below Block inter-VLAN — must reorder above block. See troubleshooting.

[x] Printer stays GenPop; Personal must reach .167 for printing
[x] Aqara → IoT WiFi
[x] OnePlus back on Personal SSID — confirmed

Follow-up (owner):

[ ] Move Aqara Hub M2 to IoT WiFi (in progress)
[ ] Move TP-Link plugs when not in active use → README Owner TODO

Q2.2 — WiFi SSID leakage¶

Observed: Several Personal and IoT devices still associate via The LAN Before Time (GenPop SSID) per client metadata in earlier scans; today many show correct network names but GenPop still has 3 clients.

[ ] Should The LAN Before Time be disabled except when guests visit?
[ ] Should UniFi client isolation or minimum RSSI be used to force trusted devices onto EAP SSIDs?

Q2.3 — Security camera IP drift¶

Device	Inventory	Live (2026-06-18)
Rack cam (G5 Flex)	192.168.8.10	192.168.8.11

[ ] Update inventory to .11, or re-reserve .10 on the camera?

3. Zone firewall (ZBF) — intent vs reality¶

Live user-defined policies (unchanged since design)¶

Policy	Effect
GenPop → Servers	ALLOW
Personal → Servers	ALLOW
Personal → Appliances	ALLOW
Management → Internal	ALLOW
Block inter-VLAN (Internal)	BLOCK new (except above)
IoT/Security → Gateway mgmt ports	BLOCK 22/80/443

Not present: Servers → IoT, Personal → IoT, GenPop → IoT, any → IoT VLAN 5 except Personal→Appliances zone.

graph LR
    HA["HA 192.168.6.227<br/>Servers VLAN"]
    ZB["SLZB .132<br/>IoT VLAN"]
    PH["Phone<br/>Personal VLAN"]
    HP["HomePod .124<br/>IoT VLAN"]

    HA -.->|"BLOCKED"| ZB
    PH -.->|"mDNS only"| HP

Q3.1 — Home Assistant → IoT (critical for your symptoms)¶

Answer (2026-06-18): B — Narrow allow: 192.168.6.227 → IoT zone only. → Policy 1 in remediation runbook

Observed: HAOS VM 200 on Servers VLAN (192.168.6.227). Zigbee coordinator (SLZB-06M) on IoT (192.168.7.132). No firewall allow for Servers → IoT.

HA evidence (§0.3): smlight SLZB-06M Connection failed; HomeKit Doorbell Repeater timeout to 192.168.7.107; 394 unavailable entities.

[x] B. Narrow allow: source 192.168.6.227 only → IoT zone (Appliances + IoT VLANs)

Q3.1b — Home Assistant → GenPop (TP-Link plugs)¶

Answer (2026-06-18): B — Move smart plugs to IoT or Appliances WiFi and re-pair. → Owner actions in remediation runbook

[x] B. Move smart plugs to IoT or Appliances WiFi and re-pair?

Q3.2 — Personal → IoT (AirPlay / HomeKit / phones)¶

Answer (2026-06-18): Yes — AirPlay must work. → Policy 2 in remediation runbook

[x] Add Personal → IoT (VLAN 5) allow

Q3.3 — Personal → IoT VLAN 5 (not just Appliances)¶

Answer (2026-06-18): Not intentional — the Appliances-only allow was an accidental side effect of zone layout, not a deliberate “block smart speakers” rule. Covered by Q3.2 (add Personal → IoT VLAN 5).

Plain English: You have two “smart device” WiFis. The firewall already let your phone talk to Appliances WiFi (VLAN 3) but blocked IoT WiFi (VLAN 5) where HomePod and Apple TV live. mDNS made speakers visible; streaming was blocked. → Full explanation: phase-7r-zbf-remediation.md

[x] Not intentional — allow Personal → IoT VLAN 5 (same as Q3.2)

Q3.4 — IoT → DNS on Servers (future AdGuard)¶

Answer (2026-06-18): Yes — allow IoT → 192.168.6.17:53 only when AdGuard is deployed (defer until DNS cutover). → Deferred policy in remediation runbook

[x] When cutting over DNS, OK to add IoT → 192.168.6.17:53 only

Q3.5 — Fellow Aiden on Appliances VLAN¶

Observed: Coffee brewer at 192.168.5.39 (Appliances). Older mapping expected IoT.

[ ] Correct VLAN: Appliances or IoT?

4. DNS & Phase 7 checklist ordering¶

Superseded observations (2026-06-19)

Answers in §4.1–4.2 reflect 2026-06-18 SSH before AdGuard was running. Current state: AdGuard authoritative on 192.168.6.17; UDM cutover complete; PiHole soak pending decom. See AdGuard service page and 2026-06-19 documentation validation.

Q4.1 — AdGuard deploy status¶

Observed (SSH 2026-06-18):

/opt/homelab/services/adguard/ exists with compose.yml, dns-rewrites.yaml, unbound.conf.
No compose.env or .env on host — stack never bootstrapped.
docker compose ps for adguard: empty (not running).
Host :53 is systemd-resolved only; no AdGuard listener on 192.168.6.17:53.
Homepage already binds host :3000 (AdGuard initial-setup docs also mention :3000).

Confirmed: AdGuard is not deployed. PiHole LXC 104 (blocktopus, .80) is still the live DNS filter; UDM WAN DNS points there.

[ ] Proceed with AdGuard deploy per Phase 7 runbook after ZBF remediation (Q3.4)?
[ ] For first-time setup: use Traefik route only (adguard.infra.realemail.app) and avoid binding AdGuard admin to host :3000 (conflicts with homepage)?

Q4.2 — PiHole still WAN upstream¶

Observed: wan_dns1 = 192.168.6.80 (Blocktopus). Entire house depends on LXC 104 PiHole for resolution.

[ ] Is this intentional interim, or oversight since migration?
[ ] OK to proceed with AdGuard only after Q3.4 answered?

5. Compute: Proxmox guests¶

Running (live)¶

ID	Name	Role (inferred)	IP (UniFi)
100	saltierpoop	Saltbox media VM	.243
123	infra-services	Docker/Komodo stack	.17
200	haos	Home Assistant	.227
104	blocktopus	PiHole LXC	.80
105	k6-loadtest	Load testing	.223
111	influxdb	Metrics TSDB	.132
116	pulse	Proxmox pulse?	.199
119	harbor-registry	Container registry	.119

Stopped but still allocated (selected)¶

ID	Name	Inventory IP	onboot	Lab audit disposition
120	octoprint	.222	0	KEEP? (Prusa on Appliances WiFi)
114	nfs-monitoring	.107	0	MIGRATE per nfs-monitoring doc
117	caddy	.244	—	DECOMMISSION?
109	graylog	.197	—	DECOMMISSION
102–122	various	many `null`	mostly 0	CONSOLIDATE/DECOM per lab-audit

Q5.1 — OctoPrint LXC stopped¶

Observed: LXC 120 stopped, onboot=0. Prusa (192.168.5.59) on Appliances WiFi.

[ ] Should OctoPrint be running? If yes: start LXC 120 and confirm .222 lease?
[ ] Or printer controlled another way now?

Q5.2 — nfs-monitoring LXC stopped¶

Observed: LXC 114 stopped, onboot=0, inventory .107, not on network.

[ ] Still needed for NFS monitoring, or safe to decommission per consolidation plan?

Q5.3 — k6-loadtest always on¶

Observed: LXC 105 running, on network at .223, onboot=1.

[ ] Intentional 24/7, or should it be stopped when not testing?

Q5.4 — harbor-registry¶

Observed: LXC 119 running at .119; in inventory; not in June device map.

[ ] What uses Harbor today? Keep / decom?

Q5.5 — saltierpoop memory pressure¶

Observed: VM 100 allocated 30720 MB, using ~30 GB of 30 GB (nearly maxed). Inventory says ballooning: true; live QEMU config has balloon: 0.

[ ] Is performance acceptable? Plan to reduce RAM or enable ballooning?

Q5.6 — Stopped guest backlog¶

Observed: 13+ stopped guests per lab-audit; still present on disk.

[ ] Confirm destroy list from lab-audit.md §2 or provide updated keep/migrate/decom for: graylog, caddy, metrimon, dnsproject, penpot, reactive-resume, netboot.xyz, unmanic, ollama, aiproject, mysql, sqlserver2022.

6. Synology (`whrrr`)¶

Q6.1 — NAS DNS¶

Observed (DSM API): NAS uses DNS 192.168.6.1 (gateway), not PiHole directly.

[ ] Is NAS DNS behavior intentional?

Q6.2 — Customer VMs (Ubuncap / Recordurbate)¶

Observed: 192.168.6.100 (Ubuncap), 192.168.6.98 (Recordurbate) on Servers VLAN. Inventory: customer-app, hosted on whrrr via VMM.

[ ] Both VMs still active and correct?
[ ] Any firewall rules needed between them and homelab services?

Q6.3 — Multi-homed NICs¶

Observed: LAN1 .215, LAN2 .214, LAN3 .216 on Servers VLAN; ovs_eth3/4 link-local only.

[ ] Still plan per-VLAN IPs on separate NICs (deferred in lab-audit), or keep all on Servers VLAN?

7. Security & exposure¶

Q7.1 — WAN port forwards (unchanged)¶

Observed: TCP/UDP 80, 8080, 443 → 192.168.6.243 (saltierpoop).

[ ] Still required for all three ports?
[ ] Traefik on infra-services vs saltierpoop ingress — still dual-stack by design?

Q7.2 — SEC-001 SMB forward¶

Security register: SEC-001 still Open (public SMB to NAS).

[ ] Confirm removed on UDM, or still present? (API port_forwards did not show SMB.)

8. Inventory & documentation hygiene¶

Q8.1 — `network-scan.py` “unknown” clients¶

Observed: 31 clients not in inventory — mostly personal phones, IoT gadgets, appliances (expected). Inventory is not meant to list every consumer device.

[ ] Should we extend inventory with an iot-devices.yaml / endpoints.yaml for smart home gear, or keep inventory infra-only?

Q8.2 — Missing from UniFi `stat/sta` but in inventory¶

Includes: octoprint (stopped), nfs-monitoring (stopped), caddy, graylog, sqlserver2022, plus UniFi gear (often not in stat/sta as clients).

[ ] Any of these should be online but aren't?

Q8.3 — `infra-services` host drift vs repo¶