Check-in — 2026-06-17¶
A diary-style snapshot of the homelab monorepo: where we started, what is running today, what changed in the last few weeks, and what still needs an owner decision or a Saturday afternoon.
TL;DR¶
The spine is real: inventory YAML, generators, ansible-pull, Komodo,
Traefik, monitoring, and docs all run on infra-services. Phase 7 network
hardening is mostly applied (ZBF, WiFi VLANs, DSM forwards removed). DNS
cutover to AdGuard is in progress — stack is live on 192.168.6.17; finish
UDM WAN + DHCP and PiHole soak before decom. Backup credentials exist in git
but scheduled restic on the host may still need a green converge. Stretch goals (Proxmox consolidation, SIEM, InfluxDB merge)
are documented but untouched.
Where we are by phase¶
| Phase | Theme | Status | Notes |
|---|---|---|---|
| 0 | Bootstrap + hotfixes | Mostly done | Docs deploy works; SEC-001 SMB forward still open in security register; ntfy capacity alerts not wired |
| 0.5 | Lab audit | Done | lab-audit.md — every entity has a disposition |
| 1 | Inventory spine | Done | 22 hosts, 14 appliances, 2 customer-apps, networks.yaml; generators in CI |
| 2 | Secrets (SOPS/age) | Done | 1Password + CI age key; per-tool SSH patterns documented |
| 3 | Ansible + pull loop | Done | ansible-pull timer, ARA callback, Tailscale role hardened for check mode |
| 4 | First services + Komodo | Done | traefik, komodo, ara, homepage on infra-services; webhook deploy deferred |
| 5 | Observability | Done | Prometheus/Grafana/Loki/Alertmanager; Discord alerts; old LXCs destroyed |
| 6 | Backup & DR | In progress | B2 bucket exists; backup.sops.yaml committed; backup-client deploy + verify still on owner TODO |
| 7 | Network + ACLs + DNS | In progress | ZBF + WiFi VLANs applied; AdGuard cutover, Tailscale API secrets, DSM-over-Tailscale verify open |
| 8+ | Cost/capacity, syslog/SIEM | Not started | Proxmox consolidation doc exists; 13 stopped guests waiting on owner decisions |
| 9+ | Renovate, polish | Not started | — |
| 10 | Stretch (NetBox, Headscale, …) | Not started | — |
gantt
title Rollout phases (approximate)
dateFormat YYYY-MM-DD
axisFormat %b
section Done
Phases 0–5 core :done, 2026-05-11, 2026-05-16
Phase 6 backup creds :done, 2026-05-13, 2026-06-10
section In flight
Phase 6 backup deploy :active, 2026-05-13, 2026-07-01
Phase 7 network ZBF :done, 2026-05-14, 2026-05-15
Phase 7 DNS AdGuard :active, 2026-05-14, 2026-07-15
Phase 7 Tailscale ACL sync :active, 2026-05-14, 2026-07-15
section Later
Proxmox consolidation :2026-07-01, 2026-12-31
Centralized syslog/SIEM :2026-08-01, 2026-12-31
What is live right now¶
| Layer | Reality check |
|---|---|
| GitOps | main → Komodo poll + ansible-pull every ~30 min on managed hosts |
| Ingress | Traefik on infra-services (*.infra.realemail.app); saltierpoop has public 80/443/8080 forwards |
| Monitoring | Grafana/Prometheus at infra-services; dashboards in repo (incl. new Proxbox thermals) |
| Network segmentation | 8 VLANs; ZBF with IoT + Security custom zones; 8 user firewall policies |
| DNS (actual) | AdGuard live on 192.168.6.17; UDM cutover + PiHole soak in progress — adguard.md |
| Remote access | Tailscale on several nodes; ACL in infra/tailscale/acl.json; GitHub Action sync blocked on missing TS_API_KEY |
| Docs | mkdocs → Cloudflare Pages; merge main to publish |
What we did recently¶
Network truth (2026-06-03 session)¶
Pulled a read-only live scan from the UDM SE (UniFi Network 10.4.57) and wrote the first as-built network documentation — not aspirational inventory, but what is actually on the wire:
- network-live.md — physical topology, VLAN map, DNS path, 48 clients by VLAN, port-forwards
- firewall-live.md — ZBF zones, effective matrix, VLAN-to-VLAN “who can talk to whom”
- network-observations-2026-06-03.md — ranked anomalies (DNS dependency on PiHole is #1)
- Extended
scripts/labctl/unifi.pywithzones,firewall-policies, anddump(single login; avoids UDM rate-limiting)
Also: roborock-iot-connectivity.md runbook for IoT VLAN + ZBF + DNS edge cases.
Ansible / Tailscale (May–June commits)¶
- Split Tailscale auth keys (Ansible vs manual SOPS paths)
- Documented all tag types for Phase 7
- Hardened
tailscaleandnode_exporterroles foransible --check - Synology
accept-routeshealth notice documented
Monitoring¶
proxbox_thermals.jsonGrafana dashboard (InfluxDB thermal bucket) added to repo
Phase 7R closeout (2026-06-18)¶
ZBF remediation applied: HA → IoT, printer allow (reordered). SLZB-06M settled in HA. HomePod AirPlay still broken — deferred. Fiio override documented. Next: Phase 7 — AdGuard deployed on infra-services (containers up); owner setup wizard + DNS cutover remain.
Known gaps and drift¶
These are documented with evidence — not guesses about what broke for you.
| Item | Why it matters | Where |
|---|---|---|
| HomePod AirPlay | Personal → IoT rule still Block or stream path not verified (2026-06-18) | phase-7r-zbf-remediation.md |
| HA integration failures | SLZB settled; Aqara/TP-Link pending WiFi moves | §0.3 |
| DNS still on PiHole | AdGuard live; finish UDM WAN + DHCP → .17, then decom LXC 104 |
AdGuard · §0.1 |
| backup-client missing on host | No /opt/homelab/services/backup-client/ yet |
Q8.3 |
| SEC-001 SMB forward | Public SMB to Synology still listed open in security register | security-register.md |
| README Owner TODO | Some rows may lag git (e.g. backup.sops.yaml is committed but table still shows open) | README.md |
flowchart LR
subgraph today["DNS today"]
C[Clients] --> GW[UDM gateway]
GW --> PH[PiHole 192.168.6.80]
end
subgraph target["DNS target"]
C2[Clients] --> ADG[AdGuard 192.168.6.17]
ADG --> UB[Unbound]
end
today -.->|"Phase 7 cutover"| target
What remains (owner-focused)¶
Grouped by “do this next” vs “schedule when bored.”
Do this next¶
- Verify DSM over Tailscale (
https://100.71.93.130:5001off-LAN) — phase-7 §4 - Finish DNS cutover — UDM Internet → DNS →
192.168.6.17, all VLAN DHCP →.17, verify from Servers VLAN (saltierpoop) → then decommission PiHole - Run
ansible-pull/ deploybackup-clienton infra-services and confirm a successful restic run to B2 - Close SEC-001 — remove SMB port forward (hotfix runbook exists)
- Add
TS_API_KEY+TS_TAILNETGitHub secrets so ACL GitOps syncs
Schedule when you have a block¶
- Reconcile inventory with network-live device table (nfs-monitoring, harbor-registry, IP/MAC fixes)
- Tailscale on all seven Phase 7 hosts (three-key model documented)
- InfluxDB consolidation (LXC 111 → monitoring stack)
- Proxmox consolidation — 13 stopped VMs; owner decisions on keep/migrate/decom
- Centralized syslog + SIEM (Phase 8+)
- Cloudflare Access on docs site; Komodo webhook via tunnel
Automation / agent-friendly follow-ups¶
- Run
network-scan.pyafter inventory fixes; commit generator output - Refresh device-vlan-mapping.md from
a new
labctl unifi dump - Update README Owner TODO table to match committed
backup.sops.yamlstatus
Doc map at this milestone¶
| Need | Read |
|---|---|
| Master plan | PLAN.md |
| Open owner tasks | README Owner TODO |
| Live network | network-live.md |
| Live firewall | firewall-live.md |
| Design / intent | network.md, firewall-policy.md |
| Phase 7 checklist | phase-7-owner-actions.md |
| Security findings | security-register.md |
| Connectivity triage | network-observations-2026-06-03.md |
Closing note¶
The project crossed from “bootstrap the repo” to “operate a documented lab.” The uncomfortable part — things that stopped talking — is most likely explained by the DNS chain still pointing at a host you plan to retire, plus normal fallout from VLAN migration and ZBF (documented in observations). The next check-in should happen after AdGuard cutover or when backup runs are green; whichever comes first.
Previous entry: (none — first journal check-in).