Skip to content

Check-in — 2026-06-17

A diary-style snapshot of the homelab monorepo: where we started, what is running today, what changed in the last few weeks, and what still needs an owner decision or a Saturday afternoon.


TL;DR

The spine is real: inventory YAML, generators, ansible-pull, Komodo, Traefik, monitoring, and docs all run on infra-services. Phase 7 network hardening is mostly applied (ZBF, WiFi VLANs, DSM forwards removed). DNS cutover to AdGuard is in progress — stack is live on 192.168.6.17; finish UDM WAN + DHCP and PiHole soak before decom. Backup credentials exist in git but scheduled restic on the host may still need a green converge. Stretch goals (Proxmox consolidation, SIEM, InfluxDB merge) are documented but untouched.


Where we are by phase

Phase Theme Status Notes
0 Bootstrap + hotfixes Mostly done Docs deploy works; SEC-001 SMB forward still open in security register; ntfy capacity alerts not wired
0.5 Lab audit Done lab-audit.md — every entity has a disposition
1 Inventory spine Done 22 hosts, 14 appliances, 2 customer-apps, networks.yaml; generators in CI
2 Secrets (SOPS/age) Done 1Password + CI age key; per-tool SSH patterns documented
3 Ansible + pull loop Done ansible-pull timer, ARA callback, Tailscale role hardened for check mode
4 First services + Komodo Done traefik, komodo, ara, homepage on infra-services; webhook deploy deferred
5 Observability Done Prometheus/Grafana/Loki/Alertmanager; Discord alerts; old LXCs destroyed
6 Backup & DR In progress B2 bucket exists; backup.sops.yaml committed; backup-client deploy + verify still on owner TODO
7 Network + ACLs + DNS In progress ZBF + WiFi VLANs applied; AdGuard cutover, Tailscale API secrets, DSM-over-Tailscale verify open
8+ Cost/capacity, syslog/SIEM Not started Proxmox consolidation doc exists; 13 stopped guests waiting on owner decisions
9+ Renovate, polish Not started
10 Stretch (NetBox, Headscale, …) Not started
gantt
    title Rollout phases (approximate)
    dateFormat YYYY-MM-DD
    axisFormat %b

    section Done
    Phases 0–5 core           :done, 2026-05-11, 2026-05-16
    Phase 6 backup creds      :done, 2026-05-13, 2026-06-10

    section In flight
    Phase 6 backup deploy     :active, 2026-05-13, 2026-07-01
    Phase 7 network ZBF       :done, 2026-05-14, 2026-05-15
    Phase 7 DNS AdGuard         :active, 2026-05-14, 2026-07-15
    Phase 7 Tailscale ACL sync  :active, 2026-05-14, 2026-07-15

    section Later
    Proxmox consolidation     :2026-07-01, 2026-12-31
    Centralized syslog/SIEM     :2026-08-01, 2026-12-31

What is live right now

Layer Reality check
GitOps main → Komodo poll + ansible-pull every ~30 min on managed hosts
Ingress Traefik on infra-services (*.infra.realemail.app); saltierpoop has public 80/443/8080 forwards
Monitoring Grafana/Prometheus at infra-services; dashboards in repo (incl. new Proxbox thermals)
Network segmentation 8 VLANs; ZBF with IoT + Security custom zones; 8 user firewall policies
DNS (actual) AdGuard live on 192.168.6.17; UDM cutover + PiHole soak in progress — adguard.md
Remote access Tailscale on several nodes; ACL in infra/tailscale/acl.json; GitHub Action sync blocked on missing TS_API_KEY
Docs mkdocs → Cloudflare Pages; merge main to publish

What we did recently

Network truth (2026-06-03 session)

Pulled a read-only live scan from the UDM SE (UniFi Network 10.4.57) and wrote the first as-built network documentation — not aspirational inventory, but what is actually on the wire:

  • network-live.md — physical topology, VLAN map, DNS path, 48 clients by VLAN, port-forwards
  • firewall-live.md — ZBF zones, effective matrix, VLAN-to-VLAN “who can talk to whom”
  • network-observations-2026-06-03.md — ranked anomalies (DNS dependency on PiHole is #1)
  • Extended scripts/labctl/unifi.py with zones, firewall-policies, and dump (single login; avoids UDM rate-limiting)

Also: roborock-iot-connectivity.md runbook for IoT VLAN + ZBF + DNS edge cases.

Ansible / Tailscale (May–June commits)

  • Split Tailscale auth keys (Ansible vs manual SOPS paths)
  • Documented all tag types for Phase 7
  • Hardened tailscale and node_exporter roles for ansible --check
  • Synology accept-routes health notice documented

Monitoring

  • proxbox_thermals.json Grafana dashboard (InfluxDB thermal bucket) added to repo

Phase 7R closeout (2026-06-18)

ZBF remediation applied: HA → IoT, printer allow (reordered). SLZB-06M settled in HA. HomePod AirPlay still broken — deferred. Fiio override documented. Next: Phase 7 — AdGuard deployed on infra-services (containers up); owner setup wizard + DNS cutover remain.


Known gaps and drift

These are documented with evidence — not guesses about what broke for you.

Item Why it matters Where
HomePod AirPlay Personal → IoT rule still Block or stream path not verified (2026-06-18) phase-7r-zbf-remediation.md
HA integration failures SLZB settled; Aqara/TP-Link pending WiFi moves §0.3
DNS still on PiHole AdGuard live; finish UDM WAN + DHCP → .17, then decom LXC 104 AdGuard · §0.1
backup-client missing on host No /opt/homelab/services/backup-client/ yet Q8.3
SEC-001 SMB forward Public SMB to Synology still listed open in security register security-register.md
README Owner TODO Some rows may lag git (e.g. backup.sops.yaml is committed but table still shows open) README.md
flowchart LR
    subgraph today["DNS today"]
        C[Clients] --> GW[UDM gateway]
        GW --> PH[PiHole 192.168.6.80]
    end
  subgraph target["DNS target"]
        C2[Clients] --> ADG[AdGuard 192.168.6.17]
        ADG --> UB[Unbound]
    end
    today -.->|"Phase 7 cutover"| target

What remains (owner-focused)

Grouped by “do this next” vs “schedule when bored.”

Do this next

  1. Verify DSM over Tailscale (https://100.71.93.130:5001 off-LAN) — phase-7 §4
  2. Finish DNS cutover — UDM Internet → DNS → 192.168.6.17, all VLAN DHCP → .17, verify from Servers VLAN (saltierpoop) → then decommission PiHole
  3. Run ansible-pull / deploy backup-client on infra-services and confirm a successful restic run to B2
  4. Close SEC-001 — remove SMB port forward (hotfix runbook exists)
  5. Add TS_API_KEY + TS_TAILNET GitHub secrets so ACL GitOps syncs

Schedule when you have a block

  • Reconcile inventory with network-live device table (nfs-monitoring, harbor-registry, IP/MAC fixes)
  • Tailscale on all seven Phase 7 hosts (three-key model documented)
  • InfluxDB consolidation (LXC 111 → monitoring stack)
  • Proxmox consolidation — 13 stopped VMs; owner decisions on keep/migrate/decom
  • Centralized syslog + SIEM (Phase 8+)
  • Cloudflare Access on docs site; Komodo webhook via tunnel

Automation / agent-friendly follow-ups

  • Run network-scan.py after inventory fixes; commit generator output
  • Refresh device-vlan-mapping.md from a new labctl unifi dump
  • Update README Owner TODO table to match committed backup.sops.yaml status

Doc map at this milestone

Need Read
Master plan PLAN.md
Open owner tasks README Owner TODO
Live network network-live.md
Live firewall firewall-live.md
Design / intent network.md, firewall-policy.md
Phase 7 checklist phase-7-owner-actions.md
Security findings security-register.md
Connectivity triage network-observations-2026-06-03.md

Closing note

The project crossed from “bootstrap the repo” to “operate a documented lab.” The uncomfortable part — things that stopped talking — is most likely explained by the DNS chain still pointing at a host you plan to retire, plus normal fallout from VLAN migration and ZBF (documented in observations). The next check-in should happen after AdGuard cutover or when backup runs are green; whichever comes first.

Previous entry: (none — first journal check-in).