Skip to content

Operations And Platform Assessment

Summary

The platform direction is appropriate: one infra-services Docker host for homelab-owned operations, Saltbox left as a managed appliance, Proxmox reduced to the hypervisor and selected guests, and service deployment through Komodo. The service estate is also reasonably small: Traefik, Komodo, ARA, Homepage, monitoring, AdGuard/Unbound, Authentik outpost, and Wazuh.

The main operational weaknesses are runtime exposure, incomplete backup/restore coverage, weak readiness controls, and gaps between repo-driven expectations and live state.

Service Topology

flowchart LR
  clients[LAN And Tailscale Clients] --> adguard[AdGuard DNS]
  adguard --> traefik[Traefik TLS Router]
  traefik --> authentikOutpost[Authentik Outpost]
  traefik --> komodo[Komodo Native OIDC]
  traefik --> ara[ARA]
  traefik --> grafana[Grafana]
  traefik --> homepage[Homepage]
  traefik --> wazuhDashboard[Wazuh Dashboard]
  monitoring[Prometheus Loki Alertmanager] --> grafana
  komodo --> composeStacks[Compose Stacks]

Runtime Validation Results

Read-only live checks were run against infra-services using the documented Cursor SSH alias. They showed:

Check Result
Container status Most containers up; wazuh-dashboard restarting.
Listening ports 8000, 8080, 8081, 3100, and 9090 listening on all interfaces.
Workstation TCP reachability 8000, 8080, 8081, 3100, and 9090 reachable at 192.168.6.17.
Backup manifests Live manifests exist for _template, AdGuard, ARA, Authentik outpost, Homepage, Komodo, monitoring, and Traefik; Wazuh is missing.
DNS rewrites komodo.infra.realemail.app and adguard.infra.realemail.app resolve to 192.168.6.17 via AdGuard.

The direct port findings are therefore runtime-confirmed, not only theoretical Compose concerns.

Service Exposure

The intended ingress model is clear: LAN DNS rewrites point *.infra.realemail.app to infra-services; Traefik terminates TLS; Authentik protects user-facing services, except where native OIDC is documented. The Compose files partly undermine this by also publishing several internal ports to the host.

Service Repo Evidence Live Risk
ARA ara/compose.yml, lines 7-13 Direct 8000 listener; permissive hosts and CORS.
Prometheus monitoring/compose.yml, lines 7-17 Direct 9090 listener and lifecycle enabled.
cAdvisor monitoring/compose.yml, lines 94-112 Direct 8081 listener; privileged container with host mounts.
Loki monitoring/compose.yml, lines 114-123 Direct 3100 listener.
Traefik dashboard/API traefik/compose.yml, lines 9-12 Direct 8080 listener in addition to protected router.

Required Revision

Contractors should produce a port exposure matrix for every service:

Port Consumer Required Source Binding Auth/ACL
8000 ARA Managed hosts / browser Hosts only or Traefik only Loopback, management IP, or no host bind ACL and explicit ARA hosts
9090 Prometheus Grafana / operators Docker network / Traefik Prefer no host bind Authentik route only
8081 cAdvisor Prometheus Docker network Prefer no host bind Prometheus only
3100 Loki Grafana / Promtail Docker network Prefer no host bind Docker network only
8080 Traefik API Operators Traefik route Prefer no host bind Authentik route only

Backup And Restore

The backup policy model is a strong start. backups/policies.yaml defines irreplaceable tier-1 services, external critical sources, tier-2 regeneratable services, and tier-3 media exclusions. See policies.yaml, lines 6-99.

The gaps are implementation completeness and restore confidence:

  • Wazuh has many persistent volumes in wazuh/compose.yml, lines 17-27 and 55-70, but no backup manifest exists.
  • Traefik backup records ACME data in the Docker volume, but restore docs copy acme.json into the service directory. See traefik/backup.yml, lines 4-10, and restore.md, lines 70-98.
  • AdGuard backup references /opt/homelab/services/adguard/unbound.conf, while Compose uses the default Unbound image config and the live check did not find that file. See adguard/backup.yml, lines 1-6, and adguard/compose.yml, lines 29-33.
  • External critical backup jobs in backups/policies.yaml need restore-test evidence, not just policy presence.

Required Revision

Add a backup confidence table to the runbooks and keep it current:

Data Tier Snapshot Evidence Restore Evidence Status
Komodo 1 Required Required Do not close without restore drill.
Traefik certs/config 1 Required Required Fix Docker volume restore docs.
Authentik DB 1 external Required Required Include in restore-test coverage.
HAOS config 1 external Required Required Include API fetch and restore validation.
Harbor registry 1 external Required Required Automate or mark owner-manual.
Wazuh TBD Missing Missing Assign tier and implement.

Observability

Prometheus, Grafana, Loki, Alertmanager, Promtail, cAdvisor, and node targets exist. The architecture is suitable for the environment. The concern is operational quality:

  • Prometheus is directly published and lifecycle is enabled.
  • Grafana grants Admin to every auth-proxy user.
  • Promtail stores positions in /tmp, which can duplicate or miss logs after restart. See promtail-config.yml, lines 5-9.
  • Wazuh dashboard is restarting live and Wazuh retention/disk controls are not encoded in repo.

Better Alternative

Keep the current stack, but move toward declarative operational controls:

  • Prometheus and Loki reachable only through Docker network and protected Traefik routes unless machine access is explicitly justified.
  • Alert on direct-port exposure and unexpected listeners on infra-services.
  • Persist Promtail positions in a named volume.
  • Add Wazuh index lifecycle management and disk alerts.
  • Add a dashboard/service health panel sourced from Docker healthchecks and Prometheus up targets.

Healthchecks And Readiness

Some live containers report Docker health, but coverage is inconsistent and stateful dependencies still use startup ordering rather than readiness. Compose depends_on appears in AdGuard, Komodo, and Wazuh stacks, but readiness is not consistently enforced. See adguard/compose.yml, lines 7-8, komodo/compose.yml, lines 22-23 and 46-47, and wazuh/compose.yml, lines 36-37 and 85-87.

Required Revision

Add explicit healthchecks for:

  • Mongo and Komodo Core.
  • ARA API.
  • AdGuard and Unbound.
  • Prometheus, Alertmanager, Grafana, Loki, cAdvisor.
  • Wazuh indexer, manager, and dashboard.
  • Authentik outpost.

Then update runbooks to distinguish container running from service healthy.

Deployment Flow

Komodo deployment is well aligned with the repo model. The GitHub Actions relay exists because infra hostnames are LAN-only and GitHub cloud webhooks cannot reach Komodo. See komodo-deploy.yml, lines 1-5 and 24-37.

One workflow exception is risky: monitoring/grafana/** is ignored, even though dashboards are file-provisioned. See komodo-deploy.yml, lines 10-16.

Required Revision

Remove the ignore or add a sync/reload job for dashboard-only changes. A dashboard PR should not require an unrelated service change to reach the host.

Automation And CI

The generator model is valuable, but several checks should be hardened:

  • render-discovery-inventory.py --check should fail if generated files are missing and should not create directories in check mode.
  • Discovery inventory should not hardcode ansible_user: someone for managed targets when prox is root-only. See render-discovery-inventory.py, lines 71-80, and prox.yml, lines 1-4.
  • CI installs several tools without consistent pinning, while pre-commit pins some versions. Toolchain drift can break CI without repo changes.

Better Alternative

Create a shared inventory/generator Python module with unit tests for:

  • Status filtering for retired/decommissioning entities.
  • Host connection var resolution.
  • Missing generated output behavior.
  • LF-only writes on Windows.
  • Schema validation of all generator-consumed nested keys.

Platform Acceptance Checklist

Before this platform should be accepted from contractors:

  • No unintended 0.0.0.0 listeners remain on infra-services.
  • Every published port has an owner, source allowlist, and documented consumer.
  • Wazuh dashboard is stable and backed up.
  • Tier-1 and external backups have restore evidence.
  • Grafana dashboard changes deploy through the GitOps path.
  • Service healthchecks cover all stateful and routed services.
  • Generated docs and generated inventory pass check mode from a clean clone.