Operations And Platform Assessment¶

Summary¶

The platform direction is appropriate: one infra-services Docker host for homelab-owned operations, Saltbox left as a managed appliance, Proxmox reduced to the hypervisor and selected guests, and service deployment through Komodo. The service estate is also reasonably small: Traefik, Komodo, ARA, Homepage, monitoring, AdGuard/Unbound, Authentik outpost, and Wazuh.

The main operational weaknesses are runtime exposure, incomplete backup/restore coverage, weak readiness controls, and gaps between repo-driven expectations and live state.

Service Topology¶

flowchart LR
  clients[LAN And Tailscale Clients] --> adguard[AdGuard DNS]
  adguard --> traefik[Traefik TLS Router]
  traefik --> authentikOutpost[Authentik Outpost]
  traefik --> komodo[Komodo Native OIDC]
  traefik --> ara[ARA]
  traefik --> grafana[Grafana]
  traefik --> homepage[Homepage]
  traefik --> wazuhDashboard[Wazuh Dashboard]
  monitoring[Prometheus Loki Alertmanager] --> grafana
  komodo --> composeStacks[Compose Stacks]

Runtime Validation Results¶

Read-only live checks were run against infra-services using the documented Cursor SSH alias. They showed:

Check	Result
Container status	Most containers up; `wazuh-dashboard` restarting.
Listening ports	`8000`, `8080`, `8081`, `3100`, and `9090` listening on all interfaces.
Workstation TCP reachability	`8000`, `8080`, `8081`, `3100`, and `9090` reachable at `192.168.6.17`.
Backup manifests	Live manifests exist for `_template`, AdGuard, ARA, Authentik outpost, Homepage, Komodo, monitoring, and Traefik; Wazuh is missing.
DNS rewrites	`komodo.infra.realemail.app` and `adguard.infra.realemail.app` resolve to `192.168.6.17` via AdGuard.

The direct port findings are therefore runtime-confirmed, not only theoretical Compose concerns.

Service Exposure¶

The intended ingress model is clear: LAN DNS rewrites point *.infra.realemail.app to infra-services; Traefik terminates TLS; Authentik protects user-facing services, except where native OIDC is documented. The Compose files partly undermine this by also publishing several internal ports to the host.

Service	Repo Evidence	Live Risk
ARA	`ara/compose.yml`, lines 7-13	Direct `8000` listener; permissive hosts and CORS.
Prometheus	`monitoring/compose.yml`, lines 7-17	Direct `9090` listener and lifecycle enabled.
cAdvisor	`monitoring/compose.yml`, lines 94-112	Direct `8081` listener; privileged container with host mounts.
Loki	`monitoring/compose.yml`, lines 114-123	Direct `3100` listener.
Traefik dashboard/API	`traefik/compose.yml`, lines 9-12	Direct `8080` listener in addition to protected router.

Required Revision¶

Contractors should produce a port exposure matrix for every service:

Port	Consumer	Required Source	Binding	Auth/ACL
`8000` ARA	Managed hosts / browser	Hosts only or Traefik only	Loopback, management IP, or no host bind	ACL and explicit ARA hosts
`9090` Prometheus	Grafana / operators	Docker network / Traefik	Prefer no host bind	Authentik route only
`8081` cAdvisor	Prometheus	Docker network	Prefer no host bind	Prometheus only
`3100` Loki	Grafana / Promtail	Docker network	Prefer no host bind	Docker network only
`8080` Traefik API	Operators	Traefik route	Prefer no host bind	Authentik route only

Backup And Restore¶

The backup policy model is a strong start. backups/policies.yaml defines irreplaceable tier-1 services, external critical sources, tier-2 regeneratable services, and tier-3 media exclusions. See policies.yaml, lines 6-99.

The gaps are implementation completeness and restore confidence:

Wazuh has many persistent volumes in wazuh/compose.yml, lines 17-27 and 55-70, but no backup manifest exists.
Traefik backup records ACME data in the Docker volume, but restore docs copy acme.json into the service directory. See traefik/backup.yml, lines 4-10, and restore.md, lines 70-98.
AdGuard backup references /opt/homelab/services/adguard/unbound.conf, while Compose uses the default Unbound image config and the live check did not find that file. See adguard/backup.yml, lines 1-6, and adguard/compose.yml, lines 29-33.
External critical backup jobs in backups/policies.yaml need restore-test evidence, not just policy presence.

Required Revision¶

Add a backup confidence table to the runbooks and keep it current:

Data	Tier	Snapshot Evidence	Restore Evidence	Status
Komodo	1	Required	Required	Do not close without restore drill.
Traefik certs/config	1	Required	Required	Fix Docker volume restore docs.
Authentik DB	1 external	Required	Required	Include in restore-test coverage.
HAOS config	1 external	Required	Required	Include API fetch and restore validation.
Harbor registry	1 external	Required	Required	Automate or mark owner-manual.
Wazuh	TBD	Missing	Missing	Assign tier and implement.

Observability¶

Prometheus, Grafana, Loki, Alertmanager, Promtail, cAdvisor, and node targets exist. The architecture is suitable for the environment. The concern is operational quality:

Prometheus is directly published and lifecycle is enabled.
Grafana grants Admin to every auth-proxy user.
Promtail stores positions in /tmp, which can duplicate or miss logs after restart. See promtail-config.yml, lines 5-9.
Wazuh dashboard is restarting live and Wazuh retention/disk controls are not encoded in repo.

Better Alternative¶

Keep the current stack, but move toward declarative operational controls:

Prometheus and Loki reachable only through Docker network and protected Traefik routes unless machine access is explicitly justified.
Alert on direct-port exposure and unexpected listeners on infra-services.
Persist Promtail positions in a named volume.
Add Wazuh index lifecycle management and disk alerts.
Add a dashboard/service health panel sourced from Docker healthchecks and Prometheus up targets.

Healthchecks And Readiness¶

Some live containers report Docker health, but coverage is inconsistent and stateful dependencies still use startup ordering rather than readiness. Compose depends_on appears in AdGuard, Komodo, and Wazuh stacks, but readiness is not consistently enforced. See adguard/compose.yml, lines 7-8, komodo/compose.yml, lines 22-23 and 46-47, and wazuh/compose.yml, lines 36-37 and 85-87.

Required Revision¶

Add explicit healthchecks for:

Mongo and Komodo Core.
ARA API.
AdGuard and Unbound.
Prometheus, Alertmanager, Grafana, Loki, cAdvisor.
Wazuh indexer, manager, and dashboard.
Authentik outpost.

Then update runbooks to distinguish container running from service healthy.

Deployment Flow¶

Komodo deployment is well aligned with the repo model. The GitHub Actions relay exists because infra hostnames are LAN-only and GitHub cloud webhooks cannot reach Komodo. See komodo-deploy.yml, lines 1-5 and 24-37.

One workflow exception is risky: monitoring/grafana/** is ignored, even though dashboards are file-provisioned. See komodo-deploy.yml, lines 10-16.

Required Revision¶

Remove the ignore or add a sync/reload job for dashboard-only changes. A dashboard PR should not require an unrelated service change to reach the host.

Automation And CI¶

The generator model is valuable, but several checks should be hardened:

render-discovery-inventory.py --check should fail if generated files are missing and should not create directories in check mode.
Discovery inventory should not hardcode ansible_user: someone for managed targets when prox is root-only. See render-discovery-inventory.py, lines 71-80, and prox.yml, lines 1-4.
CI installs several tools without consistent pinning, while pre-commit pins some versions. Toolchain drift can break CI without repo changes.

Better Alternative¶

Create a shared inventory/generator Python module with unit tests for:

Status filtering for retired/decommissioning entities.
Host connection var resolution.
Missing generated output behavior.
LF-only writes on Windows.
Schema validation of all generator-consumed nested keys.

Platform Acceptance Checklist¶

Before this platform should be accepted from contractors:

No unintended 0.0.0.0 listeners remain on infra-services.
Every published port has an owner, source allowlist, and documented consumer.
Wazuh dashboard is stable and backed up.
Tier-1 and external backups have restore evidence.
Grafana dashboard changes deploy through the GitOps path.
Service healthchecks cover all stateful and routed services.
Generated docs and generated inventory pass check mode from a clean clone.