Operations And Platform Assessment¶
Summary¶
The platform direction is appropriate: one infra-services Docker host for
homelab-owned operations, Saltbox left as a managed appliance, Proxmox reduced
to the hypervisor and selected guests, and service deployment through Komodo.
The service estate is also reasonably small: Traefik, Komodo, ARA, Homepage,
monitoring, AdGuard/Unbound, Authentik outpost, and Wazuh.
The main operational weaknesses are runtime exposure, incomplete backup/restore coverage, weak readiness controls, and gaps between repo-driven expectations and live state.
Service Topology¶
flowchart LR
clients[LAN And Tailscale Clients] --> adguard[AdGuard DNS]
adguard --> traefik[Traefik TLS Router]
traefik --> authentikOutpost[Authentik Outpost]
traefik --> komodo[Komodo Native OIDC]
traefik --> ara[ARA]
traefik --> grafana[Grafana]
traefik --> homepage[Homepage]
traefik --> wazuhDashboard[Wazuh Dashboard]
monitoring[Prometheus Loki Alertmanager] --> grafana
komodo --> composeStacks[Compose Stacks]
Runtime Validation Results¶
Read-only live checks were run against infra-services using the documented
Cursor SSH alias. They showed:
| Check | Result |
|---|---|
| Container status | Most containers up; wazuh-dashboard restarting. |
| Listening ports | 8000, 8080, 8081, 3100, and 9090 listening on all interfaces. |
| Workstation TCP reachability | 8000, 8080, 8081, 3100, and 9090 reachable at 192.168.6.17. |
| Backup manifests | Live manifests exist for _template, AdGuard, ARA, Authentik outpost, Homepage, Komodo, monitoring, and Traefik; Wazuh is missing. |
| DNS rewrites | komodo.infra.realemail.app and adguard.infra.realemail.app resolve to 192.168.6.17 via AdGuard. |
The direct port findings are therefore runtime-confirmed, not only theoretical Compose concerns.
Service Exposure¶
The intended ingress model is clear: LAN DNS rewrites point
*.infra.realemail.app to infra-services; Traefik terminates TLS; Authentik
protects user-facing services, except where native OIDC is documented. The
Compose files partly undermine this by also publishing several internal ports to
the host.
| Service | Repo Evidence | Live Risk |
|---|---|---|
| ARA | ara/compose.yml, lines 7-13 |
Direct 8000 listener; permissive hosts and CORS. |
| Prometheus | monitoring/compose.yml, lines 7-17 |
Direct 9090 listener and lifecycle enabled. |
| cAdvisor | monitoring/compose.yml, lines 94-112 |
Direct 8081 listener; privileged container with host mounts. |
| Loki | monitoring/compose.yml, lines 114-123 |
Direct 3100 listener. |
| Traefik dashboard/API | traefik/compose.yml, lines 9-12 |
Direct 8080 listener in addition to protected router. |
Required Revision¶
Contractors should produce a port exposure matrix for every service:
| Port | Consumer | Required Source | Binding | Auth/ACL |
|---|---|---|---|---|
8000 ARA |
Managed hosts / browser | Hosts only or Traefik only | Loopback, management IP, or no host bind | ACL and explicit ARA hosts |
9090 Prometheus |
Grafana / operators | Docker network / Traefik | Prefer no host bind | Authentik route only |
8081 cAdvisor |
Prometheus | Docker network | Prefer no host bind | Prometheus only |
3100 Loki |
Grafana / Promtail | Docker network | Prefer no host bind | Docker network only |
8080 Traefik API |
Operators | Traefik route | Prefer no host bind | Authentik route only |
Backup And Restore¶
The backup policy model is a strong start. backups/policies.yaml defines
irreplaceable tier-1 services, external critical sources, tier-2 regeneratable
services, and tier-3 media exclusions. See
policies.yaml, lines 6-99.
The gaps are implementation completeness and restore confidence:
- Wazuh has many persistent volumes in
wazuh/compose.yml, lines 17-27 and 55-70, but no backup manifest exists. - Traefik backup records ACME data in the Docker volume, but restore docs copy
acme.jsoninto the service directory. Seetraefik/backup.yml, lines 4-10, andrestore.md, lines 70-98. - AdGuard backup references
/opt/homelab/services/adguard/unbound.conf, while Compose uses the default Unbound image config and the live check did not find that file. Seeadguard/backup.yml, lines 1-6, andadguard/compose.yml, lines 29-33. - External critical backup jobs in
backups/policies.yamlneed restore-test evidence, not just policy presence.
Required Revision¶
Add a backup confidence table to the runbooks and keep it current:
| Data | Tier | Snapshot Evidence | Restore Evidence | Status |
|---|---|---|---|---|
| Komodo | 1 | Required | Required | Do not close without restore drill. |
| Traefik certs/config | 1 | Required | Required | Fix Docker volume restore docs. |
| Authentik DB | 1 external | Required | Required | Include in restore-test coverage. |
| HAOS config | 1 external | Required | Required | Include API fetch and restore validation. |
| Harbor registry | 1 external | Required | Required | Automate or mark owner-manual. |
| Wazuh | TBD | Missing | Missing | Assign tier and implement. |
Observability¶
Prometheus, Grafana, Loki, Alertmanager, Promtail, cAdvisor, and node targets exist. The architecture is suitable for the environment. The concern is operational quality:
- Prometheus is directly published and lifecycle is enabled.
- Grafana grants Admin to every auth-proxy user.
- Promtail stores positions in
/tmp, which can duplicate or miss logs after restart. Seepromtail-config.yml, lines 5-9. - Wazuh dashboard is restarting live and Wazuh retention/disk controls are not encoded in repo.
Better Alternative¶
Keep the current stack, but move toward declarative operational controls:
- Prometheus and Loki reachable only through Docker network and protected Traefik routes unless machine access is explicitly justified.
- Alert on direct-port exposure and unexpected listeners on
infra-services. - Persist Promtail positions in a named volume.
- Add Wazuh index lifecycle management and disk alerts.
- Add a dashboard/service health panel sourced from Docker healthchecks and
Prometheus
uptargets.
Healthchecks And Readiness¶
Some live containers report Docker health, but coverage is inconsistent and
stateful dependencies still use startup ordering rather than readiness. Compose
depends_on appears in AdGuard, Komodo, and Wazuh stacks, but readiness is not
consistently enforced. See
adguard/compose.yml, lines 7-8,
komodo/compose.yml, lines 22-23 and
46-47, and wazuh/compose.yml, lines
36-37 and 85-87.
Required Revision¶
Add explicit healthchecks for:
- Mongo and Komodo Core.
- ARA API.
- AdGuard and Unbound.
- Prometheus, Alertmanager, Grafana, Loki, cAdvisor.
- Wazuh indexer, manager, and dashboard.
- Authentik outpost.
Then update runbooks to distinguish container running from service healthy.
Deployment Flow¶
Komodo deployment is well aligned with the repo model. The GitHub Actions relay
exists because infra hostnames are LAN-only and GitHub cloud webhooks cannot
reach Komodo. See
komodo-deploy.yml, lines 1-5
and 24-37.
One workflow exception is risky: monitoring/grafana/** is ignored, even though
dashboards are file-provisioned. See
komodo-deploy.yml, lines
10-16.
Required Revision¶
Remove the ignore or add a sync/reload job for dashboard-only changes. A dashboard PR should not require an unrelated service change to reach the host.
Automation And CI¶
The generator model is valuable, but several checks should be hardened:
render-discovery-inventory.py --checkshould fail if generated files are missing and should not create directories in check mode.- Discovery inventory should not hardcode
ansible_user: someonefor managed targets whenproxis root-only. Seerender-discovery-inventory.py, lines 71-80, andprox.yml, lines 1-4. - CI installs several tools without consistent pinning, while pre-commit pins some versions. Toolchain drift can break CI without repo changes.
Better Alternative¶
Create a shared inventory/generator Python module with unit tests for:
- Status filtering for retired/decommissioning entities.
- Host connection var resolution.
- Missing generated output behavior.
- LF-only writes on Windows.
- Schema validation of all generator-consumed nested keys.
Platform Acceptance Checklist¶
Before this platform should be accepted from contractors:
- No unintended
0.0.0.0listeners remain oninfra-services. - Every published port has an owner, source allowlist, and documented consumer.
- Wazuh dashboard is stable and backed up.
- Tier-1 and external backups have restore evidence.
- Grafana dashboard changes deploy through the GitOps path.
- Service healthchecks cover all stateful and routed services.
- Generated docs and generated inventory pass check mode from a clean clone.