Contractor Assessment Report¶
Date: 2026-06-28
Repository: notarealemail/homelab
Assessment stance: senior engineering due diligence for owner acceptance of
contracted work.
Executive Summary¶
The project has a strong foundation: clear GitOps intent, inventory-driven generation, SOPS-encrypted secrets, Ansible convergence, Komodo-managed Compose stacks, Authentik-centered ingress, and a substantial runbook culture. The implementation is more mature than a typical homelab repo, but it is not yet at the level where the docs, automation, runtime posture, and recovery evidence can be treated as uniformly production-ready.
The highest-priority contractor revisions are operational security and recovery
hardening. Live validation confirmed that several internal observability and ARA
ports are bound on all interfaces and reachable from the operator workstation,
creating paths around the intended Traefik/Auth layer. Wazuh is deployed and
marked complete in project status, but the live dashboard is restarting and no
live backup.yml exists for its stateful volumes. Several secret-handling
scripts stage decrypted data in process arguments or predictable /tmp paths.
Pull-request CI also executes repository code on self-hosted runners, which is a
material trust-boundary issue.
The largest documentation risk is source-of-truth drift. PLAN.md still calls
itself the source of truth and contains early unchecked acceptance items, while
the README, security register, live network docs, and owner TODO table show many
of those items as complete. Network, firewall, host, and service docs also mix
historical snapshots with current operational truth. This makes the repo harder
to hand to contractors because multiple files can justify different actions.
Table Of Contents¶
- Findings register
- Architecture and documentation assessment
- Operations and platform assessment
- Security and risk assessment
- Appendix and evidence index
Top Revisions To Send Back¶
- Close direct service exposure on
infra-services: remove unnecessary host publishes, bind machine-only APIs to loopback or a management IP, and addDOCKER-USERallowlists for ports that must remain reachable. - Treat Wazuh as unfinished until dashboard stability, backup policy, restore procedure, retention, demo-user cleanup, and disk-watermark controls are in place.
- Move PR validation off persistent self-hosted homelab runners or isolate it with ephemeral, unprivileged runner pools that have no LAN reach.
- Split broad SOPS automation access into per-host or per-domain recipients, then rotate the current shared automation key.
- Fix secret-rendering scripts so decrypted material is passed through stdin or
restrictive temp files and outputs are written atomically with mode
0600. - Reconcile project source-of-truth docs: make
PLAN.mdhistorical or add a live status overlay, refresh DNS/firewall docs, regenerate host indexes, and expose all mature service docs in MkDocs. - Harden CI and generators: make SOPS checks detect plaintext values, make
generator
--checkmodes fail on missing outputs, and pin CI tool versions. - Add restore-test evidence for tier-1 and external backups, including Authentik DB, HAOS, Harbor, Komodo, Traefik, and any Wazuh tier assignment.
Evidence Confidence¶
| Area | Confidence | Basis |
|---|---|---|
| Repo architecture and docs | High | Static read of planning docs, ADRs, inventory, generated docs, and runbooks. |
| Compose/service posture | High | Static read plus live docker ps and listening-port checks on infra-services. |
| Direct port exposure | High | Live ss -ltnp showed 0.0.0.0/[::] listeners; workstation TCP tests succeeded for 8000, 8080, 8081, 3100, and 9090. |
| Backup manifest gaps | High | Static repo check plus live /opt/homelab/services/*/backup.yml listing confirmed no Wazuh manifest. |
| GitHub settings and branch protection | Medium | Workflow files and gh workflow list reviewed; repository settings were not inspected. |
| UniFi firewall and VLAN state | Medium | Repo docs and prior scan outputs reviewed; no fresh UniFi API scan was run in this assessment. |
| Tailscale ACL runtime state | Medium | ACL file reviewed; live tailnet policy was not queried. |
System Overview¶
flowchart LR
inventoryYaml[Inventory YAML] --> generators[Generators]
generators --> ansibleInventory[Ansible Inventory]
generators --> prometheusTargets[Prometheus Targets]
generators --> homepageConfig[Homepage Config]
gitRepo[GitHub Repo] --> ansiblePull[Ansible Pull]
gitRepo --> komodoRelay[GitHub Actions Relay]
komodoRelay --> komodo[Komodo]
komodo --> composeStacks[Compose Stacks]
composeStacks --> traefik[Traefik]
traefik --> authentikOutpost[Authentik Outpost]
composeStacks --> monitoring[Monitoring Stack]
composeStacks --> backups[Restic Backups]
Risk Heatmap¶
| Domain | Current Risk | Main Driver |
|---|---|---|
| Runtime service exposure | High | Direct host-published observability and ARA ports are live and reachable. |
| CI and runner trust | High | PR workflows run repo code on self-hosted runners. |
| Secret blast radius | High | Broad age recipients and global host automation key. |
| Backup and DR confidence | Medium-High | Tier policies exist, but Wazuh and external restore tests are incomplete. |
| Documentation correctness | Medium-High | Multiple source-of-truth layers contradict each other. |
| Observability quality | Medium | Stack exists, but health/readiness, Wazuh, and log-position durability need work. |
| Network segmentation | Medium | Good design intent; stale docs and broad Tailscale ACLs weaken assurance. |
Recommended Contractor Acceptance Criteria¶
Contractors should not be considered complete on this phase until the High and
Critical findings in findings.md have a merged remediation PR,
an updated runbook or ADR where operational truth changes, and a validation note
showing either live proof or a documented reason validation is owner-blocked.
For runtime changes, require evidence from the host, not only repo diffs. For example, a port-hardening fix should include both Compose changes and a live listener/reachability check after Komodo deploy. A backup fix should include the manifest, timer status, a successful restic snapshot, and at least a scoped restore drill.