Security And Risk Assessment¶

Summary¶

The security model has good building blocks: SOPS-encrypted repo secrets, secret-scanning workflows, Authentik as the default user-facing access layer, Traefik with Cloudflare DNS-01, Tailscale for administrative paths, restic for encrypted backups, and documented VLAN segmentation.

The main risks are trust-boundary width. Persistent self-hosted runners execute PR code, a broad SOPS automation key can decrypt too many domains, direct host ports bypass Authentik, Tailscale server ACLs permit broad lateral movement, and some scripts handle decrypted secret material too casually.

Security Architecture¶

flowchart TD
  github[GitHub Repo] --> ci[GitHub Actions]
  ci --> selfHostedRunners[Self Hosted Runners]
  github --> sopsFiles[SOPS Files]
  sopsFiles --> hostAgeKeys[Host Age Keys]
  clients[Users] --> dns[AdGuard DNS]
  dns --> traefik[Traefik]
  traefik --> authentik[Authentik Outpost]
  authentik --> services[Infra Services]
  admins[Admins] --> tailscale[Tailscale]
  tailscale --> servers[Tagged Servers]

The architecture is defensible if each edge is least-privilege. Several edges are currently broader than they need to be.

Critical Trust Boundaries¶

GitHub Actions Runners¶

The repo has explicit permissions blocks in some workflows, which is positive, but the larger risk is where untrusted code runs. lint.yml runs PR jobs on self-hosted homelab runners. secrets-scan.yml, dependency security, and Semgrep follow the same broad pattern. See lint.yml, lines 3-15, and secrets-scan.yml, lines 7-27.

Risk: A malicious PR can run on hosts with local network reachability and potential access to caches, Docker, filesystems, or credentials outside the GitHub secret model.

Revision: PR checks should run on GitHub-hosted or ephemeral isolated runners. Persistent LAN runners should be used only for trusted push/manual operations. If that is not acceptable, require an explicit threat model and documented compensating controls.

SOPS And Host Decrypt Keys¶

.sops.yaml uses the same recipients for broad secret categories. Managed hosts store the automation age key at /etc/homelab/age-key.txt; docs state that docker-group users can read it. See .sops.yaml, lines 8-25, and secrets.md, lines 109-126.

Risk: Any host compromise that reaches the automation age key can become a repo-wide secret compromise.

Revision: Split SOPS recipients by service and host role. For example:

Recipient Group	Example Files
CI validation only	Schema/test fixtures that need decrypt validation.
`infra-services` runtime	Traefik, Komodo, monitoring, AdGuard, Wazuh env files.
Saltbox OS layer	Saltbox accounts/settings SOPS files only.
Backup controller	Restic, B2, and backup-fetch credentials only.
Network automation	UniFi/Tailscale API secrets only.

Rotate the current global automation key after the split.

Direct Published Ports¶

Direct host ports are both an operational and security problem. Runtime checks confirmed that ARA 8000, Traefik 8080, cAdvisor 8081, Loki 3100, and Prometheus 9090 are reachable from the operator workstation. Compose evidence is cited in CA-003.

Risk: These ports are exposed outside the intended Authentik browser path. Prometheus has lifecycle enabled; ARA has permissive CORS and allowed hosts; cAdvisor runs privileged and mounts host paths.

Revision: Remove, restrict, or firewall every direct publish. Add automated listener checks so regressions are caught in CI or post-deploy validation.

Identity And Access¶

Authentik And Traefik¶

ADR-002 sets the correct default: all user-facing HTTPS should be protected by Authentik, with Plex as the known exception and Komodo using native OIDC. The service labels mostly follow this pattern. The issue is not the Traefik label model; it is bypass paths through direct host ports and broad downstream app roles.

Grafana¶

Grafana enables auth proxy auto-signup and assigns Admin to auto-created users. See monitoring/compose.yml, lines 69-78.

Risk: Any Authentik principal admitted by the Grafana provider can administer datasources, dashboards, and query interfaces.

Revision: Default to Viewer or Editor and map Authentik groups to Grafana roles. Verify Authentik provider scoping is ops-only.

Wazuh¶

Wazuh keeps demo users and basic-auth fallback paths. See internal_users.yml, lines 11-56. Dashboard proxy trust should also be narrowed to exact Traefik or outpost networks rather than broad Docker ranges.

Risk: Known demo users and broad proxy trust increase the blast radius if a container is compromised or proxy headers are spoofed.

Revision: Remove unused demo users, rotate hashes for retained break-glass accounts, document break-glass use, and narrow proxy trust.

Network And Remote Access¶

VLAN Firewall¶

The documented VLAN model is sensible, but firewall policy docs need a fresh validation pass. firewall-live.md is explicitly the truth when it differs from the design doc, but it is based on a dated scan and later AirPlay updates were manually added. See firewall-live.md, lines 1-11 and 180-187.

Revision: Run a fresh read-only UniFi scan and update the live matrix before contractor closeout.

Tailscale¶

infra/tailscale/acl.json grants admin full reachability, allows tag:server to tag:server:*, and permits root SSH to tagged servers. See acl.json, lines 12-45.

Risk: Tailscale becomes a parallel flat management network. It can bypass VLAN segmentation and amplify a single tagged-server compromise.

Revision: Split tags by role, replace tag:server:* with explicit flows, and restrict root SSH to hosts that require it. Add ACL tests for denied paths, not only accepted paths.

Secret Handling¶

Komodo Env Rendering¶

render-komodo-compose-env.sh passes decrypted JSON via Python argv and writes the output before all validation is complete. See render-komodo-compose-env.sh, lines 29-56.

Revision: Decrypt to stdin or a restrictive mktemp file, validate all secrets first, write to a 0600 temp output, then atomically replace compose.env.

Backup SOPS Update¶

update-backup-sops.sh writes decrypted backup secrets and SSH private key material to /tmp/backup-dec.yaml. See update-backup-sops.sh, lines 9-24.

Revision: Use mktemp, umask 077, cleanup traps, and avoid storing private keys on disk unless unavoidable.

SOPS Check¶

check-sops-encryption.py only checks for SOPS metadata. See check-sops-encryption.py, lines 13-29.

Revision: Require all non-metadata scalar values to be encrypted or validate with a SOPS-native status command.

Public Documentation Exposure¶

The public docs site is an accepted owner decision, but it materially changes the threat model. The docs include internal IPs, VLANs, hostnames, runner labels, SSH aliases, service names, DR steps, and security-control descriptions. cloudflare-pages.md records that Cloudflare Access was declined. See cloudflare-pages.md, lines 64-70.

Risk: The public site reduces attacker reconnaissance cost.

Revision: Either put the whole docs site behind Cloudflare Access or split public and private docs. If public docs remain accepted, add an accepted-risk entry with a review date and a redaction policy.

Backup Security¶

Restic encryption is the right baseline. The concern is plaintext staging before restic, missing Wazuh coverage, and restore proof for external critical data.

Revision: For every external fetch script and backup pre-hook:

Create staging directories with 0700.
Write sensitive files with 0600.
Delete remote plaintext dumps after transfer.
Use cleanup traps.
Record restore-test evidence for tier-1 data.

Security Acceptance Criteria¶

Contractor remediation should be accepted only after:

PR workflows no longer execute untrusted code on persistent homelab runners.
SOPS recipient scope is segmented and the old shared automation key is rotated.
Direct ports are removed, loopback-bound, or protected with explicit firewall allowlists.
Wazuh demo users and broad proxy trust are removed or tightly documented.
Tailscale ACLs include deny tests for unintended lateral movement.
Public docs exposure is either reduced or formally accepted with review date.
Secret-handling scripts pass a tabletop interruption test without leaving plaintext material behind.