Komodo reliability audit (2026-07-03)¶

Trigger: Owner frustration with recurring Komodo problems on effectively every merge to main — "some regressions, some mistakes, some unaccounted requirements" — with a live example (komodo.infra.realemail.app/repos/6a04081f4c98985723c7d087) still broken at request time. Ask: a thorough review, documentation, and recommendation operation so that the next Komodo problem is "absolutely unique, unavoidable, and unpredictable" — not a repeat of something already seen and fixed once.

Method: Four agents worked in parallel against independent evidence sources so the conclusions triangulate rather than rely on a single vantage point:

Agent	Scope
History mining	Every git commit, journal entry, and PR touching Komodo since Phase 4
Live audit	SSH into `infra-services`, inspect running containers, Mongo, `git status`, Komodo API/UI state
Structural review	Read every Komodo script, compose file, and `resources.toml` for logic/idempotency gaps
External research	Upstream Komodo docs/issues, known footguns in Mongo + GitHub Actions that apply here

This document synthesizes all four, plus a live incident (see postmortem) that the live-audit pass found and this session resolved in real time.

Executive summary¶

Komodo itself is not unreliable — the same three failure classes have been fixed, independently, six to seven times each since Phase 4, plus one fix (PR #40) that broke an invariant a previous fix had established, undone one day later by PR #43. None of these were ever written up as a postmortem, so each recurrence was rediscovered from scratch instead of prevented. The root cause is structural, not incidental:

Secrets/state live in up to four disconnected stores (SOPS → compose.env → running container env → live MongoDB) with no atomic reconciliation and no automated check that they agree.
resources.toml has zero CI validation despite this repo having a proven, reusable pattern for exactly this (schema-validation / generators-check jobs in lint.yml).
Komodo has no configured Alerter. A Procedure stage that fails on every run for five days produces no notification anywhere.
The one built-in guardrail against UI/git drift (KOMODO_UI_WRITE_DISABLED) was never set.

None of these require replacing Komodo or redesigning the architecture. All four are closed or well underway in this change (see What changed in this audit); the rest are a prioritized, owned backlog below.

Recurring incident taxonomy¶

Three buckets account for essentially every Komodo incident in the project's history (commit hashes are on main, oldest first):

1. Secret / credential drift (7+ occurrences)¶

Commit / PR	What drifted
`a3adea5`	SOPS placeholders (`GENERATE_ON_HOST`) left in `compose.env`, mistaken for real secrets
`602e616`	First occurrence of Mongo password drift from `compose.env` → wrote `fix-komodo-mongo-auth.sh`
`b662319`	Moved secrets to SOPS after a `git pull` wiped OIDC creds that were only ever host-local
2026-06-26 journal	Rotated all Komodo secrets, hit the same Mongo-drift pattern the rescue script was written for
PR #49 / this session	GitHub Actions `KOMODO_WEBHOOK_SECRET` silently held an empty value for over a week despite being "set" in Settings
This session	Second live occurrence of Mongo password drift → `komodo-core` crash loop, resolved with the same rescue script
Ongoing structural gap	`rotate-komodo-secrets.sh` had no pre-flight check that its assumed "old" Mongo password actually still worked (fixed in this audit)

Why it keeps recurring: four stores (SOPS ciphertext, rendered compose.env, the live container's env, and MongoDB's own user table) can each be updated independently, and nothing checks they agree until something crashes. GitHub Actions secrets add a fifth store for the webhook secret specifically, and GitHub secrets have a specific, undocumented footgun (see External research findings).

2. ResourceSync / source-of-truth drift (7+ occurrences)¶

Commit / PR	What drifted
`7d39cb0`	Had to mount `resources.toml` into the Core container — it wasn't visible at all
`1302a3d`	Switched Sync from file-on-host to git-repo mode
`cd83fd9`	Marked the ResourceSync TODO "done" — later reopened
`1dc5342` (#25)	`/opt/homelab` wasn't mounted into `komodo-periphery`, breaking `PullRepo` and file-on-server stacks
`669715b`	Repo pull targeted the wrong path and needed SSH periphery doesn't have
`0732a26`	Fixed ResourceSync "source of truth" for the homelab repo pull a second time
`4bad455`	Stacks weren't `files_on_host`, so Komodo's view of "running" didn't match `docker ps`
This session	New stacks (`litellm`) sat in `resources.toml` for days without ever being created — `BatchDeployStackIfChanged` only touches resources Komodo already knows about

Why it keeps recurring: resources.toml is treated as the single source of truth in every README and rule in this repo, but Komodo's own webhook model (docs moghtech/komodo#1120) can only target a Procedure or a Stack — not a Sync — so nothing in the push-to-deploy path was ever applying new resource declarations automatically. Manual "Execute Sync" clicks in the UI were the only path, and skipping that step (easy to do, since nothing forces it) is indistinguishable from "the sync silently didn't work."

3. Container lifecycle / stack-specific quirks (4+ occurrences in one 36h window)¶

Commit / PR	Issue
`7575c55` (#39)	Wazuh post-deploy 401s after `deploy-infra` runs
`6750801` → PR #40	Wired Wazuh pre/post-deploy hooks — introduced a regression
PR #42	Follow-up fix attempt for the same Wazuh hook wiring
PR #43 (`0811ddd`)	"Keep Komodo stack file paths compose-only" — reverted the part of #40 that broke stacks by adding non-compose paths to `file_paths`

Why it keeps recurring: Wazuh is the one stack with pre_deploy / post_deploy hooks, making it structurally different from every other [[stack]] block. That asymmetry is exactly where a copy-paste or "helpful" addition (like non-compose paths in file_paths) slips through, because nothing validated resources.toml's structure before this audit.

Live confirmation found while writing this audit¶

While verifying the RunSync fix below, a direct Mongo query turned up an already-live instance of bucket 2: the live wazuh Stack resource in Komodo's database still has the pre-PR-#43 broken config — file_paths includes 7 script paths (scripts/fix-cert-permissions.sh, etc.) mixed into the compose list, and env_file_path is set to .env, a file that doesn't exist (Wazuh's real env file is compose.env). Git's resources.toml has been correct since PR #43 merged; the live Sync was simply never re-executed afterward, so Komodo has been silently unable to redeploy wazuh for as long as that gap has existed. The Wazuh containers themselves are healthy and unaffected (running 40+ hours at time of writing) — this is the same "GitOps arrow silently broken" shape as the litellm postmortem, just on a second, previously-undiscovered stack. The RunSync stage added in Fix 3 is expected to self-heal this on the next deploy-infra run after merge; verifying that is tracked in the PR's test plan.

Aggregate¶

An independent commit/CI-history pass (the history-mining agent) counted 24 distinct Komodo-related incidents since Phase 4 with roughly a 61% CI failure rate on Komodo-touching pushes — consistent with three buckets each getting "fixed" 5–7 times instead of once.

Structural root cause¶

The taxonomy above all traces back to two design properties of the current setup, both fixable without replacing Komodo:

State fan-out with no reconciliation check. Five potential sources of truth exist for one webhook secret (SOPS, compose.env, komodo-core env, GitHub Actions secret, and — until this audit — nothing reading any of them consistently), and four for Mongo credentials. Every incident in bucket 1 is some subset of these disagreeing.
No CI gate on the one file everything else depends on. services/komodo/resources.toml gets the same "write TOML, hope it's right, find out on the live host" treatment that inventory/ explicitly moved away from months ago (schema-validation job in lint.yml). Every incident in bucket 3, and part of bucket 2, is a resources.toml mistake that a five-line CI check would have caught before merge.

Two items from the contractor assessment (findings.md) are directly relevant and still open:

CA-015 (Docker socket access overused) — komodo-periphery is one of the five services with a raw /var/run/docker.sock mount. Not the cause of any incident above, but raises the blast radius of any Komodo compromise.
CA-016 (Komodo deploy ignores Grafana dashboard changes) — the paths-ignore in komodo-deploy.yml still excludes monitoring/grafana/**; still accurate as of this audit.

External research findings¶

Confirmed against upstream Komodo docs/source and general platform behavior, not homelab-specific speculation:

MONGO_INITDB_ROOT_PASSWORD only applies on first volume init — this is standard mongo image behavior, not a Komodo bug, and it is the exact mechanism behind every "Mongo password drift" incident above. There is no way to make Mongo "just pick up" a new compose.env password on restart; db.changeUserPassword() is mandatory for any rotation.
KOMODO_UI_WRITE_DISABLED is a real, documented Komodo Core env var that rejects resource mutations from the UI/API, forcing all changes through git + Sync. It defaults to unset (writes allowed), and was unset here — meaning any UI click (deliberate or accidental) could silently diverge from resources.toml with nothing to catch it. This audit sets it.
Komodo webhooks can target a Procedure, Build, Repo, or Stack — not a Sync directly (moghtech/komodo#1120, also documented in this repo's own resources.toml header comment). The existing deploy-infra Procedure correctly works around this for known stacks via BatchDeployStackIfChanged, but nothing applied new resource declarations — the gap this audit closes with a RunSync stage.
GitHub Actions repository secrets can hold an empty string with zero validation or warning — Settings → Secrets shows the secret exists by name; it does not show whether the value is empty. Nothing about this is Komodo-specific, but it was the direct cause of the most recent CI failure in this project and is worth calling out because it will bite other secrets in this repo the same way if they're ever manually re-synced.

What changed in this audit¶

Concrete, merged fixes (not just recommendations) from this session:

Fix 1: Stop depending on a GitHub Secret for a value the runner already has on disk¶

scripts/trigger-komodo-deploy.py now reads KOMODO_WEBHOOK_SECRET from /opt/homelab/services/komodo/compose.env first, and only falls back to the $KOMODO_WEBHOOK_SECRET environment variable (the GitHub Actions secret) if that file doesn't exist. The self-hosted runner is infra-services — it already has the authoritative value on disk. This permanently eliminates the "empty/stale GitHub secret" failure class for the automated deploy path; the GitHub secret remains only as a fallback for manual/portable triggering.

Fix 2: `KOMODO_UI_WRITE_DISABLED=true`¶

Set directly in services/komodo/compose.yml (not a secret, so it's visible in every diff, not routed through SOPS/render). Any future UI mutation now fails loudly instead of silently drifting from resources.toml.

Fix 3: `RunSync` stage in `deploy-infra`¶

Added a stage between PullRepo and BatchDeployStackIfChanged that runs RunSync against the infra-services-01 sync. New resources.toml declarations (a new stack, a changed file_paths, a new procedure) now apply automatically on the next push instead of requiring a manual "Execute" click in the UI that's easy to forget.

Fix 4: Pre-flight auth check before secret rotation¶

scripts/rotate-komodo-secrets.sh now verifies the current compose.env Mongo password actually authenticates against the live komodo-mongo container before attempting changeUserPassword. If it doesn't (i.e. drift already happened), the script aborts with a pointer to fix-komodo-mongo-auth.sh instead of compounding the drift.

Fix 5: CI validation for `resources.toml`¶

New scripts/validate-komodo-resources.py, wired into lint.yml as komodo-resources-check. Validates TOML syntax, that every run_directory and file_paths entry actually exists in the repo, that file_paths only contains compose-looking files (the exact PR #40 regression class), and that pre_deploy/post_deploy scripts exist. This is the same pattern this repo already uses for inventory/ (schema-validation job) applied to the one other declarative config file that had none.

Fix 6: Live incident resolved¶

The 5d21h-blocked git pull and the resulting komodo-core Mongo auth crash loop found during the live audit were fixed in real time — see the postmortem for the full timeline, root cause, and action items (several of which are the fixes above).

Path forward¶

P0: Stop the bleeding (owner action required)¶

Item	Why it's P0	Owner
Configure a Komodo Alerter (Discord/ntfy/Slack) for Procedure and Stack failures	The 5-day outage was invisible without a human opening the UI. Every fix above still requires something to notice a failure eventually. Alerter creation isn't exposed via TOML in the version deployed here — needs one UI click.	Owner
Confirm `KOMODO_UI_WRITE_DISABLED=true` didn't break any workflow that assumed UI writes worked (e.g. manual stack edits)	New behavior change — worth a quick sanity pass after deploy	Owner (verify)

P1: Close the structural gaps (agent-executable, tracked as follow-up work)¶

Item	Why	Owner
Register Phoenix (LXC 124) as a full Komodo-managed `Server` + install periphery	Phoenix is a `host`-class entity per `homelab-project.mdc` and should get full Ansible + Komodo management, not remain outside GitOps	Agent
Declare the `litellm` and `phoenix` stacks' full lifecycle (not just `resources.toml` presence) once Phoenix has a Server	Currently `litellm` is declared but was never `RunSync`'d before this audit; `phoenix` isn't declared at all pending its own Komodo `Server`	Agent
Host-level drift metric: alert if `/opt/homelab` has untracked files matching an incoming `main` diff before the next deploy	Directly prevents a repeat of the 2026-07-03 postmortem	Agent (design), Owner (approve alerting channel)
Extend `validate-komodo-resources.py` to check for duplicate/orphaned `[[procedure]]` and `[[repo]]` blocks as the file grows	Prevention scales with the file; current checks cover today's incident classes	Agent
CA-015 — introduce `docker-socket-proxy` for `komodo-periphery` (and the other 4 services)	Reduces blast radius, not urgency-linked to reliability but tracked here since Komodo is in scope	Agent (implementation), Owner (approve)
CA-016 — decide whether Grafana dashboard changes should redeploy `monitoring` stack, or document why not	Still open from 2026-06-28 contractor assessment	Owner (decision), Agent (implement)

P2: Longer-term architecture questions (owner decision needed before agent work)¶

Item	Trade-off
Should Komodo secrets collapse to fewer stores (e.g. Core reads SOPS directly instead of via rendered `compose.env`)?	Fewer moving parts vs. losing the "plain env file for `docker compose --env-file`" simplicity Komodo's compose model expects
Should the webhook secret rotate on a schedule (forcing the reconciliation muscle to be exercised regularly) instead of only on manual rotation?	Catches drift proactively vs. adds a recurring task that itself needs monitoring
Should Komodo's own MongoDB back up on a schedule independent of Proxmox/vzdump, given how central Komodo is to GitOps?	Not currently in any backup runbook — worth a decision, not urgent given `resources.toml` is git-recoverable