Komodo reliability audit (2026-07-03)¶
Trigger: Owner frustration with recurring Komodo problems on effectively
every merge to main — "some regressions, some mistakes, some unaccounted
requirements" — with a live example
(komodo.infra.realemail.app/repos/6a04081f4c98985723c7d087)
still broken at request time. Ask: a thorough review, documentation, and
recommendation operation so that the next Komodo problem is "absolutely
unique, unavoidable, and unpredictable" — not a repeat of something already
seen and fixed once.
Method: Four agents worked in parallel against independent evidence sources so the conclusions triangulate rather than rely on a single vantage point:
| Agent | Scope |
|---|---|
| History mining | Every git commit, journal entry, and PR touching Komodo since Phase 4 |
| Live audit | SSH into infra-services, inspect running containers, Mongo, git status, Komodo API/UI state |
| Structural review | Read every Komodo script, compose file, and resources.toml for logic/idempotency gaps |
| External research | Upstream Komodo docs/issues, known footguns in Mongo + GitHub Actions that apply here |
This document synthesizes all four, plus a live incident (see postmortem) that the live-audit pass found and this session resolved in real time.
Executive summary¶
Komodo itself is not unreliable — the same three failure classes have been fixed, independently, six to seven times each since Phase 4, plus one fix (PR #40) that broke an invariant a previous fix had established, undone one day later by PR #43. None of these were ever written up as a postmortem, so each recurrence was rediscovered from scratch instead of prevented. The root cause is structural, not incidental:
- Secrets/state live in up to four disconnected stores (SOPS →
compose.env→ running container env → live MongoDB) with no atomic reconciliation and no automated check that they agree. resources.tomlhas zero CI validation despite this repo having a proven, reusable pattern for exactly this (schema-validation/generators-checkjobs inlint.yml).- Komodo has no configured Alerter. A Procedure stage that fails on every run for five days produces no notification anywhere.
- The one built-in guardrail against UI/git drift
(
KOMODO_UI_WRITE_DISABLED) was never set.
None of these require replacing Komodo or redesigning the architecture. All four are closed or well underway in this change (see What changed in this audit); the rest are a prioritized, owned backlog below.
Recurring incident taxonomy¶
Three buckets account for essentially every Komodo incident in the project's
history (commit hashes are on main, oldest first):
1. Secret / credential drift (7+ occurrences)¶
| Commit / PR | What drifted |
|---|---|
a3adea5 |
SOPS placeholders (GENERATE_ON_HOST) left in compose.env, mistaken for real secrets |
602e616 |
First occurrence of Mongo password drift from compose.env → wrote fix-komodo-mongo-auth.sh |
b662319 |
Moved secrets to SOPS after a git pull wiped OIDC creds that were only ever host-local |
| 2026-06-26 journal | Rotated all Komodo secrets, hit the same Mongo-drift pattern the rescue script was written for |
| PR #49 / this session | GitHub Actions KOMODO_WEBHOOK_SECRET silently held an empty value for over a week despite being "set" in Settings |
| This session | Second live occurrence of Mongo password drift → komodo-core crash loop, resolved with the same rescue script |
| Ongoing structural gap | rotate-komodo-secrets.sh had no pre-flight check that its assumed "old" Mongo password actually still worked (fixed in this audit) |
Why it keeps recurring: four stores (SOPS ciphertext, rendered
compose.env, the live container's env, and MongoDB's own user table) can
each be updated independently, and nothing checks they agree until something
crashes. GitHub Actions secrets add a fifth store for the webhook secret
specifically, and GitHub secrets have a specific, undocumented footgun (see
External research findings).
2. ResourceSync / source-of-truth drift (7+ occurrences)¶
| Commit / PR | What drifted |
|---|---|
7d39cb0 |
Had to mount resources.toml into the Core container — it wasn't visible at all |
1302a3d |
Switched Sync from file-on-host to git-repo mode |
cd83fd9 |
Marked the ResourceSync TODO "done" — later reopened |
1dc5342 (#25) |
/opt/homelab wasn't mounted into komodo-periphery, breaking PullRepo and file-on-server stacks |
669715b |
Repo pull targeted the wrong path and needed SSH periphery doesn't have |
0732a26 |
Fixed ResourceSync "source of truth" for the homelab repo pull a second time |
4bad455 |
Stacks weren't files_on_host, so Komodo's view of "running" didn't match docker ps |
| This session | New stacks (litellm) sat in resources.toml for days without ever being created — BatchDeployStackIfChanged only touches resources Komodo already knows about |
Why it keeps recurring: resources.toml is treated as the single source
of truth in every README and rule in this repo, but Komodo's own webhook
model (docs moghtech/komodo#1120) can only target a Procedure or a Stack —
not a Sync — so nothing in the push-to-deploy path was ever applying new
resource declarations automatically. Manual "Execute Sync" clicks in the UI
were the only path, and skipping that step (easy to do, since nothing forces
it) is indistinguishable from "the sync silently didn't work."
3. Container lifecycle / stack-specific quirks (4+ occurrences in one 36h window)¶
| Commit / PR | Issue |
|---|---|
7575c55 (#39) |
Wazuh post-deploy 401s after deploy-infra runs |
6750801 → PR #40 |
Wired Wazuh pre/post-deploy hooks — introduced a regression |
| PR #42 | Follow-up fix attempt for the same Wazuh hook wiring |
PR #43 (0811ddd) |
"Keep Komodo stack file paths compose-only" — reverted the part of #40 that broke stacks by adding non-compose paths to file_paths |
Why it keeps recurring: Wazuh is the one stack with pre_deploy /
post_deploy hooks, making it structurally different from every other
[[stack]] block. That asymmetry is exactly where a copy-paste or
"helpful" addition (like non-compose paths in file_paths) slips through,
because nothing validated resources.toml's structure before this audit.
Live confirmation found while writing this audit¶
While verifying the RunSync fix below, a direct Mongo query turned up an
already-live instance of bucket 2: the live wazuh Stack resource in
Komodo's database still has the pre-PR-#43 broken config — file_paths
includes 7 script paths (scripts/fix-cert-permissions.sh, etc.) mixed into
the compose list, and env_file_path is set to .env, a file that doesn't
exist (Wazuh's real env file is compose.env). Git's resources.toml has
been correct since PR #43 merged; the live Sync was simply never
re-executed afterward, so Komodo has been silently unable to redeploy
wazuh for as long as that gap has existed. The Wazuh containers themselves
are healthy and unaffected (running 40+ hours at time of writing) — this is
the same "GitOps arrow silently broken" shape as the litellm postmortem,
just on a second, previously-undiscovered stack. The RunSync stage added
in Fix 3 is expected to self-heal
this on the next deploy-infra run after merge; verifying that is tracked
in the PR's test plan.
Aggregate¶
An independent commit/CI-history pass (the history-mining agent) counted 24 distinct Komodo-related incidents since Phase 4 with roughly a 61% CI failure rate on Komodo-touching pushes — consistent with three buckets each getting "fixed" 5–7 times instead of once.
Structural root cause¶
The taxonomy above all traces back to two design properties of the current setup, both fixable without replacing Komodo:
- State fan-out with no reconciliation check. Five potential sources of
truth exist for one webhook secret (SOPS,
compose.env,komodo-coreenv, GitHub Actions secret, and — until this audit — nothing reading any of them consistently), and four for Mongo credentials. Every incident in bucket 1 is some subset of these disagreeing. - No CI gate on the one file everything else depends on.
services/komodo/resources.tomlgets the same "write TOML, hope it's right, find out on the live host" treatment thatinventory/explicitly moved away from months ago (schema-validationjob inlint.yml). Every incident in bucket 3, and part of bucket 2, is aresources.tomlmistake that a five-line CI check would have caught before merge.
Two items from the contractor assessment (findings.md) are directly relevant and still open:
- CA-015 (Docker socket access overused) —
komodo-peripheryis one of the five services with a raw/var/run/docker.sockmount. Not the cause of any incident above, but raises the blast radius of any Komodo compromise. - CA-016 (Komodo deploy ignores Grafana dashboard changes) — the
paths-ignoreinkomodo-deploy.ymlstill excludesmonitoring/grafana/**; still accurate as of this audit.
External research findings¶
Confirmed against upstream Komodo docs/source and general platform behavior, not homelab-specific speculation:
MONGO_INITDB_ROOT_PASSWORDonly applies on first volume init — this is standardmongoimage behavior, not a Komodo bug, and it is the exact mechanism behind every "Mongo password drift" incident above. There is no way to make Mongo "just pick up" a newcompose.envpassword on restart;db.changeUserPassword()is mandatory for any rotation.KOMODO_UI_WRITE_DISABLEDis a real, documented Komodo Core env var that rejects resource mutations from the UI/API, forcing all changes through git + Sync. It defaults to unset (writes allowed), and was unset here — meaning any UI click (deliberate or accidental) could silently diverge fromresources.tomlwith nothing to catch it. This audit sets it.- Komodo webhooks can target a Procedure, Build, Repo, or Stack — not a
Sync directly (
moghtech/komodo#1120, also documented in this repo's ownresources.tomlheader comment). The existingdeploy-infraProcedure correctly works around this for known stacks viaBatchDeployStackIfChanged, but nothing applied new resource declarations — the gap this audit closes with aRunSyncstage. - GitHub Actions repository secrets can hold an empty string with zero
validation or warning —
Settings → Secretsshows the secret exists by name; it does not show whether the value is empty. Nothing about this is Komodo-specific, but it was the direct cause of the most recent CI failure in this project and is worth calling out because it will bite other secrets in this repo the same way if they're ever manually re-synced.
What changed in this audit¶
Concrete, merged fixes (not just recommendations) from this session:
Fix 1: Stop depending on a GitHub Secret for a value the runner already has on disk¶
scripts/trigger-komodo-deploy.py now reads KOMODO_WEBHOOK_SECRET from
/opt/homelab/services/komodo/compose.env first, and only falls back to the
$KOMODO_WEBHOOK_SECRET environment variable (the GitHub Actions secret) if
that file doesn't exist. The self-hosted runner is infra-services — it
already has the authoritative value on disk. This permanently eliminates the
"empty/stale GitHub secret" failure class for the automated deploy path; the
GitHub secret remains only as a fallback for manual/portable triggering.
Fix 2: KOMODO_UI_WRITE_DISABLED=true¶
Set directly in services/komodo/compose.yml (not a secret, so it's
visible in every diff, not routed through SOPS/render). Any future UI
mutation now fails loudly instead of silently drifting from
resources.toml.
Fix 3: RunSync stage in deploy-infra¶
Added a stage between PullRepo and BatchDeployStackIfChanged that runs
RunSync against the infra-services-01 sync. New resources.toml
declarations (a new stack, a changed file_paths, a new procedure) now
apply automatically on the next push instead of requiring a manual "Execute"
click in the UI that's easy to forget.
Fix 4: Pre-flight auth check before secret rotation¶
scripts/rotate-komodo-secrets.sh now verifies the current
compose.env Mongo password actually authenticates against the live
komodo-mongo container before attempting changeUserPassword. If it
doesn't (i.e. drift already happened), the script aborts with a pointer to
fix-komodo-mongo-auth.sh instead of compounding the drift.
Fix 5: CI validation for resources.toml¶
New scripts/validate-komodo-resources.py, wired into lint.yml as
komodo-resources-check. Validates TOML syntax, that every run_directory
and file_paths entry actually exists in the repo, that file_paths only
contains compose-looking files (the exact PR #40 regression class), and that
pre_deploy/post_deploy scripts exist. This is the same pattern this repo
already uses for inventory/ (schema-validation job) applied to the one
other declarative config file that had none.
Fix 6: Live incident resolved¶
The 5d21h-blocked git pull and the resulting komodo-core Mongo auth
crash loop found during the live audit were fixed in real time — see the
postmortem for
the full timeline, root cause, and action items (several of which are the
fixes above).
Path forward¶
P0: Stop the bleeding (owner action required)¶
| Item | Why it's P0 | Owner |
|---|---|---|
| Configure a Komodo Alerter (Discord/ntfy/Slack) for Procedure and Stack failures | The 5-day outage was invisible without a human opening the UI. Every fix above still requires something to notice a failure eventually. Alerter creation isn't exposed via TOML in the version deployed here — needs one UI click. | Owner |
Confirm KOMODO_UI_WRITE_DISABLED=true didn't break any workflow that assumed UI writes worked (e.g. manual stack edits) |
New behavior change — worth a quick sanity pass after deploy | Owner (verify) |
P1: Close the structural gaps (agent-executable, tracked as follow-up work)¶
| Item | Why | Owner |
|---|---|---|
Register Phoenix (LXC 124) as a full Komodo-managed Server + install periphery |
Phoenix is a host-class entity per homelab-project.mdc and should get full Ansible + Komodo management, not remain outside GitOps |
Agent |
Declare the litellm and phoenix stacks' full lifecycle (not just resources.toml presence) once Phoenix has a Server |
Currently litellm is declared but was never RunSync'd before this audit; phoenix isn't declared at all pending its own Komodo Server |
Agent |
Host-level drift metric: alert if /opt/homelab has untracked files matching an incoming main diff before the next deploy |
Directly prevents a repeat of the 2026-07-03 postmortem | Agent (design), Owner (approve alerting channel) |
Extend validate-komodo-resources.py to check for duplicate/orphaned [[procedure]] and [[repo]] blocks as the file grows |
Prevention scales with the file; current checks cover today's incident classes | Agent |
CA-015 — introduce docker-socket-proxy for komodo-periphery (and the other 4 services) |
Reduces blast radius, not urgency-linked to reliability but tracked here since Komodo is in scope | Agent (implementation), Owner (approve) |
CA-016 — decide whether Grafana dashboard changes should redeploy monitoring stack, or document why not |
Still open from 2026-06-28 contractor assessment | Owner (decision), Agent (implement) |
P2: Longer-term architecture questions (owner decision needed before agent work)¶
| Item | Trade-off |
|---|---|
Should Komodo secrets collapse to fewer stores (e.g. Core reads SOPS directly instead of via rendered compose.env)? |
Fewer moving parts vs. losing the "plain env file for docker compose --env-file" simplicity Komodo's compose model expects |
| Should the webhook secret rotate on a schedule (forcing the reconciliation muscle to be exercised regularly) instead of only on manual rotation? | Catches drift proactively vs. adds a recurring task that itself needs monitoring |
| Should Komodo's own MongoDB back up on a schedule independent of Proxmox/vzdump, given how central Komodo is to GitOps? | Not currently in any backup runbook — worth a decision, not urgent given resources.toml is git-recoverable |