Skip to content

Komodo reliability audit (2026-07-03)

Trigger: Owner frustration with recurring Komodo problems on effectively every merge to main — "some regressions, some mistakes, some unaccounted requirements" — with a live example (komodo.infra.realemail.app/repos/6a04081f4c98985723c7d087) still broken at request time. Ask: a thorough review, documentation, and recommendation operation so that the next Komodo problem is "absolutely unique, unavoidable, and unpredictable" — not a repeat of something already seen and fixed once.

Method: Four agents worked in parallel against independent evidence sources so the conclusions triangulate rather than rely on a single vantage point:

Agent Scope
History mining Every git commit, journal entry, and PR touching Komodo since Phase 4
Live audit SSH into infra-services, inspect running containers, Mongo, git status, Komodo API/UI state
Structural review Read every Komodo script, compose file, and resources.toml for logic/idempotency gaps
External research Upstream Komodo docs/issues, known footguns in Mongo + GitHub Actions that apply here

This document synthesizes all four, plus a live incident (see postmortem) that the live-audit pass found and this session resolved in real time.

Executive summary

Komodo itself is not unreliable — the same three failure classes have been fixed, independently, six to seven times each since Phase 4, plus one fix (PR #40) that broke an invariant a previous fix had established, undone one day later by PR #43. None of these were ever written up as a postmortem, so each recurrence was rediscovered from scratch instead of prevented. The root cause is structural, not incidental:

  1. Secrets/state live in up to four disconnected stores (SOPS → compose.env → running container env → live MongoDB) with no atomic reconciliation and no automated check that they agree.
  2. resources.toml has zero CI validation despite this repo having a proven, reusable pattern for exactly this (schema-validation / generators-check jobs in lint.yml).
  3. Komodo has no configured Alerter. A Procedure stage that fails on every run for five days produces no notification anywhere.
  4. The one built-in guardrail against UI/git drift (KOMODO_UI_WRITE_DISABLED) was never set.

None of these require replacing Komodo or redesigning the architecture. All four are closed or well underway in this change (see What changed in this audit); the rest are a prioritized, owned backlog below.

Recurring incident taxonomy

Three buckets account for essentially every Komodo incident in the project's history (commit hashes are on main, oldest first):

1. Secret / credential drift (7+ occurrences)

Commit / PR What drifted
a3adea5 SOPS placeholders (GENERATE_ON_HOST) left in compose.env, mistaken for real secrets
602e616 First occurrence of Mongo password drift from compose.env → wrote fix-komodo-mongo-auth.sh
b662319 Moved secrets to SOPS after a git pull wiped OIDC creds that were only ever host-local
2026-06-26 journal Rotated all Komodo secrets, hit the same Mongo-drift pattern the rescue script was written for
PR #49 / this session GitHub Actions KOMODO_WEBHOOK_SECRET silently held an empty value for over a week despite being "set" in Settings
This session Second live occurrence of Mongo password drift → komodo-core crash loop, resolved with the same rescue script
Ongoing structural gap rotate-komodo-secrets.sh had no pre-flight check that its assumed "old" Mongo password actually still worked (fixed in this audit)

Why it keeps recurring: four stores (SOPS ciphertext, rendered compose.env, the live container's env, and MongoDB's own user table) can each be updated independently, and nothing checks they agree until something crashes. GitHub Actions secrets add a fifth store for the webhook secret specifically, and GitHub secrets have a specific, undocumented footgun (see External research findings).

2. ResourceSync / source-of-truth drift (7+ occurrences)

Commit / PR What drifted
7d39cb0 Had to mount resources.toml into the Core container — it wasn't visible at all
1302a3d Switched Sync from file-on-host to git-repo mode
cd83fd9 Marked the ResourceSync TODO "done" — later reopened
1dc5342 (#25) /opt/homelab wasn't mounted into komodo-periphery, breaking PullRepo and file-on-server stacks
669715b Repo pull targeted the wrong path and needed SSH periphery doesn't have
0732a26 Fixed ResourceSync "source of truth" for the homelab repo pull a second time
4bad455 Stacks weren't files_on_host, so Komodo's view of "running" didn't match docker ps
This session New stacks (litellm) sat in resources.toml for days without ever being created — BatchDeployStackIfChanged only touches resources Komodo already knows about

Why it keeps recurring: resources.toml is treated as the single source of truth in every README and rule in this repo, but Komodo's own webhook model (docs moghtech/komodo#1120) can only target a Procedure or a Stack — not a Sync — so nothing in the push-to-deploy path was ever applying new resource declarations automatically. Manual "Execute Sync" clicks in the UI were the only path, and skipping that step (easy to do, since nothing forces it) is indistinguishable from "the sync silently didn't work."

3. Container lifecycle / stack-specific quirks (4+ occurrences in one 36h window)

Commit / PR Issue
7575c55 (#39) Wazuh post-deploy 401s after deploy-infra runs
6750801 → PR #40 Wired Wazuh pre/post-deploy hooks — introduced a regression
PR #42 Follow-up fix attempt for the same Wazuh hook wiring
PR #43 (0811ddd) "Keep Komodo stack file paths compose-only" — reverted the part of #40 that broke stacks by adding non-compose paths to file_paths

Why it keeps recurring: Wazuh is the one stack with pre_deploy / post_deploy hooks, making it structurally different from every other [[stack]] block. That asymmetry is exactly where a copy-paste or "helpful" addition (like non-compose paths in file_paths) slips through, because nothing validated resources.toml's structure before this audit.

Live confirmation found while writing this audit

While verifying the RunSync fix below, a direct Mongo query turned up an already-live instance of bucket 2: the live wazuh Stack resource in Komodo's database still has the pre-PR-#43 broken config — file_paths includes 7 script paths (scripts/fix-cert-permissions.sh, etc.) mixed into the compose list, and env_file_path is set to .env, a file that doesn't exist (Wazuh's real env file is compose.env). Git's resources.toml has been correct since PR #43 merged; the live Sync was simply never re-executed afterward, so Komodo has been silently unable to redeploy wazuh for as long as that gap has existed. The Wazuh containers themselves are healthy and unaffected (running 40+ hours at time of writing) — this is the same "GitOps arrow silently broken" shape as the litellm postmortem, just on a second, previously-undiscovered stack. The RunSync stage added in Fix 3 is expected to self-heal this on the next deploy-infra run after merge; verifying that is tracked in the PR's test plan.

Aggregate

An independent commit/CI-history pass (the history-mining agent) counted 24 distinct Komodo-related incidents since Phase 4 with roughly a 61% CI failure rate on Komodo-touching pushes — consistent with three buckets each getting "fixed" 5–7 times instead of once.

Structural root cause

The taxonomy above all traces back to two design properties of the current setup, both fixable without replacing Komodo:

  1. State fan-out with no reconciliation check. Five potential sources of truth exist for one webhook secret (SOPS, compose.env, komodo-core env, GitHub Actions secret, and — until this audit — nothing reading any of them consistently), and four for Mongo credentials. Every incident in bucket 1 is some subset of these disagreeing.
  2. No CI gate on the one file everything else depends on. services/komodo/resources.toml gets the same "write TOML, hope it's right, find out on the live host" treatment that inventory/ explicitly moved away from months ago (schema-validation job in lint.yml). Every incident in bucket 3, and part of bucket 2, is a resources.toml mistake that a five-line CI check would have caught before merge.

Two items from the contractor assessment (findings.md) are directly relevant and still open:

  • CA-015 (Docker socket access overused) — komodo-periphery is one of the five services with a raw /var/run/docker.sock mount. Not the cause of any incident above, but raises the blast radius of any Komodo compromise.
  • CA-016 (Komodo deploy ignores Grafana dashboard changes) — the paths-ignore in komodo-deploy.yml still excludes monitoring/grafana/**; still accurate as of this audit.

External research findings

Confirmed against upstream Komodo docs/source and general platform behavior, not homelab-specific speculation:

  • MONGO_INITDB_ROOT_PASSWORD only applies on first volume init — this is standard mongo image behavior, not a Komodo bug, and it is the exact mechanism behind every "Mongo password drift" incident above. There is no way to make Mongo "just pick up" a new compose.env password on restart; db.changeUserPassword() is mandatory for any rotation.
  • KOMODO_UI_WRITE_DISABLED is a real, documented Komodo Core env var that rejects resource mutations from the UI/API, forcing all changes through git + Sync. It defaults to unset (writes allowed), and was unset here — meaning any UI click (deliberate or accidental) could silently diverge from resources.toml with nothing to catch it. This audit sets it.
  • Komodo webhooks can target a Procedure, Build, Repo, or Stack — not a Sync directly (moghtech/komodo#1120, also documented in this repo's own resources.toml header comment). The existing deploy-infra Procedure correctly works around this for known stacks via BatchDeployStackIfChanged, but nothing applied new resource declarations — the gap this audit closes with a RunSync stage.
  • GitHub Actions repository secrets can hold an empty string with zero validation or warningSettings → Secrets shows the secret exists by name; it does not show whether the value is empty. Nothing about this is Komodo-specific, but it was the direct cause of the most recent CI failure in this project and is worth calling out because it will bite other secrets in this repo the same way if they're ever manually re-synced.

What changed in this audit

Concrete, merged fixes (not just recommendations) from this session:

Fix 1: Stop depending on a GitHub Secret for a value the runner already has on disk

scripts/trigger-komodo-deploy.py now reads KOMODO_WEBHOOK_SECRET from /opt/homelab/services/komodo/compose.env first, and only falls back to the $KOMODO_WEBHOOK_SECRET environment variable (the GitHub Actions secret) if that file doesn't exist. The self-hosted runner is infra-services — it already has the authoritative value on disk. This permanently eliminates the "empty/stale GitHub secret" failure class for the automated deploy path; the GitHub secret remains only as a fallback for manual/portable triggering.

Fix 2: KOMODO_UI_WRITE_DISABLED=true

Set directly in services/komodo/compose.yml (not a secret, so it's visible in every diff, not routed through SOPS/render). Any future UI mutation now fails loudly instead of silently drifting from resources.toml.

Fix 3: RunSync stage in deploy-infra

Added a stage between PullRepo and BatchDeployStackIfChanged that runs RunSync against the infra-services-01 sync. New resources.toml declarations (a new stack, a changed file_paths, a new procedure) now apply automatically on the next push instead of requiring a manual "Execute" click in the UI that's easy to forget.

Fix 4: Pre-flight auth check before secret rotation

scripts/rotate-komodo-secrets.sh now verifies the current compose.env Mongo password actually authenticates against the live komodo-mongo container before attempting changeUserPassword. If it doesn't (i.e. drift already happened), the script aborts with a pointer to fix-komodo-mongo-auth.sh instead of compounding the drift.

Fix 5: CI validation for resources.toml

New scripts/validate-komodo-resources.py, wired into lint.yml as komodo-resources-check. Validates TOML syntax, that every run_directory and file_paths entry actually exists in the repo, that file_paths only contains compose-looking files (the exact PR #40 regression class), and that pre_deploy/post_deploy scripts exist. This is the same pattern this repo already uses for inventory/ (schema-validation job) applied to the one other declarative config file that had none.

Fix 6: Live incident resolved

The 5d21h-blocked git pull and the resulting komodo-core Mongo auth crash loop found during the live audit were fixed in real time — see the postmortem for the full timeline, root cause, and action items (several of which are the fixes above).

Path forward

P0: Stop the bleeding (owner action required)

Item Why it's P0 Owner
Configure a Komodo Alerter (Discord/ntfy/Slack) for Procedure and Stack failures The 5-day outage was invisible without a human opening the UI. Every fix above still requires something to notice a failure eventually. Alerter creation isn't exposed via TOML in the version deployed here — needs one UI click. Owner
Confirm KOMODO_UI_WRITE_DISABLED=true didn't break any workflow that assumed UI writes worked (e.g. manual stack edits) New behavior change — worth a quick sanity pass after deploy Owner (verify)

P1: Close the structural gaps (agent-executable, tracked as follow-up work)

Item Why Owner
Register Phoenix (LXC 124) as a full Komodo-managed Server + install periphery Phoenix is a host-class entity per homelab-project.mdc and should get full Ansible + Komodo management, not remain outside GitOps Agent
Declare the litellm and phoenix stacks' full lifecycle (not just resources.toml presence) once Phoenix has a Server Currently litellm is declared but was never RunSync'd before this audit; phoenix isn't declared at all pending its own Komodo Server Agent
Host-level drift metric: alert if /opt/homelab has untracked files matching an incoming main diff before the next deploy Directly prevents a repeat of the 2026-07-03 postmortem Agent (design), Owner (approve alerting channel)
Extend validate-komodo-resources.py to check for duplicate/orphaned [[procedure]] and [[repo]] blocks as the file grows Prevention scales with the file; current checks cover today's incident classes Agent
CA-015 — introduce docker-socket-proxy for komodo-periphery (and the other 4 services) Reduces blast radius, not urgency-linked to reliability but tracked here since Komodo is in scope Agent (implementation), Owner (approve)
CA-016 — decide whether Grafana dashboard changes should redeploy monitoring stack, or document why not Still open from 2026-06-28 contractor assessment Owner (decision), Agent (implement)

P2: Longer-term architecture questions (owner decision needed before agent work)

Item Trade-off
Should Komodo secrets collapse to fewer stores (e.g. Core reads SOPS directly instead of via rendered compose.env)? Fewer moving parts vs. losing the "plain env file for docker compose --env-file" simplicity Komodo's compose model expects
Should the webhook secret rotate on a schedule (forcing the reconciliation muscle to be exercised regularly) instead of only on manual rotation? Catches drift proactively vs. adds a recurring task that itself needs monitoring
Should Komodo's own MongoDB back up on a schedule independent of Proxmox/vzdump, given how central Komodo is to GitOps? Not currently in any backup runbook — worth a decision, not urgent given resources.toml is git-recoverable