Komodo reliability audit (2026-07-03)¶
TL;DR¶
Owner frustration with recurring Komodo problems triggered a four-agent
parallel audit (history, live state, structural review, external research).
Findings converged on two structural root causes — secret/state fan-out
across up to four stores, and zero CI validation on resources.toml — behind
three incident buckets that had each been "fixed" independently 5-7+ times.
A live, previously-undetected 5d21h deploy-pipeline outage was found and
fixed during the audit itself.
What shipped¶
- First formal Komodo postmortem:
2026-07-03-komodo-deploy-pipeline-outage.md —
untracked-file
git pullconflict + Mongo credential drift, both resolved live. - Comprehensive audit doc: komodo-reliability-audit-2026-07-03.md — full incident taxonomy, structural root cause, external-research findings, and a P0/P1/P2 path forward with explicit owner-vs-agent ownership.
- Five concrete fixes merged, not just recommended:
scripts/trigger-komodo-deploy.pyreads the webhook secret from the host'scompose.envdirectly (the runner ISinfra-services) instead of depending on a GitHub Actions secret staying in sync — eliminates the exact failure class that caused the most recent CI break.KOMODO_UI_WRITE_DISABLED=trueset inservices/komodo/compose.yml— Komodo's own built-in guardrail against UI/resources.tomldrift, previously unset.- A
RunSyncstage added to thedeploy-infraProcedure, beforeBatchDeployStackIfChanged— newresources.tomlstacks now apply automatically instead of needing a manual "Execute Sync" click that's easy to forget (this is whylitellmsat undeployed for days). scripts/rotate-komodo-secrets.shnow pre-flight-checks that the current Mongo password actually authenticates before attemptingchangeUserPassword, instead of assumingcompose.envis live-accurate.- New
scripts/validate-komodo-resources.py, wired intolint.ymlaskomodo-resources-check— validatesresources.tomlTOML syntax, path/file existence, and compose-onlyfile_paths, mirroring the existingschema-validationpattern forinventory/. docs/runbooks/komodo-github-webhook.mdupdated with the new secret resolution order and an "Untracked file conflict" troubleshooting section.
Phase status¶
Not a PLAN.md phase boundary — this is an operational-reliability
initiative layered on top of already-complete Phase 4/5 Komodo deployment.
Tracking is entirely in the audit doc's P0/P1/P2 table now.
Known gaps / drift¶
- P0, owner action required: No Komodo Alerter is configured. The 5-day
outage was invisible without a human opening the Komodo UI — every fix in
this session still assumes something eventually notices a failure.
Alerter creation isn't exposed via
resources.tomlin the deployed version; needs a UI action. - P1, tracked as follow-up: Phoenix (LXC 124) isn't yet a Komodo-managed
Server— it's ahost-class entity per policy but currently sits outside GitOps. CA-015 (Docker socket exposure) and CA-016 (Grafana dashboard changes don't trigger Komodo deploy) remain open from the 2026-06-28 contractor assessment. - P2, owner decision needed: whether to collapse Komodo's secret stores further, schedule webhook-secret rotation proactively, and whether Komodo's MongoDB needs an independent backup schedule.
What remains¶
| Item | Owner | Priority |
|---|---|---|
| Configure Komodo Alerter (Discord/ntfy/Slack) | Owner | P0 |
Verify KOMODO_UI_WRITE_DISABLED didn't break an assumed UI workflow |
Owner | P0 |
| Register Phoenix as full Komodo-managed Server + periphery | Agent | P1 |
| Host-level untracked-file drift metric before next deploy | Agent (design), Owner (approve alert channel) | P1 |
| CA-015 docker-socket-proxy, CA-016 Grafana redeploy decision | Agent (impl), Owner (approve/decide) | P1 |
Mermaid: incident taxonomy → root cause → fix¶
flowchart TD
A1["Secret drift (7+x)<br/>SOPS / compose.env / container / Mongo / GH secret"] --> R1["No reconciliation across secret stores"]
A2["ResourceSync drift (7+x)<br/>new stacks never applied"] --> R2["Webhook can't target Sync directly (#1120);<br/>no auto-apply path existed"]
A3["Container lifecycle quirks (4+x in 36h)<br/>Wazuh pre/post-deploy hooks"] --> R3["No CI validation on resources.toml"]
R1 --> F1["Fix: read webhook secret from host<br/>compose.env; rotate script preflight check"]
R2 --> F2["Fix: RunSync stage in deploy-infra procedure"]
R3 --> F3["Fix: validate-komodo-resources.py in CI"]
F1 --> G["KOMODO_UI_WRITE_DISABLED=true<br/>(closes the drift source, not just symptoms)"]
F2 --> G
F3 --> G