Skip to content

Komodo reliability audit (2026-07-03)

TL;DR

Owner frustration with recurring Komodo problems triggered a four-agent parallel audit (history, live state, structural review, external research). Findings converged on two structural root causes — secret/state fan-out across up to four stores, and zero CI validation on resources.toml — behind three incident buckets that had each been "fixed" independently 5-7+ times. A live, previously-undetected 5d21h deploy-pipeline outage was found and fixed during the audit itself.

What shipped

  • First formal Komodo postmortem: 2026-07-03-komodo-deploy-pipeline-outage.md — untracked-file git pull conflict + Mongo credential drift, both resolved live.
  • Comprehensive audit doc: komodo-reliability-audit-2026-07-03.md — full incident taxonomy, structural root cause, external-research findings, and a P0/P1/P2 path forward with explicit owner-vs-agent ownership.
  • Five concrete fixes merged, not just recommended:
  • scripts/trigger-komodo-deploy.py reads the webhook secret from the host's compose.env directly (the runner IS infra-services) instead of depending on a GitHub Actions secret staying in sync — eliminates the exact failure class that caused the most recent CI break.
  • KOMODO_UI_WRITE_DISABLED=true set in services/komodo/compose.yml — Komodo's own built-in guardrail against UI/resources.toml drift, previously unset.
  • A RunSync stage added to the deploy-infra Procedure, before BatchDeployStackIfChanged — new resources.toml stacks now apply automatically instead of needing a manual "Execute Sync" click that's easy to forget (this is why litellm sat undeployed for days).
  • scripts/rotate-komodo-secrets.sh now pre-flight-checks that the current Mongo password actually authenticates before attempting changeUserPassword, instead of assuming compose.env is live-accurate.
  • New scripts/validate-komodo-resources.py, wired into lint.yml as komodo-resources-check — validates resources.toml TOML syntax, path/file existence, and compose-only file_paths, mirroring the existing schema-validation pattern for inventory/.
  • docs/runbooks/komodo-github-webhook.md updated with the new secret resolution order and an "Untracked file conflict" troubleshooting section.

Phase status

Not a PLAN.md phase boundary — this is an operational-reliability initiative layered on top of already-complete Phase 4/5 Komodo deployment. Tracking is entirely in the audit doc's P0/P1/P2 table now.

Known gaps / drift

  • P0, owner action required: No Komodo Alerter is configured. The 5-day outage was invisible without a human opening the Komodo UI — every fix in this session still assumes something eventually notices a failure. Alerter creation isn't exposed via resources.toml in the deployed version; needs a UI action.
  • P1, tracked as follow-up: Phoenix (LXC 124) isn't yet a Komodo-managed Server — it's a host-class entity per policy but currently sits outside GitOps. CA-015 (Docker socket exposure) and CA-016 (Grafana dashboard changes don't trigger Komodo deploy) remain open from the 2026-06-28 contractor assessment.
  • P2, owner decision needed: whether to collapse Komodo's secret stores further, schedule webhook-secret rotation proactively, and whether Komodo's MongoDB needs an independent backup schedule.

What remains

Item Owner Priority
Configure Komodo Alerter (Discord/ntfy/Slack) Owner P0
Verify KOMODO_UI_WRITE_DISABLED didn't break an assumed UI workflow Owner P0
Register Phoenix as full Komodo-managed Server + periphery Agent P1
Host-level untracked-file drift metric before next deploy Agent (design), Owner (approve alert channel) P1
CA-015 docker-socket-proxy, CA-016 Grafana redeploy decision Agent (impl), Owner (approve/decide) P1

Mermaid: incident taxonomy → root cause → fix

flowchart TD
    A1["Secret drift (7+x)<br/>SOPS / compose.env / container / Mongo / GH secret"] --> R1["No reconciliation across secret stores"]
    A2["ResourceSync drift (7+x)<br/>new stacks never applied"] --> R2["Webhook can't target Sync directly (#1120);<br/>no auto-apply path existed"]
    A3["Container lifecycle quirks (4+x in 36h)<br/>Wazuh pre/post-deploy hooks"] --> R3["No CI validation on resources.toml"]

    R1 --> F1["Fix: read webhook secret from host<br/>compose.env; rotate script preflight check"]
    R2 --> F2["Fix: RunSync stage in deploy-infra procedure"]
    R3 --> F3["Fix: validate-komodo-resources.py in CI"]

    F1 --> G["KOMODO_UI_WRITE_DISABLED=true<br/>(closes the drift source, not just symptoms)"]
    F2 --> G
    F3 --> G