Skip to content

Postmortem: Komodo deploy pipeline outage (2026-07-03)

Status: Resolved Duration: ~5 days 21 hours of blocked deploys (untracked-file conflict) plus a same-day komodo-core crash loop (Mongo auth) discovered during the fix. Impact: Every push to main that should have triggered a Komodo deploy during the window silently failed to update /opt/homelab on infra-services. New services (litellm, phoenix) and a Wazuh config fix never reached the host. No customer-facing outage — the drift was between git main and the deployed state, not a service crash, until the Mongo auth issue below.

This is the first formal postmortem for a Komodo incident despite the project having accumulated more than a dozen Komodo-related fix commits since Phase 4. See Komodo reliability audit for the broader pattern this incident belongs to.

Timeline

  • Ongoing (~5d21h before detection): git pull inside the deploy-infra Procedure's PullRepo stage began failing on infra-services because /opt/homelab had untracked files/directories (services/litellm/, services/phoenix/, monitoring/grafana/dashboards/litellm.json, services/traefik/config/dynamic/phoenix.yml) that collided with the same paths being added on main. Git refuses a fast-forward pull when it would overwrite untracked local files, so every subsequent webhook-triggered pull failed the same way. Komodo's UI showed this as a stuck/pending repo state.
  • 2026-07-03, during a live audit (this session): A dispatched subagent auditing live Komodo state found the pull failure and reported it as the active root blocker for all GitOps deploys.
  • Diagnosis: Confirmed via SSH that every conflicting untracked path was byte-identical to its origin/main counterpart (i.e. someone had manually copied the files onto the host outside of git at some point, or a prior partial deploy left them uncommitted-but-present).
  • Fix attempt 1: Removed the untracked paths with rm -rf and re-ran git pull origin main. The pull succeeded.
  • Regression introduced by the fix: The rm -rf also deleted the gitignored, host-local services/litellm/.env (it lived inside the services/litellm/ directory being cleaned up). This file is never in git — it's the live secrets file for the litellm container (LITELLM_MASTER_KEY, POSTGRES_PASSWORD, API keys) — so git pull could not restore it.
  • Recovery: Extracted LITELLM_MASTER_KEY and POSTGRES_PASSWORD from the still-running litellm and litellm-db containers via docker inspect (the container environment held the last-known-good values even though the file on disk was gone), and recreated services/litellm/.env on the host with 0600 permissions.
  • Second failure discovered: With the pull unblocked, forcing a fresh Komodo deploy surfaced that komodo-core was crash-looping with SCRAM failure: Authentication failed against komodo-mongo. The MongoDB root password baked into the komodo-mongo container (set once at first volume init via MONGO_INITDB_ROOT_PASSWORD) had drifted from the value currently in services/komodo/compose.env.
  • Fix: Used scripts/fix-komodo-mongo-auth.sh with the previous Mongo password (recovered via docker inspect komodo-mongo) to run db.changeUserPassword() and reconcile Mongo to match compose.env, then force-recreated core and periphery.
  • Verification: Triggered a fresh komodo-deploy.yml run end to end — green. git status on /opt/homelab clean and up to date with origin/main.

Root cause

Two independent, previously-known failure classes compounded:

  1. No detection for a stuck PullRepo stage. Komodo has no configured Alerter, so a Procedure stage that fails on every run for days produces no notification anywhere — it's only visible if a human opens the Komodo UI. Untracked files reaching /opt/homelab outside of a clean git pull (most likely a manual docker compose test or an interrupted earlier deploy) is exactly the kind of one-off drift a repo layered under active GitOps will eventually see.
  2. MongoDB root credentials and compose.env are two separate stores with no reconciliation. MONGO_INITDB_ROOT_PASSWORD only takes effect on first volume init (documented in fix-komodo-mongo-auth.sh itself, added 2026-06-26 after an earlier occurrence of this same drift). Any path that updates compose.env's password without also calling db.changeUserPassword() — including a manually edited SOPS file, a partially-run rotation, or restoring compose.env from an older backup — leaves Mongo and the env file disagreeing, and komodo-core crash-loops until someone notices and runs the rescue script.

Both are cataloged as recurring incident classes, not one-offs — see the reliability audit.

What went well

  • The byte-identical check before deleting untracked files prevented actual data loss on the tracked-content side.
  • fix-komodo-mongo-auth.sh (written after the 2026-06-26 occurrence of the same Mongo drift) worked exactly as designed on the second occurrence — the runbook investment paid off.
  • Recovering .env values from the running container's live environment avoided a LiteLLM outage on top of the Komodo one.

What went poorly

  • The outage was only found because an owner-directed audit happened to include a live-state check. Nothing in CI, Komodo, or monitoring would have surfaced it otherwise — the pattern the audit was specifically asked to eliminate.
  • The fix itself (rm -rf on untracked paths) caused a second, avoidable regression (deleting .env) because the untracked-file check didn't first distinguish "byte-identical tracked content" from "gitignored host state that happens to live in the same directory."

Action items

Action Owner Status
Add a Komodo Alerter (Slack/Discord/ntfy) for Procedure/Stack failures Owner (Komodo UI/API access) Tracked — audit P0
Pre-flight Mongo auth check before any password rotation Agent Done — scripts/rotate-komodo-secrets.sh now verifies the live password before calling changeUserPassword
RunSync stage before BatchDeployStackIfChanged in deploy-infra so new stacks apply automatically Agent Done — services/komodo/resources.toml
Runbook/checklist step: never rm -rf an untracked directory on /opt/homelab without first checking for gitignored files inside it Agent Done — see komodo-github-webhook.md
Recurring host git status drift check (untracked files outside .gitignore) surfaced as a metric/alert Owner + Agent Tracked — audit P1