Postmortem: Komodo deploy pipeline outage (2026-07-03)¶
Status: Resolved
Duration: ~5 days 21 hours of blocked deploys (untracked-file conflict) plus
a same-day komodo-core crash loop (Mongo auth) discovered during the fix.
Impact: Every push to main that should have triggered a Komodo deploy
during the window silently failed to update /opt/homelab on
infra-services. New services (litellm, phoenix) and a Wazuh config fix
never reached the host. No customer-facing outage — the drift was between
git main and the deployed state, not a service crash, until the Mongo
auth issue below.
This is the first formal postmortem for a Komodo incident despite the project having accumulated more than a dozen Komodo-related fix commits since Phase 4. See Komodo reliability audit for the broader pattern this incident belongs to.
Timeline¶
- Ongoing (~5d21h before detection):
git pullinside thedeploy-infraProcedure'sPullRepostage began failing oninfra-servicesbecause/opt/homelabhad untracked files/directories (services/litellm/,services/phoenix/,monitoring/grafana/dashboards/litellm.json,services/traefik/config/dynamic/phoenix.yml) that collided with the same paths being added onmain. Git refuses a fast-forward pull when it would overwrite untracked local files, so every subsequent webhook-triggered pull failed the same way. Komodo's UI showed this as a stuck/pending repo state. - 2026-07-03, during a live audit (this session): A dispatched subagent auditing live Komodo state found the pull failure and reported it as the active root blocker for all GitOps deploys.
- Diagnosis: Confirmed via SSH that every conflicting untracked path was
byte-identical to its
origin/maincounterpart (i.e. someone had manually copied the files onto the host outside of git at some point, or a prior partial deploy left them uncommitted-but-present). - Fix attempt 1: Removed the untracked paths with
rm -rfand re-rangit pull origin main. The pull succeeded. - Regression introduced by the fix: The
rm -rfalso deleted the gitignored, host-localservices/litellm/.env(it lived inside theservices/litellm/directory being cleaned up). This file is never in git — it's the live secrets file for thelitellmcontainer (LITELLM_MASTER_KEY,POSTGRES_PASSWORD, API keys) — sogit pullcould not restore it. - Recovery: Extracted
LITELLM_MASTER_KEYandPOSTGRES_PASSWORDfrom the still-runninglitellmandlitellm-dbcontainers viadocker inspect(the container environment held the last-known-good values even though the file on disk was gone), and recreatedservices/litellm/.envon the host with0600permissions. - Second failure discovered: With the pull unblocked, forcing a fresh
Komodo deploy surfaced that
komodo-corewas crash-looping withSCRAM failure: Authentication failedagainstkomodo-mongo. The MongoDB root password baked into thekomodo-mongocontainer (set once at first volume init viaMONGO_INITDB_ROOT_PASSWORD) had drifted from the value currently inservices/komodo/compose.env. - Fix: Used
scripts/fix-komodo-mongo-auth.shwith the previous Mongo password (recovered viadocker inspect komodo-mongo) to rundb.changeUserPassword()and reconcile Mongo to matchcompose.env, then force-recreatedcoreandperiphery. - Verification: Triggered a fresh
komodo-deploy.ymlrun end to end — green.git statuson/opt/homelabclean and up to date withorigin/main.
Root cause¶
Two independent, previously-known failure classes compounded:
- No detection for a stuck
PullRepostage. Komodo has no configured Alerter, so a Procedure stage that fails on every run for days produces no notification anywhere — it's only visible if a human opens the Komodo UI. Untracked files reaching/opt/homelaboutside of a cleangit pull(most likely a manualdocker composetest or an interrupted earlier deploy) is exactly the kind of one-off drift a repo layered under active GitOps will eventually see. - MongoDB root credentials and
compose.envare two separate stores with no reconciliation.MONGO_INITDB_ROOT_PASSWORDonly takes effect on first volume init (documented infix-komodo-mongo-auth.shitself, added 2026-06-26 after an earlier occurrence of this same drift). Any path that updatescompose.env's password without also callingdb.changeUserPassword()— including a manually edited SOPS file, a partially-run rotation, or restoringcompose.envfrom an older backup — leaves Mongo and the env file disagreeing, andkomodo-corecrash-loops until someone notices and runs the rescue script.
Both are cataloged as recurring incident classes, not one-offs — see the reliability audit.
What went well¶
- The byte-identical check before deleting untracked files prevented actual data loss on the tracked-content side.
fix-komodo-mongo-auth.sh(written after the 2026-06-26 occurrence of the same Mongo drift) worked exactly as designed on the second occurrence — the runbook investment paid off.- Recovering
.envvalues from the running container's live environment avoided a LiteLLM outage on top of the Komodo one.
What went poorly¶
- The outage was only found because an owner-directed audit happened to include a live-state check. Nothing in CI, Komodo, or monitoring would have surfaced it otherwise — the pattern the audit was specifically asked to eliminate.
- The fix itself (
rm -rfon untracked paths) caused a second, avoidable regression (deleting.env) because the untracked-file check didn't first distinguish "byte-identical tracked content" from "gitignored host state that happens to live in the same directory."
Action items¶
| Action | Owner | Status |
|---|---|---|
| Add a Komodo Alerter (Slack/Discord/ntfy) for Procedure/Stack failures | Owner (Komodo UI/API access) | Tracked — audit P0 |
| Pre-flight Mongo auth check before any password rotation | Agent | Done — scripts/rotate-komodo-secrets.sh now verifies the live password before calling changeUserPassword |
RunSync stage before BatchDeployStackIfChanged in deploy-infra so new stacks apply automatically |
Agent | Done — services/komodo/resources.toml |
Runbook/checklist step: never rm -rf an untracked directory on /opt/homelab without first checking for gitignored files inside it |
Agent | Done — see komodo-github-webhook.md |
Recurring host git status drift check (untracked files outside .gitignore) surfaced as a metric/alert |
Owner + Agent | Tracked — audit P1 |