Skip to content

Public edge incident response (Cloudflare + Traefik + Authentik)

Runbook for SEC-009: credential leak, certificate failure, or database corruption on the public-facing chain that serves *.realemail.app and *.infra.realemail.app.

Related: Restore (per-service restic), DR from Zero, Secrets compromise, Traefik README.


Dependency chain

flowchart LR
    CF[Cloudflare DNS + proxy]
    TR[Traefik infra-services + saltierpoop]
    AK[Authentik on saltierpoop]
    APPS[Protected apps behind Traefik]

    CF -->|DNS-01 ACME| TR
    CF -->|orange-cloud optional| TR
    TR -->|forward auth| AK
    AK --> APPS
Layer What breaks if it fails Typical symptom
Cloudflare API token Traefik cannot renew Let's Encrypt certs via DNS-01 TLS errors, cert expiry alerts
Traefik acme.json Same — lost or corrupt ACME account state Sudden TLS failures after restart
Authentik DB SSO / forward-auth broken 401/502 on protected routes, login loops
Traefik config Routing wrong or down 404/502 for all hostnames

Tier-1 backups cover Traefik state, Komodo, and Authentik DB (external fetch from saltierpoop). Cloudflare account credentials are not in restic — they live in SOPS, GitHub Actions secrets, and 1Password.


Scenario A — Cloudflare API token leak or rotation

When to use

  • Token committed to git, pasted in a ticket, or Cloudflare audit shows misuse.
  • Planned rotation (good hygiene every 6–12 months).
  • Traefik logs: Cloudflare API errors during cert renewal.

Where tokens live today

Location Key name Scope needed
services/traefik/.env.sops.yaml CF_DNS_API_TOKEN Zone DNS Edit for realemail.app
GitHub Actions docs.yml CLOUDFLARE_API_TOKEN, CLOUDFLARE_ACCOUNT_ID Pages deploy (separate token recommended)
Saltbox inventory on saltierpoop Per-app DNS roles Optional — many apps use Cloudflare DNS records

Use separate tokens per consumer when possible (Traefik DNS-01 vs CI vs Saltbox).

Response steps

  1. Cloudflare dashboard → My Profile → API Tokens → revoke the compromised token.
  2. Create replacement with minimum scope:
  3. Traefik: Zone → DNS → Edit on realemail.app (and any other zones Traefik manages).
  4. Docs CI: Account → Cloudflare Pages → Edit only.
  5. Update secrets (never commit plaintext):
# On dev machine with age key / sops
sops services/traefik/.env.sops.yaml
# Edit CF_DNS_API_TOKEN, save encrypted
git add services/traefik/.env.sops.yaml && git commit && git push
  1. Redeploy Traefik on infra-services:
ssh infra-services
cd /opt/homelab/services/traefik
SOPS_AGE_KEY_FILE=/etc/homelab/age-key.txt sops -d .env.sops.yaml \
  | sed 's/CF_DNS_API_TOKEN: /CLOUDFLARE_DNS_API_TOKEN=/' > .env
docker compose up -d
  1. Saltbox Traefik (if token shared): update inventory / accounts.yml as applicable, then sb install traefik on saltierpoop.
  2. GitHub: Settings → Secrets → update CLOUDFLARE_API_TOKEN if CI token was rotated.
  3. Verify renewal (force if needed):
docker logs traefik 2>&1 | tail -50 | grep -i acme
curl -sI https://traefik.infra.realemail.app | head -5
  1. Post-incident: Cloudflare Audit Logs → confirm old token shows no further API calls.

If certs already expired before rotation

Traefik will request new certs once the token is valid. Browsers may need a hard refresh. If acme.json is corrupt, restore Traefik backup first (Scenario B) then rotate token.


Scenario B — Traefik acme.json loss or corruption

When to use

  • acme.json deleted, zeroed, or permissions wrong (acme.json must be 600).
  • Traefik fails to start with ACME / lock errors.

Recovery

  1. Restore from tier-1 restic (preferred):
ssh infra-services
restic -r /var/backups/restic/traefik snapshots
restic -r /var/backups/restic/traefik restore latest \
  --target /tmp/restore-traefik
cd /opt/homelab/services/traefik
docker compose stop traefik
cp -a /tmp/restore-traefik/opt/homelab/services/traefik/acme.json ./acme.json
chmod 600 ./acme.json
docker compose up -d traefik
  1. If no snapshot: valid CF_DNS_API_TOKEN + empty acme.json lets Traefik re-issue all certs (may hit Let's Encrypt rate limits — stagger or use staging first).
  2. Verify HTTPS on traefik.infra.realemail.app and one *.realemail.app hostname.

See also restore.md § Traefik.


Scenario C — Authentik PostgreSQL corruption or loss

When to use

  • authentik container crash loops; Postgres errors on startup.
  • Accidental docker volume rm or failed upgrade.
  • Need point-in-time recovery after bad config change.

Backups available

Source Path / job Schedule
Tier-1 external fetch backups/external/authentik-db.yamlfetch-authentik-db.sh Daily ~02:08 PDT
Restic local + B2 restic-backup-external-authentik-db.timer After fetch

Staging on infra-services: /var/backups/external-staging/authentik-db/.

Recovery on saltierpoop

  1. Stop Authentik (keep Postgres running for restore):
ssh someone@192.168.6.243
docker stop authentik authentik-worker 2>/dev/null || true
  1. Get latest dump — from restic on infra-services or re-fetch:
# On infra-services — restore staging copy
restic -r /var/backups/restic/external-authentik-db restore latest \
  --target /tmp/restore-ak-db
scp /tmp/restore-ak-db/var/backups/external-staging/authentik-db/*.sql.gz \
  someone@192.168.6.243:/tmp/

Or on saltierpoop trigger fetch manually if SSH key is deployed:

# From infra-services as backup user / script
/opt/homelab/backups/external/fetch-authentik-db.sh
  1. Restore into Postgres:
gunzip -c /tmp/authentik.sql.gz | docker exec -i authentik-postgres \
  psql -U authentik -d authentik

For a clean restore, drop and recreate DB first (destructive):

docker exec -i authentik-postgres psql -U authentik -c \
  "DROP DATABASE authentik WITH (FORCE); CREATE DATABASE authentik;"
  1. Start Authentik:
docker start authentik authentik-worker
docker logs authentik --tail 30
  1. Verify:
  2. Login at https://auth.realemail.app (or your Authentik hostname).
  3. Open a Traefik-protected app — forward-auth should succeed.
  4. Check Authentik admin → Applications / Providers still present.

See restore.md § Authentik DB.

If backup is stale or missing

  • Re-bootstrap Authentik from Saltbox (sb install authentik) and re-create providers/applications manually — last resort.
  • Export provider config periodically (Authentik blueprints) as future hardening.

Scenario D — Full edge down (DNS + TLS + auth)

Work top-down:

  1. DNS resolves? dig +short grafana.infra.realemail.app192.168.6.17 (LAN) or Cloudflare proxy IP (WAN).
  2. TLS? curl -vI https://… — cert name, expiry, issuer.
  3. Traefik routing? Traefik dashboard / access logs.
  4. Authentik? Direct hit to auth URL bypassing apps.
  5. App container? Komodo stack status.

Use DR from Zero only for total loss; partial incidents use sections above.


Validation log

Date Check Result
2026-06-21 restic-backup-traefik.timer active
2026-06-21 restic-backup-external-authentik-db.timer active
2026-06-21 Authentik staging dump authentik.sql.gz ~23 MB
2026-06-21 Runbook scenarios A–C documented
2026-06-21 Tabletop (owner) proven — token rotation, Traefik redeploy, backup restore paths exercised

Prevention checklist

  • [x] Traefik + Authentik tier-1 timers green on infra-services
  • [x] Cloudflare tokens scoped per service; rotation date in 1Password
  • [x] acme.json backed up daily (Traefik restic timer)
  • [x] Authentik admin 2FA enabled
  • [x] This runbook exercised once (tabletop): token revoke + restore Authentik from restic

Closure

SEC-009 closed 2026-06-21 — see security-register.md.