Public edge incident response (Cloudflare + Traefik + Authentik)¶
Runbook for SEC-009: credential leak, certificate failure, or database corruption
on the public-facing chain that serves *.realemail.app and *.infra.realemail.app.
Related: Restore (per-service restic), DR from Zero, Secrets compromise, Traefik README.
Dependency chain¶
flowchart LR
CF[Cloudflare DNS + proxy]
TR[Traefik infra-services + saltierpoop]
AK[Authentik on saltierpoop]
APPS[Protected apps behind Traefik]
CF -->|DNS-01 ACME| TR
CF -->|orange-cloud optional| TR
TR -->|forward auth| AK
AK --> APPS
| Layer | What breaks if it fails | Typical symptom |
|---|---|---|
| Cloudflare API token | Traefik cannot renew Let's Encrypt certs via DNS-01 | TLS errors, cert expiry alerts |
Traefik acme.json |
Same — lost or corrupt ACME account state | Sudden TLS failures after restart |
| Authentik DB | SSO / forward-auth broken | 401/502 on protected routes, login loops |
| Traefik config | Routing wrong or down | 404/502 for all hostnames |
Tier-1 backups cover Traefik state, Komodo, and Authentik DB (external fetch from saltierpoop). Cloudflare account credentials are not in restic — they live in SOPS, GitHub Actions secrets, and 1Password.
Scenario A — Cloudflare API token leak or rotation¶
When to use¶
- Token committed to git, pasted in a ticket, or Cloudflare audit shows misuse.
- Planned rotation (good hygiene every 6–12 months).
- Traefik logs: Cloudflare API errors during cert renewal.
Where tokens live today¶
| Location | Key name | Scope needed |
|---|---|---|
services/traefik/.env.sops.yaml |
CF_DNS_API_TOKEN |
Zone DNS Edit for realemail.app |
GitHub Actions docs.yml |
CLOUDFLARE_API_TOKEN, CLOUDFLARE_ACCOUNT_ID |
Pages deploy (separate token recommended) |
| Saltbox inventory on saltierpoop | Per-app DNS roles | Optional — many apps use Cloudflare DNS records |
Use separate tokens per consumer when possible (Traefik DNS-01 vs CI vs Saltbox).
Response steps¶
- Cloudflare dashboard → My Profile → API Tokens → revoke the compromised token.
- Create replacement with minimum scope:
- Traefik: Zone → DNS → Edit on
realemail.app(and any other zones Traefik manages). - Docs CI: Account → Cloudflare Pages → Edit only.
- Update secrets (never commit plaintext):
# On dev machine with age key / sops
sops services/traefik/.env.sops.yaml
# Edit CF_DNS_API_TOKEN, save encrypted
git add services/traefik/.env.sops.yaml && git commit && git push
- Redeploy Traefik on infra-services:
ssh infra-services
cd /opt/homelab/services/traefik
SOPS_AGE_KEY_FILE=/etc/homelab/age-key.txt sops -d .env.sops.yaml \
| sed 's/CF_DNS_API_TOKEN: /CLOUDFLARE_DNS_API_TOKEN=/' > .env
docker compose up -d
- Saltbox Traefik (if token shared): update inventory /
accounts.ymlas applicable, thensb install traefikon saltierpoop. - GitHub: Settings → Secrets → update
CLOUDFLARE_API_TOKENif CI token was rotated. - Verify renewal (force if needed):
docker logs traefik 2>&1 | tail -50 | grep -i acme
curl -sI https://traefik.infra.realemail.app | head -5
- Post-incident: Cloudflare Audit Logs → confirm old token shows no further API calls.
If certs already expired before rotation¶
Traefik will request new certs once the token is valid. Browsers may need a hard refresh.
If acme.json is corrupt, restore Traefik backup first (Scenario B) then rotate token.
Scenario B — Traefik acme.json loss or corruption¶
When to use¶
acme.jsondeleted, zeroed, or permissions wrong (acme.jsonmust be600).- Traefik fails to start with ACME / lock errors.
Recovery¶
- Restore from tier-1 restic (preferred):
ssh infra-services
restic -r /var/backups/restic/traefik snapshots
restic -r /var/backups/restic/traefik restore latest \
--target /tmp/restore-traefik
cd /opt/homelab/services/traefik
docker compose stop traefik
cp -a /tmp/restore-traefik/opt/homelab/services/traefik/acme.json ./acme.json
chmod 600 ./acme.json
docker compose up -d traefik
- If no snapshot: valid
CF_DNS_API_TOKEN+ emptyacme.jsonlets Traefik re-issue all certs (may hit Let's Encrypt rate limits — stagger or use staging first). - Verify HTTPS on
traefik.infra.realemail.appand one*.realemail.apphostname.
See also restore.md § Traefik.
Scenario C — Authentik PostgreSQL corruption or loss¶
When to use¶
authentikcontainer crash loops; Postgres errors on startup.- Accidental
docker volume rmor failed upgrade. - Need point-in-time recovery after bad config change.
Backups available¶
| Source | Path / job | Schedule |
|---|---|---|
| Tier-1 external fetch | backups/external/authentik-db.yaml → fetch-authentik-db.sh |
Daily ~02:08 PDT |
| Restic local + B2 | restic-backup-external-authentik-db.timer |
After fetch |
Staging on infra-services: /var/backups/external-staging/authentik-db/.
Recovery on saltierpoop¶
- Stop Authentik (keep Postgres running for restore):
- Get latest dump — from restic on infra-services or re-fetch:
# On infra-services — restore staging copy
restic -r /var/backups/restic/external-authentik-db restore latest \
--target /tmp/restore-ak-db
scp /tmp/restore-ak-db/var/backups/external-staging/authentik-db/*.sql.gz \
someone@192.168.6.243:/tmp/
Or on saltierpoop trigger fetch manually if SSH key is deployed:
- Restore into Postgres:
gunzip -c /tmp/authentik.sql.gz | docker exec -i authentik-postgres \
psql -U authentik -d authentik
For a clean restore, drop and recreate DB first (destructive):
docker exec -i authentik-postgres psql -U authentik -c \
"DROP DATABASE authentik WITH (FORCE); CREATE DATABASE authentik;"
- Start Authentik:
- Verify:
- Login at
https://auth.realemail.app(or your Authentik hostname). - Open a Traefik-protected app — forward-auth should succeed.
- Check Authentik admin → Applications / Providers still present.
See restore.md § Authentik DB.
If backup is stale or missing¶
- Re-bootstrap Authentik from Saltbox (
sb install authentik) and re-create providers/applications manually — last resort. - Export provider config periodically (Authentik blueprints) as future hardening.
Scenario D — Full edge down (DNS + TLS + auth)¶
Work top-down:
- DNS resolves?
dig +short grafana.infra.realemail.app→192.168.6.17(LAN) or Cloudflare proxy IP (WAN). - TLS?
curl -vI https://…— cert name, expiry, issuer. - Traefik routing? Traefik dashboard / access logs.
- Authentik? Direct hit to auth URL bypassing apps.
- App container? Komodo stack status.
Use DR from Zero only for total loss; partial incidents use sections above.
Validation log¶
| Date | Check | Result |
|---|---|---|
| 2026-06-21 | restic-backup-traefik.timer |
active |
| 2026-06-21 | restic-backup-external-authentik-db.timer |
active |
| 2026-06-21 | Authentik staging dump | authentik.sql.gz ~23 MB |
| 2026-06-21 | Runbook scenarios A–C | documented |
| 2026-06-21 | Tabletop (owner) | proven — token rotation, Traefik redeploy, backup restore paths exercised |
Prevention checklist¶
- [x] Traefik + Authentik tier-1 timers green on infra-services
- [x] Cloudflare tokens scoped per service; rotation date in 1Password
- [x]
acme.jsonbacked up daily (Traefik restic timer) - [x] Authentik admin 2FA enabled
- [x] This runbook exercised once (tabletop): token revoke + restore Authentik from restic
Closure¶
SEC-009 closed 2026-06-21 — see security-register.md.