Common failures
Purpose
The gated operator playbook for recovering from the most frequent production incidents in VibeSwitch. This page assumes you're on call, you can SSH (or the equivalent) to the server, you can reach the provider consoles (Firebase, NewsAPI.ai, Meta, OpenAI, Anthropic), and you're authorized to rotate secrets and restart the service. If you're a user rather than an operator, see the public Common failures page instead — it covers the same symptoms without the operator-level steps.
Prerequisites
- Required: Access to server logs (stdout / platform log viewer) and the ability to restart the service.
- Required: Access to the secret manager or env configuration surface so you can rotate keys.
- Required: Admin access to the Firebase / Identity Platform project.
- Required: Admin access to the WhatsApp Business account in the Meta Developer console.
- Recommended: A runbook where you record incident timelines and post-mortems.
Inputs
- The symptom the user or alerting reported. A UI message, an alert fire, an absent report, a cost-cap hit.
- The date affected. Almost every incident is date-scoped; capture it before doing anything else.
- Your role in the response. Are you triaging, fixing, or recording? Don't do all three at once.
Outputs
- A deterministic action sequence: what you checked, what you changed, what you verified afterwards.
- A short post-mortem entry: symptom, root cause, fix, prevention. Even one sentence per field is enough.
- (When relevant) a user-facing update posted to whatever channel your users watch — especially when the report is stale or unavailable.
Constraints
- Gating. This page is access-controlled by default —
intent: playbooksmakes it gated in the in-app Docs panel, andplaybooks/pages are hidden from unauthenticated callers. Don't paste the content into public channels. - One change at a time. Most incidents are single-cause. If you rotate a key, rebuild the client, and restart the server in one go, you won't know which action mattered.
- Prefer deterministic checks over narrative guessing. HTTP status codes, log lines, and files on disk are ground truth.
- Don't suppress errors as shortcuts. Commenting out a validation step to "make it work" is how silent drift gets introduced. Fix the underlying issue or escalate.
- Preserve the evidence. Snapshot the failing dated exports, signals JSON, and log lines before you re-run the pipeline. You'll want them for the post-mortem.
Examples
Fast triage sequence
Run these first, in order:
curl -sS -o /dev/null -w "%{http_code}\n" http://SERVER/api/openapi.json— is the server running?curl -sS http://SERVER/api/auth/config— what's the auth posture?curl -sS -o /dev/null -w "%{http_code}\n" http://SERVER/api/docs/index— isproduct_docs/deployed?curl -sS -o /dev/null -w "%{http_code}\n" http://SERVER/api/report/today— is today's report cached?ls signals/ reports/ business_modules/news-sites/articles_extracted/— which stages produced artifacts today?
Each step halves the possible causes. Don't skip ahead.
Incident playbooks
P1 — "API key is required" at startup
Symptom: server refuses to start, logs say API key is required.
- Check: secret manager / env has
NEWSAPI_API_KEY; the latest deploy actually read it (not a stale revision). - Fix: set the key in the runtime env and restart. Confirm
/api/openapi.jsonreturns200. If the key was rotated, also update the value in your secrets vault and any CI that reads it.
P2 — Report is stale (yellow "outdated" banner)
Symptom: UI Report tab shows yesterday's date with a yellow banner.
- Check:
ls signals/signals-*-$(date +%F).json reports/*$(date +%F)*— did today's pipeline run? - Fix: run
./scripts/daily-pipeline.sh. If it failed mid-stage, run the missing stage explicitly. Hard-refresh the UI after the assessment completes.
P3 — Every /api/* returns 401
Symptom: UI can't load data; unauthenticated curl returns 401.
- Check:
/api/auth/config— isauthRequired: true? If yes, does the client have Firebase config baked in? - Fix: if auth should be on, confirm
FIREBASE_PROJECT_IDand runtime service account are correct (firebase-adminlogs the verification failure reason on server stdout). If auth should be off, setAUTH_REQUIRED=falseand restart.
P4 — Firebase sign-in closes with "unauthorized domain"
Symptom: users can't sign in; popup closes with an unauthorized-domain error.
- Check: Firebase Console → Authentication → Settings → Authorized domains.
- Fix: add the production domain (and any preview domains). Propagation is near-instant. No client rebuild required.
P5 — Upstream provider (NewsAPI / OpenAI / Anthropic) outage
Symptom: ingestion or extraction fails with 5xx or connection errors.
- Check: the provider's status page and your account dashboard (quota, billing).
- Fix: pause the affected stage, wait for upstream recovery, then replay the date. Don't spin retries indefinitely — each failed attempt burns cost. See Cost controls.