specweave 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/INSTALL.md +848 -0
- package/LICENSE +21 -0
- package/README.md +675 -0
- package/SPECWEAVE.md +665 -0
- package/bin/install-agents.sh +57 -0
- package/bin/install-all.sh +49 -0
- package/bin/install-commands.sh +56 -0
- package/bin/install-skills.sh +57 -0
- package/bin/specweave.js +81 -0
- package/dist/adapters/adapter-base.d.ts +50 -0
- package/dist/adapters/adapter-base.d.ts.map +1 -0
- package/dist/adapters/adapter-base.js +146 -0
- package/dist/adapters/adapter-base.js.map +1 -0
- package/dist/adapters/adapter-interface.d.ts +108 -0
- package/dist/adapters/adapter-interface.d.ts.map +1 -0
- package/dist/adapters/adapter-interface.js +9 -0
- package/dist/adapters/adapter-interface.js.map +1 -0
- package/dist/adapters/claude/adapter.d.ts +54 -0
- package/dist/adapters/claude/adapter.d.ts.map +1 -0
- package/dist/adapters/claude/adapter.js +184 -0
- package/dist/adapters/claude/adapter.js.map +1 -0
- package/dist/adapters/copilot/adapter.d.ts +42 -0
- package/dist/adapters/copilot/adapter.d.ts.map +1 -0
- package/dist/adapters/copilot/adapter.js +239 -0
- package/dist/adapters/copilot/adapter.js.map +1 -0
- package/dist/adapters/cursor/adapter.d.ts +42 -0
- package/dist/adapters/cursor/adapter.d.ts.map +1 -0
- package/dist/adapters/cursor/adapter.js +297 -0
- package/dist/adapters/cursor/adapter.js.map +1 -0
- package/dist/adapters/generic/adapter.d.ts +40 -0
- package/dist/adapters/generic/adapter.d.ts.map +1 -0
- package/dist/adapters/generic/adapter.js +155 -0
- package/dist/adapters/generic/adapter.js.map +1 -0
- package/dist/cli/commands/init.d.ts +6 -0
- package/dist/cli/commands/init.d.ts.map +1 -0
- package/dist/cli/commands/init.js +247 -0
- package/dist/cli/commands/init.js.map +1 -0
- package/dist/cli/commands/install.d.ts +7 -0
- package/dist/cli/commands/install.d.ts.map +1 -0
- package/dist/cli/commands/install.js +160 -0
- package/dist/cli/commands/install.js.map +1 -0
- package/dist/cli/commands/list.d.ts +6 -0
- package/dist/cli/commands/list.d.ts.map +1 -0
- package/dist/cli/commands/list.js +154 -0
- package/dist/cli/commands/list.js.map +1 -0
- package/package.json +90 -0
- package/src/adapters/README.md +312 -0
- package/src/adapters/adapter-base.ts +146 -0
- package/src/adapters/adapter-interface.ts +120 -0
- package/src/adapters/claude/README.md +241 -0
- package/src/adapters/claude/adapter.ts +157 -0
- package/src/adapters/copilot/.github/copilot/instructions.md +376 -0
- package/src/adapters/copilot/README.md +200 -0
- package/src/adapters/copilot/adapter.ts +210 -0
- package/src/adapters/cursor/.cursor/context/docs-context.md +62 -0
- package/src/adapters/cursor/.cursor/context/increments-context.md +71 -0
- package/src/adapters/cursor/.cursor/context/strategy-context.md +73 -0
- package/src/adapters/cursor/.cursor/context/tests-context.md +89 -0
- package/src/adapters/cursor/.cursorrules +325 -0
- package/src/adapters/cursor/README.md +243 -0
- package/src/adapters/cursor/adapter.ts +268 -0
- package/src/adapters/generic/README.md +277 -0
- package/src/adapters/generic/SPECWEAVE-MANUAL.md +676 -0
- package/src/adapters/generic/adapter.ts +159 -0
- package/src/adapters/registry.yaml +126 -0
- package/src/agents/architect/AGENT.md +416 -0
- package/src/agents/devops/AGENT.md +1738 -0
- package/src/agents/docs-writer/AGENT.md +239 -0
- package/src/agents/performance/AGENT.md +228 -0
- package/src/agents/pm/AGENT.md +751 -0
- package/src/agents/qa-lead/AGENT.md +150 -0
- package/src/agents/security/AGENT.md +179 -0
- package/src/agents/sre/AGENT.md +582 -0
- package/src/agents/sre/modules/backend-diagnostics.md +481 -0
- package/src/agents/sre/modules/database-diagnostics.md +509 -0
- package/src/agents/sre/modules/infrastructure.md +561 -0
- package/src/agents/sre/modules/monitoring.md +439 -0
- package/src/agents/sre/modules/security-incidents.md +421 -0
- package/src/agents/sre/modules/ui-diagnostics.md +302 -0
- package/src/agents/sre/playbooks/01-high-cpu-usage.md +204 -0
- package/src/agents/sre/playbooks/02-database-deadlock.md +241 -0
- package/src/agents/sre/playbooks/03-memory-leak.md +252 -0
- package/src/agents/sre/playbooks/04-slow-api-response.md +269 -0
- package/src/agents/sre/playbooks/05-ddos-attack.md +293 -0
- package/src/agents/sre/playbooks/06-disk-full.md +314 -0
- package/src/agents/sre/playbooks/07-service-down.md +333 -0
- package/src/agents/sre/playbooks/08-data-corruption.md +337 -0
- package/src/agents/sre/playbooks/09-cascade-failure.md +430 -0
- package/src/agents/sre/playbooks/10-rate-limit-exceeded.md +464 -0
- package/src/agents/sre/scripts/health-check.sh +230 -0
- package/src/agents/sre/scripts/log-analyzer.py +213 -0
- package/src/agents/sre/scripts/metrics-collector.sh +294 -0
- package/src/agents/sre/scripts/trace-analyzer.js +257 -0
- package/src/agents/sre/templates/incident-report.md +249 -0
- package/src/agents/sre/templates/mitigation-plan.md +375 -0
- package/src/agents/sre/templates/post-mortem.md +418 -0
- package/src/agents/sre/templates/runbook-template.md +412 -0
- package/src/agents/tech-lead/AGENT.md +263 -0
- package/src/commands/add-tasks.md +176 -0
- package/src/commands/close-increment.md +347 -0
- package/src/commands/create-increment.md +223 -0
- package/src/commands/create-project.md +528 -0
- package/src/commands/generate-docs.md +623 -0
- package/src/commands/list-increments.md +180 -0
- package/src/commands/review-docs.md +331 -0
- package/src/commands/start-increment.md +139 -0
- package/src/commands/sync-github.md +115 -0
- package/src/commands/validate-increment.md +800 -0
- package/src/hooks/README.md +252 -0
- package/src/hooks/docs-changed.sh +59 -0
- package/src/hooks/human-input-required.sh +55 -0
- package/src/hooks/post-task-completion.sh +57 -0
- package/src/hooks/pre-implementation.sh +47 -0
- package/src/skills/ado-sync/README.md +449 -0
- package/src/skills/ado-sync/SKILL.md +245 -0
- package/src/skills/ado-sync/test-cases/test-1.yaml +9 -0
- package/src/skills/ado-sync/test-cases/test-2.yaml +8 -0
- package/src/skills/ado-sync/test-cases/test-3.yaml +9 -0
- package/src/skills/bmad-method-expert/SKILL.md +628 -0
- package/src/skills/bmad-method-expert/scripts/analyze-project.js +318 -0
- package/src/skills/bmad-method-expert/scripts/check-setup.js +208 -0
- package/src/skills/bmad-method-expert/scripts/generate-template.js +1149 -0
- package/src/skills/bmad-method-expert/scripts/validate-documents.js +340 -0
- package/src/skills/bmad-method-expert/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/bmad-method-expert/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/bmad-method-expert/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/brownfield-analyzer/SKILL.md +523 -0
- package/src/skills/brownfield-analyzer/test-cases/test-1-basic-analysis.yaml +48 -0
- package/src/skills/brownfield-analyzer/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/brownfield-analyzer/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/brownfield-onboarder/SKILL.md +625 -0
- package/src/skills/brownfield-onboarder/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/brownfield-onboarder/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/brownfield-onboarder/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/calendar-system/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/calendar-system/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/calendar-system/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/context-loader/SKILL.md +734 -0
- package/src/skills/context-loader/test-cases/test-1-basic-loading.yaml +39 -0
- package/src/skills/context-loader/test-cases/test-2-token-budget-exceeded.yaml +44 -0
- package/src/skills/context-loader/test-cases/test-3-section-anchors.yaml +45 -0
- package/src/skills/context-optimizer/SKILL.md +618 -0
- package/src/skills/context-optimizer/test-cases/test-1-bug-fix-narrow.yaml +97 -0
- package/src/skills/context-optimizer/test-cases/test-2-feature-focused.yaml +109 -0
- package/src/skills/context-optimizer/test-cases/test-3-architecture-broad.yaml +98 -0
- package/src/skills/cost-optimizer/SKILL.md +190 -0
- package/src/skills/cost-optimizer/test-cases/test-1-basic-comparison.yaml +75 -0
- package/src/skills/cost-optimizer/test-cases/test-2-budget-constraint.yaml +52 -0
- package/src/skills/cost-optimizer/test-cases/test-3-scale-requirement.yaml +63 -0
- package/src/skills/cost-optimizer/test-results/README.md +46 -0
- package/src/skills/design-system-architect/SKILL.md +107 -0
- package/src/skills/design-system-architect/test-cases/test-1-token-structure.yaml +23 -0
- package/src/skills/design-system-architect/test-cases/test-2-component-hierarchy.yaml +24 -0
- package/src/skills/design-system-architect/test-cases/test-3-accessibility-checklist.yaml +23 -0
- package/src/skills/diagrams-architect/SKILL.md +763 -0
- package/src/skills/diagrams-generator/SKILL.md +25 -0
- package/src/skills/diagrams-generator/test-cases/test-1.yaml +9 -0
- package/src/skills/diagrams-generator/test-cases/test-2.yaml +9 -0
- package/src/skills/diagrams-generator/test-cases/test-3.yaml +8 -0
- package/src/skills/docs-updater/README.md +48 -0
- package/src/skills/docs-updater/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/docs-updater/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/docs-updater/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/dotnet-backend/SKILL.md +250 -0
- package/src/skills/e2e-playwright/README.md +506 -0
- package/src/skills/e2e-playwright/SKILL.md +457 -0
- package/src/skills/e2e-playwright/execute.js +373 -0
- package/src/skills/e2e-playwright/lib/utils.js +514 -0
- package/src/skills/e2e-playwright/package.json +33 -0
- package/src/skills/e2e-playwright/test-cases/TC-001-basic-navigation.yaml +54 -0
- package/src/skills/e2e-playwright/test-cases/TC-002-form-interaction.yaml +64 -0
- package/src/skills/e2e-playwright/test-cases/TC-003-specweave-integration.yaml +74 -0
- package/src/skills/e2e-playwright/test-cases/TC-004-accessibility-check.yaml +98 -0
- package/src/skills/figma-designer/SKILL.md +149 -0
- package/src/skills/figma-implementer/SKILL.md +148 -0
- package/src/skills/figma-mcp-connector/SKILL.md +136 -0
- package/src/skills/figma-mcp-connector/test-cases/test-1-read-file-desktop.yaml +22 -0
- package/src/skills/figma-mcp-connector/test-cases/test-2-read-file-framelink.yaml +21 -0
- package/src/skills/figma-mcp-connector/test-cases/test-3-error-handling.yaml +18 -0
- package/src/skills/figma-to-code/SKILL.md +128 -0
- package/src/skills/figma-to-code/test-cases/test-1-token-generation.yaml +29 -0
- package/src/skills/figma-to-code/test-cases/test-2-component-generation.yaml +27 -0
- package/src/skills/figma-to-code/test-cases/test-3-typescript-generation.yaml +28 -0
- package/src/skills/frontend/SKILL.md +177 -0
- package/src/skills/github-sync/SKILL.md +252 -0
- package/src/skills/github-sync/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/github-sync/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/github-sync/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/hetzner-provisioner/README.md +308 -0
- package/src/skills/hetzner-provisioner/SKILL.md +251 -0
- package/src/skills/hetzner-provisioner/test-cases/test-1-basic-provision.yaml +71 -0
- package/src/skills/hetzner-provisioner/test-cases/test-2-postgres-provision.yaml +85 -0
- package/src/skills/hetzner-provisioner/test-cases/test-3-ssl-config.yaml +126 -0
- package/src/skills/hetzner-provisioner/test-results/README.md +259 -0
- package/src/skills/increment-planner/SKILL.md +889 -0
- package/src/skills/increment-planner/scripts/feature-utils.js +250 -0
- package/src/skills/increment-planner/test-cases/test-1-basic-feature.yaml +27 -0
- package/src/skills/increment-planner/test-cases/test-2-complex-feature.yaml +30 -0
- package/src/skills/increment-planner/test-cases/test-3-auto-numbering.yaml +24 -0
- package/src/skills/increment-quality-judge/SKILL.md +566 -0
- package/src/skills/increment-quality-judge/test-cases/test-1-good-spec.yaml +95 -0
- package/src/skills/increment-quality-judge/test-cases/test-2-poor-spec.yaml +108 -0
- package/src/skills/increment-quality-judge/test-cases/test-3-export-suggestions.yaml +87 -0
- package/src/skills/jira-sync/README.md +328 -0
- package/src/skills/jira-sync/SKILL.md +209 -0
- package/src/skills/jira-sync/test-cases/test-1.yaml +9 -0
- package/src/skills/jira-sync/test-cases/test-2.yaml +9 -0
- package/src/skills/jira-sync/test-cases/test-3.yaml +10 -0
- package/src/skills/nextjs/SKILL.md +176 -0
- package/src/skills/nodejs-backend/SKILL.md +181 -0
- package/src/skills/notification-system/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/notification-system/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/notification-system/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/python-backend/SKILL.md +226 -0
- package/src/skills/role-orchestrator/README.md +197 -0
- package/src/skills/role-orchestrator/SKILL.md +1184 -0
- package/src/skills/role-orchestrator/test-cases/test-1-simple-product.yaml +98 -0
- package/src/skills/role-orchestrator/test-cases/test-2-quality-gate-failure.yaml +73 -0
- package/src/skills/role-orchestrator/test-cases/test-3-security-workflow.yaml +121 -0
- package/src/skills/role-orchestrator/test-cases/test-4-parallel-execution.yaml +145 -0
- package/src/skills/role-orchestrator/test-cases/test-5-feedback-loops.yaml +149 -0
- package/src/skills/skill-creator/LICENSE.txt +202 -0
- package/src/skills/skill-creator/SKILL.md +209 -0
- package/src/skills/skill-creator/scripts/init_skill.py +303 -0
- package/src/skills/skill-creator/scripts/package_skill.py +110 -0
- package/src/skills/skill-creator/scripts/quick_validate.py +65 -0
- package/src/skills/skill-creator/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/skill-creator/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/skill-creator/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/skill-router/SKILL.md +497 -0
- package/src/skills/skill-router/test-cases/test-1-basic-routing.yaml +33 -0
- package/src/skills/skill-router/test-cases/test-2-ambiguous-request.yaml +42 -0
- package/src/skills/skill-router/test-cases/test-3-nested-orchestration.yaml +50 -0
- package/src/skills/spec-driven-brainstorming/README.md +264 -0
- package/src/skills/spec-driven-brainstorming/SKILL.md +439 -0
- package/src/skills/spec-driven-brainstorming/test-cases/TC-001-simple-idea-to-design.yaml +148 -0
- package/src/skills/spec-driven-brainstorming/test-cases/TC-002-complex-ultrathink-design.yaml +190 -0
- package/src/skills/spec-driven-brainstorming/test-cases/TC-003-unclear-requirements-socratic.yaml +233 -0
- package/src/skills/spec-driven-debugging/README.md +479 -0
- package/src/skills/spec-driven-debugging/SKILL.md +652 -0
- package/src/skills/spec-driven-debugging/test-cases/TC-001-simple-auth-bug.yaml +212 -0
- package/src/skills/spec-driven-debugging/test-cases/TC-002-race-condition-ultrathink.yaml +461 -0
- package/src/skills/spec-driven-debugging/test-cases/TC-003-brownfield-missing-spec.yaml +366 -0
- package/src/skills/spec-kit-expert/SKILL.md +1012 -0
- package/src/skills/spec-kit-expert/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/spec-kit-expert/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/spec-kit-expert/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/specweave-ado-mapper/SKILL.md +501 -0
- package/src/skills/specweave-detector/SKILL.md +420 -0
- package/src/skills/specweave-detector/test-cases/test-1-basic-detection.yaml +37 -0
- package/src/skills/specweave-detector/test-cases/test-2-missing-config.yaml +37 -0
- package/src/skills/specweave-detector/test-cases/test-3-non-specweave-project.yaml +34 -0
- package/src/skills/specweave-jira-mapper/SKILL.md +500 -0
- package/src/skills/stripe-integrator/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/stripe-integrator/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/stripe-integrator/test-cases/test-3-placeholder.yaml +12 -0
- package/src/skills/task-builder/README.md +90 -0
- package/src/skills/task-builder/test-cases/test-1-placeholder.yaml +12 -0
- package/src/skills/task-builder/test-cases/test-2-placeholder.yaml +12 -0
- package/src/skills/task-builder/test-cases/test-3-placeholder.yaml +12 -0
- package/src/templates/.env.example +144 -0
- package/src/templates/.gitignore.template +81 -0
- package/src/templates/CLAUDE.md.template +383 -0
- package/src/templates/README.md.template +240 -0
- package/src/templates/config.yaml +333 -0
- package/src/templates/docs/README.md +124 -0
- package/src/templates/docs/adr-template.md +118 -0
- package/src/templates/docs/hld-template.md +220 -0
- package/src/templates/docs/lld-template.md +580 -0
- package/src/templates/docs/prd-template.md +132 -0
- package/src/templates/docs/rfc-template.md +229 -0
- package/src/templates/docs/runbook-template.md +298 -0
- package/src/templates/environments/minimal/.env.production +16 -0
- package/src/templates/environments/minimal/README.md +54 -0
- package/src/templates/environments/minimal/deploy-production.yml +52 -0
- package/src/templates/environments/progressive/.env.qa +28 -0
- package/src/templates/environments/progressive/README.md +129 -0
- package/src/templates/environments/progressive/deploy-production.yml +93 -0
- package/src/templates/environments/progressive/deploy-qa.yml +62 -0
- package/src/templates/environments/progressive/deploy-staging.yml +67 -0
- package/src/templates/environments/standard/.env.development +20 -0
- package/src/templates/environments/standard/.env.production +30 -0
- package/src/templates/environments/standard/.env.staging +23 -0
- package/src/templates/environments/standard/README.md +97 -0
- package/src/templates/environments/standard/deploy-production.yml +68 -0
- package/src/templates/environments/standard/deploy-staging.yml +61 -0
- package/src/templates/environments/standard/docker-compose.yml +43 -0
- package/src/templates/increment-metadata-template.yaml +138 -0
|
@@ -0,0 +1,430 @@
|
|
|
1
|
+
# Playbook: Cascade Failure
|
|
2
|
+
|
|
3
|
+
## Symptoms
|
|
4
|
+
|
|
5
|
+
- Multiple services failing simultaneously
|
|
6
|
+
- Failures spreading across services
|
|
7
|
+
- Dependency services timing out
|
|
8
|
+
- Error rate increasing exponentially
|
|
9
|
+
- Monitoring alert: "Multiple services degraded", "Cascade detected"
|
|
10
|
+
|
|
11
|
+
## Severity
|
|
12
|
+
|
|
13
|
+
- **SEV1** - Cascade affecting production services
|
|
14
|
+
|
|
15
|
+
## What is a Cascade Failure?
|
|
16
|
+
|
|
17
|
+
**Definition**: One service failure triggers failures in dependent services, spreading through the system.
|
|
18
|
+
|
|
19
|
+
**Example**:
|
|
20
|
+
```
|
|
21
|
+
Database slow (2s queries)
|
|
22
|
+
↓
|
|
23
|
+
API times out waiting for database (5s timeout)
|
|
24
|
+
↓
|
|
25
|
+
Frontend times out waiting for API (10s timeout)
|
|
26
|
+
↓
|
|
27
|
+
Load balancer marks frontend unhealthy
|
|
28
|
+
↓
|
|
29
|
+
Traffic routes to other frontends (overload them)
|
|
30
|
+
↓
|
|
31
|
+
All frontends fail → Complete outage
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Diagnosis
|
|
37
|
+
|
|
38
|
+
### Step 1: Identify Initial Failure Point
|
|
39
|
+
|
|
40
|
+
**Check Service Dependencies**:
|
|
41
|
+
```
|
|
42
|
+
Frontend → API → Database
|
|
43
|
+
↓
|
|
44
|
+
Cache (Redis)
|
|
45
|
+
↓
|
|
46
|
+
Queue (RabbitMQ)
|
|
47
|
+
↓
|
|
48
|
+
External API
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
**Find the root**:
|
|
52
|
+
```bash
|
|
53
|
+
# Check service health (start with leaf dependencies)
|
|
54
|
+
# 1. Database
|
|
55
|
+
psql -c "SELECT 1"
|
|
56
|
+
|
|
57
|
+
# 2. Cache
|
|
58
|
+
redis-cli PING
|
|
59
|
+
|
|
60
|
+
# 3. Queue
|
|
61
|
+
rabbitmqctl status
|
|
62
|
+
|
|
63
|
+
# 4. External API
|
|
64
|
+
curl https://api.external.com/health
|
|
65
|
+
|
|
66
|
+
# First failure = likely root cause
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
### Step 2: Trace Failure Propagation
|
|
72
|
+
|
|
73
|
+
**Check Service Logs** (in order):
|
|
74
|
+
```bash
|
|
75
|
+
# Database logs (first)
|
|
76
|
+
tail -100 /var/log/postgresql/postgresql.log
|
|
77
|
+
|
|
78
|
+
# API logs (second)
|
|
79
|
+
tail -100 /var/log/api/error.log
|
|
80
|
+
|
|
81
|
+
# Frontend logs (third)
|
|
82
|
+
tail -100 /var/log/frontend/error.log
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
**Look for timestamps**:
|
|
86
|
+
```
|
|
87
|
+
14:00:00 - Database: Slow query (7s) ← ROOT CAUSE
|
|
88
|
+
14:00:05 - API: Timeout error
|
|
89
|
+
14:00:10 - Frontend: API unavailable
|
|
90
|
+
14:00:15 - Load balancer: All frontends unhealthy
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
---
|
|
94
|
+
|
|
95
|
+
### Step 3: Assess Cascade Depth
|
|
96
|
+
|
|
97
|
+
**How many layers affected?**
|
|
98
|
+
- **1 layer**: Database only (isolated failure)
|
|
99
|
+
- **2-3 layers**: Database → API → Frontend (cascade)
|
|
100
|
+
- **4+ layers**: Full system cascade (critical)
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## Mitigation
|
|
105
|
+
|
|
106
|
+
### Immediate (Now - 5 min)
|
|
107
|
+
|
|
108
|
+
**PRIORITY: Stop the cascade from spreading**
|
|
109
|
+
|
|
110
|
+
**Option A: Circuit Breaker** (if not already enabled)
|
|
111
|
+
```javascript
|
|
112
|
+
// Enable circuit breaker manually
|
|
113
|
+
// Prevents API from overwhelming database
|
|
114
|
+
|
|
115
|
+
const CircuitBreaker = require('opossum');
|
|
116
|
+
|
|
117
|
+
const dbQuery = new CircuitBreaker(queryDatabase, {
|
|
118
|
+
timeout: 3000, // 3s timeout
|
|
119
|
+
errorThresholdPercentage: 50, // Open after 50% failures
|
|
120
|
+
resetTimeout: 30000 // Try again after 30s
|
|
121
|
+
});
|
|
122
|
+
|
|
123
|
+
dbQuery.on('open', () => {
|
|
124
|
+
console.log('Circuit breaker OPEN - using fallback');
|
|
125
|
+
});
|
|
126
|
+
|
|
127
|
+
// Use fallback when circuit open
|
|
128
|
+
dbQuery.fallback(() => {
|
|
129
|
+
return cachedData; // Return cached data instead
|
|
130
|
+
});
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
**Option B: Rate Limiting** (protect downstream)
|
|
134
|
+
```nginx
|
|
135
|
+
# Limit requests to database (nginx)
|
|
136
|
+
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
|
|
137
|
+
|
|
138
|
+
location /api/ {
|
|
139
|
+
limit_req zone=api burst=20 nodelay;
|
|
140
|
+
proxy_pass http://api-backend;
|
|
141
|
+
}
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
**Option C: Shed Load** (reject non-critical requests)
|
|
145
|
+
```javascript
|
|
146
|
+
// Reject non-critical requests when overloaded
|
|
147
|
+
app.use((req, res, next) => {
|
|
148
|
+
const load = getCurrentLoad(); // CPU, memory, queue depth
|
|
149
|
+
|
|
150
|
+
if (load > 0.8 && !isCriticalEndpoint(req.path)) {
|
|
151
|
+
return res.status(503).json({
|
|
152
|
+
error: 'Service overloaded, try again later'
|
|
153
|
+
});
|
|
154
|
+
}
|
|
155
|
+
|
|
156
|
+
next();
|
|
157
|
+
});
|
|
158
|
+
|
|
159
|
+
function isCriticalEndpoint(path) {
|
|
160
|
+
return ['/api/health', '/api/payment'].includes(path);
|
|
161
|
+
}
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
**Option D: Isolate Failure** (take failing service offline)
|
|
165
|
+
```bash
|
|
166
|
+
# Remove failing service from load balancer
|
|
167
|
+
# AWS ELB:
|
|
168
|
+
aws elbv2 deregister-targets \
|
|
169
|
+
--target-group-arn <arn> \
|
|
170
|
+
--targets Id=i-1234567890abcdef0
|
|
171
|
+
|
|
172
|
+
# nginx:
|
|
173
|
+
# Comment out failing backend in upstream block
|
|
174
|
+
# upstream api {
|
|
175
|
+
# server api1.example.com; # Healthy
|
|
176
|
+
# # server api2.example.com; # FAILING - commented out
|
|
177
|
+
# }
|
|
178
|
+
|
|
179
|
+
# Impact: Prevents failing service from affecting others
|
|
180
|
+
# Risk: Reduced capacity
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
### Short-term (5 min - 1 hour)
|
|
186
|
+
|
|
187
|
+
**Option A: Fix Root Cause**
|
|
188
|
+
|
|
189
|
+
**If database slow**:
|
|
190
|
+
```sql
|
|
191
|
+
-- Add missing index
|
|
192
|
+
CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login_at);
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
**If external API slow**:
|
|
196
|
+
```javascript
|
|
197
|
+
// Add timeout + fallback
|
|
198
|
+
const response = await fetch('https://api.external.com', {
|
|
199
|
+
timeout: 2000 // 2s timeout
|
|
200
|
+
});
|
|
201
|
+
|
|
202
|
+
if (!response.ok) {
|
|
203
|
+
return fallbackData; // Don't cascade failure
|
|
204
|
+
}
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
**If service overloaded**:
|
|
208
|
+
```bash
|
|
209
|
+
# Scale horizontally (add more instances)
|
|
210
|
+
# AWS Auto Scaling:
|
|
211
|
+
aws autoscaling set-desired-capacity \
|
|
212
|
+
--auto-scaling-group-name my-asg \
|
|
213
|
+
--desired-capacity 10 # Was 5
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
---
|
|
217
|
+
|
|
218
|
+
**Option B: Add Timeouts** (prevent indefinite waiting)
|
|
219
|
+
```javascript
|
|
220
|
+
// Database query timeout
|
|
221
|
+
const result = await db.query('SELECT * FROM users', {
|
|
222
|
+
timeout: 3000 // 3 second timeout
|
|
223
|
+
});
|
|
224
|
+
|
|
225
|
+
// API call timeout
|
|
226
|
+
const response = await fetch('/api/data', {
|
|
227
|
+
signal: AbortSignal.timeout(5000) // 5 second timeout
|
|
228
|
+
});
|
|
229
|
+
|
|
230
|
+
// Impact: Fail fast instead of cascading
|
|
231
|
+
// Risk: Low (better to timeout than cascade)
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
**Option C: Add Bulkheads** (isolate critical paths)
|
|
237
|
+
```javascript
|
|
238
|
+
// Separate connection pools for critical vs non-critical
|
|
239
|
+
const criticalPool = new Pool({ max: 10 }); // Payments, auth
|
|
240
|
+
const nonCriticalPool = new Pool({ max: 5 }); // Analytics, reports
|
|
241
|
+
|
|
242
|
+
// Critical requests get priority
|
|
243
|
+
app.post('/api/payment', async (req, res) => {
|
|
244
|
+
const conn = await criticalPool.connect();
|
|
245
|
+
// ...
|
|
246
|
+
});
|
|
247
|
+
|
|
248
|
+
// Non-critical requests use separate pool
|
|
249
|
+
app.get('/api/analytics', async (req, res) => {
|
|
250
|
+
const conn = await nonCriticalPool.connect();
|
|
251
|
+
// ...
|
|
252
|
+
});
|
|
253
|
+
|
|
254
|
+
// Impact: Critical paths protected from non-critical load
|
|
255
|
+
// Risk: None (isolation improves reliability)
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
---
|
|
259
|
+
|
|
260
|
+
### Long-term (1 hour+)
|
|
261
|
+
|
|
262
|
+
**Architecture Improvements**:
|
|
263
|
+
|
|
264
|
+
- [ ] **Circuit Breakers** (all external dependencies)
|
|
265
|
+
- [ ] **Timeouts** (every network call, database query)
|
|
266
|
+
- [ ] **Retries with exponential backoff** (transient failures)
|
|
267
|
+
- [ ] **Bulkheads** (isolate critical paths)
|
|
268
|
+
- [ ] **Rate limiting** (protect downstream services)
|
|
269
|
+
- [ ] **Graceful degradation** (fallback data, cached responses)
|
|
270
|
+
- [ ] **Health checks** (detect failures early)
|
|
271
|
+
- [ ] **Auto-scaling** (handle load spikes)
|
|
272
|
+
- [ ] **Chaos engineering** (test cascade scenarios)
|
|
273
|
+
|
|
274
|
+
---
|
|
275
|
+
|
|
276
|
+
## Cascade Prevention Patterns
|
|
277
|
+
|
|
278
|
+
### 1. Circuit Breaker Pattern
|
|
279
|
+
```javascript
|
|
280
|
+
const breaker = new CircuitBreaker(riskyOperation, {
|
|
281
|
+
timeout: 3000,
|
|
282
|
+
errorThresholdPercentage: 50,
|
|
283
|
+
resetTimeout: 30000
|
|
284
|
+
});
|
|
285
|
+
|
|
286
|
+
breaker.fallback(() => cachedData);
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
**Benefits**:
|
|
290
|
+
- Fast failure (don't wait for timeout)
|
|
291
|
+
- Automatic recovery (reset after timeout)
|
|
292
|
+
- Fallback data (graceful degradation)
|
|
293
|
+
|
|
294
|
+
---
|
|
295
|
+
|
|
296
|
+
### 2. Timeout Pattern
|
|
297
|
+
```javascript
|
|
298
|
+
// ALWAYS set timeouts
|
|
299
|
+
const response = await fetch('/api', {
|
|
300
|
+
signal: AbortSignal.timeout(5000)
|
|
301
|
+
});
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
**Benefits**:
|
|
305
|
+
- Fail fast (don't cascade indefinite waits)
|
|
306
|
+
- Predictable behavior
|
|
307
|
+
|
|
308
|
+
---
|
|
309
|
+
|
|
310
|
+
### 3. Bulkhead Pattern
|
|
311
|
+
```javascript
|
|
312
|
+
// Separate resource pools
|
|
313
|
+
const criticalPool = new Pool({ max: 10 });
|
|
314
|
+
const nonCriticalPool = new Pool({ max: 5 });
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
**Benefits**:
|
|
318
|
+
- Critical paths protected
|
|
319
|
+
- Non-critical load can't exhaust resources
|
|
320
|
+
|
|
321
|
+
---
|
|
322
|
+
|
|
323
|
+
### 4. Retry with Backoff
|
|
324
|
+
```javascript
|
|
325
|
+
async function retryWithBackoff(fn, retries = 3) {
|
|
326
|
+
for (let i = 0; i < retries; i++) {
|
|
327
|
+
try {
|
|
328
|
+
return await fn();
|
|
329
|
+
} catch (error) {
|
|
330
|
+
if (i === retries - 1) throw error;
|
|
331
|
+
await sleep(Math.pow(2, i) * 1000); // 1s, 2s, 4s
|
|
332
|
+
}
|
|
333
|
+
}
|
|
334
|
+
}
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
**Benefits**:
|
|
338
|
+
- Handles transient failures
|
|
339
|
+
- Exponential backoff prevents thundering herd
|
|
340
|
+
|
|
341
|
+
---
|
|
342
|
+
|
|
343
|
+
### 5. Load Shedding
|
|
344
|
+
```javascript
|
|
345
|
+
// Reject requests when overloaded
|
|
346
|
+
if (queueDepth > threshold) {
|
|
347
|
+
return res.status(503).send('Overloaded');
|
|
348
|
+
}
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
**Benefits**:
|
|
352
|
+
- Prevent overload
|
|
353
|
+
- Protect downstream services
|
|
354
|
+
|
|
355
|
+
---
|
|
356
|
+
|
|
357
|
+
## Escalation
|
|
358
|
+
|
|
359
|
+
**Escalate to architecture team if**:
|
|
360
|
+
- System-wide cascade
|
|
361
|
+
- Architectural changes needed
|
|
362
|
+
|
|
363
|
+
**Escalate to all service owners if**:
|
|
364
|
+
- Multiple teams affected
|
|
365
|
+
- Need coordinated response
|
|
366
|
+
|
|
367
|
+
**Escalate to management if**:
|
|
368
|
+
- Complete outage
|
|
369
|
+
- Large customer impact
|
|
370
|
+
|
|
371
|
+
---
|
|
372
|
+
|
|
373
|
+
## Prevention Checklist
|
|
374
|
+
|
|
375
|
+
- [ ] Circuit breakers on all external calls
|
|
376
|
+
- [ ] Timeouts on all network operations
|
|
377
|
+
- [ ] Retries with exponential backoff
|
|
378
|
+
- [ ] Bulkheads for critical paths
|
|
379
|
+
- [ ] Rate limiting (protect downstream)
|
|
380
|
+
- [ ] Health checks (detect failures early)
|
|
381
|
+
- [ ] Auto-scaling (handle load)
|
|
382
|
+
- [ ] Graceful degradation (fallback data)
|
|
383
|
+
- [ ] Chaos engineering (test failure scenarios)
|
|
384
|
+
- [ ] Load testing (find breaking points)
|
|
385
|
+
|
|
386
|
+
---
|
|
387
|
+
|
|
388
|
+
## Related Runbooks
|
|
389
|
+
|
|
390
|
+
- [04-slow-api-response.md](04-slow-api-response.md) - API performance
|
|
391
|
+
- [07-service-down.md](07-service-down.md) - Service failures
|
|
392
|
+
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Backend troubleshooting
|
|
393
|
+
|
|
394
|
+
---
|
|
395
|
+
|
|
396
|
+
## Post-Incident
|
|
397
|
+
|
|
398
|
+
After resolving:
|
|
399
|
+
- [ ] Create post-mortem (MANDATORY for cascade failures)
|
|
400
|
+
- [ ] Draw cascade diagram (which services failed in order)
|
|
401
|
+
- [ ] Identify missing safeguards (circuit breakers, timeouts)
|
|
402
|
+
- [ ] Implement prevention patterns
|
|
403
|
+
- [ ] Test cascade scenarios (chaos engineering)
|
|
404
|
+
- [ ] Update this runbook if needed
|
|
405
|
+
|
|
406
|
+
---
|
|
407
|
+
|
|
408
|
+
## Cascade Failure Examples
|
|
409
|
+
|
|
410
|
+
**Netflix Outage (2012)**:
|
|
411
|
+
- Database latency → API timeouts → Frontend failures → Complete outage
|
|
412
|
+
- **Fix**: Circuit breakers, timeouts, fallback data
|
|
413
|
+
|
|
414
|
+
**AWS S3 Outage (2017)**:
|
|
415
|
+
- S3 down → Websites using S3 fail → Status dashboards fail (also on S3)
|
|
416
|
+
- **Fix**: Multi-region redundancy, fallback to different regions
|
|
417
|
+
|
|
418
|
+
**Google Cloud Outage (2019)**:
|
|
419
|
+
- Network misconfiguration → Internal services fail → External services cascade
|
|
420
|
+
- **Fix**: Network configuration validation, staged rollouts
|
|
421
|
+
|
|
422
|
+
---
|
|
423
|
+
|
|
424
|
+
## Key Takeaways
|
|
425
|
+
|
|
426
|
+
1. **Cascades happen when failures propagate** (no circuit breakers, timeouts)
|
|
427
|
+
2. **Fix the root cause first** (not the symptoms)
|
|
428
|
+
3. **Fail fast, don't cascade waits** (timeouts everywhere)
|
|
429
|
+
4. **Graceful degradation** (fallback > failure)
|
|
430
|
+
5. **Test failure scenarios** (chaos engineering)
|