agentic-qe 1.9.4 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/agents/qe-api-contract-validator.md +95 -1336
- package/.claude/agents/qe-chaos-engineer.md +152 -1211
- package/.claude/agents/qe-code-complexity.md +144 -707
- package/.claude/agents/qe-coverage-analyzer.md +147 -743
- package/.claude/agents/qe-deployment-readiness.md +143 -1496
- package/.claude/agents/qe-flaky-test-hunter.md +132 -1529
- package/.claude/agents/qe-fleet-commander.md +12 -12
- package/.claude/agents/qe-performance-tester.md +150 -886
- package/.claude/agents/qe-production-intelligence.md +155 -1396
- package/.claude/agents/qe-quality-analyzer.md +6 -6
- package/.claude/agents/qe-quality-gate.md +151 -648
- package/.claude/agents/qe-regression-risk-analyzer.md +132 -1150
- package/.claude/agents/qe-requirements-validator.md +149 -932
- package/.claude/agents/qe-security-scanner.md +157 -797
- package/.claude/agents/qe-test-data-architect.md +96 -1365
- package/.claude/agents/qe-test-executor.md +8 -8
- package/.claude/agents/qe-test-generator.md +145 -1540
- package/.claude/agents/qe-visual-tester.md +153 -1257
- package/.claude/agents/qx-partner.md +248 -0
- package/.claude/agents/subagents/qe-code-reviewer.md +40 -136
- package/.claude/agents/subagents/qe-coverage-gap-analyzer.md +40 -480
- package/.claude/agents/subagents/qe-data-generator.md +41 -125
- package/.claude/agents/subagents/qe-flaky-investigator.md +55 -411
- package/.claude/agents/subagents/qe-integration-tester.md +53 -141
- package/.claude/agents/subagents/qe-performance-validator.md +54 -130
- package/.claude/agents/subagents/qe-security-auditor.md +56 -114
- package/.claude/agents/subagents/qe-test-data-architect-sub.md +57 -548
- package/.claude/agents/subagents/qe-test-implementer.md +58 -551
- package/.claude/agents/subagents/qe-test-refactorer.md +65 -722
- package/.claude/agents/subagents/qe-test-writer.md +63 -726
- package/.claude/skills/accessibility-testing/SKILL.md +144 -692
- package/.claude/skills/agentic-quality-engineering/SKILL.md +176 -529
- package/.claude/skills/api-testing-patterns/SKILL.md +180 -560
- package/.claude/skills/brutal-honesty-review/SKILL.md +113 -603
- package/.claude/skills/bug-reporting-excellence/SKILL.md +116 -517
- package/.claude/skills/chaos-engineering-resilience/SKILL.md +127 -72
- package/.claude/skills/cicd-pipeline-qe-orchestrator/SKILL.md +209 -404
- package/.claude/skills/code-review-quality/SKILL.md +158 -608
- package/.claude/skills/compatibility-testing/SKILL.md +148 -38
- package/.claude/skills/compliance-testing/SKILL.md +132 -63
- package/.claude/skills/consultancy-practices/SKILL.md +114 -446
- package/.claude/skills/context-driven-testing/SKILL.md +117 -381
- package/.claude/skills/contract-testing/SKILL.md +176 -141
- package/.claude/skills/database-testing/SKILL.md +137 -130
- package/.claude/skills/exploratory-testing-advanced/SKILL.md +160 -629
- package/.claude/skills/holistic-testing-pact/SKILL.md +140 -188
- package/.claude/skills/localization-testing/SKILL.md +145 -33
- package/.claude/skills/mobile-testing/SKILL.md +132 -448
- package/.claude/skills/mutation-testing/SKILL.md +147 -41
- package/.claude/skills/performance-testing/SKILL.md +200 -546
- package/.claude/skills/quality-metrics/SKILL.md +164 -519
- package/.claude/skills/refactoring-patterns/SKILL.md +132 -699
- package/.claude/skills/regression-testing/SKILL.md +120 -926
- package/.claude/skills/risk-based-testing/SKILL.md +157 -660
- package/.claude/skills/security-testing/SKILL.md +199 -538
- package/.claude/skills/sherlock-review/SKILL.md +163 -699
- package/.claude/skills/shift-left-testing/SKILL.md +161 -465
- package/.claude/skills/shift-right-testing/SKILL.md +161 -519
- package/.claude/skills/six-thinking-hats/SKILL.md +175 -1110
- package/.claude/skills/skills-manifest.json +683 -0
- package/.claude/skills/tdd-london-chicago/SKILL.md +131 -448
- package/.claude/skills/technical-writing/SKILL.md +103 -154
- package/.claude/skills/test-automation-strategy/SKILL.md +166 -772
- package/.claude/skills/test-data-management/SKILL.md +126 -910
- package/.claude/skills/test-design-techniques/SKILL.md +179 -89
- package/.claude/skills/test-environment-management/SKILL.md +136 -91
- package/.claude/skills/test-reporting-analytics/SKILL.md +169 -92
- package/.claude/skills/testability-scoring/README.md +71 -0
- package/.claude/skills/testability-scoring/SKILL.md +245 -0
- package/.claude/skills/testability-scoring/resources/templates/config.template.js +84 -0
- package/.claude/skills/testability-scoring/resources/templates/testability-scoring.spec.template.js +532 -0
- package/.claude/skills/testability-scoring/scripts/generate-html-report.js +1007 -0
- package/.claude/skills/testability-scoring/scripts/run-assessment.sh +70 -0
- package/.claude/skills/visual-testing-advanced/SKILL.md +155 -78
- package/.claude/skills/xp-practices/SKILL.md +151 -587
- package/CHANGELOG.md +110 -0
- package/README.md +55 -21
- package/dist/agents/QXPartnerAgent.d.ts +146 -0
- package/dist/agents/QXPartnerAgent.d.ts.map +1 -0
- package/dist/agents/QXPartnerAgent.js +1831 -0
- package/dist/agents/QXPartnerAgent.js.map +1 -0
- package/dist/agents/index.d.ts +1 -0
- package/dist/agents/index.d.ts.map +1 -1
- package/dist/agents/index.js +82 -2
- package/dist/agents/index.js.map +1 -1
- package/dist/agents/lifecycle/AgentLifecycleManager.d.ts.map +1 -1
- package/dist/agents/lifecycle/AgentLifecycleManager.js +34 -31
- package/dist/agents/lifecycle/AgentLifecycleManager.js.map +1 -1
- package/dist/cli/commands/debug/agent.d.ts.map +1 -1
- package/dist/cli/commands/debug/agent.js +19 -6
- package/dist/cli/commands/debug/agent.js.map +1 -1
- package/dist/cli/commands/debug/health-check.js +20 -7
- package/dist/cli/commands/debug/health-check.js.map +1 -1
- package/dist/cli/commands/init-claude-md-template.d.ts +1 -0
- package/dist/cli/commands/init-claude-md-template.d.ts.map +1 -1
- package/dist/cli/commands/init-claude-md-template.js +18 -3
- package/dist/cli/commands/init-claude-md-template.js.map +1 -1
- package/dist/cli/commands/workflow/cancel.d.ts.map +1 -1
- package/dist/cli/commands/workflow/cancel.js +4 -3
- package/dist/cli/commands/workflow/cancel.js.map +1 -1
- package/dist/cli/commands/workflow/list.d.ts.map +1 -1
- package/dist/cli/commands/workflow/list.js +4 -3
- package/dist/cli/commands/workflow/list.js.map +1 -1
- package/dist/cli/commands/workflow/pause.d.ts.map +1 -1
- package/dist/cli/commands/workflow/pause.js +4 -3
- package/dist/cli/commands/workflow/pause.js.map +1 -1
- package/dist/cli/init/claude-config.d.ts.map +1 -1
- package/dist/cli/init/claude-config.js +3 -8
- package/dist/cli/init/claude-config.js.map +1 -1
- package/dist/cli/init/claude-md.d.ts.map +1 -1
- package/dist/cli/init/claude-md.js +44 -2
- package/dist/cli/init/claude-md.js.map +1 -1
- package/dist/cli/init/database-init.js +1 -1
- package/dist/cli/init/index.d.ts.map +1 -1
- package/dist/cli/init/index.js +13 -6
- package/dist/cli/init/index.js.map +1 -1
- package/dist/cli/init/skills.d.ts.map +1 -1
- package/dist/cli/init/skills.js +2 -1
- package/dist/cli/init/skills.js.map +1 -1
- package/dist/core/SwarmCoordinator.d.ts +180 -0
- package/dist/core/SwarmCoordinator.d.ts.map +1 -0
- package/dist/core/SwarmCoordinator.js +473 -0
- package/dist/core/SwarmCoordinator.js.map +1 -0
- package/dist/core/memory/AgentDBIntegration.d.ts +24 -6
- package/dist/core/memory/AgentDBIntegration.d.ts.map +1 -1
- package/dist/core/memory/AgentDBIntegration.js +66 -10
- package/dist/core/memory/AgentDBIntegration.js.map +1 -1
- package/dist/core/memory/UnifiedMemoryCoordinator.d.ts +341 -0
- package/dist/core/memory/UnifiedMemoryCoordinator.d.ts.map +1 -0
- package/dist/core/memory/UnifiedMemoryCoordinator.js +986 -0
- package/dist/core/memory/UnifiedMemoryCoordinator.js.map +1 -0
- package/dist/core/memory/index.d.ts +5 -0
- package/dist/core/memory/index.d.ts.map +1 -1
- package/dist/core/memory/index.js +23 -1
- package/dist/core/memory/index.js.map +1 -1
- package/dist/core/metrics/MetricsAggregator.d.ts +228 -0
- package/dist/core/metrics/MetricsAggregator.d.ts.map +1 -0
- package/dist/core/metrics/MetricsAggregator.js +482 -0
- package/dist/core/metrics/MetricsAggregator.js.map +1 -0
- package/dist/core/metrics/index.d.ts +5 -0
- package/dist/core/metrics/index.d.ts.map +1 -0
- package/dist/core/metrics/index.js +11 -0
- package/dist/core/metrics/index.js.map +1 -0
- package/dist/core/optimization/SwarmOptimizer.d.ts +190 -0
- package/dist/core/optimization/SwarmOptimizer.d.ts.map +1 -0
- package/dist/core/optimization/SwarmOptimizer.js +648 -0
- package/dist/core/optimization/SwarmOptimizer.js.map +1 -0
- package/dist/core/optimization/index.d.ts +9 -0
- package/dist/core/optimization/index.d.ts.map +1 -0
- package/dist/core/optimization/index.js +25 -0
- package/dist/core/optimization/index.js.map +1 -0
- package/dist/core/optimization/types.d.ts +53 -0
- package/dist/core/optimization/types.d.ts.map +1 -0
- package/dist/core/optimization/types.js +6 -0
- package/dist/core/optimization/types.js.map +1 -0
- package/dist/core/orchestration/AdaptiveScheduler.d.ts +190 -0
- package/dist/core/orchestration/AdaptiveScheduler.d.ts.map +1 -0
- package/dist/core/orchestration/AdaptiveScheduler.js +460 -0
- package/dist/core/orchestration/AdaptiveScheduler.js.map +1 -0
- package/dist/core/orchestration/PriorityQueue.d.ts +54 -0
- package/dist/core/orchestration/PriorityQueue.d.ts.map +1 -0
- package/dist/core/orchestration/PriorityQueue.js +122 -0
- package/dist/core/orchestration/PriorityQueue.js.map +1 -0
- package/dist/core/orchestration/WorkflowOrchestrator.d.ts +189 -0
- package/dist/core/orchestration/WorkflowOrchestrator.d.ts.map +1 -0
- package/dist/core/orchestration/WorkflowOrchestrator.js +845 -0
- package/dist/core/orchestration/WorkflowOrchestrator.js.map +1 -0
- package/dist/core/orchestration/index.d.ts +7 -0
- package/dist/core/orchestration/index.d.ts.map +1 -0
- package/dist/core/orchestration/index.js +11 -0
- package/dist/core/orchestration/index.js.map +1 -0
- package/dist/core/orchestration/types.d.ts +96 -0
- package/dist/core/orchestration/types.d.ts.map +1 -0
- package/dist/core/orchestration/types.js +6 -0
- package/dist/core/orchestration/types.js.map +1 -0
- package/dist/core/recovery/CircuitBreaker.d.ts +176 -0
- package/dist/core/recovery/CircuitBreaker.d.ts.map +1 -0
- package/dist/core/recovery/CircuitBreaker.js +382 -0
- package/dist/core/recovery/CircuitBreaker.js.map +1 -0
- package/dist/core/recovery/RecoveryOrchestrator.d.ts +186 -0
- package/dist/core/recovery/RecoveryOrchestrator.d.ts.map +1 -0
- package/dist/core/recovery/RecoveryOrchestrator.js +476 -0
- package/dist/core/recovery/RecoveryOrchestrator.js.map +1 -0
- package/dist/core/recovery/RetryStrategy.d.ts +127 -0
- package/dist/core/recovery/RetryStrategy.d.ts.map +1 -0
- package/dist/core/recovery/RetryStrategy.js +314 -0
- package/dist/core/recovery/RetryStrategy.js.map +1 -0
- package/dist/core/recovery/index.d.ts +8 -0
- package/dist/core/recovery/index.d.ts.map +1 -0
- package/dist/core/recovery/index.js +27 -0
- package/dist/core/recovery/index.js.map +1 -0
- package/dist/core/skills/DependencyResolver.d.ts +99 -0
- package/dist/core/skills/DependencyResolver.d.ts.map +1 -0
- package/dist/core/skills/DependencyResolver.js +260 -0
- package/dist/core/skills/DependencyResolver.js.map +1 -0
- package/dist/core/skills/DynamicSkillLoader.d.ts +96 -0
- package/dist/core/skills/DynamicSkillLoader.d.ts.map +1 -0
- package/dist/core/skills/DynamicSkillLoader.js +353 -0
- package/dist/core/skills/DynamicSkillLoader.js.map +1 -0
- package/dist/core/skills/ManifestGenerator.d.ts +114 -0
- package/dist/core/skills/ManifestGenerator.d.ts.map +1 -0
- package/dist/core/skills/ManifestGenerator.js +449 -0
- package/dist/core/skills/ManifestGenerator.js.map +1 -0
- package/dist/core/skills/index.d.ts +9 -0
- package/dist/core/skills/index.d.ts.map +1 -0
- package/dist/core/skills/index.js +24 -0
- package/dist/core/skills/index.js.map +1 -0
- package/dist/core/skills/types.d.ts +118 -0
- package/dist/core/skills/types.d.ts.map +1 -0
- package/dist/core/skills/types.js +7 -0
- package/dist/core/skills/types.js.map +1 -0
- package/dist/core/transport/QUICTransport.d.ts +320 -0
- package/dist/core/transport/QUICTransport.d.ts.map +1 -0
- package/dist/core/transport/QUICTransport.js +711 -0
- package/dist/core/transport/QUICTransport.js.map +1 -0
- package/dist/core/transport/index.d.ts +40 -0
- package/dist/core/transport/index.d.ts.map +1 -0
- package/dist/core/transport/index.js +46 -0
- package/dist/core/transport/index.js.map +1 -0
- package/dist/core/transport/quic-loader.d.ts +123 -0
- package/dist/core/transport/quic-loader.d.ts.map +1 -0
- package/dist/core/transport/quic-loader.js +293 -0
- package/dist/core/transport/quic-loader.js.map +1 -0
- package/dist/core/transport/quic.d.ts +154 -0
- package/dist/core/transport/quic.d.ts.map +1 -0
- package/dist/core/transport/quic.js +214 -0
- package/dist/core/transport/quic.js.map +1 -0
- package/dist/mcp/server.d.ts +9 -9
- package/dist/mcp/server.d.ts.map +1 -1
- package/dist/mcp/server.js +1 -2
- package/dist/mcp/server.js.map +1 -1
- package/dist/mcp/services/AgentRegistry.d.ts.map +1 -1
- package/dist/mcp/services/AgentRegistry.js +4 -1
- package/dist/mcp/services/AgentRegistry.js.map +1 -1
- package/dist/types/index.d.ts +2 -1
- package/dist/types/index.d.ts.map +1 -1
- package/dist/types/index.js +2 -0
- package/dist/types/index.js.map +1 -1
- package/dist/types/qx.d.ts +429 -0
- package/dist/types/qx.d.ts.map +1 -0
- package/dist/types/qx.js +71 -0
- package/dist/types/qx.js.map +1 -0
- package/dist/visualization/api/RestEndpoints.js +2 -2
- package/dist/visualization/api/RestEndpoints.js.map +1 -1
- package/dist/visualization/api/WebSocketServer.d.ts +44 -0
- package/dist/visualization/api/WebSocketServer.d.ts.map +1 -1
- package/dist/visualization/api/WebSocketServer.js +144 -23
- package/dist/visualization/api/WebSocketServer.js.map +1 -1
- package/dist/visualization/core/DataTransformer.d.ts +10 -0
- package/dist/visualization/core/DataTransformer.d.ts.map +1 -1
- package/dist/visualization/core/DataTransformer.js +60 -5
- package/dist/visualization/core/DataTransformer.js.map +1 -1
- package/dist/visualization/emit-event.d.ts +75 -0
- package/dist/visualization/emit-event.d.ts.map +1 -0
- package/dist/visualization/emit-event.js +213 -0
- package/dist/visualization/emit-event.js.map +1 -0
- package/dist/visualization/index.d.ts +1 -0
- package/dist/visualization/index.d.ts.map +1 -1
- package/dist/visualization/index.js +7 -1
- package/dist/visualization/index.js.map +1 -1
- package/docs/reference/skills.md +63 -1
- package/package.json +16 -58
|
@@ -1,1242 +1,183 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: qe-chaos-engineer
|
|
3
|
-
description: Resilience testing
|
|
3
|
+
description: Resilience testing with controlled fault injection and blast radius management
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
'p99_latency < 500ms',
|
|
82
|
-
'error_rate < 0.01',
|
|
83
|
-
'cpu_utilization < 0.70'
|
|
84
|
-
],
|
|
85
|
-
duration: '5m'
|
|
86
|
-
});
|
|
87
|
-
|
|
88
|
-
if (!steadyState.healthy) {
|
|
89
|
-
throw new Error('System not in steady state - aborting experiment');
|
|
90
|
-
}
|
|
91
|
-
|
|
92
|
-
// Setup monitoring and observability
|
|
93
|
-
await setupExperimentMonitoring({
|
|
94
|
-
metrics: ['latency', 'error_rate', 'throughput', 'resource_usage'],
|
|
95
|
-
alerts: ['critical_errors', 'cascading_failures'],
|
|
96
|
-
sampling_rate: '1s'
|
|
97
|
-
});
|
|
98
|
-
|
|
99
|
-
// Create rollback plan
|
|
100
|
-
const rollbackPlan = {
|
|
101
|
-
trigger_conditions: [
|
|
102
|
-
'error_rate > 0.05',
|
|
103
|
-
'p99_latency > 5000ms',
|
|
104
|
-
'cascading_failures_detected'
|
|
105
|
-
],
|
|
106
|
-
rollback_steps: [
|
|
107
|
-
'stop_fault_injection',
|
|
108
|
-
'restore_connection_pool',
|
|
109
|
-
'verify_recovery'
|
|
110
|
-
],
|
|
111
|
-
max_rollback_time: '30s'
|
|
112
|
-
};
|
|
113
|
-
```
|
|
114
|
-
|
|
115
|
-
### Phase 3: Fault Injection Execution
|
|
116
|
-
```javascript
|
|
117
|
-
// Gradually inject fault
|
|
118
|
-
const faultInjection = {
|
|
119
|
-
target: 'postgres-connection-pool',
|
|
120
|
-
method: 'gradual-exhaustion',
|
|
121
|
-
timeline: [
|
|
122
|
-
{ time: '0s', connections_available: 100, percentage: 100 },
|
|
123
|
-
{ time: '30s', connections_available: 75, percentage: 75 },
|
|
124
|
-
{ time: '60s', connections_available: 50, percentage: 50 },
|
|
125
|
-
{ time: '90s', connections_available: 25, percentage: 25 },
|
|
126
|
-
{ time: '120s', connections_available: 10, percentage: 10 },
|
|
127
|
-
{ time: '150s', connections_available: 0, percentage: 0 }
|
|
128
|
-
]
|
|
129
|
-
};
|
|
130
|
-
|
|
131
|
-
// Execute fault injection with real-time monitoring
|
|
132
|
-
await executeFaultInjection({
|
|
133
|
-
config: faultInjection,
|
|
134
|
-
monitoring: true,
|
|
135
|
-
auto_rollback: rollbackPlan,
|
|
136
|
-
safety_checks: 'continuous'
|
|
137
|
-
});
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
### Phase 4: Observability & Analysis
|
|
141
|
-
```javascript
|
|
142
|
-
// Collect experiment telemetry
|
|
143
|
-
const telemetry = {
|
|
144
|
-
system_metrics: collectSystemMetrics(),
|
|
145
|
-
application_logs: collectApplicationLogs(),
|
|
146
|
-
distributed_traces: collectDistributedTraces(),
|
|
147
|
-
user_impact: measureUserImpact()
|
|
148
|
-
};
|
|
149
|
-
|
|
150
|
-
// Analyze system behavior under chaos
|
|
151
|
-
const analysis = {
|
|
152
|
-
hypothesis_validated: telemetry.error_rate < 0.05,
|
|
153
|
-
recovery_time: calculateRecoveryTime(telemetry),
|
|
154
|
-
blast_radius_contained: telemetry.affected_services.length === 1,
|
|
155
|
-
graceful_degradation: telemetry.partial_functionality_maintained
|
|
156
|
-
};
|
|
157
|
-
|
|
158
|
-
// Generate insights
|
|
159
|
-
const insights = generateResilience Insights({
|
|
160
|
-
telemetry,
|
|
161
|
-
analysis,
|
|
162
|
-
experiment
|
|
163
|
-
});
|
|
164
|
-
```
|
|
165
|
-
|
|
166
|
-
## Integration Points
|
|
167
|
-
|
|
168
|
-
### Memory Coordination
|
|
169
|
-
```typescript
|
|
170
|
-
// Store experiment configuration
|
|
171
|
-
await this.memoryStore.store(`aqe/chaos/experiments/${experimentId}`, experimentConfig, {
|
|
172
|
-
partition: 'coordination',
|
|
173
|
-
ttl: 86400 // 24 hours
|
|
174
|
-
});
|
|
175
|
-
|
|
176
|
-
// Store safety constraints
|
|
177
|
-
await this.memoryStore.store('aqe/chaos/safety/constraints', safetyRules, {
|
|
178
|
-
partition: 'coordination'
|
|
179
|
-
});
|
|
180
|
-
|
|
181
|
-
// Store experiment results
|
|
182
|
-
await this.memoryStore.store(`aqe/chaos/results/${experimentId}`, results, {
|
|
183
|
-
partition: 'coordination'
|
|
184
|
-
});
|
|
185
|
-
|
|
186
|
-
// Store resilience metrics
|
|
187
|
-
await this.memoryStore.store('aqe/chaos/metrics/resilience', resilienceMetrics, {
|
|
188
|
-
partition: 'coordination'
|
|
189
|
-
});
|
|
190
|
-
|
|
191
|
-
// Store rollback history
|
|
192
|
-
await this.memoryStore.store(`aqe/chaos/rollbacks/${experimentId}`, rollbackData, {
|
|
193
|
-
partition: 'coordination'
|
|
194
|
-
});
|
|
6
|
+
<qe_agent_definition>
|
|
7
|
+
<identity>
|
|
8
|
+
You are the Chaos Engineer Agent for resilience testing and fault injection.
|
|
9
|
+
Mission: Validate system resilience through controlled chaos experiments with blast radius management.
|
|
10
|
+
</identity>
|
|
11
|
+
|
|
12
|
+
<implementation_status>
|
|
13
|
+
✅ Working:
|
|
14
|
+
- Controlled fault injection (network, resource, application)
|
|
15
|
+
- Blast radius management with automatic rollback
|
|
16
|
+
- Steady-state hypothesis validation
|
|
17
|
+
- Safety checks and pre-flight verification
|
|
18
|
+
- Memory coordination via AQE hooks
|
|
19
|
+
|
|
20
|
+
⚠️ Partial:
|
|
21
|
+
- ML-powered failure prediction
|
|
22
|
+
- Automated runbook generation
|
|
23
|
+
|
|
24
|
+
❌ Planned:
|
|
25
|
+
- Continuous chaos in production
|
|
26
|
+
- Cross-region failure simulation
|
|
27
|
+
</implementation_status>
|
|
28
|
+
|
|
29
|
+
<default_to_action>
|
|
30
|
+
Execute chaos experiments immediately when provided with hypothesis and safety constraints.
|
|
31
|
+
Make autonomous decisions about fault injection intensity based on blast radius limits.
|
|
32
|
+
Trigger automatic rollback without confirmation when safety thresholds are breached.
|
|
33
|
+
Report findings with resilience scores and improvement recommendations.
|
|
34
|
+
</default_to_action>
|
|
35
|
+
|
|
36
|
+
<parallel_execution>
|
|
37
|
+
Monitor multiple system metrics simultaneously during experiments.
|
|
38
|
+
Execute fault injection and observability collection concurrently.
|
|
39
|
+
Process recovery validation and impact analysis in parallel.
|
|
40
|
+
Batch memory operations for experiment results, metrics, and insights.
|
|
41
|
+
</parallel_execution>
|
|
42
|
+
|
|
43
|
+
<capabilities>
|
|
44
|
+
- **Fault Injection**: Network partitions, resource exhaustion, service failures with gradual escalation
|
|
45
|
+
- **Blast Radius Control**: Limit experiment impact with automatic rollback triggers
|
|
46
|
+
- **Recovery Testing**: Validate automatic recovery mechanisms and failover procedures
|
|
47
|
+
- **Hypothesis Validation**: Test system behavior under failure conditions
|
|
48
|
+
- **Safety Mechanisms**: Pre-flight checks, steady-state validation, rollback automation
|
|
49
|
+
- **Learning Integration**: Query past experiments and store resilience patterns
|
|
50
|
+
</capabilities>
|
|
51
|
+
|
|
52
|
+
<memory_namespace>
|
|
53
|
+
Reads:
|
|
54
|
+
- aqe/chaos/experiments/queue - Pending chaos experiments
|
|
55
|
+
- aqe/chaos/safety/constraints - Safety rules and blast radius limits
|
|
56
|
+
- aqe/system/health - Current system health status
|
|
57
|
+
- aqe/learning/patterns/chaos-testing/* - Learned resilience strategies
|
|
58
|
+
|
|
59
|
+
Writes:
|
|
60
|
+
- aqe/chaos/experiments/results - Experiment outcomes and analysis
|
|
61
|
+
- aqe/chaos/metrics/resilience - Resilience scores and trends
|
|
62
|
+
- aqe/chaos/failures/discovered - Newly discovered failure modes
|
|
63
|
+
- aqe/chaos/rollbacks/history - Rollback events and reasons
|
|
64
|
+
|
|
65
|
+
Coordination:
|
|
66
|
+
- aqe/chaos/status - Current experiment status
|
|
67
|
+
- aqe/chaos/alerts - Real-time chaos alerts
|
|
68
|
+
- aqe/chaos/blast-radius - Live blast radius tracking
|
|
69
|
+
</memory_namespace>
|
|
70
|
+
|
|
71
|
+
<learning_protocol>
|
|
72
|
+
Query before experiment:
|
|
73
|
+
```javascript
|
|
74
|
+
mcp__agentic_qe__learning_query({
|
|
75
|
+
agentId: "qe-chaos-engineer",
|
|
76
|
+
taskType: "chaos-testing",
|
|
77
|
+
minReward: 0.8,
|
|
78
|
+
queryType: "all",
|
|
79
|
+
limit: 10
|
|
80
|
+
})
|
|
195
81
|
```
|
|
196
82
|
|
|
197
|
-
|
|
83
|
+
Store after completion:
|
|
198
84
|
```javascript
|
|
199
|
-
// Subscribe to chaos events
|
|
200
|
-
eventBus.subscribe('chaos:experiment-started', (event) => {
|
|
201
|
-
monitoringAgent.increaseAlertSensitivity();
|
|
202
|
-
});
|
|
203
|
-
|
|
204
|
-
eventBus.subscribe('chaos:fault-injected', (event) => {
|
|
205
|
-
loggingAgent.captureDetailedLogs(event.target);
|
|
206
|
-
});
|
|
207
|
-
|
|
208
|
-
eventBus.subscribe('chaos:rollback-triggered', (event) => {
|
|
209
|
-
alertingAgent.notifyOnCall(event.reason);
|
|
210
|
-
});
|
|
211
|
-
|
|
212
|
-
// Broadcast chaos events
|
|
213
|
-
eventBus.publish('chaos:steady-state-violated', {
|
|
214
|
-
experiment_id: 'exp-123',
|
|
215
|
-
metric: 'error_rate',
|
|
216
|
-
threshold: 0.05,
|
|
217
|
-
actual: 0.08,
|
|
218
|
-
action: 'auto-rollback'
|
|
219
|
-
});
|
|
220
|
-
```
|
|
221
|
-
|
|
222
|
-
### Agent Collaboration
|
|
223
|
-
- **QE Test Executor**: Coordinates chaos experiments with test execution
|
|
224
|
-
- **QE Performance Tester**: Validates performance under chaos conditions
|
|
225
|
-
- **QE Security Scanner**: Tests security resilience during failures
|
|
226
|
-
- **QE Coverage Analyzer**: Measures chaos experiment coverage
|
|
227
|
-
- **Fleet Commander**: Reports chaos experiment impact on fleet health
|
|
228
|
-
|
|
229
|
-
## Coordination Protocol
|
|
230
|
-
|
|
231
|
-
This agent uses **AQE hooks (Agentic QE native hooks)** for coordination (zero external dependencies, 100-500x faster).
|
|
232
|
-
|
|
233
|
-
**Automatic Lifecycle Hooks:**
|
|
234
|
-
```typescript
|
|
235
|
-
// Called automatically by BaseAgent
|
|
236
|
-
protected async onPreTask(data: { assignment: TaskAssignment }): Promise<void> {
|
|
237
|
-
// Load experiment queue and safety constraints
|
|
238
|
-
const experiments = await this.memoryStore.retrieve('aqe/chaos/experiments/queue');
|
|
239
|
-
const safetyRules = await this.memoryStore.retrieve('aqe/chaos/safety/constraints');
|
|
240
|
-
const systemHealth = await this.memoryStore.retrieve('aqe/system/health');
|
|
241
|
-
|
|
242
|
-
// Verify environment for chaos testing
|
|
243
|
-
const verification = await this.hookManager.executePreTaskVerification({
|
|
244
|
-
task: 'chaos-experiment',
|
|
245
|
-
context: {
|
|
246
|
-
requiredVars: ['CHAOS_ENABLED', 'BLAST_RADIUS_MAX'],
|
|
247
|
-
minMemoryMB: 1024,
|
|
248
|
-
requiredKeys: ['aqe/chaos/safety/constraints', 'aqe/system/health']
|
|
249
|
-
}
|
|
250
|
-
});
|
|
251
|
-
|
|
252
|
-
// Emit chaos experiment starting event
|
|
253
|
-
this.eventBus.emit('chaos:experiment-starting', {
|
|
254
|
-
agentId: this.agentId,
|
|
255
|
-
experimentName: data.assignment.task.metadata.experimentName,
|
|
256
|
-
blastRadius: data.assignment.task.metadata.blastRadius
|
|
257
|
-
});
|
|
258
|
-
|
|
259
|
-
this.logger.info('Chaos experiment initialized', {
|
|
260
|
-
pendingExperiments: experiments?.length || 0,
|
|
261
|
-
systemHealthy: systemHealth?.healthy || false,
|
|
262
|
-
verification: verification.passed
|
|
263
|
-
});
|
|
264
|
-
}
|
|
265
|
-
|
|
266
|
-
protected async onPostTask(data: { assignment: TaskAssignment; result: any }): Promise<void> {
|
|
267
|
-
// Store experiment results and resilience metrics
|
|
268
|
-
await this.memoryStore.store('aqe/chaos/experiments/results', data.result.experimentOutcomes, {
|
|
269
|
-
partition: 'agent_results',
|
|
270
|
-
ttl: 86400 // 24 hours
|
|
271
|
-
});
|
|
272
|
-
|
|
273
|
-
await this.memoryStore.store('aqe/chaos/metrics/resilience', data.result.resilienceMetrics, {
|
|
274
|
-
partition: 'metrics',
|
|
275
|
-
ttl: 604800 // 7 days
|
|
276
|
-
});
|
|
277
|
-
|
|
278
|
-
// Store chaos experiment metrics
|
|
279
|
-
await this.memoryStore.store('aqe/chaos/metrics/experiment', {
|
|
280
|
-
timestamp: Date.now(),
|
|
281
|
-
experimentName: data.result.experimentName,
|
|
282
|
-
passed: data.result.steadyStateValidated,
|
|
283
|
-
rollbackTriggered: data.result.rollbackTriggered,
|
|
284
|
-
recoveryTime: data.result.recoveryTime
|
|
285
|
-
}, {
|
|
286
|
-
partition: 'metrics',
|
|
287
|
-
ttl: 604800 // 7 days
|
|
288
|
-
});
|
|
289
|
-
|
|
290
|
-
// Emit completion event with chaos experiment results
|
|
291
|
-
this.eventBus.emit('chaos:experiment-completed', {
|
|
292
|
-
agentId: this.agentId,
|
|
293
|
-
experimentId: data.assignment.id,
|
|
294
|
-
passed: data.result.steadyStateValidated,
|
|
295
|
-
rollbackTriggered: data.result.rollbackTriggered
|
|
296
|
-
});
|
|
297
|
-
|
|
298
|
-
// Validate chaos experiment results
|
|
299
|
-
const validation = await this.hookManager.executePostTaskValidation({
|
|
300
|
-
task: 'chaos-experiment',
|
|
301
|
-
result: {
|
|
302
|
-
output: data.result,
|
|
303
|
-
passed: data.result.steadyStateValidated,
|
|
304
|
-
metrics: {
|
|
305
|
-
recoveryTime: data.result.recoveryTime,
|
|
306
|
-
blastRadius: data.result.blastRadius
|
|
307
|
-
}
|
|
308
|
-
}
|
|
309
|
-
});
|
|
310
|
-
|
|
311
|
-
this.logger.info('Chaos experiment completed', {
|
|
312
|
-
experimentName: data.result.experimentName,
|
|
313
|
-
passed: data.result.steadyStateValidated,
|
|
314
|
-
validated: validation.passed
|
|
315
|
-
});
|
|
316
|
-
}
|
|
317
|
-
|
|
318
|
-
protected async onTaskError(data: { assignment: TaskAssignment; error: Error }): Promise<void> {
|
|
319
|
-
// Store error for fleet analysis
|
|
320
|
-
await this.memoryStore.store(`aqe/errors/${data.assignment.task.id}`, {
|
|
321
|
-
error: data.error.message,
|
|
322
|
-
timestamp: Date.now(),
|
|
323
|
-
agent: this.agentId,
|
|
324
|
-
taskType: 'chaos-engineering',
|
|
325
|
-
experimentName: data.assignment.task.metadata.experimentName
|
|
326
|
-
}, {
|
|
327
|
-
partition: 'errors',
|
|
328
|
-
ttl: 604800 // 7 days
|
|
329
|
-
});
|
|
330
|
-
|
|
331
|
-
// Emit error event for fleet coordination
|
|
332
|
-
this.eventBus.emit('chaos:experiment-error', {
|
|
333
|
-
agentId: this.agentId,
|
|
334
|
-
error: data.error.message,
|
|
335
|
-
taskId: data.assignment.task.id
|
|
336
|
-
});
|
|
337
|
-
|
|
338
|
-
this.logger.error('Chaos experiment failed', {
|
|
339
|
-
error: data.error.message,
|
|
340
|
-
stack: data.error.stack
|
|
341
|
-
});
|
|
342
|
-
}
|
|
343
|
-
```
|
|
344
|
-
|
|
345
|
-
**Advanced Verification (Optional):**
|
|
346
|
-
```typescript
|
|
347
|
-
// Use VerificationHookManager for comprehensive validation
|
|
348
|
-
const hookManager = new VerificationHookManager(this.memoryStore);
|
|
349
|
-
const verification = await hookManager.executePreTaskVerification({
|
|
350
|
-
task: 'chaos-experiment',
|
|
351
|
-
context: {
|
|
352
|
-
requiredVars: ['CHAOS_ENABLED', 'BLAST_RADIUS_MAX'],
|
|
353
|
-
minMemoryMB: 1024,
|
|
354
|
-
requiredKeys: ['aqe/chaos/safety/constraints', 'aqe/system/health']
|
|
355
|
-
}
|
|
356
|
-
});
|
|
357
|
-
```
|
|
358
|
-
|
|
359
|
-
## Learning Protocol (Phase 6 - Option C Implementation)
|
|
360
|
-
|
|
361
|
-
**⚠️ MANDATORY**: When executed via Claude Code Task tool, you MUST call learning MCP tools to persist learning data.
|
|
362
|
-
|
|
363
|
-
### Required Learning Actions (Call AFTER Task Completion)
|
|
364
|
-
|
|
365
|
-
**1. Store Learning Experience:**
|
|
366
|
-
```typescript
|
|
367
|
-
// Call this MCP tool after completing your task
|
|
368
85
|
mcp__agentic_qe__learning_store_experience({
|
|
369
86
|
agentId: "qe-chaos-engineer",
|
|
370
87
|
taskType: "chaos-testing",
|
|
371
|
-
reward: 0.95,
|
|
88
|
+
reward: 0.95,
|
|
372
89
|
outcome: {
|
|
373
|
-
// Your actual results
|
|
374
90
|
experimentsRun: 5,
|
|
375
91
|
vulnerabilitiesFound: 3,
|
|
376
92
|
recoveryTime: 23,
|
|
377
93
|
executionTime: 8000
|
|
378
94
|
},
|
|
379
95
|
metadata: {
|
|
380
|
-
// Additional context
|
|
381
96
|
blastRadiusManagement: true,
|
|
382
|
-
faultTypes: ["network-partition", "pod-kill"
|
|
97
|
+
faultTypes: ["network-partition", "pod-kill"],
|
|
383
98
|
controlledRollback: true
|
|
384
99
|
}
|
|
385
100
|
})
|
|
386
101
|
```
|
|
387
102
|
|
|
388
|
-
|
|
389
|
-
```
|
|
390
|
-
// Store Q-value for the strategy you used
|
|
391
|
-
mcp__agentic_qe__learning_store_qvalue({
|
|
392
|
-
agentId: "qe-chaos-engineer",
|
|
393
|
-
stateKey: "chaos-testing-state",
|
|
394
|
-
actionKey: "controlled-fault-injection",
|
|
395
|
-
qValue: 0.85, // Expected value of this approach (based on results)
|
|
396
|
-
metadata: {
|
|
397
|
-
// Strategy details
|
|
398
|
-
injectionStrategy: "gradual-escalation",
|
|
399
|
-
safetyLevel: 0.95,
|
|
400
|
-
effectiveness: 0.90
|
|
401
|
-
}
|
|
402
|
-
})
|
|
403
|
-
```
|
|
404
|
-
|
|
405
|
-
**3. Store Successful Patterns:**
|
|
406
|
-
```typescript
|
|
407
|
-
// If you discovered a useful pattern, store it
|
|
103
|
+
Store patterns when discovered:
|
|
104
|
+
```javascript
|
|
408
105
|
mcp__agentic_qe__learning_store_pattern({
|
|
409
|
-
|
|
410
|
-
|
|
411
|
-
confidence: 0.95, // How confident you are (0-1)
|
|
106
|
+
pattern: "Gradual fault injection with blast radius monitoring prevents cascading failures while discovering vulnerabilities",
|
|
107
|
+
confidence: 0.95,
|
|
412
108
|
domain: "resilience",
|
|
413
109
|
metadata: {
|
|
414
|
-
|
|
415
|
-
resiliencePatterns: ["circuit-breaker", "bulkhead", "timeout"],
|
|
110
|
+
resiliencePatterns: ["circuit-breaker", "bulkhead"],
|
|
416
111
|
predictionAccuracy: 0.92
|
|
417
112
|
}
|
|
418
113
|
})
|
|
419
114
|
```
|
|
420
115
|
|
|
421
|
-
|
|
422
|
-
|
|
423
|
-
|
|
424
|
-
|
|
425
|
-
|
|
426
|
-
|
|
427
|
-
|
|
428
|
-
|
|
429
|
-
|
|
430
|
-
|
|
431
|
-
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
|
|
435
|
-
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
|
|
439
|
-
|
|
440
|
-
|
|
441
|
-
|
|
442
|
-
|
|
443
|
-
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
|
|
447
|
-
|
|
448
|
-
|
|
449
|
-
|
|
450
|
-
|
|
451
|
-
|
|
452
|
-
|
|
453
|
-
|
|
454
|
-
|
|
455
|
-
|
|
456
|
-
|
|
457
|
-
|
|
458
|
-
|
|
459
|
-
|
|
460
|
-
-
|
|
461
|
-
-
|
|
462
|
-
-
|
|
463
|
-
-
|
|
464
|
-
-
|
|
465
|
-
|
|
466
|
-
|
|
467
|
-
|
|
468
|
-
|
|
469
|
-
|
|
470
|
-
|
|
471
|
-
|
|
472
|
-
|
|
473
|
-
|
|
474
|
-
|
|
475
|
-
|
|
476
|
-
|
|
477
|
-
|
|
478
|
-
|
|
479
|
-
|
|
480
|
-
|
|
481
|
-
|
|
482
|
-
|
|
483
|
-
|
|
484
|
-
|
|
485
|
-
|
|
486
|
-
|
|
487
|
-
|
|
488
|
-
|
|
489
|
-
discountFactor: 0.95
|
|
490
|
-
});
|
|
491
|
-
|
|
492
|
-
await learningEngine.initialize();
|
|
493
|
-
|
|
494
|
-
// Record chaos experiment episode
|
|
495
|
-
await learningEngine.recordEpisode({
|
|
496
|
-
state: {
|
|
497
|
-
experimentType: 'network-partition',
|
|
498
|
-
target: 'database-cluster',
|
|
499
|
-
systemHealth: 'healthy',
|
|
500
|
-
blastRadius: 'controlled'
|
|
501
|
-
},
|
|
502
|
-
action: {
|
|
503
|
-
faultType: 'network-partition',
|
|
504
|
-
duration: 120,
|
|
505
|
-
intensity: 'gradual',
|
|
506
|
-
autoRollback: true
|
|
507
|
-
},
|
|
508
|
-
reward: hypothesisValidated ? 1.0 : (systemRecovered ? 0.5 : -1.0),
|
|
509
|
-
nextState: {
|
|
510
|
-
steadyStateValidated: true,
|
|
511
|
-
recoveryTime: 23,
|
|
512
|
-
rollbackTriggered: false
|
|
513
|
-
}
|
|
514
|
-
});
|
|
515
|
-
|
|
516
|
-
// Learn from chaos experiment outcomes
|
|
517
|
-
await learningEngine.learn();
|
|
518
|
-
|
|
519
|
-
// Get learned experiment parameters
|
|
520
|
-
const prediction = await learningEngine.predict({
|
|
521
|
-
experimentType: 'network-partition',
|
|
522
|
-
target: 'database-cluster',
|
|
523
|
-
systemHealth: 'healthy'
|
|
524
|
-
});
|
|
525
|
-
```
|
|
526
|
-
|
|
527
|
-
### Reward Function
|
|
528
|
-
|
|
529
|
-
```typescript
|
|
530
|
-
function calculateChaosReward(outcome: ChaosExperimentOutcome): number {
|
|
531
|
-
let reward = 0;
|
|
532
|
-
|
|
533
|
-
// Base reward for hypothesis validation
|
|
534
|
-
if (outcome.hypothesisValidated) {
|
|
535
|
-
reward += 1.0;
|
|
536
|
-
} else {
|
|
537
|
-
reward -= 0.5;
|
|
538
|
-
}
|
|
539
|
-
|
|
540
|
-
// Reward for controlled blast radius
|
|
541
|
-
if (outcome.blastRadiusContained) {
|
|
542
|
-
reward += 0.5;
|
|
543
|
-
} else {
|
|
544
|
-
reward -= 2.0; // Large penalty for uncontrolled chaos
|
|
545
|
-
}
|
|
546
|
-
|
|
547
|
-
// Reward for quick recovery
|
|
548
|
-
const recoveryBonus = Math.max(0, (60 - outcome.recoveryTime) / 60);
|
|
549
|
-
reward += recoveryBonus * 0.5;
|
|
550
|
-
|
|
551
|
-
// Penalty for needing rollback (but less than uncontrolled)
|
|
552
|
-
if (outcome.rollbackTriggered) {
|
|
553
|
-
reward -= 0.3;
|
|
554
|
-
}
|
|
555
|
-
|
|
556
|
-
// Bonus for discovering new failure modes
|
|
557
|
-
if (outcome.newFailureModeDiscovered) {
|
|
558
|
-
reward += 1.0;
|
|
559
|
-
}
|
|
560
|
-
|
|
561
|
-
// Penalty for zero learning (experiment too safe or trivial)
|
|
562
|
-
if (outcome.steadyStateNeverDisturbed) {
|
|
563
|
-
reward -= 0.2;
|
|
564
|
-
}
|
|
565
|
-
|
|
566
|
-
return reward;
|
|
567
|
-
}
|
|
568
|
-
```
|
|
569
|
-
|
|
570
|
-
### Learning Metrics
|
|
571
|
-
|
|
572
|
-
Track learning progress:
|
|
573
|
-
- **Hypothesis Validation Rate**: Percentage of experiments that validate hypotheses
|
|
574
|
-
- **Blast Radius Control**: Success rate of blast radius containment
|
|
575
|
-
- **Recovery Time**: Average and p95 recovery time
|
|
576
|
-
- **Rollback Rate**: Percentage of experiments requiring rollback
|
|
577
|
-
- **Failure Mode Discovery**: Rate of discovering new failure modes
|
|
578
|
-
|
|
579
|
-
```bash
|
|
580
|
-
# View learning metrics
|
|
581
|
-
aqe learn status --agent qe-chaos-engineer
|
|
582
|
-
|
|
583
|
-
# Export learning history
|
|
584
|
-
aqe learn export --agent qe-chaos-engineer --format json
|
|
585
|
-
|
|
586
|
-
# Analyze resilience trends
|
|
587
|
-
aqe learn analyze --agent qe-chaos-engineer --metric resilience
|
|
588
|
-
```
|
|
589
|
-
|
|
590
|
-
## Memory Keys
|
|
591
|
-
|
|
592
|
-
### Input Keys
|
|
593
|
-
- `aqe/chaos/experiments/queue`: Pending chaos experiments
|
|
594
|
-
- `aqe/chaos/safety/constraints`: Safety rules and blast radius limits
|
|
595
|
-
- `aqe/chaos/targets`: Systems and services available for chaos testing
|
|
596
|
-
- `aqe/system/health`: Current system health status
|
|
597
|
-
- `aqe/chaos/hypotheses`: Resilience hypotheses to validate
|
|
598
|
-
|
|
599
|
-
### Output Keys
|
|
600
|
-
- `aqe/chaos/experiments/results`: Experiment outcomes and analysis
|
|
601
|
-
- `aqe/chaos/metrics/resilience`: Resilience scores and trends
|
|
602
|
-
- `aqe/chaos/failures/discovered`: Newly discovered failure modes
|
|
603
|
-
- `aqe/chaos/recommendations`: System hardening recommendations
|
|
604
|
-
- `aqe/chaos/rollbacks/history`: Rollback events and reasons
|
|
605
|
-
|
|
606
|
-
### Coordination Keys
|
|
607
|
-
- `aqe/chaos/status`: Current chaos experiment status
|
|
608
|
-
- `aqe/chaos/active-experiments`: Currently running experiments
|
|
609
|
-
- `aqe/chaos/blast-radius`: Real-time blast radius tracking
|
|
610
|
-
- `aqe/chaos/alerts`: Chaos-related alerts and warnings
|
|
611
|
-
|
|
612
|
-
## Coordination Protocol
|
|
613
|
-
|
|
614
|
-
### Swarm Integration
|
|
615
|
-
```typescript
|
|
616
|
-
// Initialize chaos engineering workflow via task manager
|
|
617
|
-
await this.taskManager.orchestrate({
|
|
618
|
-
task: 'Execute chaos experiment: database failure',
|
|
619
|
-
agents: ['qe-chaos-engineer', 'qe-performance-tester', 'qe-test-executor'],
|
|
620
|
-
strategy: 'sequential-with-monitoring'
|
|
621
|
-
});
|
|
622
|
-
|
|
623
|
-
// Coordinate with monitoring agents via EventBus
|
|
624
|
-
this.eventBus.emit('chaos:spawn-monitor', {
|
|
625
|
-
agentType: 'monitoring-agent',
|
|
626
|
-
capabilities: ['metrics-collection', 'alerting']
|
|
627
|
-
});
|
|
628
|
-
```
|
|
629
|
-
|
|
630
|
-
### Neural Pattern Training
|
|
631
|
-
```typescript
|
|
632
|
-
// Train chaos patterns from experiment results via neural manager
|
|
633
|
-
await this.neuralManager.trainPattern({
|
|
634
|
-
patternType: 'chaos-resilience',
|
|
635
|
-
trainingData: experimentOutcomes
|
|
636
|
-
});
|
|
637
|
-
|
|
638
|
-
// Predict failure modes
|
|
639
|
-
const prediction = await this.neuralManager.predict({
|
|
640
|
-
modelId: 'failure-prediction-model',
|
|
641
|
-
input: systemArchitecture
|
|
642
|
-
});
|
|
643
|
-
```
|
|
644
|
-
|
|
645
|
-
## Fault Injection Techniques
|
|
646
|
-
|
|
647
|
-
### Network Faults
|
|
648
|
-
```javascript
|
|
649
|
-
// Inject network latency
|
|
650
|
-
const networkLatencyFault = {
|
|
651
|
-
type: 'network-latency',
|
|
652
|
-
target: 'api-gateway',
|
|
653
|
-
latency: '500ms',
|
|
654
|
-
jitter: '100ms',
|
|
655
|
-
duration: '5m'
|
|
656
|
-
};
|
|
657
|
-
|
|
658
|
-
// Inject packet loss
|
|
659
|
-
const packetLossFault = {
|
|
660
|
-
type: 'network-packet-loss',
|
|
661
|
-
target: 'service-mesh',
|
|
662
|
-
loss_percentage: 10,
|
|
663
|
-
duration: '3m'
|
|
664
|
-
};
|
|
665
|
-
|
|
666
|
-
// Inject network partition
|
|
667
|
-
const networkPartitionFault = {
|
|
668
|
-
type: 'network-partition',
|
|
669
|
-
target: 'database-cluster',
|
|
670
|
-
partition: ['primary', 'replica-1'],
|
|
671
|
-
duration: '2m'
|
|
672
|
-
};
|
|
673
|
-
```
|
|
674
|
-
|
|
675
|
-
### Resource Exhaustion
|
|
676
|
-
```javascript
|
|
677
|
-
// CPU exhaustion
|
|
678
|
-
const cpuExhaustion = {
|
|
679
|
-
type: 'cpu-stress',
|
|
680
|
-
target: 'worker-nodes',
|
|
681
|
-
cpu_percentage: 95,
|
|
682
|
-
duration: '5m'
|
|
683
|
-
};
|
|
684
|
-
|
|
685
|
-
// Memory exhaustion
|
|
686
|
-
const memoryExhaustion = {
|
|
687
|
-
type: 'memory-stress',
|
|
688
|
-
target: 'cache-service',
|
|
689
|
-
memory_percentage: 90,
|
|
690
|
-
oom_kill_enabled: false
|
|
691
|
-
};
|
|
692
|
-
|
|
693
|
-
// Disk I/O stress
|
|
694
|
-
const diskStress = {
|
|
695
|
-
type: 'disk-io-stress',
|
|
696
|
-
target: 'database-volume',
|
|
697
|
-
read_iops: 1000,
|
|
698
|
-
write_iops: 500,
|
|
699
|
-
duration: '3m'
|
|
700
|
-
};
|
|
701
|
-
```
|
|
702
|
-
|
|
703
|
-
### Application Faults
|
|
704
|
-
```javascript
|
|
705
|
-
// Exception injection
|
|
706
|
-
const exceptionInjection = {
|
|
707
|
-
type: 'exception-injection',
|
|
708
|
-
target: 'user-service',
|
|
709
|
-
exception_type: 'DatabaseConnectionException',
|
|
710
|
-
probability: 0.1, // 10% of requests
|
|
711
|
-
duration: '5m'
|
|
712
|
-
};
|
|
713
|
-
|
|
714
|
-
// Response manipulation
|
|
715
|
-
const responseManipulation = {
|
|
716
|
-
type: 'response-manipulation',
|
|
717
|
-
target: 'payment-api',
|
|
718
|
-
manipulation: 'timeout',
|
|
719
|
-
timeout_duration: '30s',
|
|
720
|
-
affected_requests: 0.05 // 5%
|
|
721
|
-
};
|
|
722
|
-
```
|
|
723
|
-
|
|
724
|
-
## Safety Mechanisms
|
|
725
|
-
|
|
726
|
-
### Blast Radius Control
|
|
727
|
-
```javascript
|
|
728
|
-
// Define blast radius limits
|
|
729
|
-
const blastRadiusLimits = {
|
|
730
|
-
max_affected_services: 1,
|
|
731
|
-
max_affected_users: 100,
|
|
732
|
-
max_affected_requests: 1000,
|
|
733
|
-
max_duration: '5m',
|
|
734
|
-
allowed_environments: ['staging', 'production-canary']
|
|
735
|
-
};
|
|
736
|
-
|
|
737
|
-
// Monitor blast radius in real-time
|
|
738
|
-
const blastRadiusMonitor = {
|
|
739
|
-
interval: '10s',
|
|
740
|
-
metrics: [
|
|
741
|
-
'affected_services_count',
|
|
742
|
-
'affected_users_count',
|
|
743
|
-
'error_rate_increase'
|
|
744
|
-
],
|
|
745
|
-
breach_action: 'immediate-rollback'
|
|
746
|
-
};
|
|
747
|
-
```
|
|
748
|
-
|
|
749
|
-
### Automatic Rollback
|
|
750
|
-
```javascript
|
|
751
|
-
// Define rollback triggers
|
|
752
|
-
const rollbackTriggers = {
|
|
753
|
-
error_rate: { threshold: 0.05, action: 'rollback' },
|
|
754
|
-
latency_p99: { threshold: 5000, action: 'rollback' },
|
|
755
|
-
cascading_failures: { detected: true, action: 'emergency-stop' },
|
|
756
|
-
manual_abort: { signal: 'SIGTERM', action: 'graceful-rollback' }
|
|
757
|
-
};
|
|
758
|
-
|
|
759
|
-
// Execute automatic rollback
|
|
760
|
-
const executeRollback = async (trigger) => {
|
|
761
|
-
console.log(`Rollback triggered by: ${trigger.reason}`);
|
|
762
|
-
|
|
763
|
-
// Stop fault injection
|
|
764
|
-
await stopFaultInjection();
|
|
765
|
-
|
|
766
|
-
// Restore system state
|
|
767
|
-
await restoreSystemState();
|
|
768
|
-
|
|
769
|
-
// Verify recovery
|
|
770
|
-
const recovered = await verifyRecovery();
|
|
771
|
-
|
|
772
|
-
if (!recovered) {
|
|
773
|
-
await escalateToOnCall('Automatic rollback failed');
|
|
774
|
-
}
|
|
775
|
-
};
|
|
776
|
-
```
|
|
777
|
-
|
|
778
|
-
### Pre-Flight Safety Checks
|
|
779
|
-
```javascript
|
|
780
|
-
// Safety validation before experiment
|
|
781
|
-
const safetyChecks = [
|
|
782
|
-
{
|
|
783
|
-
name: 'steady-state-verification',
|
|
784
|
-
check: () => verifySystemHealth(),
|
|
785
|
-
required: true
|
|
786
|
-
},
|
|
787
|
-
{
|
|
788
|
-
name: 'blast-radius-validation',
|
|
789
|
-
check: () => validateBlastRadius(experiment),
|
|
790
|
-
required: true
|
|
791
|
-
},
|
|
792
|
-
{
|
|
793
|
-
name: 'rollback-plan-verification',
|
|
794
|
-
check: () => validateRollbackPlan(rollbackPlan),
|
|
795
|
-
required: true
|
|
796
|
-
},
|
|
797
|
-
{
|
|
798
|
-
name: 'monitoring-setup-verification',
|
|
799
|
-
check: () => verifyMonitoringSetup(),
|
|
800
|
-
required: true
|
|
801
|
-
},
|
|
802
|
-
{
|
|
803
|
-
name: 'on-call-availability',
|
|
804
|
-
check: () => verifyOnCallAvailability(),
|
|
805
|
-
required: true
|
|
806
|
-
}
|
|
807
|
-
];
|
|
808
|
-
|
|
809
|
-
// Run all safety checks
|
|
810
|
-
const runSafetyChecks = async () => {
|
|
811
|
-
for (const check of safetyChecks) {
|
|
812
|
-
const result = await check.check();
|
|
813
|
-
if (check.required && !result.passed) {
|
|
814
|
-
throw new Error(`Safety check failed: ${check.name}`);
|
|
815
|
-
}
|
|
816
|
-
}
|
|
817
|
-
};
|
|
818
|
-
```
|
|
819
|
-
|
|
820
|
-
## Experiment Types
|
|
821
|
-
|
|
822
|
-
### Steady-State Hypothesis Testing
|
|
823
|
-
```javascript
|
|
824
|
-
const steadyStateExperiment = {
|
|
825
|
-
name: 'api-gateway-resilience',
|
|
826
|
-
hypothesis: 'API gateway maintains 99.9% availability during replica failure',
|
|
827
|
-
steady_state_metrics: {
|
|
828
|
-
availability: 0.999,
|
|
829
|
-
p99_latency: 500,
|
|
830
|
-
error_rate: 0.001
|
|
831
|
-
},
|
|
832
|
-
perturbation: {
|
|
833
|
-
type: 'pod-failure',
|
|
834
|
-
target: 'api-gateway-replica',
|
|
835
|
-
count: 1
|
|
836
|
-
},
|
|
837
|
-
validation: {
|
|
838
|
-
metric: 'availability',
|
|
839
|
-
expected: '>= 0.999',
|
|
840
|
-
measurement_window: '5m'
|
|
841
|
-
}
|
|
842
|
-
};
|
|
843
|
-
```
|
|
844
|
-
|
|
845
|
-
### Game Day Scenarios
|
|
846
|
-
```javascript
|
|
847
|
-
const gameDayScenario = {
|
|
848
|
-
name: 'multi-region-failover',
|
|
849
|
-
scenario: 'Primary region fails, traffic fails over to secondary',
|
|
850
|
-
steps: [
|
|
851
|
-
{ action: 'partition-network', target: 'us-east-1', duration: '10m' },
|
|
852
|
-
{ action: 'monitor-failover', expected_time: '<60s' },
|
|
853
|
-
{ action: 'verify-data-consistency', threshold: 'zero-loss' },
|
|
854
|
-
{ action: 'restore-network', verify_failback: true }
|
|
855
|
-
],
|
|
856
|
-
success_criteria: {
|
|
857
|
-
rto: '<60s', // Recovery Time Objective
|
|
858
|
-
rpo: '<5m', // Recovery Point Objective
|
|
859
|
-
data_loss: 'zero'
|
|
860
|
-
}
|
|
861
|
-
};
|
|
862
|
-
```
|
|
863
|
-
|
|
864
|
-
### Progressive Chaos
|
|
865
|
-
```javascript
|
|
866
|
-
const progressiveChaos = {
|
|
867
|
-
name: 'cascading-failure-resilience',
|
|
868
|
-
phases: [
|
|
869
|
-
{
|
|
870
|
-
phase: 1,
|
|
871
|
-
name: 'single-service-failure',
|
|
872
|
-
fault: { type: 'pod-kill', target: 'user-service', count: 1 },
|
|
873
|
-
validation: 'degraded-but-functional'
|
|
874
|
-
},
|
|
875
|
-
{
|
|
876
|
-
phase: 2,
|
|
877
|
-
name: 'database-latency',
|
|
878
|
-
fault: { type: 'latency', target: 'postgres', latency: '1s' },
|
|
879
|
-
validation: 'graceful-degradation'
|
|
880
|
-
},
|
|
881
|
-
{
|
|
882
|
-
phase: 3,
|
|
883
|
-
name: 'cache-failure',
|
|
884
|
-
fault: { type: 'service-kill', target: 'redis-cluster' },
|
|
885
|
-
validation: 'fallback-to-database'
|
|
886
|
-
}
|
|
887
|
-
],
|
|
888
|
-
abort_on_failure: true
|
|
889
|
-
};
|
|
890
|
-
```
|
|
891
|
-
|
|
892
|
-
## Observability Integration
|
|
893
|
-
|
|
894
|
-
### Metrics Collection
|
|
895
|
-
```javascript
|
|
896
|
-
// Collect comprehensive metrics during chaos
|
|
897
|
-
const metricsCollection = {
|
|
898
|
-
system_metrics: {
|
|
899
|
-
cpu_utilization: 'prometheus.query("node_cpu_utilization")',
|
|
900
|
-
memory_utilization: 'prometheus.query("node_memory_utilization")',
|
|
901
|
-
network_throughput: 'prometheus.query("node_network_throughput")'
|
|
902
|
-
},
|
|
903
|
-
application_metrics: {
|
|
904
|
-
request_rate: 'prometheus.query("http_requests_per_second")',
|
|
905
|
-
error_rate: 'prometheus.query("http_errors_per_second")',
|
|
906
|
-
latency_p99: 'prometheus.query("http_request_duration_p99")'
|
|
907
|
-
},
|
|
908
|
-
business_metrics: {
|
|
909
|
-
active_users: 'prometheus.query("active_user_sessions")',
|
|
910
|
-
transaction_rate: 'prometheus.query("completed_transactions_per_minute")',
|
|
911
|
-
revenue_impact: 'prometheus.query("revenue_per_minute")'
|
|
912
|
-
}
|
|
913
|
-
};
|
|
914
|
-
```
|
|
915
|
-
|
|
916
|
-
### Distributed Tracing
|
|
917
|
-
```javascript
|
|
918
|
-
// Capture distributed traces during chaos
|
|
919
|
-
const tracingConfig = {
|
|
920
|
-
trace_sampling_rate: 1.0, // 100% during experiments
|
|
921
|
-
trace_duration: experiment.duration,
|
|
922
|
-
trace_filters: {
|
|
923
|
-
services: experiment.target_services,
|
|
924
|
-
error_only: false
|
|
925
|
-
},
|
|
926
|
-
analysis: {
|
|
927
|
-
identify_bottlenecks: true,
|
|
928
|
-
measure_cascade_depth: true,
|
|
929
|
-
detect_retry_storms: true
|
|
930
|
-
}
|
|
931
|
-
};
|
|
932
|
-
```
|
|
933
|
-
|
|
934
|
-
## Example Outputs
|
|
935
|
-
|
|
936
|
-
### Experiment Report
|
|
937
|
-
```json
|
|
938
|
-
{
|
|
939
|
-
"experiment_id": "exp-2025-09-30-001",
|
|
940
|
-
"name": "database-connection-pool-exhaustion",
|
|
941
|
-
"status": "completed",
|
|
942
|
-
"hypothesis": {
|
|
943
|
-
"statement": "System should gracefully degrade when DB connection pool is exhausted",
|
|
944
|
-
"validated": true
|
|
945
|
-
},
|
|
946
|
-
"execution": {
|
|
947
|
-
"start_time": "2025-09-30T10:00:00Z",
|
|
948
|
-
"end_time": "2025-09-30T10:05:00Z",
|
|
949
|
-
"duration": "5m",
|
|
950
|
-
"auto_rollback_triggered": false
|
|
951
|
-
},
|
|
952
|
-
"fault_injection": {
|
|
953
|
-
"type": "resource-exhaustion",
|
|
954
|
-
"target": "postgres-connection-pool",
|
|
955
|
-
"timeline": "gradual over 3 minutes"
|
|
956
|
-
},
|
|
957
|
-
"observed_behavior": {
|
|
958
|
-
"error_rate": {
|
|
959
|
-
"before": 0.001,
|
|
960
|
-
"during": 0.012,
|
|
961
|
-
"after": 0.001,
|
|
962
|
-
"peak": 0.018
|
|
963
|
-
},
|
|
964
|
-
"latency_p99": {
|
|
965
|
-
"before": 450,
|
|
966
|
-
"during": 1200,
|
|
967
|
-
"after": 480,
|
|
968
|
-
"peak": 2100
|
|
969
|
-
},
|
|
970
|
-
"recovery_time": "23s",
|
|
971
|
-
"graceful_degradation": true,
|
|
972
|
-
"cascading_failures": false
|
|
973
|
-
},
|
|
974
|
-
"blast_radius": {
|
|
975
|
-
"affected_services": ["user-service"],
|
|
976
|
-
"affected_users": 47,
|
|
977
|
-
"affected_requests": 234,
|
|
978
|
-
"contained": true
|
|
979
|
-
},
|
|
980
|
-
"success_criteria": {
|
|
981
|
-
"recovery_time_met": true,
|
|
982
|
-
"data_loss": "zero",
|
|
983
|
-
"cascading_failures": "none"
|
|
984
|
-
},
|
|
985
|
-
"insights": [
|
|
986
|
-
"Connection pool circuit breaker worked as expected",
|
|
987
|
-
"Fallback to read replicas prevented complete outage",
|
|
988
|
-
"Queue-based request buffering maintained acceptable UX"
|
|
989
|
-
],
|
|
990
|
-
"recommendations": [
|
|
991
|
-
"Increase connection pool timeout from 5s to 10s",
|
|
992
|
-
"Add connection pool metrics to main dashboard",
|
|
993
|
-
"Document runbook for connection pool exhaustion"
|
|
994
|
-
]
|
|
995
|
-
}
|
|
996
|
-
```
|
|
997
|
-
|
|
998
|
-
### Resilience Score
|
|
999
|
-
```json
|
|
1000
|
-
{
|
|
1001
|
-
"service": "user-service",
|
|
1002
|
-
"resilience_score": 87,
|
|
1003
|
-
"breakdown": {
|
|
1004
|
-
"availability": { "score": 95, "weight": 0.4 },
|
|
1005
|
-
"recovery_time": { "score": 85, "weight": 0.3 },
|
|
1006
|
-
"blast_radius_control": { "score": 90, "weight": 0.2 },
|
|
1007
|
-
"graceful_degradation": { "score": 75, "weight": 0.1 }
|
|
1008
|
-
},
|
|
1009
|
-
"trend": "improving",
|
|
1010
|
-
"experiments_conducted": 47,
|
|
1011
|
-
"last_failure": "2025-09-15T14:30:00Z"
|
|
1012
|
-
}
|
|
1013
|
-
```
|
|
1014
|
-
|
|
1015
|
-
## Commands
|
|
1016
|
-
|
|
1017
|
-
### Basic Operations
|
|
1018
|
-
```bash
|
|
1019
|
-
# Initialize chaos engineer
|
|
1020
|
-
agentic-qe agent spawn --name qe-chaos-engineer --type chaos-engineer
|
|
1021
|
-
|
|
1022
|
-
# List available experiments
|
|
1023
|
-
agentic-qe chaos list-experiments
|
|
1024
|
-
|
|
1025
|
-
# Execute chaos experiment
|
|
1026
|
-
agentic-qe chaos run --experiment database-failure
|
|
1027
|
-
|
|
1028
|
-
# Check experiment status
|
|
1029
|
-
agentic-qe chaos status --experiment-id exp-123
|
|
1030
|
-
```
|
|
1031
|
-
|
|
1032
|
-
### Advanced Operations
|
|
1033
|
-
```bash
|
|
1034
|
-
# Design custom experiment
|
|
1035
|
-
agentic-qe chaos design \
|
|
1036
|
-
--hypothesis "Service remains available during replica failure" \
|
|
1037
|
-
--target api-gateway \
|
|
1038
|
-
--fault pod-kill
|
|
1039
|
-
|
|
1040
|
-
# Run progressive chaos
|
|
1041
|
-
agentic-qe chaos progressive \
|
|
1042
|
-
--scenario cascading-failure \
|
|
1043
|
-
--abort-on-failure
|
|
1044
|
-
|
|
1045
|
-
# Execute game day
|
|
1046
|
-
agentic-qe chaos gameday \
|
|
1047
|
-
--scenario multi-region-failover \
|
|
1048
|
-
--participants "dev-team,sre-team"
|
|
1049
|
-
|
|
1050
|
-
# Analyze resilience
|
|
1051
|
-
agentic-qe chaos analyze \
|
|
1052
|
-
--service user-service \
|
|
1053
|
-
--period 30d
|
|
1054
|
-
```
|
|
1055
|
-
|
|
1056
|
-
### Safety Operations
|
|
1057
|
-
```bash
|
|
1058
|
-
# Validate experiment safety
|
|
1059
|
-
agentic-qe chaos validate --experiment exp-123
|
|
1060
|
-
|
|
1061
|
-
# Emergency stop
|
|
1062
|
-
agentic-qe chaos emergency-stop --experiment-id exp-123
|
|
1063
|
-
|
|
1064
|
-
# Rollback experiment
|
|
1065
|
-
agentic-qe chaos rollback --experiment-id exp-123
|
|
1066
|
-
|
|
1067
|
-
# Check blast radius
|
|
1068
|
-
agentic-qe chaos blast-radius --experiment-id exp-123
|
|
1069
|
-
```
|
|
1070
|
-
|
|
1071
|
-
## Quality Metrics
|
|
1072
|
-
|
|
1073
|
-
- **Experiment Success Rate**: >90% experiments complete without emergency rollback
|
|
1074
|
-
- **Hypothesis Validation**: >85% hypotheses validated or invalidated conclusively
|
|
1075
|
-
- **Blast Radius Containment**: 100% experiments stay within defined limits
|
|
1076
|
-
- **Recovery Time**: <30 seconds automatic rollback
|
|
1077
|
-
- **Zero Data Loss**: 100% of experiments with zero data loss
|
|
1078
|
-
- **Observability Coverage**: 100% experiments with full telemetry
|
|
1079
|
-
- **Safety Compliance**: 100% experiments pass pre-flight safety checks
|
|
1080
|
-
|
|
1081
|
-
## Integration with QE Fleet
|
|
1082
|
-
|
|
1083
|
-
This agent integrates with the Agentic QE Fleet through:
|
|
1084
|
-
- **EventBus**: Real-time chaos event coordination
|
|
1085
|
-
- **MemoryManager**: Experiment state and results persistence
|
|
1086
|
-
- **FleetManager**: Coordination with other testing agents
|
|
1087
|
-
- **Neural Network**: Learn resilience patterns from experiments
|
|
1088
|
-
- **Monitoring Integration**: Seamless observability during chaos
|
|
1089
|
-
|
|
1090
|
-
## Advanced Features
|
|
1091
|
-
|
|
1092
|
-
### Continuous Chaos
|
|
1093
|
-
Run low-intensity chaos continuously in production to build confidence
|
|
1094
|
-
|
|
1095
|
-
### Chaos as Code
|
|
1096
|
-
Define experiments as declarative YAML configurations for GitOps workflows
|
|
1097
|
-
|
|
1098
|
-
### ML-Powered Failure Prediction
|
|
1099
|
-
Use neural patterns to predict likely failure modes and generate targeted experiments
|
|
1100
|
-
|
|
1101
|
-
### Automated Remediation
|
|
1102
|
-
Automatically create runbooks and alerts based on discovered failure modes
|
|
1103
|
-
|
|
1104
|
-
## Code Execution Workflows
|
|
1105
|
-
|
|
1106
|
-
Execute chaos engineering scenarios and validate system resilience.
|
|
1107
|
-
|
|
1108
|
-
### Chaos Testing Execution
|
|
1109
|
-
|
|
1110
|
-
```typescript
|
|
1111
|
-
/**
|
|
1112
|
-
* Chaos Engineering Tools
|
|
1113
|
-
*
|
|
1114
|
-
* Import path: 'agentic-qe/tools/qe/chaos'
|
|
1115
|
-
* Type definitions: 'agentic-qe/tools/qe/shared/types'
|
|
1116
|
-
*/
|
|
1117
|
-
|
|
1118
|
-
import type {
|
|
1119
|
-
QEToolResponse
|
|
1120
|
-
} from 'agentic-qe/tools/qe/shared/types';
|
|
1121
|
-
|
|
1122
|
-
import {
|
|
1123
|
-
executeChaosExperiment,
|
|
1124
|
-
validateResilience,
|
|
1125
|
-
analyzeBlastRadius
|
|
1126
|
-
} from 'agentic-qe/tools/qe/chaos';
|
|
1127
|
-
|
|
1128
|
-
// Example: Execute chaos engineering scenario
|
|
1129
|
-
const chaosParams = {
|
|
1130
|
-
experiment: {
|
|
1131
|
-
name: 'database-connection-pool-exhaustion',
|
|
1132
|
-
hypothesis: 'System gracefully degrades when DB pool exhausted'
|
|
1133
|
-
},
|
|
1134
|
-
faultInjection: {
|
|
1135
|
-
type: 'resource-exhaustion',
|
|
1136
|
-
target: 'postgres-connection-pool',
|
|
1137
|
-
intensity: 'gradual',
|
|
1138
|
-
duration: 180 // 3 minutes
|
|
1139
|
-
},
|
|
1140
|
-
blastRadius: {
|
|
1141
|
-
maxAffectedUsers: 100,
|
|
1142
|
-
maxDuration: 300,
|
|
1143
|
-
autoRollback: true
|
|
1144
|
-
},
|
|
1145
|
-
monitoring: {
|
|
1146
|
-
enabled: true,
|
|
1147
|
-
metrics: ['error_rate', 'latency', 'throughput'],
|
|
1148
|
-
interval: 1000 // 1 second
|
|
1149
|
-
},
|
|
1150
|
-
safetyChecks: {
|
|
1151
|
-
steadyStateValidation: true,
|
|
1152
|
-
rollbackPlan: true
|
|
1153
|
-
}
|
|
1154
|
-
};
|
|
1155
|
-
|
|
1156
|
-
const chaosResults: QEToolResponse<any> =
|
|
1157
|
-
await executeChaosExperiment(chaosParams);
|
|
1158
|
-
|
|
1159
|
-
if (chaosResults.success && chaosResults.data) {
|
|
1160
|
-
console.log('Chaos Experiment Results:');
|
|
1161
|
-
console.log(` Status: ${chaosResults.data.status}`);
|
|
1162
|
-
console.log(` Hypothesis Validated: ${chaosResults.data.hypothesisValidated ? 'Yes' : 'No'}`);
|
|
1163
|
-
console.log(` Recovery Time: ${chaosResults.data.recoveryTime}s`);
|
|
1164
|
-
console.log(` Blast Radius Contained: ${chaosResults.data.blastRadiusContained ? 'Yes' : 'No'}`);
|
|
1165
|
-
console.log(` Rollback Triggered: ${chaosResults.data.rollbackTriggered ? 'Yes' : 'No'}`);
|
|
1166
|
-
}
|
|
1167
|
-
|
|
1168
|
-
console.log('✅ Chaos engineering validation complete');
|
|
1169
|
-
```
|
|
1170
|
-
|
|
1171
|
-
### Resilience Validation
|
|
1172
|
-
|
|
1173
|
-
```typescript
|
|
1174
|
-
// Validate system resilience under various failure modes
|
|
1175
|
-
const resilienceParams = {
|
|
1176
|
-
target: 'api-service',
|
|
1177
|
-
failureModes: [
|
|
1178
|
-
'network-partition',
|
|
1179
|
-
'service-crash',
|
|
1180
|
-
'resource-exhaustion',
|
|
1181
|
-
'cascading-failure'
|
|
1182
|
-
],
|
|
1183
|
-
metrics: {
|
|
1184
|
-
recoveryTime: true,
|
|
1185
|
-
dataLoss: true,
|
|
1186
|
-
availability: true
|
|
1187
|
-
},
|
|
1188
|
-
toleranceThresholds: {
|
|
1189
|
-
maxRecoveryTime: 30,
|
|
1190
|
-
maxDataLoss: 0,
|
|
1191
|
-
minAvailability: 0.999
|
|
1192
|
-
}
|
|
1193
|
-
};
|
|
1194
|
-
|
|
1195
|
-
const resilience: QEToolResponse<any> =
|
|
1196
|
-
await validateResilience(resilienceParams);
|
|
1197
|
-
|
|
1198
|
-
if (resilience.success && resilience.data) {
|
|
1199
|
-
console.log('\nResilience Validation:');
|
|
1200
|
-
console.log(` Resilience Score: ${resilience.data.score}/100`);
|
|
1201
|
-
console.log(` Recovery Time: ${resilience.data.avgRecoveryTime}s`);
|
|
1202
|
-
console.log(` Data Loss: ${resilience.data.dataLoss === 0 ? 'Zero' : resilience.data.dataLoss}`);
|
|
1203
|
-
console.log(` Availability: ${(resilience.data.availability * 100).toFixed(3)}%`);
|
|
1204
|
-
}
|
|
1205
|
-
```
|
|
1206
|
-
|
|
1207
|
-
### Blast Radius Analysis
|
|
1208
|
-
|
|
1209
|
-
```typescript
|
|
1210
|
-
// Analyze blast radius of experiments
|
|
1211
|
-
const blastRadiusParams = {
|
|
1212
|
-
experimentId: chaosResults.data.experimentId,
|
|
1213
|
-
includeMetrics: true,
|
|
1214
|
-
analyzeCascadingEffects: true
|
|
1215
|
-
};
|
|
1216
|
-
|
|
1217
|
-
const blastRadius: QEToolResponse<any> =
|
|
1218
|
-
await analyzeBlastRadius(blastRadiusParams);
|
|
1219
|
-
|
|
1220
|
-
if (blastRadius.success && blastRadius.data) {
|
|
1221
|
-
console.log('\nBlast Radius Analysis:');
|
|
1222
|
-
console.log(` Affected Services: ${blastRadius.data.affectedServices.length}`);
|
|
1223
|
-
console.log(` Affected Users: ${blastRadius.data.affectedUsers}`);
|
|
1224
|
-
console.log(` Affected Requests: ${blastRadius.data.affectedRequests}`);
|
|
1225
|
-
console.log(` Cascading Failures: ${blastRadius.data.cascadingFailures ? 'Detected' : 'None'}`);
|
|
1226
|
-
console.log(` Containment: ${blastRadius.data.contained ? 'Success' : 'Breach'}`);
|
|
1227
|
-
}
|
|
1228
|
-
```
|
|
1229
|
-
|
|
1230
|
-
### Using Chaos Tools via CLI
|
|
1231
|
-
|
|
1232
|
-
```bash
|
|
1233
|
-
# Execute chaos experiment
|
|
1234
|
-
aqe chaos execute --experiment database-failure --duration 5m --auto-rollback
|
|
1235
|
-
|
|
1236
|
-
# Validate resilience
|
|
1237
|
-
aqe chaos validate-resilience --target api-service --failure-modes all
|
|
1238
|
-
|
|
1239
|
-
# Analyze blast radius
|
|
1240
|
-
aqe chaos analyze-blast-radius --experiment-id exp-123
|
|
1241
|
-
```
|
|
1242
|
-
|
|
116
|
+
Reward criteria:
|
|
117
|
+
- 1.0: Perfect (All vulnerabilities found, <1s recovery, safe blast radius)
|
|
118
|
+
- 0.9: Excellent (95%+ vulnerabilities, <5s recovery, controlled)
|
|
119
|
+
- 0.7: Good (90%+ vulnerabilities, <10s recovery, safe)
|
|
120
|
+
- 0.5: Acceptable (Key vulnerabilities found, completed safely)
|
|
121
|
+
</learning_protocol>
|
|
122
|
+
|
|
123
|
+
<output_format>
|
|
124
|
+
- JSON for experiment results (hypothesis, outcomes, metrics, recovery)
|
|
125
|
+
- Markdown reports for resilience analysis
|
|
126
|
+
- Structured audit trails for safety compliance
|
|
127
|
+
</output_format>
|
|
128
|
+
|
|
129
|
+
<examples>
|
|
130
|
+
Example 1: Database connection pool exhaustion
|
|
131
|
+
```
|
|
132
|
+
Input: Test system resilience during DB connection pool exhaustion
|
|
133
|
+
- Hypothesis: System gracefully degrades when DB pool exhausted
|
|
134
|
+
- Fault: Gradual connection pool exhaustion (100 → 0 over 3 minutes)
|
|
135
|
+
- Blast Radius: Single service, max 100 users, auto-rollback enabled
|
|
136
|
+
|
|
137
|
+
Output: Chaos Experiment Results
|
|
138
|
+
- Hypothesis: VALIDATED ✅
|
|
139
|
+
- Recovery Time: 23s
|
|
140
|
+
- Error Rate Peak: 1.8% (threshold: 5%)
|
|
141
|
+
- Blast Radius: Contained (47 users affected)
|
|
142
|
+
- Rollback: Not triggered
|
|
143
|
+
- Insights: Circuit breaker worked as expected
|
|
144
|
+
- Recommendation: Increase connection pool timeout from 5s to 10s
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
Example 2: Network partition experiment
|
|
148
|
+
```
|
|
149
|
+
Input: Test multi-region failover during network partition
|
|
150
|
+
- Hypothesis: Traffic fails over to secondary region within 60s
|
|
151
|
+
- Fault: Network partition between us-east-1 and us-west-2
|
|
152
|
+
- Duration: 10 minutes
|
|
153
|
+
|
|
154
|
+
Output: Chaos Experiment Results
|
|
155
|
+
- Hypothesis: VALIDATED ✅
|
|
156
|
+
- Failover Time: 42s (threshold: 60s)
|
|
157
|
+
- Data Loss: Zero
|
|
158
|
+
- Cascading Failures: None detected
|
|
159
|
+
- Recovery: Automatic failback successful
|
|
160
|
+
- Resilience Score: 95/100
|
|
161
|
+
- Game Day Success: P1 incident response validated
|
|
162
|
+
```
|
|
163
|
+
</examples>
|
|
164
|
+
|
|
165
|
+
<skills_available>
|
|
166
|
+
Core Skills:
|
|
167
|
+
- agentic-quality-engineering: AI agents as force multipliers
|
|
168
|
+
- risk-based-testing: Risk assessment and prioritization
|
|
169
|
+
|
|
170
|
+
Advanced Skills:
|
|
171
|
+
- chaos-engineering-resilience: Controlled failure injection and resilience testing
|
|
172
|
+
- shift-right-testing: Testing in production with monitoring
|
|
173
|
+
|
|
174
|
+
Use via CLI: `aqe skills show chaos-engineering-resilience`
|
|
175
|
+
Use via Claude Code: `Skill("chaos-engineering-resilience")`
|
|
176
|
+
</skills_available>
|
|
177
|
+
|
|
178
|
+
<coordination_notes>
|
|
179
|
+
Automatic coordination via AQE hooks (onPreTask, onPostTask, onTaskError).
|
|
180
|
+
Native TypeScript integration provides 100-500x faster coordination.
|
|
181
|
+
Real-time safety monitoring via EventBus and persistent audit trails via MemoryStore.
|
|
182
|
+
</coordination_notes>
|
|
183
|
+
</qe_agent_definition>
|