compound-agent 1.7.6 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. package/CHANGELOG.md +45 -1
  2. package/README.md +70 -47
  3. package/bin/ca +32 -0
  4. package/package.json +19 -78
  5. package/scripts/postinstall.cjs +221 -0
  6. package/dist/cli.d.ts +0 -1
  7. package/dist/cli.js +0 -13158
  8. package/dist/cli.js.map +0 -1
  9. package/dist/index.d.ts +0 -3730
  10. package/dist/index.js +0 -3240
  11. package/dist/index.js.map +0 -1
  12. package/docs/research/AgenticAiCodebaseGuide.md +0 -1206
  13. package/docs/research/BuildingACCompilerAnthropic.md +0 -116
  14. package/docs/research/HarnessEngineeringOpenAi.md +0 -220
  15. package/docs/research/code-review/systematic-review-methodology.md +0 -409
  16. package/docs/research/index.md +0 -76
  17. package/docs/research/learning-systems/knowledge-compounding-for-agents.md +0 -695
  18. package/docs/research/property-testing/property-based-testing-and-invariants.md +0 -742
  19. package/docs/research/scenario-testing/advanced-and-emerging.md +0 -470
  20. package/docs/research/scenario-testing/core-foundations.md +0 -507
  21. package/docs/research/scenario-testing/domain-specific-and-human-factors.md +0 -474
  22. package/docs/research/security/auth-patterns.md +0 -138
  23. package/docs/research/security/data-exposure.md +0 -185
  24. package/docs/research/security/dependency-security.md +0 -91
  25. package/docs/research/security/injection-patterns.md +0 -249
  26. package/docs/research/security/overview.md +0 -81
  27. package/docs/research/security/secrets-checklist.md +0 -92
  28. package/docs/research/security/secure-coding-failure.md +0 -297
  29. package/docs/research/software_architecture/01-science-of-decomposition.md +0 -615
  30. package/docs/research/software_architecture/02-architecture-under-uncertainty.md +0 -649
  31. package/docs/research/software_architecture/03-emergent-behavior-in-composed-systems.md +0 -644
  32. package/docs/research/spec_design/decision_theory_specifications_and_multi_criteria_tradeoffs.md +0 -0
  33. package/docs/research/spec_design/design_by_contract.md +0 -251
  34. package/docs/research/spec_design/domain_driven_design_strategic_modeling.md +0 -183
  35. package/docs/research/spec_design/formal_specification_methods.md +0 -161
  36. package/docs/research/spec_design/logic_and_proof_theory_under_the_curry_howard_correspondence.md +0 -250
  37. package/docs/research/spec_design/natural_language_formal_semantics_abuguity_in_specifications.md +0 -259
  38. package/docs/research/spec_design/requirements_engineering.md +0 -234
  39. package/docs/research/spec_design/systems_engineering_specifications_emergent_behavior_interface_contracts.md +0 -149
  40. package/docs/research/spec_design/what_is_this_about.md +0 -305
  41. package/docs/research/tdd/test-driven-development-methodology.md +0 -547
  42. package/docs/research/test-optimization-strategies.md +0 -401
  43. package/scripts/postinstall.mjs +0 -102
@@ -1,470 +0,0 @@
1
- # Scenario Testing for CS Development: Advanced & Emerging Techniques
2
-
3
- *2026-03-07*
4
-
5
- ---
6
-
7
- ## Abstract
8
-
9
- This survey examines four advanced and emerging paradigms that push scenario-based software testing beyond the foundational techniques of behavioral specification, model-based generation, and combinatorial selection covered in Part 1 of this series. The paradigms examined are: mutation testing as a quality oracle for scenario suites; formal verification and model checking as a rigorous bridge between scenario testing and correctness proofs; AI- and machine-learning-driven scenario generation, including large language model synthesis and search-based software testing; and chaos engineering as a disciplined methodology for resilience scenario design and production-grade fault injection.
10
-
11
- Each paradigm addresses a distinct inadequacy of classical scenario testing. Mutation testing answers the question of whether a given scenario suite is sensitive enough to detect realistic faults. Formal verification asks whether the scenarios provably cover all reachable states that violate specified properties. AI-driven generation attacks the combinatorial intractability of scenario space by using intelligent search and generative models to synthesize scenarios that a human engineer would not enumerate. Chaos engineering rejects the closed-world assumption of pre-production testing entirely, treating the live production system as the most faithful simulation of itself.
12
-
13
- The paper identifies persistent open problems -- the equivalent mutant problem, state-space explosion, the oracle problem for machine-generated tests, and the absence of standardized resilience metrics -- and situates each technique within a unified taxonomy based on automation level, fault model, theoretical grounding, and practical maturity. Readers of Part 3 of this series will find the theoretical foundations for domain-specific adaptations of all four paradigms developed here.
14
-
15
- ---
16
-
17
- ## 1. Introduction
18
-
19
- ### 1.1 The Inadequacy of Foundational Scenario Testing
20
-
21
- Behavioral-driven development (BDD), model-based test generation, and combinatorial test design -- the core of Part 1 of this series -- share a common architectural assumption: the test designer knows, or can enumerate, the important classes of system behavior before testing begins. This assumption is workable for specifying nominal behavior from requirements, but it breaks down along at least four fault lines.
22
-
23
- First, it provides no feedback on the quality of the scenario suite itself. A suite of one hundred Gherkin scenarios may exercise only a fraction of the fault-revealing paths through an implementation, and classical coverage metrics (line coverage, branch coverage) are well-documented poor proxies for fault detection capability (Inozemtseva & Holmes, 2014). The tester needs an external quality oracle -- something that can measure whether the scenarios are sensitive to realistic faults. Mutation testing provides precisely this.
24
-
25
- Second, classical scenario testing operates in the domain of examples: particular input sequences, particular state configurations. Formal correctness properties, by contrast, are universal: "for all reachable states, property P holds." The gap between an example-based test suite and a universal correctness guarantee is unbridgeable by testing alone, but formal verification -- model checking in particular -- can serve as a complementary discipline that either closes the gap or precisely characterizes where it remains open.
26
-
27
- Third, the space of possible scenarios for any non-trivial system is combinatorially intractable. Human engineers select scenarios heuristically, constrained by time and imagination. Search-based software testing (SBST) and, more recently, large language model (LLM)-based generation reframe scenario synthesis as an optimization problem over a fitness landscape, enabling automated discovery of scenarios that maximize structural coverage, fault exposure, or semantic diversity in ways that manual enumeration cannot match.
28
-
29
- Fourth, the system-under-test in classical scenario testing is a simplified model of production: a controlled environment, mocked dependencies, and absence of the concurrent, distributed, and stochastic behaviors that characterize real deployed systems. Chaos engineering is the disciplined practice of injecting realistic failures into production (or near-production) systems to discover resilience gaps that no pre-production scenario can reliably expose.
30
-
31
- ### 1.2 Scope
32
-
33
- This paper covers the following four topics, each treated at depth:
34
-
35
- 1. **Mutation Testing for Scenario Adequacy** -- mutation operators, mutation score as a quality metric, the equivalent mutant problem, weak versus strong mutation, and the major mutation testing tools.
36
- 2. **Formal Verification and Scenario Properties** -- Kripke structures, CTL and LTL temporal logic, model checking algorithms, bounded model checking, CEGAR, runtime verification, and the principal tool implementations.
37
- 3. **AI/ML-Driven Scenario Generation** -- search-based software testing fitness functions, genetic and evolutionary algorithms for test generation, reinforcement learning for adversarial scenario discovery, LLM-based test synthesis, and the current tool landscape.
38
- 4. **Chaos Engineering and Resilience Scenarios** -- steady-state hypothesis, failure injection taxonomy, blast radius control, game days, fuzzing as scenario generation, and the major platform implementations.
39
-
40
- This paper explicitly excludes the foundational techniques of BDD, model-based generation fundamentals, and combinatorial testing (Part 1), and the domain-specific applications to safety-critical systems, security testing, embedded systems, and distributed systems (Part 3).
41
-
42
- ### 1.3 Key Definitions
43
-
44
- **Mutation score**: The ratio of mutants killed (detected by at least one failing test) to the total number of non-equivalent mutants generated. A score of 1.0 indicates that every syntactic variant of the program is detected by the suite.
45
-
46
- **Model checking**: An automated formal verification technique that exhaustively explores the reachable state space of a finite-state model of a system to determine whether a temporal logic property holds.
47
-
48
- **Search-based software testing (SBST)**: The application of metaheuristic search algorithms (genetic algorithms, simulated annealing, hill climbing) to software test generation, guided by a fitness function that quantifies test quality.
49
-
50
- **Chaos engineering**: "The discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production" (Basiri et al., 2016). It proceeds by defining a steady-state hypothesis, designing controlled experiments that inject failure, and observing deviations from steady state.
51
-
52
- ### 1.4 Relationship to Part 1
53
-
54
- The four paradigms in this paper are not replacements for the foundational techniques but amplifiers and quality assurance mechanisms for them. A scenario suite designed using BDD and combinatorial selection (Part 1) is an appropriate input to mutation analysis, which then reveals its adequacy gaps. Model checking verifies the temporal properties that BDD scenarios implicitly assert but cannot prove universally. SBST and LLM generation can be used to fill the coverage gaps that mutation analysis identifies. And chaos engineering stress-tests the scenarios that survive all prior validation by deploying them against the uncontrolled complexity of live infrastructure.
55
-
56
- ---
57
-
58
- ## 2. Theoretical Foundations
59
-
60
- ### 2.1 Fault-Based Testing Theory
61
-
62
- The theoretical justification for mutation testing rests on two foundational hypotheses articulated by DeMillo, Lipton, and Sayward (1978) in their landmark ACM paper introducing the mutation testing concept:
63
-
64
- **The Competent Programmer Hypothesis (CPH)**: Real programmers, while imperfect, are "competent" in the sense that the programs they write are close to correct. Faults in real programs are typically small, local syntactic deviations from a correct program -- an off-by-one in a loop bound, a wrong comparison operator, a missing negation -- rather than wholesale algorithmic errors. This hypothesis justifies restricting the fault model to simple syntactic transformations (mutation operators) rather than arbitrary program modifications.
65
-
66
- **The Coupling Effect**: Test cases that detect simple (first-order) mutants -- programs with a single fault introduced -- tend also to detect complex (higher-order) mutants composed of multiple simultaneous faults, because the failure effects couple through the program. DeMillo et al. (1978) argued informally that this coupling implies a suite adequate for first-order mutants is likely adequate for the realistic faults that programs actually contain. Empirical evidence for the coupling effect at scale was provided by Jia and Harman (2011) in their comprehensive survey of the mutation testing literature.
67
-
68
- The formal connection between mutation score and fault detection probability was placed on firmer empirical ground by Andrews, Briand, and Labiche (2005), who demonstrated in controlled experiments that mutation-adequate test suites detect real faults at higher rates than coverage-adequate ones. This result was replicated and extended by Papadakis, Kintis, Zhang, Jia, Le Traon, and Harman (2019) in a large-scale study examining 27,000 real faults across multiple Java programs, finding that mutation score was a substantially stronger predictor of real fault detection than any structural coverage criterion.
69
-
70
- ### 2.2 Formal Methods Origins
71
-
72
- The theoretical foundations of model checking trace to three independent but converging research traditions in the late 1970s and early 1980s, recognized by the 2007 ACM Turing Award to Edmund M. Clarke, E. Allen Emerson, and Joseph Sifakis.
73
-
74
- Clarke and Emerson (1981) developed the framework of Computation Tree Logic (CTL), a branching-time temporal logic that can express properties of the tree of all possible execution paths of a reactive system, and demonstrated that CTL model checking could be performed in polynomial time in the number of states. Independently, Queille and Sifakis (1982) developed a verification system (CESAR) for concurrent programs that could verify temporal logic properties by state-space exploration. Emerson and Clarke (1982) then showed the connection between the two approaches and gave the first algorithmic treatment of CTL model checking via fixed-point computation over the state space represented as a binary decision diagram (BDD).
75
-
76
- Linear Temporal Logic (LTL), formulated by Pnueli (1977), takes a different approach: it reasons about a single linear sequence of states (a single execution trace) rather than the full computation tree. LTL model checking is PSPACE-complete (Sistla & Clarke, 1985) and is implemented via automata-theoretic methods: the negation of the LTL property is converted to a Buchi automaton, and model checking reduces to checking whether the product of the system automaton and the property automaton has an accepting run.
77
-
78
- Leslie Lamport's TLA+ (Temporal Logic of Actions, 2002) provides a practical specification language grounded in TLA, a combination of temporal logic and predicate logic, used extensively in industry to model and verify concurrent and distributed system designs (e.g., Amazon Web Services internal use; Newcombe et al., 2015).
79
-
80
- ### 2.3 Search-Based Software Engineering
81
-
82
- The application of metaheuristic search to software engineering problems was systematized by Harman and Jones (2001) in their foundational paper coining the term "search-based software engineering" (SBSE). The central insight is that most software engineering problems can be reformulated as optimization problems: given a representation of the solution space and a fitness function measuring solution quality, apply a metaheuristic search -- genetic algorithm, simulated annealing, particle swarm, tabu search -- to find high-quality solutions.
83
-
84
- For test generation specifically, the representation is a test case (an input sequence and expected outputs), and the fitness function measures test quality according to some objective: branch coverage, mutation score, code complexity traversal. Fraser and Arcuri's EvoSuite (2011) operationalized this framework for Java using a whole-suite genetic algorithm that simultaneously evolves an entire test suite toward a multi-objective fitness function combining branch coverage and mutation score.
85
-
86
- The SBSE framework was formally connected to scenario testing by Ali, Briand, Hemmati, and Panesar-Walawege (2010), who showed that scenario-based test generation for UML-specified systems could be posed as a multi-objective optimization problem and solved with NSGA-II, a standard evolutionary multi-objective algorithm.
87
-
88
- ### 2.4 Resilience Engineering
89
-
90
- Chaos engineering's intellectual ancestry lies in resilience engineering, a discipline originating in nuclear power safety (Perrow, 1984; "Normal Accidents") and aviation human factors, then systematized by Erik Hollnagel and colleagues (Hollnagel, Woods, & Leveson, 2006; "Resilience Engineering: Concepts and Precepts"). Resilience engineering distinguishes between "work as imagined" (the system behavior anticipated by designers) and "work as done" (actual behavior under real operational conditions), and holds that the gap between these is the primary source of serious system failures.
91
-
92
- Chaos engineering translates this insight into an empirical methodology: the only way to know how a distributed system behaves under failure is to inject failure and observe. This is distinct from fault injection testing in the traditional sense (which operates against pre-production test environments) in that it targets production or production-faithful environments, recognizing that the complexity, load, and coupling of real production systems cannot be faithfully simulated in a test harness. The seminal Netflix Chaos Monkey paper (Basiri et al., 2016) and the O'Reilly "Chaos Engineering" book (Rosenthal, Jones, et al., 2020) formalized the methodology and defined the steady-state hypothesis as the organizing principle.
93
-
94
- ---
95
-
96
- ## 3. Taxonomy of Approaches
97
-
98
- **Table 1. Taxonomy of Advanced Scenario Testing Approaches**
99
-
100
- | Dimension | Mutation Testing | Formal Verification | AI/ML Generation | Chaos Engineering |
101
- |---|---|---|---|---|
102
- | **Automation Level** | High (tool-automated mutation + test execution) | Medium-High (model building manual; checking automated) | High (generation automated; oracle partially manual) | Medium (experiment design manual; injection automated) |
103
- | **Theoretical Foundation** | Fault-based testing (CPH, coupling effect) | Mathematical logic (temporal logic, fixed-point theory) | Metaheuristic optimization / deep learning | Resilience engineering, empirical systems science |
104
- | **Fault Model** | Code-level syntactic faults (logic errors, off-by-ones) | Behavioral property violations (safety, liveness) | Coverage-gap faults, boundary conditions, oracle failures | Infrastructure faults (network partitions, node failures, latency) |
105
- | **Scalability** | Moderate (quadratic in mutant x test count; mitigated by incremental/selective mutation) | Poor for large state spaces (state explosion); bounded MC improves this | High for generation; oracle validation scales poorly | Scales to production; blast radius control required |
106
- | **Maturity Level** | High (40+ year research history; commercial tools) | High in hardware/protocols; Low-Medium in general software | Rapidly emerging (LLMs: 2022-present) | Medium (formalized ~2010-2016; growing adoption) |
107
- | **Prerequisite Knowledge** | Basic understanding of mutation operators; CI pipeline integration | Formal specification skill; temporal logic literacy | Varies: SBST tools low-barrier; LLM prompting moderate; RL high | Systems operations expertise; organizational risk tolerance |
108
- | **Primary Artifact** | Mutated program variants | Formal model (Kripke structure, TLA+ spec) | Generated test cases / scenarios | Failure injection experiments |
109
- | **Output** | Mutation score; suite adequacy gap analysis | Counterexamples (failing traces) or proof of correctness | Test suite augmentation; coverage-increasing scenarios | Resilience metrics; blast-radius-bounded failure behavior |
110
-
111
- This taxonomy reveals two orthogonal axes of significance. The first axis is **pre-production versus production**: mutation testing, formal verification, and AI-driven generation all operate on a model or artifact of the system before deployment; chaos engineering deliberately operates on the deployed system. The second axis is **fault discovery versus fault prevention**: mutation testing and chaos engineering are fundamentally about discovering whether a system or its test suite is deficient; formal verification provides proofs that specific properties cannot be violated (within the model's abstractions); AI generation sits between these, discovering gaps and synthesizing scenarios to fill them.
112
-
113
- Within AI-driven generation, a further sub-taxonomy is warranted:
114
-
115
- - **Evolutionary/genetic algorithms**: White-box, coverage-driven, mature (EvoSuite, Pynguin).
116
- - **Reinforcement learning**: Environment-modeled exploration, adversarial, emerging.
117
- - **LLM-based synthesis**: Black-box, semantics-aware, rapid but oracle-dependent, very recent.
118
- - **Fuzzing**: Grey-box or black-box, input mutation focused, overlaps with chaos engineering at the interface boundary.
119
-
120
- ---
121
-
122
- ## 4. Analysis
123
-
124
- ### 4.1 Mutation Testing for Scenario Adequacy
125
-
126
- #### 4.1.1 Theory and Mechanism
127
-
128
- Mutation testing works by programmatically generating a large set of slightly modified versions of the program under test -- *mutants* -- and then executing the existing test suite against each mutant. A mutant is *killed* if at least one test produces a different result on the mutant than on the original program. A mutant that survives all tests is *alive* and indicates either a gap in the scenario suite or an equivalent mutant.
129
-
130
- The **mutation score** is defined as:
131
-
132
- ```
133
- Mutation Score = (Killed Mutants) / (Total Mutants - Equivalent Mutants)
134
- ```
135
-
136
- **Mutation operators** are the rules that transform a program into a mutant. Standard operator families include:
137
-
138
- - *Statement deletion operators* (SDL): Remove individual statements.
139
- - *Arithmetic operator replacement* (AOR): Replace `+` with `-`, `*` with `/`, etc.
140
- - *Relational operator replacement* (ROR): Replace `<` with `<=`, `==` with `!=`, etc.
141
- - *Logical operator replacement* (LOR): Replace `&&` with `||`, `!` with the identity, etc.
142
- - *Constant replacement* (CR): Replace numeric constants with 0, 1, or -1.
143
- - *Variable replacement* (VR): Replace a variable reference with a different in-scope variable of compatible type.
144
- - *Return value mutation* (RV): Replace a return expression with a fixed value.
145
-
146
- **First-order versus higher-order mutants**: A first-order mutant (FOM) contains exactly one syntactic change. A higher-order mutant (HOM) contains two or more changes. The coupling effect justifies focusing on FOMs, but Jia and Harman (2009) showed that certain HOMs, called "strongly subsuming HOMs," are harder to kill than any component first-order mutant.
147
-
148
- **Weak versus strong mutation**: Strong mutation killing requires that the test case propagate the fault's effects to an observable output. Weak mutation killing requires only that the fault be *reached and executed*. Weak mutation is cheaper to compute but gives a lower bound on the mutation score.
149
-
150
- #### 4.1.2 Literature Evidence
151
-
152
- The foundational paper by DeMillo, Lipton, and Sayward (1978), "Hints on Test Data Selection: Help for the Practicing Programmer," introduced mutation testing as a formal concept. Offutt and colleagues extended the theoretical framework through the 1990s, with Offutt (1992) providing the first rigorous empirical evidence for the coupling effect and Offutt and Untch (2001) providing a comprehensive taxonomy of mutation operators.
153
-
154
- Jia and Harman (2011), "An Analysis and Survey of the Development of Mutation Analysis," covered 390 papers from 1977 to 2009. Papadakis et al. (2019), "Mutation Testing Advances: An Analysis and Survey," examined the relationship between mutation score and real fault detection across 27 Java open-source programs, finding that mutation-adequate suites detected on average 33.5% more real faults than branch-coverage-adequate suites of similar size.
155
-
156
- Schuler and Zeller (2012) proposed automated equivalent mutant detection using dynamic impact analysis. Kintis et al. (2018) achieved precision above 80% using machine learning approaches, though the problem remains fundamentally undecidable.
157
-
158
- #### 4.1.3 Implementations and Benchmarks
159
-
160
- **PIT (PITest)** is the dominant mutation testing tool for Java. PIT's key innovation is byte-code-level mutation, which eliminates the compilation step and enables incremental mutation -- rerunning analysis only for changed code paths. PIT integrates with Maven and Gradle.
161
-
162
- **Stryker Mutator** is the standard for JavaScript and TypeScript (with ports to .NET and Scala). Stryker generates mutants at the AST level and provides mutation score trending over time via its dashboard.
163
-
164
- **mutmut** is the primary Python mutation testing tool, integrating with pytest. **cosmic-ray** is an alternative with a plugin architecture for custom operators.
165
-
166
- **Mull** targets C and C++ via LLVM IR mutation.
167
-
168
- **Major** is a research-oriented Java mutation tool from ETH Zurich, used in many academic studies including the large-scale empirical studies by Just, Jalali, and Ernst (2014) who created the Defects4J benchmark.
169
-
170
- #### 4.1.4 Strengths and Limitations
171
-
172
- **Strengths**: Quantitative quality metric correlated with real fault detection. Reveals specific coverage gaps. Integration with CI/CD via incremental analysis (PIT).
173
-
174
- **Limitations**: Computational cost O(N x T). Equivalent mutant problem (undecidable in general). Scalability to scenario-level tests requires careful operator selection. Operator redundancy -- Papadakis et al. (2016) showed 5-10 operators can achieve near-full adequacy.
175
-
176
- ---
177
-
178
- ### 4.2 Formal Verification and Scenario Properties
179
-
180
- #### 4.2.1 Theory and Mechanism
181
-
182
- **Kripke Structures**: M = (S, S0, R, L) where S is a finite set of states, S0 is initial states, R is the transition relation, and L: S -> 2^AP assigns atomic propositions to states.
183
-
184
- **CTL**: Formulas use path quantifiers (A: "for all paths", E: "there exists a path") with temporal operators (X: next, F: eventually, G: globally, U: until). CTL model checking is O(|M| x |phi|). Example: `AG(request_received -> AF response_sent)` -- "whenever a request is received, a response is always eventually sent."
185
-
186
- **LTL**: Evaluated over individual linear traces. LTL model checking proceeds by constructing the product automaton of the system and the negation of phi (as a Buchi automaton), then checking for emptiness.
187
-
188
- **Bounded Model Checking (BMC)** (Biere et al., 1999): Encodes "does a counterexample of length at most k exist?" as a SAT problem. BMC generates concrete counterexample scenarios -- execution traces that violate the specified property.
189
-
190
- **CEGAR** (Clarke et al., 2000): Counterexample-Guided Abstraction Refinement computes properties on an abstract model and refines when spurious counterexamples are found.
191
-
192
- **Runtime Verification** (Havelund & Rosu, 2001): Monitors the executing system and checks whether the current trace satisfies or violates a property. Trades completeness for scalability.
193
-
194
- #### 4.2.2 Literature Evidence
195
-
196
- Clarke and Emerson (1981) established CTL model checking. Holzmann's SPIN model checker (1997, 2004) demonstrated practical model checking of concurrent protocols. Lamport's TLA+ (2002) and Newcombe et al. (2015) documented AWS's use for verifying distributed system designs, finding critical bugs that escaped all testing.
197
-
198
- Williams, Marre, Mouy, and Roger (2005) showed that BMC over UML state-machine models generates test scenarios covering temporal property violations with high efficiency. Chen and Rosu (2007) extended runtime verification with JavaMOP.
199
-
200
- #### 4.2.3 Implementations and Benchmarks
201
-
202
- **SPIN** -- LTL model checker for concurrent protocols (Promela language). 2001 ACM Software System Award.
203
-
204
- **NuSMV** -- Symbolic model checker supporting CTL and LTL with BDD-based and SAT-based checking.
205
-
206
- **TLA+/TLC** -- Specification language for concurrent/distributed systems. Used at Amazon, Microsoft, Oracle.
207
-
208
- **CBMC** -- C Bounded Model Checker. Applied to device drivers, automotive software, cryptographic implementations.
209
-
210
- **Java PathFinder (JPF)** -- Model checker for Java bytecode. Symbolic PathFinder extension adds symbolic execution for test input generation.
211
-
212
- **SV-COMP** -- Annual competition providing standardized verification benchmarks.
213
-
214
- #### 4.2.4 Strengths and Limitations
215
-
216
- **Strengths**: Completeness within the model. Concrete counterexample traces for suite augmentation. Temporal property expressiveness (safety, liveness, fairness). Increasing tool maturity.
217
-
218
- **Limitations**: State explosion. Abstraction gap between model and implementation. Specification learning curve. Limited support for dynamic languages.
219
-
220
- ---
221
-
222
- ### 4.3 AI/ML-Driven Scenario Generation
223
-
224
- #### 4.3.1 Theory and Mechanism
225
-
226
- **SBST Framework**: Representation (test case as point in search space), Fitness Function (branch coverage distance, mutation score, sequence complexity), Search Algorithm (genetic algorithms, simulated annealing, hill climbing).
227
-
228
- **EvoSuite's Whole-Suite Approach** (Fraser & Arcuri, 2011): Evolves an entire test suite simultaneously using a multi-objective fitness function combining coverage and mutation objectives.
229
-
230
- **Reinforcement Learning**: Frames test generation as a sequential decision problem. Agent takes actions, receives coverage/crash rewards. Effective for interactive systems and autonomous vehicle perception testing (Koren et al., 2018).
231
-
232
- **LLM-Based Test Synthesis**: Generates test code by predicting plausible tokens given code context. Leverages training on massive code corpora encoding conventions, bug patterns, and testing idioms.
233
-
234
- **CodaMosa** (Lemieux et al., 2023): Hybridizes SBST and LLM -- uses EvoSuite-style GA as primary driver, queries LLM when GA stalls. Achieves significantly higher coverage than either approach alone.
235
-
236
- #### 4.3.2 Literature Evidence
237
-
238
- Fraser and Arcuri (2011, 2013) established EvoSuite as state-of-the-art. Lukasczyk et al. (2022) adapted the approach for Python with Pynguin. Lemieux et al. (2023) demonstrated CodaMosa's hybrid superiority.
239
-
240
- Meta's TestGen-LLM (Schafer et al., 2024) reported 25% of LLM-generated tests accepted without modification in production, 32% with minor edits.
241
-
242
- Bohme et al. (2016) connected coverage-guided fuzzing to Markov chain theory. Barr et al. (2015) established the oracle problem as the primary limitation of automated test generation.
243
-
244
- #### 4.3.3 Implementations and Benchmarks
245
-
246
- **EvoSuite** (Java) -- Most mature SBST tool. Maven/Gradle plugin. Multiple fitness functions.
247
-
248
- **Pynguin** (Python) -- EvoSuite equivalent implementing DYNAMOSA and MIO algorithms.
249
-
250
- **CodaMosa** -- Pynguin + LLM escape hatch.
251
-
252
- **Randoop** (Java) -- Feedback-directed random test generation. Simpler but faster than EvoSuite.
253
-
254
- **AFL++** -- Leading coverage-guided fuzzer. AFLGo extends with directed greybox fuzzing.
255
-
256
- **LLM tools** -- GitHub Copilot `/tests`, Qodo (formerly CodiumAI), Microsoft SymPrompt.
257
-
258
- **Benchmarks**: SF110, SBST Java Tool Competition (annual at ICSE), CodaMosa benchmark (486 Python functions).
259
-
260
- #### 4.3.4 Strengths and Limitations
261
-
262
- **Strengths**: Coverage of intractable scenario spaces. SBST and LLM are complementary. Reduces boilerplate effort. Continuously improving.
263
-
264
- **Limitations**: Oracle problem (generated assertions may be incorrect). Flaky test generation. Coverage vs. meaningfulness tension. Computational cost. Domain knowledge absence -- high-level scenario intent must come from humans.
265
-
266
- ---
267
-
268
- ### 4.4 Chaos Engineering and Resilience Scenarios
269
-
270
- #### 4.4.1 Theory and Mechanism
271
-
272
- Five principles (Basiri et al., 2019): (1) Build hypothesis around steady-state behavior. (2) Vary real-world events. (3) Run experiments in production. (4) Automate continuously. (5) Minimize blast radius.
273
-
274
- **Failure Injection Taxonomy**:
275
- - **Layer**: Infrastructure, network, application, data
276
- - **Scope**: Process, service, zone, region
277
- - **Temporal pattern**: Instantaneous, sustained, intermittent, correlated
278
-
279
- **Fault-Tolerance Pattern Testing**: Each resilience pattern (retry, circuit breaker, bulkhead, timeout, fallback) has a corresponding chaos experiment design.
280
-
281
- **Fuzzing as Scenario Generation**: Protocol fuzzing (AFL++, RESTler) targets input-handling resilience.
282
-
283
- **Game Days**: Structured exercises for practicing incident response and validating runbooks.
284
-
285
- #### 4.4.2 Literature Evidence
286
-
287
- Basiri et al. (2016) formalized the Netflix methodology. Rosenthal et al. (2020) provided the comprehensive reference. Gunawi et al. (2011) introduced systematic fault injection for cloud recovery testing (FATE/DESTINI). Alvaro et al. (2016) introduced Lineage-Driven Fault Injection (LDFI) -- using data provenance to identify minimal failure sets.
288
-
289
- Nygard (2018) synthesized stability patterns and their chaos testing strategies. Heorhiadi et al. (2016) proposed the Chaos Engineering Maturity Model.
290
-
291
- #### 4.4.3 Implementations and Benchmarks
292
-
293
- **Chaos Monkey** (Netflix) -- Random instance termination in production.
294
-
295
- **Chaos Toolkit** -- Language-agnostic, YAML-based experiment definitions.
296
-
297
- **Litmus Chaos** (CNCF) -- Kubernetes-native chaos with ChaosExperiment CRDs and ChaosHub.
298
-
299
- **Gremlin** (commercial) -- GUI-driven with strong blast radius control.
300
-
301
- **Chaos Mesh** (CNCF) -- Kubernetes chaos emphasizing network fault simulation via `tc`.
302
-
303
- **ToxiProxy** (Shopify) -- TCP proxy for deterministic network condition simulation in pre-production.
304
-
305
- **Pumba** -- Docker container chaos tool.
306
-
307
- No standardized benchmark suite exists -- resilience is measured by system-specific SLA metrics.
308
-
309
- #### 4.4.4 Strengths and Limitations
310
-
311
- **Strengths**: Production fidelity. Discovery of unknown unknowns. Organizational resilience building. Continuous validation.
312
-
313
- **Limitations**: Production risk. Organizational maturity requirement. No formal completeness guarantees. Absence of standardized resilience metrics. Limited applicability to monolithic/batch systems.
314
-
315
- ---
316
-
317
- ## 5. Comparative Synthesis
318
-
319
- **Table 2. Cross-Cutting Comparative Trade-Off Analysis**
320
-
321
- | Dimension | Mutation Testing | Formal Verification | AI/ML Generation | Chaos Engineering |
322
- |---|---|---|---|---|
323
- | **Automation Level** | High | Medium | High (generation) / Medium (oracle) | Medium-High |
324
- | **Fault Detection Capability** | High for code-level logic faults | Complete within model | High for structural coverage; variable for semantic faults | High for infrastructure/resilience faults |
325
- | **Scalability** | Moderate (mitigated by sampling, incremental) | Poor (state explosion) | High generation; poor oracle validation | Scales to production with blast radius control |
326
- | **Theoretical Foundation** | CPH, coupling effect (empirically validated) | Mathematical logic (sound and complete) | Optimization theory / empirical (LLMs) | Resilience engineering (empirical) |
327
- | **Tool Maturity** | High (PIT, Stryker) | High for protocols (SPIN); Medium for software | Medium (EvoSuite, Pynguin); Rapidly emerging (LLMs) | Medium (Litmus, Chaos Toolkit) |
328
- | **Primary Output** | Mutation score; surviving mutant list | Proof certificate or counterexample trace | Augmented test suite | Resilience metric; failure mode catalog |
329
- | **Evidence Base** | Strong (30+ years; Defects4J) | Strong for hardware/protocols; moderate for software | Strong for SBST; recent for LLMs | Strong in industry; limited academic RCTs |
330
- | **Primary Adoption Barrier** | Computational cost; equivalent mutants | Specification learning curve; abstraction gap | Oracle problem; flakiness | Production risk; observability prerequisites |
331
-
332
- **The Complementarity Cycle**: The most productive deployment is sequential and complementary. An initial BDD suite (Part 1) is subjected to mutation analysis to identify gaps. Surviving mutants target AI/ML generation (CodaMosa, EvoSuite). Critical temporal properties are verified by model checking; counterexample traces augment the suite. The suite is then validated in production via chaos engineering. This workflow maximizes fault space coverage at each level.
333
-
334
- ---
335
-
336
- ## 6. Open Problems and Research Gaps
337
-
338
- ### 6.1 The Equivalent Mutant Problem
339
-
340
- Theoretically undecidable. Practical automated approaches achieve 75-85% precision. The research gap is a classifier with >95% precision suitable for CI pipelines without human review. LLM-powered semantic reasoning shows promise but lacks systematic evaluation at scale.
341
-
342
- ### 6.2 State Explosion in Real-World Model Checking
343
-
344
- Enterprise applications with hundreds of threads have state spaces far beyond current model checkers. Promising directions: compositional verification, assume-guarantee reasoning, and integration with static analysis for state space pruning.
345
-
346
- ### 6.3 LLM-Generated Tests: Meaningfulness vs. Coverage
347
-
348
- LLMs can generate high-coverage tests that are structurally reachable but semantically trivial. The oracle problem compounds this: LLM-generated assertions may encode incorrect expected behavior. Research gaps include automated meaningfulness evaluation, reliable oracle generation, and longitudinal fault detection studies.
349
-
350
- ### 6.4 Chaos Engineering Metrics and Maturity Models
351
-
352
- No standardized "chaos adequacy" metric exists analogous to mutation score. LDFI (Alvaro et al., 2016) provides a theoretical starting point but has not been generalized.
353
-
354
- ### 6.5 Integration of Formal Methods with Practical Workflows
355
-
356
- Lightweight formal methods -- verification techniques applicable incrementally within agile workflows -- remain a gap. Promising directions: property-based testing, runtime verification, and LLM-synthesized formal specifications from natural language requirements.
357
-
358
- ### 6.6 Adversarial Machine Learning for Scenario Generation
359
-
360
- ML-based systems require scenario testing against adversarial inputs that no conventional methodology addresses. The SBST and chaos engineering frameworks have not been fully adapted to the adversarial ML threat model.
361
-
362
- ---
363
-
364
- ## 7. Conclusion
365
-
366
- This survey has examined four advanced paradigms for scenario-based testing. Each addresses a fundamental inadequacy of classical scenario testing.
367
-
368
- Mutation testing is the most mature and empirically grounded, with production-ready tooling and three decades of evidence. Formal verification delivers correctness guarantees unavailable from testing, but scalability constraints limit it to protocol-level and model-level verification; its contribution to scenario testing is most valuable as a counterexample generator (BMC) and runtime monitor. AI/ML-driven generation is the most rapidly evolving area: SBST is mature and LLM generation is demonstrating industrial utility, but the oracle problem and meaningfulness challenge remain unresolved. Chaos engineering has proven its value at scale but lacks theoretical foundations and standardized metrics.
369
-
370
- No verdict on primacy is appropriate: each paradigm addresses a different slice of the testing problem space. The mutation-verification-generation-chaos cycle, when implemented with mature tooling and appropriate organizational prerequisites, provides substantially more complete assurance than any single paradigm alone.
371
-
372
- ---
373
-
374
- ## References
375
-
376
- **Mutation Testing**
377
-
378
- 1. DeMillo, R. A., Lipton, R. J., & Sayward, F. G. (1978). Hints on test data selection. *Computer*, 11(4), 34-41.
379
- 2. Offutt, A. J. (1992). Investigations of the software testing coupling effect. *ACM TOSEM*, 1(1), 5-20.
380
- 3. Offutt, A. J., & Untch, R. H. (2001). Mutation 2000: Uniting the orthogonal. In *Mutation Testing for the New Century* (pp. 34-44). Springer.
381
- 4. Jia, Y., & Harman, M. (2009). Higher order mutation testing. *IST*, 51(10), 1379-1393.
382
- 5. Jia, Y., & Harman, M. (2011). An analysis and survey of the development of mutation analysis. *IEEE TSE*, 37(5), 649-678.
383
- 6. Andrews, J. H., Briand, L. C., & Labiche, Y. (2005). Is mutation an appropriate tool for testing experiments? *ICSE 2005*, 402-411.
384
- 7. Papadakis, M., et al. (2019). Mutation testing advances: An analysis and survey. *Advances in Computers*, 112, 275-378.
385
- 8. Just, R., Jalali, D., & Ernst, M. D. (2014). Defects4J: A database of existing faults. *ISSTA 2014*, 437-440.
386
- 9. Schuler, D., & Zeller, A. (2012). Reassessing automatic evaluation of test suites. *ICSE 2012*, 1112-1121.
387
- 10. Kintis, M., et al. (2018). Detecting trivial mutant equivalences via compiler optimisations. *IEEE TSE*, 44(4), 308-333.
388
- 11. Papadakis, M., et al. (2016). Threats to the validity of mutation-based test assessment. *ISSTA 2016*, 354-365.
389
-
390
- **Formal Verification**
391
-
392
- 12. Clarke, E. M., & Emerson, E. A. (1981). Design and synthesis of synchronization skeletons. *Logic of Programs* (LNCS 131), 52-71.
393
- 13. Queille, J. P., & Sifakis, J. (1982). Specification and verification of concurrent systems in CESAR. *5th International Symposium on Programming* (LNCS 137), 337-351.
394
- 14. Pnueli, A. (1977). The temporal logic of programs. *FOCS 1977*, 46-57.
395
- 15. Clarke, E. M., Grumberg, O., & Peled, D. (1999). *Model Checking*. MIT Press.
396
- 16. Holzmann, G. J. (1997). The model checker SPIN. *IEEE TSE*, 23(5), 279-295.
397
- 17. Holzmann, G. J. (2004). *The SPIN Model Checker*. Addison-Wesley.
398
- 18. Biere, A., et al. (1999). Symbolic model checking without BDDs. *TACAS 1999* (LNCS 1579), 193-207.
399
- 19. Clarke, E. M., et al. (2000). Counterexample-guided abstraction refinement. *CAV 2000* (LNCS 1855), 154-169.
400
- 20. Lamport, L. (2002). *Specifying Systems: The TLA+ Language and Tools*. Addison-Wesley.
401
- 21. Newcombe, C., et al. (2015). How Amazon Web Services uses formal methods. *CACM*, 58(4), 66-73.
402
- 22. Havelund, K., & Rosu, G. (2001). Synthesizing monitors for safety properties. *TACAS 2001* (LNCS 2031), 342-356.
403
- 23. Cimatti, A., et al. (2002). NuSMV 2. *CAV 2002* (LNCS 2404), 359-364.
404
- 24. Visser, W., et al. (2003). Model checking programs. *ASE*, 10(2), 203-232.
405
-
406
- **Search-Based and AI/ML-Driven Generation**
407
-
408
- 25. Harman, M., & Jones, B. F. (2001). Search-based software engineering. *IST*, 43(14), 833-839.
409
- 26. McMinn, P. (2004). Search-based software test data generation: A survey. *STVR*, 14(2), 105-156.
410
- 27. Fraser, G., & Arcuri, A. (2011). EvoSuite: Automatic test suite generation. *ESEC/FSE 2011*, 416-419.
411
- 28. Fraser, G., & Arcuri, A. (2013). Whole test suite generation. *IEEE TSE*, 39(2), 276-291.
412
- 29. Lukasczyk, S., et al. (2022). An empirical study of automated unit test generation for Python. *IST*, 147, 106794.
413
- 30. Lemieux, C., et al. (2023). CodaMosa: Escaping coverage plateaus with pre-trained LLMs. *ICSE 2023*, 919-931.
414
- 31. Schafer, M., et al. (2024). An empirical evaluation of using LLMs for automated unit test generation. *IEEE TSE*, 50(1), 85-105.
415
- 32. Barr, E. T., et al. (2015). The oracle problem in software testing: A survey. *IEEE TSE*, 41(5), 507-525.
416
- 33. Bohme, M., et al. (2016). Coverage-based greybox fuzzing as Markov chain. *CCS 2016*, 1032-1043.
417
- 34. Pradel, M., & Sen, K. (2018). DeepBugs. *OOPSLA*, 2, 147.
418
- 35. Inozemtseva, L., & Holmes, R. (2014). Coverage is not strongly correlated with test suite effectiveness. *ICSE 2014*, 435-445.
419
-
420
- **Chaos Engineering**
421
-
422
- 36. Basiri, A., et al. (2016). Chaos engineering. *IEEE Software*, 33(3), 35-41.
423
- 37. Rosenthal, C., et al. (2020). *Chaos Engineering: System Resiliency in Practice*. O'Reilly.
424
- 38. Gunawi, H. S., et al. (2011). FATE and DESTINI. *NSDI 2011*, 238-252.
425
- 39. Alvaro, P., et al. (2016). Lineage-driven fault injection. *SIGMOD 2016*, 331-346.
426
- 40. Hollnagel, E., Woods, D. D., & Leveson, N. (2006). *Resilience Engineering*. Ashgate.
427
- 41. Nygard, M. T. (2018). *Release It!* (2nd ed.). Pragmatic Bookshelf.
428
- 42. Heorhiadi, V., et al. (2016). Gremlin: Systematic resilience testing. *ICDCS 2016*, 57-66.
429
-
430
- ---
431
-
432
- ## Practitioner Resources
433
-
434
- ### Mutation Testing
435
-
436
- - **PIT**: pitest.org -- Production-grade Java mutation testing. Start with Maven plugin and default operators.
437
- - **Stryker Mutator**: stryker-mutator.io -- JavaScript/TypeScript/Scala/.NET. Dashboard provides trend tracking.
438
- - **mutmut**: github.com/boxed/mutmut -- Python, integrates with pytest.
439
- - **cosmic-ray**: github.com/sixty-north/cosmic-ray -- Python, plugin-extensible.
440
- - **Mull**: github.com/mull-project/mull -- C/C++ via LLVM IR.
441
- - Key reading: Papadakis et al. (2019) survey; Just et al. (2014) Defects4J.
442
-
443
- ### Formal Verification
444
-
445
- - **TLA+/PlusCal**: lamport.azurewebsites.net/tla/tla.html -- Start with PlusCal. Free video course.
446
- - **SPIN**: spinroot.com -- For concurrent protocol verification.
447
- - **CBMC**: cprover.org/cbmc/ -- C bounded model checker.
448
- - **Java PathFinder**: github.com/javapathfinder/jpf-core -- Java concurrency verification.
449
- - **Alloy Analyzer**: alloytools.org -- Declarative relational modeling, lower learning curve than TLA+.
450
- - Key reading: Newcombe et al. (2015) AWS paper; Clarke et al. (1999) textbook.
451
-
452
- ### AI/ML-Driven Generation
453
-
454
- - **EvoSuite**: github.com/EvoSuite/evosuite -- Java, Maven/Gradle plugin.
455
- - **Pynguin**: github.com/se2p/pynguin -- Python. MIO algorithm recommended.
456
- - **CodaMosa**: github.com/microsoft/codamosa -- Python + LLM escape hatch.
457
- - **AFL++**: github.com/AFLplusplus/AFLplusplus -- Coverage-guided fuzzer for C/C++.
458
- - **Hypothesis**: hypothesis.works -- Property-based testing for Python. Accessible first step toward automated generation.
459
- - **Qodo**: qodo.ai -- Commercial LLM-based test generation.
460
- - Key reading: Fraser & Arcuri (2013); Lemieux et al. (2023); Barr et al. (2015).
461
-
462
- ### Chaos Engineering
463
-
464
- - **Litmus Chaos**: litmuschaos.io -- Kubernetes-native. Start with pod-delete and network-latency.
465
- - **Chaos Toolkit**: chaostoolkit.org -- YAML-driven, multi-cloud.
466
- - **Gremlin**: gremlin.com -- Commercial, strongest blast-radius controls.
467
- - **Chaos Mesh**: chaos-mesh.org -- Kubernetes network fault simulation.
468
- - **ToxiProxy**: github.com/Shopify/toxiproxy -- Pre-production network simulation.
469
- - **Principles of Chaos Engineering**: principlesofchaos.org
470
- - Key reading: Basiri et al. (2016); Alvaro et al. (2016) LDFI; Rosenthal et al. (2020).