compound-agent 1.7.6 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +45 -1
- package/README.md +70 -47
- package/bin/ca +32 -0
- package/package.json +19 -78
- package/scripts/postinstall.cjs +221 -0
- package/dist/cli.d.ts +0 -1
- package/dist/cli.js +0 -13158
- package/dist/cli.js.map +0 -1
- package/dist/index.d.ts +0 -3730
- package/dist/index.js +0 -3240
- package/dist/index.js.map +0 -1
- package/docs/research/AgenticAiCodebaseGuide.md +0 -1206
- package/docs/research/BuildingACCompilerAnthropic.md +0 -116
- package/docs/research/HarnessEngineeringOpenAi.md +0 -220
- package/docs/research/code-review/systematic-review-methodology.md +0 -409
- package/docs/research/index.md +0 -76
- package/docs/research/learning-systems/knowledge-compounding-for-agents.md +0 -695
- package/docs/research/property-testing/property-based-testing-and-invariants.md +0 -742
- package/docs/research/scenario-testing/advanced-and-emerging.md +0 -470
- package/docs/research/scenario-testing/core-foundations.md +0 -507
- package/docs/research/scenario-testing/domain-specific-and-human-factors.md +0 -474
- package/docs/research/security/auth-patterns.md +0 -138
- package/docs/research/security/data-exposure.md +0 -185
- package/docs/research/security/dependency-security.md +0 -91
- package/docs/research/security/injection-patterns.md +0 -249
- package/docs/research/security/overview.md +0 -81
- package/docs/research/security/secrets-checklist.md +0 -92
- package/docs/research/security/secure-coding-failure.md +0 -297
- package/docs/research/software_architecture/01-science-of-decomposition.md +0 -615
- package/docs/research/software_architecture/02-architecture-under-uncertainty.md +0 -649
- package/docs/research/software_architecture/03-emergent-behavior-in-composed-systems.md +0 -644
- package/docs/research/spec_design/decision_theory_specifications_and_multi_criteria_tradeoffs.md +0 -0
- package/docs/research/spec_design/design_by_contract.md +0 -251
- package/docs/research/spec_design/domain_driven_design_strategic_modeling.md +0 -183
- package/docs/research/spec_design/formal_specification_methods.md +0 -161
- package/docs/research/spec_design/logic_and_proof_theory_under_the_curry_howard_correspondence.md +0 -250
- package/docs/research/spec_design/natural_language_formal_semantics_abuguity_in_specifications.md +0 -259
- package/docs/research/spec_design/requirements_engineering.md +0 -234
- package/docs/research/spec_design/systems_engineering_specifications_emergent_behavior_interface_contracts.md +0 -149
- package/docs/research/spec_design/what_is_this_about.md +0 -305
- package/docs/research/tdd/test-driven-development-methodology.md +0 -547
- package/docs/research/test-optimization-strategies.md +0 -401
- package/scripts/postinstall.mjs +0 -102
|
@@ -1,507 +0,0 @@
|
|
|
1
|
-
# Scenario Testing for CS Development: Core Foundations
|
|
2
|
-
*2026-03-07*
|
|
3
|
-
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
## Abstract
|
|
7
|
-
|
|
8
|
-
Scenario-based testing is one of the most consequential families of software testing techniques, grounding verification and validation in concrete, human-readable descriptions of system behavior rather than in abstract structural properties of code. This survey examines the core foundations of scenario testing across four interconnected areas: scenario specification languages and frameworks (most prominently Behavior-Driven Development and the Gherkin/Cucumber ecosystem); model-based scenario generation using finite automata, UML diagrams, and Markov chains; combinatorial and interaction testing based on covering arrays and t-way interaction theory; and the complementary problem of scenario explosion and selection, addressed through risk-based prioritization, equivalence partitioning, and search-based techniques.
|
|
9
|
-
|
|
10
|
-
Taken together, these four areas span the full lifecycle of a scenario: how it is expressed and made executable (Section 4.1), how it can be generated systematically from behavioral models (Section 4.2), how the combinatorial space of parameter interactions is managed (Section 4.3), and how a tractable and high-value subset of all possible scenarios is identified and maintained (Section 4.4). Each area rests on distinct theoretical foundations -- formal language theory, graph traversal algorithms, combinatorics, and multi-objective optimization respectively -- yet they share the central premise that testing gains power when anchored to observable, stakeholder-meaningful system behavior.
|
|
11
|
-
|
|
12
|
-
A key finding of this survey is that no single approach dominates across all quality dimensions. BDD-style specification maximizes stakeholder communication and executable documentation but imposes significant maintenance costs; model-based generation maximizes systematic coverage but requires model fidelity that is expensive to establish; combinatorial testing provides strong fault-detection guarantees at tractable cost but is most suited to parameter-rich, interaction-sensitive systems; and search-based selection optimizes scenario portfolios against cost-coverage trade-offs but depends heavily on the quality of the fitness function. The paper identifies six open research problems and provides a practitioner resource guide.
|
|
13
|
-
|
|
14
|
-
---
|
|
15
|
-
|
|
16
|
-
## 1. Introduction
|
|
17
|
-
|
|
18
|
-
### 1.1 Problem Statement
|
|
19
|
-
|
|
20
|
-
Software systems fail in scenarios. When a payment gateway rejects a valid credit card during a promotional sale, when a medical device displays a contradictory alert sequence, or when a web application silently corrupts data upon concurrent user actions, the failure mode is invariably a particular combination of inputs, states, and timing -- that is, a scenario. Unit tests, which verify isolated functions, and code-coverage metrics, which measure structural traversal, are necessary but not sufficient to catch these failures. The 2002 NIST report on the cost of software bugs estimated that inadequate testing infrastructure costs the US economy between $22.2 billion and $59.5 billion annually (Tassey, 2002); subsequent analyses have consistently found that integration and system-level scenario failures account for the majority of costs that escape to production.
|
|
21
|
-
|
|
22
|
-
The concept of testing through scenarios is not new, but it has undergone significant formalization over the past three decades. Ivar Jacobson's introduction of use cases in object-oriented software engineering (Jacobson, 1992) provided the first widely adopted vocabulary for describing system behavior from an external, actor-centric perspective. Carroll's scenario-based design literature (Carroll, 1995, 2000) established the cognitive and communicative value of narrative scenarios in software development. The testing community absorbed and extended these ideas, producing a rich family of techniques that now span natural-language specifications, formal behavioral models, combinatorial mathematics, and meta-heuristic optimization.
|
|
23
|
-
|
|
24
|
-
This survey addresses a gap in the existing literature. While several excellent surveys address individual sub-areas -- Utting, Pretschner, and Legeard's taxonomy of model-based testing (2012), Nie and Leung's survey of combinatorial testing (2011), and Yoo and Harman's survey of regression test optimization (2012) -- no single work synthesizes the four foundational areas of scenario testing into a unified framework that clarifies their theoretical relationships, comparative strengths, and integration potential.
|
|
25
|
-
|
|
26
|
-
### 1.2 Scope and Boundaries
|
|
27
|
-
|
|
28
|
-
This paper, the first in a three-part series, covers the foundational concepts and techniques of scenario-based testing:
|
|
29
|
-
|
|
30
|
-
- **Paper 1 (this work):** Scenario specification languages and frameworks; model-based scenario generation; combinatorial and interaction testing; scenario explosion and selection.
|
|
31
|
-
- **Paper 2 (Advanced and Emerging):** Mutation testing applied to scenarios; formal verification and model checking; AI/ML-driven scenario generation; chaos engineering and resilience testing.
|
|
32
|
-
- **Paper 3 (Domain-Specific Applications):** Safety-critical systems (ISO 26262, DO-178C); security and penetration scenario testing; embedded and real-time systems; distributed and microservice architectures.
|
|
33
|
-
|
|
34
|
-
The boundary between Paper 1 and Paper 2 is intentional: we cover the use of state machines and Markov chains as generative models for test scenarios (a classical, well-understood technique) but defer coverage of neural-network-based test generation, property-based testing with shrinking, and formal LTL/CTL model checking to Paper 2. The boundary between Paper 1 and Paper 3 reflects the distinction between general-purpose techniques and their domain-specific instantiations.
|
|
35
|
-
|
|
36
|
-
### 1.3 Key Definitions
|
|
37
|
-
|
|
38
|
-
**Scenario:** A concrete, sequenced description of an interaction between an actor (human or system) and the system under test, resulting in an observable outcome. A scenario is characterized by a precondition, a sequence of actions or events, and a postcondition (Jacobson, 1992; Wiegers, 2003).
|
|
39
|
-
|
|
40
|
-
**Test scenario:** A scenario that has been made executable -- i.e., the preconditions, actions, and expected postconditions are expressed with sufficient precision that test execution can produce a pass or fail verdict.
|
|
41
|
-
|
|
42
|
-
**Scenario specification:** A formal or semi-formal artifact that defines one or more test scenarios, including the language or notation used, the coverage criteria it addresses, and the level of abstraction maintained.
|
|
43
|
-
|
|
44
|
-
**Scenario coverage:** A measure of the extent to which a test suite exercises the scenario space defined by a specification or model. Coverage criteria for scenarios include state coverage, transition coverage, path coverage in model-based testing, and interaction strength in combinatorial testing.
|
|
45
|
-
|
|
46
|
-
**Scenario explosion:** The combinatorial phenomenon in which the number of distinct scenarios that could be derived from a specification grows exponentially with the number of parameters, states, or actors, making exhaustive testing infeasible.
|
|
47
|
-
|
|
48
|
-
### 1.4 Relationship to Papers 2 and 3
|
|
49
|
-
|
|
50
|
-
The techniques in this paper are preconditions to the advanced approaches of Paper 2 and the domain applications of Paper 3. A safety-critical system engineer applying Paper 3's ISO 26262 guidance will use the BDD specification frameworks and model-based generation techniques of this paper as their foundation. An AI-driven test generation tool of Paper 2 will typically use covering arrays (Section 4.3) as a baseline comparison or as a seed for neural generation. Understanding the theoretical grounding here -- why certain coverage criteria are defined the way they are, what the mathematical guarantees of combinatorial testing actually are -- is essential for evaluating the claims made in the more advanced and specialized literature.
|
|
51
|
-
|
|
52
|
-
---
|
|
53
|
-
|
|
54
|
-
## 2. Foundations
|
|
55
|
-
|
|
56
|
-
### 2.1 Testing Theory: The Impossibility of Exhaustive Testing
|
|
57
|
-
|
|
58
|
-
The intellectual foundation for all scenario selection techniques is Dijkstra's famous observation that "program testing can be used to show the presence of bugs, but never to show their absence" (Dijkstra, 1972). This is not merely a philosophical point; it follows directly from the combinatorial explosion of possible inputs and execution paths in any non-trivial program. Howden (1976) formalized this as the "infeasibility of complete testing," demonstrating that even for a program accepting two 32-bit integers, exhaustive testing requires on the order of 10^19 test cases. Myers (1979), in "The Art of Software Testing," introduced the concept of test design as a disciplined engineering activity -- choosing test cases according to rational criteria rather than intuition -- and identified equivalence partitioning and boundary value analysis as the primary systematic techniques.
|
|
59
|
-
|
|
60
|
-
Beizer (1990), in "Software Testing Techniques," extended this into a taxonomy of test design methods organized by the level of abstraction at which they operate: from domain-level functional testing through data-flow analysis to path testing. Beizer's work is notable for its early recognition that the most important bugs cluster not in individual statements but at the interfaces between functional components -- exactly the domain in which scenario testing excels.
|
|
61
|
-
|
|
62
|
-
The concept of a **coverage criterion** is central to testing theory. A coverage criterion C defines a set of test requirements TR, and a test suite T is said to satisfy C if every element of TR is exercised by at least one test in T. Ammann and Offutt (2008, 2017) provide the most comprehensive modern treatment of coverage criteria, unified under a logic coverage framework that subsumes clause coverage, predicate coverage, and mutation coverage. Scenario testing introduces coverage criteria at a higher level of abstraction -- transition coverage, use-case coverage, scenario path coverage -- that map to structural coverage criteria only under specific model assumptions.
|
|
63
|
-
|
|
64
|
-
### 2.2 Origins of the Scenario Concept
|
|
65
|
-
|
|
66
|
-
The scenario as a unit of software specification has two intellectual lineages: the use-case tradition and the scenario-based design tradition.
|
|
67
|
-
|
|
68
|
-
**Use cases** were introduced by Ivar Jacobson in his work on the Objectory method (Jacobson, 1987, formally published in Jacobson et al., 1992, "Object-Oriented Software Engineering: A Use Case Driven Approach"). Jacobson's insight was that systems should be specified from the perspective of actors pursuing goals, and that each use case describes a family of related scenarios (main success scenario plus alternative and exception flows). The OMG's adoption of use cases into the Unified Modeling Language (UML 1.0, 1997; UML 2.5.1, 2017) institutionalized this vocabulary across the software engineering community.
|
|
69
|
-
|
|
70
|
-
**Scenario-based design** emerged from the human-computer interaction community through John Carroll's work at IBM Research and Virginia Tech (Carroll, 1995, "Scenario-Based Design: Envisioning Work and Technology in System Development"). Carroll's framework treats scenarios as a design medium -- concrete stories about system use that drive requirements elicitation, design exploration, and ultimately testing. Rolland et al. (1998) formalized the connection between requirements scenarios and test scenarios, establishing what they called the "scenario principle": that scenarios both describe desired behavior and constitute a natural acceptance test specification.
|
|
71
|
-
|
|
72
|
-
**The specification-testing duality** -- the observation that a complete behavioral specification of a system is, in principle, also a complete test specification -- was articulated most clearly by Meyer (1992) in the context of Design by Contract. Meyer argued that preconditions and postconditions specified as part of the design could be directly instrumented as runtime assertions, collapsing the gap between specification and test. This duality is the theoretical underpinning of BDD's "executable specifications" claim: a Gherkin scenario that specifies system behavior is simultaneously an executable test.
|
|
73
|
-
|
|
74
|
-
### 2.3 Scenario Testing in the Software Engineering Canon
|
|
75
|
-
|
|
76
|
-
The integration of scenario testing into mainstream software engineering methodologies accelerated through three developments. First, the Rational Unified Process (RUP) (Jacobson, Booch, Rumbaugh, 1999) placed use-case-driven development at the center of its lifecycle, explicitly connecting use-case scenarios to system testing. Second, the Agile movement's emphasis on user stories (Cohn, 2004; Beck et al., 2001, the Agile Manifesto) created a lightweight scenario vocabulary aligned with sprint-level acceptance criteria. Third, Dan North's formalization of BDD in 2006 (North, 2006, "Introducing BDD," Better Software Magazine) provided the connective tissue between Agile user stories and executable test scenarios, directly birthing the Gherkin language and Cucumber tool.
|
|
77
|
-
|
|
78
|
-
---
|
|
79
|
-
|
|
80
|
-
## 3. Taxonomy of Approaches
|
|
81
|
-
|
|
82
|
-
The four approaches surveyed in this paper can be classified along four orthogonal dimensions: **specification method** (how scenarios are expressed), **generation strategy** (how test cases are derived from the specification), **selection criteria** (how the test suite is constrained to a tractable subset), and **tool support maturity** (the readiness of available tooling for industrial use). Table 1 presents this classification.
|
|
83
|
-
|
|
84
|
-
**Table 1: Taxonomy of Core Scenario Testing Approaches**
|
|
85
|
-
|
|
86
|
-
| Dimension | 4.1 BDD/Specification | 4.2 Model-Based Generation | 4.3 Combinatorial Testing | 4.4 Explosion & Selection |
|
|
87
|
-
|---|---|---|---|---|
|
|
88
|
-
| **Specification method** | Natural language (Gherkin), semi-formal, structured prose | Formal behavioral models (FSM, UML, Markov chains) | Parameter-value tables, constraint models | Existing scenario suites, risk registers, change history |
|
|
89
|
-
| **Generation strategy** | Manual authoring with step binding | Algorithmic model traversal (path generation, random walk) | Mathematical array construction (greedy, ILP, metaheuristic) | Search-based selection, equivalence partitioning, risk scoring |
|
|
90
|
-
| **Selection criteria** | Business value, acceptance criteria, story-level coverage | Transition coverage, state coverage, path coverage, MC/DC | t-way interaction coverage (t=2: pairwise, t=3: 3-way, etc.) | Risk, change impact, execution cost, historical fault rate |
|
|
91
|
-
| **Tool support maturity** | High (Cucumber, SpecFlow, Behave) | Medium (GraphWalker, Spec Explorer, ModelJUnit) | Medium-High (ACTS, PICT, CAgen, AllPairs) | Medium (commercial tools, research prototypes) |
|
|
92
|
-
| **Stakeholder accessibility** | High (readable by non-engineers) | Low (requires modeling expertise) | Low-Medium (requires parameter abstraction) | Low (primarily for test engineers) |
|
|
93
|
-
| **Formal guarantees** | None (coverage is behavioral, not formal) | Traversal completeness given model | Fault detection bounds given interaction strength | Approximation bounds (NP-hard problem) |
|
|
94
|
-
| **Primary failure modes caught** | Integration, acceptance, regression | Behavioral, state-transition, protocol | Interaction faults, configuration bugs | High-risk regressions, changed code paths |
|
|
95
|
-
|
|
96
|
-
The classification reveals a clear trade-off between stakeholder accessibility and formal rigor: BDD occupies the high-accessibility, low-formality end of the spectrum, while model-based generation provides the strongest formal guarantees but demands significant modeling investment. Combinatorial testing and search-based selection occupy the middle, providing mathematical guarantees within a more constrained problem formulation than full model-based testing.
|
|
97
|
-
|
|
98
|
-
A second cross-cutting dimension is **where in the development lifecycle each approach is most effective**. BDD operates from requirements through acceptance testing; model-based generation from design through system testing; combinatorial testing at the configuration and integration level; and selection/prioritization throughout the lifecycle as a portfolio management concern.
|
|
99
|
-
|
|
100
|
-
---
|
|
101
|
-
|
|
102
|
-
## 4. Analysis
|
|
103
|
-
|
|
104
|
-
### 4.1 Scenario Specification Languages and Frameworks
|
|
105
|
-
|
|
106
|
-
#### 4.1.1 Theory and Mechanism
|
|
107
|
-
|
|
108
|
-
Behavior-Driven Development emerged as a response to a practical failure mode in test-driven development: the tendency for unit tests to drift into implementation-focused micro-tests that neither communicate system behavior to non-technical stakeholders nor serve as reliable acceptance criteria. Dan North identified this problem in 2003 while coaching teams at Thoughtworks, and articulated the solution in his seminal 2006 article: tests should be written in the language of the business domain, structured as descriptions of expected system behavior, and understandable without programming knowledge (North, 2006).
|
|
109
|
-
|
|
110
|
-
The **Given-When-Then** structure that defines BDD scenarios is directly derived from the structure of use-case scenarios (Jacobson, 1992) and from condition-action-result patterns in formal specification. The tripartite structure maps precisely onto the pre/post-condition structure of Hoare logic (Hoare, 1969): Given establishes the precondition, When specifies the state transformation (an action or event), and Then asserts the postcondition. This formal correspondence is underappreciated in the practitioner literature but is essential for understanding why BDD scenarios can function as both documentation and executable tests.
|
|
111
|
-
|
|
112
|
-
**Gherkin** (Wynne and Hellesoy, 2012) is the domain-specific language that implements this structure. It is intentionally minimal: keywords (Given, When, Then, And, But, Background, Scenario, Scenario Outline, Examples, Feature) provide structural scaffolding while the actual step text is free-form natural language. The language is parsed by Cucumber (JVM/Ruby/JavaScript), SpecFlow (.NET), Behave (Python), and Karate (API testing), each of which binds Gherkin step text to executable code through pattern matching (regular expressions or Cucumber Expression syntax).
|
|
113
|
-
|
|
114
|
-
**Living documentation** is the theoretical ideal that BDD scenarios realize: a test suite that simultaneously serves as system documentation, where the documentation is guaranteed up-to-date by the requirement that it passes. Adzic (2011) in "Specification by Example" provides the most thorough treatment of this concept, introducing the term "Specification by Example" (SBE) as a synonym for BDD practiced at the acceptance level. Adzic's case studies demonstrate that teams practicing SBE showed measurable reductions in defect escape rates and requirements misunderstandings.
|
|
115
|
-
|
|
116
|
-
**FIT (Framework for Integrated Tests)** and **FitNesse**, developed by Ward Cunningham and Robert Martin (Cunningham, 2002; Martin and Micah Martin, 2006), represent an alternative scenario specification paradigm where acceptance scenarios are expressed as wiki-formatted tables rather than prose. FIT tables map directly to decision tables (a concept discussed further in Section 4.2), and FitNesse extends this with a wiki-based collaboration environment. While BDD/Gherkin has largely superseded FIT/FitNesse in contemporary practice, FitNesse remains in active use in organizations with large legacy FIT suites, particularly in finance and insurance (Adzic, 2011).
|
|
117
|
-
|
|
118
|
-
**Scenario Outlines and Data Tables** in Gherkin address a specific form of the scenario explosion problem (Section 4.4): when the same behavioral sequence must be verified for multiple input combinations, the Scenario Outline keyword allows parameterized scenarios with an Examples table. This is a limited form of combinatorial testing embedded within the specification framework, though without the formal coverage guarantees of the dedicated combinatorial tools discussed in Section 4.3.
|
|
119
|
-
|
|
120
|
-
**User stories as scenarios** in Agile development represent the requirements-level precursor to test scenarios. Cohn's (2004) "User Stories Applied" defines the classic three-part structure: "As a [role], I want [goal], so that [benefit]." The connection to test scenarios is mediated by acceptance criteria, which North (2006) proposed should themselves be written as scenarios in Given-When-Then format. This closes the loop: user stories carry acceptance scenarios that become, without transformation, executable BDD tests.
|
|
121
|
-
|
|
122
|
-
#### 4.1.2 Literature Evidence
|
|
123
|
-
|
|
124
|
-
The academic literature on BDD has grown substantially since North's original article, though it remains more practitioner-oriented than the model-based testing literature. Key academic contributions include:
|
|
125
|
-
|
|
126
|
-
Solis and Wang (2011, "A Study of the Characteristics of Behaviour Driven Development," IEEE SEAA) conducted one of the first empirical studies of BDD adoption, analyzing open-source BDD projects and finding that the primary driver of BDD adoption was improved communication rather than defect detection, a finding that has been consistently replicated.
|
|
127
|
-
|
|
128
|
-
Binamungu et al. (2016, "Characterising the Quality of Behaviour Driven Development Specifications," XP Conference) developed a taxonomy of Gherkin anti-patterns, classifying problems such as "conjunctive steps" (multiple actions in a single When step), "incidental details" (irrelevant context in Given steps), and "imperative style" (implementation-level instructions rather than behavioral assertions). Their analysis of 66 open-source Cucumber projects found that anti-patterns were present in the majority of step definitions, undermining the readability goals of BDD.
|
|
129
|
-
|
|
130
|
-
Soares et al. (2021, "BDD in Practice: An Analysis of 40 Repositories," IEEE ICST) extended this empirical work, finding that BDD projects exhibited significantly higher test suite size and scenario coverage than non-BDD projects, but also significantly higher test maintenance costs, confirming the theoretical prediction.
|
|
131
|
-
|
|
132
|
-
Carvalho et al. (2018) conducted a systematic literature review of BDD, analyzing 45 primary studies and finding that the most frequently reported benefits were improved communication (reported in 71% of studies) and living documentation (reported in 62%), while the most frequently reported challenges were step definition maintenance (67%) and the learning curve for non-technical stakeholders writing Gherkin (54%).
|
|
133
|
-
|
|
134
|
-
Smart (2023, "BDD in Action, Second Edition," Manning) provides the most comprehensive practitioner treatment, with substantial empirical grounding from industry case studies. Smart documents the "Cucumber anti-pattern": organizations that adopt Gherkin as a test scripting language without the stakeholder collaboration practice end up with verbose, brittle step definitions that provide none of the communication benefits and all of the maintenance costs.
|
|
135
|
-
|
|
136
|
-
#### 4.1.3 Implementations and Benchmarks
|
|
137
|
-
|
|
138
|
-
**Cucumber** (originally RSpec Stories by Aslak Hellesoy and Dan North, renamed Cucumber in 2008) is the reference implementation. The Cucumber project maintains implementations for Java/Kotlin/Scala (Cucumber-JVM), Ruby (the original), JavaScript/TypeScript (Cucumber.js), Python (via Behave as the de-facto standard, though cucumber-python exists), .NET (via SpecFlow, which is an independent reimplementation), and Go (Godog). As of 2024, the Cucumber-JVM GitHub repository reports over 6,000 stars and is downloaded from Maven Central approximately 4 million times per month, making it among the most widely used test frameworks in the JVM ecosystem.
|
|
139
|
-
|
|
140
|
-
**SpecFlow** (.NET), acquired by Tricentis and rebranded as SpecFlow+ in its commercial variant, is the dominant BDD framework on the .NET platform. Tricentis's 2022 State of Testing Report cited SpecFlow as used by approximately 38% of .NET development teams surveyed.
|
|
141
|
-
|
|
142
|
-
**Behave** (Python) is the most widely used Python BDD framework. It follows the Gherkin specification closely and integrates with pytest fixtures. An alternative, pytest-bdd, allows Gherkin scenarios to be run as pytest tests, enabling tighter integration with the Python testing ecosystem.
|
|
143
|
-
|
|
144
|
-
**Karate** (Peter Thomas, 2017) is noteworthy as a BDD framework specifically designed for API testing, where Gherkin steps are interpreted as HTTP request/response operations without requiring separate step definition code. This reduces the implementation burden significantly for API-level scenarios.
|
|
145
|
-
|
|
146
|
-
Benchmarks comparing BDD-specified test suites to other specification styles are sparse in the academic literature, as they are difficult to control for confounding variables. Wiklund et al. (2017, "Technical Debt in Test Automation," IEEE ICSE WAPI) found that BDD-style specifications exhibited higher "test debt" accumulation rates (scenarios becoming unmaintained over time) than record-and-playback tests but lower rates than manual test case documents -- a nuanced empirical finding that complicates simple advocacy for BDD.
|
|
147
|
-
|
|
148
|
-
#### 4.1.4 Strengths and Limitations
|
|
149
|
-
|
|
150
|
-
**Strengths:** The primary validated strength of BDD is improved communication between technical and non-technical stakeholders. When practiced as intended -- with three-amigos sessions (business analyst, developer, tester) co-authoring scenarios before implementation -- BDD has been shown to reduce requirement ambiguity and defect escape rates (Adzic, 2011; Smart, 2023). The executable specification property provides an automatic consistency check: the moment code diverges from the specified behavior, tests fail.
|
|
151
|
-
|
|
152
|
-
**Limitations:** The Achilles heel of BDD is maintenance. Step definitions are code that must be maintained in parallel with the feature code they test. The natural language matching layer (regular expressions or Cucumber Expressions) creates a fragile dependency between the prose of the scenario and the implementation of the step definition. As Binamungu et al. (2016) documented, the "feature file rot" problem -- scenario files that pass because step definitions have silently become too permissive -- is endemic in mature BDD suites. A second structural limitation is the scenario explosion problem discussed in Section 4.4: BDD provides no principled mechanism for managing the combinatorial growth of scenarios in parameterized systems.
|
|
153
|
-
|
|
154
|
-
---
|
|
155
|
-
|
|
156
|
-
### 4.2 Model-Based Scenario Generation
|
|
157
|
-
|
|
158
|
-
#### 4.2.1 Theory and Mechanism
|
|
159
|
-
|
|
160
|
-
Model-based testing (MBT) is the practice of deriving test cases algorithmically from a formal behavioral model of the system under test. The model serves as a precise, machine-readable specification of the expected behavior, and test generation reduces to a graph traversal or coverage problem on that model. MBT for scenario generation is concerned specifically with the question: given a model of system behavior, what sequences of actions (scenarios) should be executed to achieve a specified level of confidence?
|
|
161
|
-
|
|
162
|
-
The theoretical foundations of MBT trace to the automata-theoretic tradition. A **finite state machine (FSM)** is a 5-tuple M = (S, I, O, d, l), where S is a finite set of states, I is an input alphabet, O is an output alphabet, d: S x I -> S is the state transition function, and l: S x I -> O is the output function (Moore, 1956; Mealy, 1955). Test generation from FSMs is the problem of finding a set of input sequences that, when applied to the FSM, achieves a specified coverage criterion.
|
|
163
|
-
|
|
164
|
-
Coverage criteria for FSM-based test generation are hierarchically ordered by subsumption:
|
|
165
|
-
- **State coverage**: every state is visited by at least one test.
|
|
166
|
-
- **Transition coverage (0-switch)**: every transition is traversed by at least one test.
|
|
167
|
-
- **1-switch coverage**: every pair of consecutive transitions is traversed.
|
|
168
|
-
- **n-switch coverage (Chow, 1978)**: every sequence of n+1 consecutive transitions is traversed. Chow's W-method (1978) produces transition tours achieving n-switch coverage and is one of the foundational algorithms of MBT.
|
|
169
|
-
|
|
170
|
-
**Markov chains** extend FSMs by annotating transitions with probabilities, enabling **statistical testing** (Whittaker and Thomason, 1994; Whittaker and Poore, 1993). In **Operational Profile testing** (Musa, 1993), the transition probabilities reflect the actual usage distribution of the system, and test scenarios are generated by sampling from the Markov chain. The theoretical justification is that reliability, defined as the probability of failure-free operation given a usage profile, is best estimated by testing scenarios in proportion to their operational frequency.
|
|
171
|
-
|
|
172
|
-
**UML behavioral diagrams** provide a more expressive family of models than plain FSMs:
|
|
173
|
-
- **Activity diagrams** represent workflow-level scenario flows with fork/join constructs for concurrent activities, decision nodes, and object flows. Test generation from activity diagrams has been studied by Briand et al. (2002) and Gnesi et al. (2004).
|
|
174
|
-
- **Sequence diagrams** and **collaboration diagrams** represent interaction scenarios between objects, specifying the order of message exchanges. Storrle (2004) and Apfelbaum and Doyle (1997) examined scenario generation from sequence diagrams.
|
|
175
|
-
- **Statecharts** (Harel, 1987) extend FSMs with hierarchical states and concurrent regions, dramatically increasing expressive power while complicating test generation. Hartmann et al. (2000, "UML-Based Integration Testing") developed one of the first systematic approaches to generating integration test scenarios from UML statecharts.
|
|
176
|
-
|
|
177
|
-
**Extended Finite State Machines (EFSMs)** (Cheng and Krishnakumar, 1993) add data variables and guards to transitions, enabling the specification of data-dependent behavior. Test generation from EFSMs is more powerful than from plain FSMs but computationally harder -- path feasibility is undecidable in general, requiring heuristic or bounded approaches.
|
|
178
|
-
|
|
179
|
-
**Decision tables** represent the relationship between condition combinations and actions in tabular form. Ammann and Offutt (2008) characterize decision table coverage criteria in terms of the number of distinct condition combinations exercised, connecting to the combinatorial testing framework of Section 4.3.
|
|
180
|
-
|
|
181
|
-
#### 4.2.2 Literature Evidence
|
|
182
|
-
|
|
183
|
-
The foundational taxonomy for model-based testing is Utting, Pretschner, and Legeard (2012, "A Taxonomy of Model-Based Testing Approaches," Software Testing, Verification and Reliability), which classifies MBT approaches along five dimensions: model subject, model paradigm, test generation technology, test execution technology, and coverage criterion.
|
|
184
|
-
|
|
185
|
-
Tretmans (1996, 2008) developed the theory of **conformance testing** from labeled transition systems (LTS), which formalizes the question of when a system implementation "conforms to" a specification. The ioco (input-output conformance) relation of Tretmans (2008) is the semantic foundation for conformance-based MBT tools.
|
|
186
|
-
|
|
187
|
-
Broy, Jonsson, Katoen, Leucker, and Pretschner (2005, "Model-Based Testing of Reactive Systems," Springer LNCS 3472) provide a comprehensive treatment of reactive system testing from behavioral models. This volume remains the definitive reference for MBT theory.
|
|
188
|
-
|
|
189
|
-
Empirical evidence for MBT effectiveness comes from several industrial case studies. Pretschner et al. (2005, "One Evaluation of Model-Based Testing and Its Automation," ICSE) reported on MBT application at ABB Research, finding that model-based tests detected defects that had escaped manual review. Dalal et al. (1999, "Model-Based Testing in Practice," ICSE) described MBT application at Lucent Technologies, with a reported 40% reduction in testing effort and improved defect detection rates.
|
|
190
|
-
|
|
191
|
-
#### 4.2.3 Implementations and Benchmarks
|
|
192
|
-
|
|
193
|
-
**GraphWalker** (Altom, open-source) is the most widely used open-source MBT tool for graph-model-based generation. It accepts models as directed graphs and generates test sequences according to pluggable stop criteria: edge coverage, vertex coverage, random walk with edge coverage target. GraphWalker has been applied in case studies at Spotify (de Oliveira Neto et al., 2019).
|
|
194
|
-
|
|
195
|
-
**Spec Explorer** (Microsoft Research, Veanes et al., 2008, 2010) implements MBT using Abstract State Machines (ASMs). Grieskamp et al. (2011) reported that Spec Explorer-generated tests detected over 200 unique bugs in the Windows protocol stack that had not been found by prior testing.
|
|
196
|
-
|
|
197
|
-
**ModelJUnit** (Utting, 2007) is an academic MBT framework for Java allowing test models to be written as annotated Java classes.
|
|
198
|
-
|
|
199
|
-
**CONFORMIQ** (commercial) is the leading commercial MBT tool, supporting UML statecharts and sequence diagrams with integration into enterprise testing workflows.
|
|
200
|
-
|
|
201
|
-
#### 4.2.4 Strengths and Limitations
|
|
202
|
-
|
|
203
|
-
**Strengths:** MBT provides the strongest formal guarantees of any scenario generation technique: given a model that accurately captures the system's specification, coverage criteria provide mathematically defined bounds on the scenarios that will be exercised. For protocol and interface testing, where the specification is naturally expressed as a state machine, MBT is demonstrably superior to manual test design (Tretmans, 2008; Pretschner et al., 2005).
|
|
204
|
-
|
|
205
|
-
**Limitations:** The primary limitation is the **model maintenance problem**: a test suite is only as good as the model it was generated from, and models of real industrial systems are expensive to create, require specialized expertise, and drift from the implementation as the system evolves. The "oracle problem" (Barr et al., 2014) is particularly acute in MBT. The **state explosion problem** (Clarke et al., 1999) limits scalability: for systems with large data spaces, the number of states in an EFSM is exponential in the number of data variables.
|
|
206
|
-
|
|
207
|
-
---
|
|
208
|
-
|
|
209
|
-
### 4.3 Combinatorial and Interaction Testing
|
|
210
|
-
|
|
211
|
-
#### 4.3.1 Theory and Mechanism
|
|
212
|
-
|
|
213
|
-
Combinatorial testing is grounded in the empirical observation that the vast majority of software failures are triggered by the interaction of a small number of input parameters. This observation, which Kuhn, Wallace, and Gallo (2004) established through analysis of real bug databases, motivates the key insight: rather than testing all possible combinations of inputs (exhaustive testing), it is sufficient -- with high probability -- to test all combinations of some small number t of parameters (t-way testing).
|
|
214
|
-
|
|
215
|
-
**Formal definition:** Let a system under test have k parameters P1, P2, ..., Pk, each taking values from a finite domain Di (|Di| = vi). A **t-way covering array** CA(N; t, k, v) is an N x k array over a v-symbol alphabet such that every N x t subarray contains all v^t possible t-tuples at least once. When the parameters have different domain sizes, the generalization is a **mixed covering array** MCA(N; t, k, (v1, v2, ..., vk)).
|
|
216
|
-
|
|
217
|
-
The **fault detection claim** of combinatorial testing (Kuhn, Wallace, Gallo, 2004; NIST IR 7007, 2003) is that a t-way covering array is guaranteed to detect any fault triggered by an interaction of at most t parameters, regardless of how many parameters the system has. The empirical evidence:
|
|
218
|
-
|
|
219
|
-
- Kuhn, Wallace, and Gallo (2004, IEEE TSE) analyzed bug databases from NASA, Linux, and commercial products and found that 67% of bugs were triggered by a single parameter value, and 97% required the interaction of at most 4 parameters.
|
|
220
|
-
- Kuhn, Kacker, and Lei (2013, CRC Press) extended this analysis: t=2 covered 50-75% of faults, t=3 covered 75-90%, and t=6 covered virtually all faults.
|
|
221
|
-
|
|
222
|
-
**Mathematical foundations:**
|
|
223
|
-
|
|
224
|
-
**Orthogonal arrays (OAs)**, introduced by Rao (1947), are a special case of covering arrays where exact balance is required: every t-tuple appears exactly N/v^t times.
|
|
225
|
-
|
|
226
|
-
**Latin squares** are OA(v^2; 2, v, v): v x v arrays where every symbol appears exactly once in each row and column. The problem of finding the maximum number of mutually orthogonal Latin squares (MOLS) of order n is a deep unsolved problem in combinatorics.
|
|
227
|
-
|
|
228
|
-
**Covering array construction algorithms** address the computational problem of finding minimum-size covering arrays (NP-hard in general, Colbourn, 2004):
|
|
229
|
-
|
|
230
|
-
- **In-parameter-order (IPO)** algorithm (Lei and Tai, 1998): greedy horizontal extension building the array column by column.
|
|
231
|
-
- **Simulated annealing** (Cohen et al., 1997, AETG algorithm): the first widely deployed algorithmic approach.
|
|
232
|
-
- **Integer Linear Programming (ILP)** approaches (Grindal, Offutt, Andler, 2005): produce optimal arrays for small parameter counts.
|
|
233
|
-
|
|
234
|
-
**Constraint handling** is critical. Real systems have constraints between parameters. The **forbidden tuple** model (Cohen et al., 2003) and the **conditional constraint** model (Czerwonka, 2006) extend covering arrays to exclude infeasible combinations.
|
|
235
|
-
|
|
236
|
-
#### 4.3.2 Literature Evidence
|
|
237
|
-
|
|
238
|
-
- **Mandl (1985)** applied OAs to software testing at AT&T Bell Labs, establishing the first industrial application.
|
|
239
|
-
- **Cohen, Dalal, Fredman, and Patton (1997, IEEE TSE)** introduced AETG, demonstrating that pairwise testing reduces test suite sizes by 75-95% compared to exhaustive testing.
|
|
240
|
-
- **Nie and Leung (2011, ACM Computing Surveys)** is the definitive survey of the field (36 pages).
|
|
241
|
-
- **Grindal, Offutt, and Andler (2005, STVR)** surveyed combination testing strategies from a software engineering perspective.
|
|
242
|
-
- **Yilmaz, Cohen, and Porter (2006, IEEE TSE)** applied covering arrays to Linux kernel configurations.
|
|
243
|
-
|
|
244
|
-
#### 4.3.3 Implementations and Benchmarks
|
|
245
|
-
|
|
246
|
-
**ACTS (Automated Combinatorial Testing for Software)** -- NIST's reference implementation; supports 2-6 way covering arrays with mixed parameters and constraints.
|
|
247
|
-
|
|
248
|
-
**PICT (Pairwise Independent Combinatorial Testing)** -- Microsoft's tool; open-sourced; Czerwonka (2006) reports that pairwise testing detected 80-100% of defects found by exhaustive testing in several case studies.
|
|
249
|
-
|
|
250
|
-
**CAgen** -- Academic tool for covering array construction algorithm research.
|
|
251
|
-
|
|
252
|
-
**AllPairs** -- Lightweight alternative implementations.
|
|
253
|
-
|
|
254
|
-
Hartman (2006) found that AETG-based tools consistently produced arrays within 10-20% of theoretical minimum sizes.
|
|
255
|
-
|
|
256
|
-
#### 4.3.4 Strengths and Limitations
|
|
257
|
-
|
|
258
|
-
**Strengths:** Most economical path to high fault-detection confidence for parameter-rich systems. Mathematical guarantee independent of implementation. Fully automated generation.
|
|
259
|
-
|
|
260
|
-
**Limitations:** Addresses only parameter interaction faults; insensitive to sequential, temporal, or state-dependent faults. Constraint handling can become complex. Parameter model construction requires engineering judgment.
|
|
261
|
-
|
|
262
|
-
---
|
|
263
|
-
|
|
264
|
-
### 4.4 Scenario Explosion and Selection
|
|
265
|
-
|
|
266
|
-
#### 4.4.1 Theory and Mechanism
|
|
267
|
-
|
|
268
|
-
The scenario explosion problem is the manifestation of the fundamental infeasibility of exhaustive testing. For a system with n use cases, each with k alternative flows, the number of distinct scenario paths is O(k^n) or worse. The problem of scenario selection is a multi-objective optimization problem.
|
|
269
|
-
|
|
270
|
-
**Equivalence partitioning (EP)** applied to scenarios (Myers, 1979; Ammann and Offutt, 2008): the scenario space is partitioned into equivalence classes such that all scenarios within a class exercise the same program behavior. Testing one representative per class is theoretically sufficient.
|
|
271
|
-
|
|
272
|
-
**Boundary value analysis (BVA)** (Myers, 1979): select scenarios at boundaries between equivalence classes, exploiting the observation that faults cluster at boundaries.
|
|
273
|
-
|
|
274
|
-
**Risk-based testing (RBT)** (Amland, 1999; Van Veenendaal, 2012): select scenarios in proportion to risk (likelihood x impact). Felderer and Ramler (2014) showed risk-based prioritization consistently outperforms random prioritization.
|
|
275
|
-
|
|
276
|
-
**Regression test selection (RTS)** (Rothermel and Harrold, 1994, 1996): given a modified version, which existing scenarios need re-execution? Uses control-flow and program dependence graphs for provably safe selection.
|
|
277
|
-
|
|
278
|
-
**Test case prioritization (TCP)** (Rothermel et al., 2001): order scenarios to maximize fault detection rate. The **Average Percentage of Faults Detected (APFD)** metric: for n tests detecting m faults, APFD = 1 - (sum of first-fault-detection positions) / (n * m) + 1/(2n).
|
|
279
|
-
|
|
280
|
-
**Search-based test case selection (SBSE)** (Harman and Jones, 2001; Harman, 2007): applies metaheuristic optimization (genetic algorithms, simulated annealing) to scenario selection. Yoo and Harman (2012) survey over 100 papers and find genetic algorithms consistently outperform greedy algorithms in multi-objective formulations.
|
|
281
|
-
|
|
282
|
-
**History-based selection** (Elbaum et al., 2002; Kim and Porter, 2002): uses fault history to predict which scenarios will detect faults. Elbaum et al. found 30-50% improvements in fault detection rate.
|
|
283
|
-
|
|
284
|
-
#### 4.4.2 Literature Evidence
|
|
285
|
-
|
|
286
|
-
- **Rothermel and Harrold (1996, IEEE TSE)**: foundational safe RTS framework; 40-75% test suite reduction with zero fault miss rate.
|
|
287
|
-
- **Rothermel et al. (2001, IEEE TSE)**: introduced APFD metric; coverage-based prioritization consistently outperforms random.
|
|
288
|
-
- **Elbaum et al. (2002, IEEE TSE)**: extended empirical study establishing history-based prioritization baseline.
|
|
289
|
-
- **Yoo and Harman (2012, STVR 22(2))**: definitive survey of 210 papers; multi-objective approaches outperform single-objective.
|
|
290
|
-
- **Harman and McMinn (2010, IEEE TSE)**: landscape smoothness as key predictor of SBSE performance.
|
|
291
|
-
- **Felderer et al. (2014, EASE)**: risk-based prioritization detected 80% of high-priority faults in first 30% of execution.
|
|
292
|
-
|
|
293
|
-
#### 4.4.3 Implementations and Benchmarks
|
|
294
|
-
|
|
295
|
-
**STARTS** (Gligoric et al., 2015, ISSTA): academic RTS for Java using class-level dependency analysis; 80% average test suite reduction.
|
|
296
|
-
|
|
297
|
-
**Ekstazi** (Gligoric et al., 2014, FSE): file-level dependency-based RTS for Maven/Gradle; 32% average execution time reduction.
|
|
298
|
-
|
|
299
|
-
**Launchable** (commercial): ML-based predictive test selection for CI/CD.
|
|
300
|
-
|
|
301
|
-
**ATM tools** (Jira Xray, Zephyr, TestRail): risk-based prioritization through manual risk scoring.
|
|
302
|
-
|
|
303
|
-
**SBST benchmarks** at ICSE (annual since 2013) provide standardized evaluation environments.
|
|
304
|
-
|
|
305
|
-
#### 4.4.4 Strengths and Limitations
|
|
306
|
-
|
|
307
|
-
**Strengths:** Empirically validated techniques with demonstrated industrial benefit. APFD provides principled comparison metric. Search-based approaches provide strong multi-objective performance.
|
|
308
|
-
|
|
309
|
-
**Limitations:** Scenario selection (test suite minimization) is NP-hard; all approaches are approximations. The "adequate test suite" problem is formally intractable (Hamlet, 1990). Best-performing techniques require significant instrumentation (fault tracking, build system integration, coverage instrumentation).
|
|
310
|
-
|
|
311
|
-
---
|
|
312
|
-
|
|
313
|
-
## 5. Comparative Synthesis
|
|
314
|
-
|
|
315
|
-
**Table 2: Comparative Analysis of Core Scenario Testing Approaches**
|
|
316
|
-
|
|
317
|
-
| Dimension | BDD / Specification (4.1) | Model-Based Generation (4.2) | Combinatorial Testing (4.3) | Selection & Prioritization (4.4) |
|
|
318
|
-
|---|---|---|---|---|
|
|
319
|
-
| **Primary purpose** | Specify and communicate scenarios | Generate scenarios from models | Generate interaction-covering tests | Select high-value subset of scenarios |
|
|
320
|
-
| **Expressiveness** | High (prose-natural language) | Medium (formal model language) | Low (parameter-value tables) | N/A (selects from existing scenarios) |
|
|
321
|
-
| **Automation level** | Low (authoring) / High (execution) | High (generation + execution) | High (generation) / Medium (execution) | Medium-High (selection) / High (execution) |
|
|
322
|
-
| **Scalability** | Poor for large systems (explosion) | Limited by state explosion | Excellent (sublinear in parameters) | Good (polynomial algorithms available) |
|
|
323
|
-
| **Stakeholder accessibility** | Excellent | Poor | Poor | Not applicable |
|
|
324
|
-
| **Maintenance cost** | High (step definitions, feature files) | High (model maintenance) | Low-Medium (parameter model) | Low (parasitic on existing suite) |
|
|
325
|
-
| **Formal fault coverage guarantee** | None | Transition/path completeness (given model) | t-way interaction completeness | Approximation (NP-hard exact problem) |
|
|
326
|
-
| **Empirical fault detection** | Qualitative evidence | Strong (protocol/state faults) | Strong (interaction/configuration faults) | Strong (compared to random ordering) |
|
|
327
|
-
| **Tool maturity** | High (industry standard) | Medium (specialized tools) | Medium-High (ACTS, PICT) | Medium (research + commercial) |
|
|
328
|
-
| **Entry cost** | Medium (BDD culture change) | High (modeling expertise) | Medium (parameter abstraction) | Medium (instrumentation) |
|
|
329
|
-
| **Best-fit domain** | Feature testing, acceptance testing | Protocol, UI workflow, embedded | Configuration, API, highly parameterized | CI/CD regression, large test suites |
|
|
330
|
-
| **Worst-fit domain** | Configuration spaces, protocol | Exploratory, early-stage | Sequential logic, stateful behavior | Small suites, novel systems |
|
|
331
|
-
|
|
332
|
-
Three key integration opportunities:
|
|
333
|
-
|
|
334
|
-
1. **BDD + Combinatorial:** Scenario Outlines in Gherkin represent an implicit form of combinatorial testing. Fusing ACTS/PICT's parameter models with Gherkin's step framework would allow combinatorial coverage guarantees in readable scenarios.
|
|
335
|
-
|
|
336
|
-
2. **MBT + Selection:** Model-based generation produces large scenario sets; applying risk-based or history-based selection combines model completeness with economic feasibility. Cartaxo et al. (2011) explored this with promising results.
|
|
337
|
-
|
|
338
|
-
3. **BDD + MBT:** The "executable model" pattern, where a BDD feature file implicitly defines a state machine, can be exploited to generate MBT scenarios from BDD specifications. Aichernig et al. (2015) demonstrated this for reactive systems.
|
|
339
|
-
|
|
340
|
-
---
|
|
341
|
-
|
|
342
|
-
## 6. Open Problems and Gaps
|
|
343
|
-
|
|
344
|
-
### 6.1 Scenario Specification Drift
|
|
345
|
-
|
|
346
|
-
Scenario drift -- the divergence of scenario specifications from actual system behavior over time -- is the most practically significant open problem. In BDD projects, feature files describing behavior that no longer matches the implementation are pervasive (Smart, 2023; Binamungu et al., 2016). The "living documentation" guarantee holds locally for exercised behavior but does not prevent scenarios from becoming too permissive or failing to cover new behavior. A formal solution would require semantic equivalence checking (undecidable in general); the research opportunity is in practical, sound approximations.
|
|
347
|
-
|
|
348
|
-
### 6.2 Model-to-Code Traceability
|
|
349
|
-
|
|
350
|
-
Maintaining **bidirectional traceability** between model elements and code artifacts is unsolved. Static analysis can extract partial models from code (Amalfitano et al., 2012; Mesbah and van Deursen, 2009), but resulting models are implementation-level rather than specification-level.
|
|
351
|
-
|
|
352
|
-
### 6.3 Optimal Interaction Strength Selection
|
|
353
|
-
|
|
354
|
-
The choice of t in t-way testing is theoretically underdetermined. No general method exists for determining the right t without prior fault data. Research directions include static analysis to bound interaction depth (Ghandehari et al., 2012) and adaptive strategies.
|
|
355
|
-
|
|
356
|
-
### 6.4 Integration of Specification and Generation Approaches
|
|
357
|
-
|
|
358
|
-
The four approaches are largely practiced in isolation. A unified framework allowing specifications (BDD), generation (MBT, combinatorial), and selection to be applied in coordination does not exist.
|
|
359
|
-
|
|
360
|
-
### 6.5 Metrics for Scenario Quality Beyond Coverage
|
|
361
|
-
|
|
362
|
-
Coverage measures extent of scenario space exercised but not quality of individual scenarios. Research on **scenario mutation testing** (applying mutation operators to specifications) has begun to address this, but no standardized operator suite or tooling exists.
|
|
363
|
-
|
|
364
|
-
### 6.6 Scalability of Search-Based Selection to Continuous Integration
|
|
365
|
-
|
|
366
|
-
Computing APFD-optimal orderings is expensive for large suites in time-constrained CI pipelines. Recent work on lightweight selection (Luo et al., 2019; Ziftci and Reardon, 2017 at Google) has attempted to address this, but the theoretical relationship to safe RTS guarantees is not established.
|
|
367
|
-
|
|
368
|
-
---
|
|
369
|
-
|
|
370
|
-
## 7. Conclusion
|
|
371
|
-
|
|
372
|
-
Scenario testing occupies a foundational role in software quality assurance: it is the bridge between the abstract properties that specification languages and formal methods can capture and the concrete, observable behavior that matters to users and system stakeholders.
|
|
373
|
-
|
|
374
|
-
**Scenario specification languages and frameworks** (Section 4.1) provide the vocabulary and tooling for expressing system behavior in human-readable, executable form. BDD and Gherkin have become the de-facto standard, supported by a mature ecosystem of tools and a documented promise of living documentation and improved stakeholder communication.
|
|
375
|
-
|
|
376
|
-
**Model-based scenario generation** (Section 4.2) provides the strongest formal guarantees when behavioral models of sufficient fidelity can be constructed. Tools such as GraphWalker and Spec Explorer have demonstrated substantial industrial value in protocol and interface testing domains.
|
|
377
|
-
|
|
378
|
-
**Combinatorial and interaction testing** (Section 4.3) offers a mathematically grounded approach to the parameter interaction problem particularly powerful for configuration-sensitive systems. The empirical evidence that most faults require at most 4-way interactions provides a practical basis for t-way covering arrays.
|
|
379
|
-
|
|
380
|
-
**Scenario explosion and selection** (Section 4.4) provides the portfolio management layer: evidence-based techniques from equivalence partitioning through search-based optimization allow practitioners to invest testing effort where it has the highest probability of detecting faults.
|
|
381
|
-
|
|
382
|
-
The appropriate choice of technique depends on the nature of the system under test, the composition of the development team, the stage of the development lifecycle, and the resources available for testing infrastructure. The four foundational approaches are theoretically well-grounded, empirically validated to meaningful degrees, and practically available through mature or maturing tooling -- making the decision of how to combine them an engineering judgment rather than a scientific question.
|
|
383
|
-
|
|
384
|
-
---
|
|
385
|
-
|
|
386
|
-
## References
|
|
387
|
-
|
|
388
|
-
1. Adzic, G. (2011). *Specification by Example*. Manning Publications.
|
|
389
|
-
2. Amland, S. (1999). Risk Based Testing. *Journal of Systems and Software*, 53(3), 287-295.
|
|
390
|
-
3. Ammann, P., and Offutt, J. (2008, 2017). *Introduction to Software Testing*. Cambridge University Press.
|
|
391
|
-
4. Apfelbaum, L., and Doyle, J. (1997). Model Based Testing. *Quality Week Conference*.
|
|
392
|
-
5. Barr, E. T., et al. (2014). The Oracle Problem in Software Testing. *IEEE TSE*, 41(5), 507-525.
|
|
393
|
-
6. Beck, K., et al. (2001). *Manifesto for Agile Software Development*.
|
|
394
|
-
7. Beizer, B. (1990). *Software Testing Techniques* (2nd ed.). Van Nostrand Reinhold.
|
|
395
|
-
8. Binamungu, L. P., et al. (2016). Characterising the Quality of BDD Specifications. *XP 2016*.
|
|
396
|
-
9. Briand, L. C., et al. (2002). Toward the Reverse Engineering of UML Sequence Diagrams. *IEEE TSE*, 32(9).
|
|
397
|
-
10. Broy, M., et al. (2005). *Model-Based Testing of Reactive Systems*. Springer LNCS 3472.
|
|
398
|
-
11. Carroll, J. M. (1995). *Scenario-Based Design*. John Wiley and Sons.
|
|
399
|
-
12. Carroll, J. M. (2000). *Making Use: Scenario-Based Design of Human-Computer Interactions*. MIT Press.
|
|
400
|
-
13. Cartaxo, E. G., et al. (2011). Test Case Selection via Similarity in MBT. *SEKE 2011*.
|
|
401
|
-
14. Carvalho, L. V., et al. (2018). Investigating BDD: A Systematic Literature Review. *EASE 2018*.
|
|
402
|
-
15. Cheng, K.-T., and Krishnakumar, A. S. (1993). Automatic Functional Test Generation. *DAC*, 86-91.
|
|
403
|
-
16. Chow, T. S. (1978). Testing Software Design Modeled by FSMs. *IEEE TSE*, 4(3), 178-187.
|
|
404
|
-
17. Clarke, E., et al. (1999). *Model Checking*. MIT Press.
|
|
405
|
-
18. Cohen, D. M., et al. (1997). The AETG System. *IEEE TSE*, 23(7), 437-444.
|
|
406
|
-
19. Cohn, M. (2004). *User Stories Applied*. Addison-Wesley.
|
|
407
|
-
20. Colbourn, C. J. (2004). Combinatorial Aspects of Covering Arrays. *Le Matematiche*, 59(1-2).
|
|
408
|
-
21. Cunningham, W. (2002). Framework for Integrated Tests (FIT).
|
|
409
|
-
22. Czerwonka, J. (2006). Pairwise Testing in the Real World. *PNSQC 2006*.
|
|
410
|
-
23. Dalal, S. R., et al. (1999). Model-Based Testing in Practice. *ICSE 1999*, 285-294.
|
|
411
|
-
24. Dijkstra, E. W. (1972). The Humble Programmer. *CACM*, 15(10), 859-866.
|
|
412
|
-
25. Elbaum, S., et al. (2002). Test Case Prioritization: A Family of Empirical Studies. *IEEE TSE*, 28(2), 159-182.
|
|
413
|
-
26. Felderer, M., and Ramler, R. (2014). Risk Orientation in Software Testing. *Software Quality Journal*, 24(3).
|
|
414
|
-
27. Ghandehari, L. S. J., et al. (2012). Identifying Failure-Inducing Combinations. *ICST 2012*.
|
|
415
|
-
28. Gligoric, M., et al. (2015). Practical Regression Test Selection. *ISSTA 2015*, 211-222.
|
|
416
|
-
29. Grieskamp, W., et al. (2011). Generating FSMs from ASM Models. *ENTCS*, 82(6).
|
|
417
|
-
30. Grindal, M., et al. (2005). Combination Testing Strategies: A Survey. *STVR*, 15(3), 167-199.
|
|
418
|
-
31. Hamlet, D. G. (1990). Probable Correctness Theory. *IPL*, 25(1), 17-25.
|
|
419
|
-
32. Harel, D. (1987). Statecharts. *Science of Computer Programming*, 8(3), 231-274.
|
|
420
|
-
33. Harman, M., and Jones, B. F. (2001). Search-Based Software Engineering. *IST*, 43(14), 833-839.
|
|
421
|
-
34. Harman, M., and McMinn, P. (2010). A Theoretical and Empirical Study of Search-Based Testing. *IEEE TSE*, 36(2), 226-247.
|
|
422
|
-
35. Howden, W. E. (1976). Reliability of Path Analysis Testing. *IEEE TSE*, 2(3), 208-215.
|
|
423
|
-
36. Jacobson, I., et al. (1992). *Object-Oriented Software Engineering: A Use Case Driven Approach*. Addison-Wesley.
|
|
424
|
-
37. Jacobson, I., et al. (1999). *The Unified Software Development Process*. Addison-Wesley.
|
|
425
|
-
38. Kim, J. M., and Porter, A. (2002). A History-Based Test Prioritization Technique. *ICSE 2002*, 119-129.
|
|
426
|
-
39. Kuhn, D. R., et al. (2004). Software Fault Interactions. *IEEE TSE*, 30(6), 418-421.
|
|
427
|
-
40. Kuhn, D. R., et al. (2013). *Introduction to Combinatorial Testing*. CRC Press.
|
|
428
|
-
41. Lei, Y., and Tai, K.-C. (1998). In-Parameter-Order. *HASE 1998*, 254-261.
|
|
429
|
-
42. Mandl, R. (1985). Orthogonal Latin Squares. *CACM*, 28(10), 1054-1058.
|
|
430
|
-
43. Meyer, B. (1992). Applying "Design by Contract." *IEEE Computer*, 25(10), 40-51.
|
|
431
|
-
44. Musa, J. D. (1993). Operational Profiles. *IEEE Software*, 10(2), 14-32.
|
|
432
|
-
45. Myers, G. J. (1979). *The Art of Software Testing*. John Wiley and Sons.
|
|
433
|
-
46. Nie, C., and Leung, H. (2011). A Survey of Combinatorial Testing. *ACM Computing Surveys*, 43(2).
|
|
434
|
-
47. North, D. (2006). Introducing BDD. *Better Software Magazine*.
|
|
435
|
-
48. de Oliveira Neto, F. G., et al. (2019). A Research Agenda for Test Automation at Spotify. *ICST 2019 Workshops*.
|
|
436
|
-
49. Pretschner, A., et al. (2005). One Evaluation of Model-Based Testing. *ICSE 2005*, 392-401.
|
|
437
|
-
50. Rao, C. R. (1947). Factorial Experiments. *JRSS*, 9, 128-139.
|
|
438
|
-
51. Rolland, C., et al. (1998). A Proposal for a Scenario Classification Framework. *RE Journal*, 3(1).
|
|
439
|
-
52. Rothermel, G., and Harrold, M. J. (1996). Analyzing Regression Test Selection Techniques. *IEEE TSE*, 22(8), 529-551.
|
|
440
|
-
53. Rothermel, G., et al. (2001). Prioritizing Test Cases for Regression Testing. *IEEE TSE*, 27(10), 929-948.
|
|
441
|
-
54. Smart, J. F. (2023). *BDD in Action* (2nd ed.). Manning Publications.
|
|
442
|
-
55. Soares, G., et al. (2021). BDD in Practice. *ICST 2021*.
|
|
443
|
-
56. Solis, C., and Wang, X. (2011). A Study of BDD Characteristics. *SEAA 2011*, 383-387.
|
|
444
|
-
57. Tassey, G. (2002). *The Economic Impacts of Inadequate Infrastructure for Software Testing*. NIST Planning Report 02-3.
|
|
445
|
-
58. Tretmans, G. J. (2008). Model-Based Testing with LTS. *Springer LNCS 4949*, 1-38.
|
|
446
|
-
59. Utting, M., et al. (2012). A Taxonomy of Model-Based Testing. *STVR*, 22(5), 297-312.
|
|
447
|
-
60. Veanes, M., et al. (2008). Online Testing with Model Programs. *ESEC/FSE 2008*, 273-283.
|
|
448
|
-
61. Whittaker, J. A., and Thomason, M. G. (1994). A Markov Chain Model for Statistical Software Testing. *IEEE TSE*, 20(10), 812-824.
|
|
449
|
-
62. Wiklund, K., et al. (2017). Technical Debt in Test Automation. *ICST 2017*.
|
|
450
|
-
63. Wynne, M., and Hellesoy, A. (2012). *The Cucumber Book*. Pragmatic Bookshelf.
|
|
451
|
-
64. Yilmaz, C., et al. (2006). Covering Arrays for Efficient Fault Characterization. *IEEE TSE*, 32(1), 20-34.
|
|
452
|
-
65. Yoo, S., and Harman, M. (2012). Regression Testing Minimization, Selection and Prioritization: A Survey. *STVR*, 22(2), 67-120.
|
|
453
|
-
|
|
454
|
-
---
|
|
455
|
-
|
|
456
|
-
## Practitioner Resources
|
|
457
|
-
|
|
458
|
-
### BDD and Scenario Specification
|
|
459
|
-
|
|
460
|
-
**Tools**
|
|
461
|
-
- **Cucumber** (multi-language) -- The reference BDD framework; JVM, JavaScript, Ruby, Go.
|
|
462
|
-
- **SpecFlow** (.NET/C#) -- The leading BDD framework for .NET; acquired by Tricentis.
|
|
463
|
-
- **Behave** (Python) -- Most widely adopted Python BDD framework; integrates with pytest.
|
|
464
|
-
- **Karate** (API + UI testing) -- BDD framework eliminating separate step definitions for API testing.
|
|
465
|
-
- **FitNesse** -- Wiki-based acceptance testing with tabular scenario specification.
|
|
466
|
-
|
|
467
|
-
**Key Texts**
|
|
468
|
-
- Smart, J. F. (2023). *BDD in Action* (2nd ed.). Manning. -- Most current comprehensive treatment.
|
|
469
|
-
- Wynne, M., and Hellesoy, A. (2012). *The Cucumber Book*. Pragmatic Bookshelf. -- Canonical reference.
|
|
470
|
-
- Adzic, G. (2011). *Specification by Example*. Manning. -- Theoretical and organizational framework.
|
|
471
|
-
|
|
472
|
-
### Model-Based Testing
|
|
473
|
-
|
|
474
|
-
**Tools**
|
|
475
|
-
- **GraphWalker** -- Open-source graph-model-based test generation; JUnit/TestNG integration.
|
|
476
|
-
- **Spec Explorer** -- ASM-based MBT for .NET/COM systems (Microsoft Research).
|
|
477
|
-
- **CONFORMIQ** (commercial) -- Enterprise-grade MBT for UML/statechart models.
|
|
478
|
-
- **ModelJUnit** -- Academic Java MBT framework for research and teaching.
|
|
479
|
-
- **UPPAAL** -- Timed automata model checker with test generation extension.
|
|
480
|
-
|
|
481
|
-
**Key Texts**
|
|
482
|
-
- Utting, M., and Legeard, B. (2006). *Practical Model-Based Testing*. Elsevier.
|
|
483
|
-
- Broy et al. (2005). *Model-Based Testing of Reactive Systems*. Springer LNCS 3472.
|
|
484
|
-
|
|
485
|
-
### Combinatorial Testing
|
|
486
|
-
|
|
487
|
-
**Tools**
|
|
488
|
-
- **ACTS** -- NIST's reference implementation; supports 2-6 way covering arrays.
|
|
489
|
-
- **PICT** -- Microsoft's pairwise tool; open-sourced.
|
|
490
|
-
- **CAgen** -- Academic covering array generator.
|
|
491
|
-
- **AllPairs** (Python) -- Lightweight pairwise tool.
|
|
492
|
-
|
|
493
|
-
**Key Texts**
|
|
494
|
-
- Kuhn, D. R., et al. (2013). *Introduction to Combinatorial Testing*. CRC Press.
|
|
495
|
-
- Nie, C., and Leung, H. (2011). "A Survey of Combinatorial Testing." *ACM Computing Surveys*, 43(2).
|
|
496
|
-
|
|
497
|
-
### Scenario Selection and Prioritization
|
|
498
|
-
|
|
499
|
-
**Tools**
|
|
500
|
-
- **Ekstazi** -- File-level dependency-based RTS for Maven/Gradle.
|
|
501
|
-
- **STARTS** -- Static RTS for Maven Java projects.
|
|
502
|
-
- **Launchable** (commercial) -- ML-based predictive test selection for CI/CD.
|
|
503
|
-
- **Xray for Jira** (commercial) -- Risk-based test prioritization.
|
|
504
|
-
|
|
505
|
-
**Key Texts**
|
|
506
|
-
- Yoo, S., and Harman, M. (2012). "Regression Testing Survey." *STVR*, 22(2).
|
|
507
|
-
- Ammann, P., and Offutt, J. (2017). *Introduction to Software Testing* (2nd ed.). Cambridge University Press.
|