langwatch-scenario 0.2.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. langwatch_scenario-0.4.0/PKG-INFO +363 -0
  2. langwatch_scenario-0.4.0/README.md +325 -0
  3. langwatch_scenario-0.4.0/langwatch_scenario.egg-info/PKG-INFO +363 -0
  4. {langwatch_scenario-0.2.0 → langwatch_scenario-0.4.0}/langwatch_scenario.egg-info/SOURCES.txt +10 -4
  5. {langwatch_scenario-0.2.0 → langwatch_scenario-0.4.0}/langwatch_scenario.egg-info/requires.txt +6 -1
  6. {langwatch_scenario-0.2.0 → langwatch_scenario-0.4.0}/pyproject.toml +22 -9
  7. langwatch_scenario-0.4.0/scenario/__init__.py +250 -0
  8. langwatch_scenario-0.4.0/scenario/agent_adapter.py +111 -0
  9. langwatch_scenario-0.4.0/scenario/cache.py +186 -0
  10. langwatch_scenario-0.4.0/scenario/config.py +183 -0
  11. langwatch_scenario-0.4.0/scenario/error_messages.py +104 -0
  12. langwatch_scenario-0.4.0/scenario/judge_agent.py +435 -0
  13. langwatch_scenario-0.4.0/scenario/pytest_plugin.py +385 -0
  14. langwatch_scenario-0.4.0/scenario/scenario_executor.py +758 -0
  15. langwatch_scenario-0.4.0/scenario/scenario_state.py +205 -0
  16. langwatch_scenario-0.4.0/scenario/script.py +361 -0
  17. langwatch_scenario-0.4.0/scenario/types.py +269 -0
  18. langwatch_scenario-0.4.0/scenario/user_simulator_agent.py +249 -0
  19. langwatch_scenario-0.4.0/scenario/utils.py +514 -0
  20. {langwatch_scenario-0.2.0 → langwatch_scenario-0.4.0}/setup.py +1 -1
  21. langwatch_scenario-0.4.0/tests/test_scenario.py +419 -0
  22. langwatch_scenario-0.4.0/tests/test_scenario_agent.py +38 -0
  23. langwatch_scenario-0.4.0/tests/test_scenario_executor.py +202 -0
  24. langwatch_scenario-0.2.0/PKG-INFO +0 -254
  25. langwatch_scenario-0.2.0/README.md +0 -221
  26. langwatch_scenario-0.2.0/langwatch_scenario.egg-info/PKG-INFO +0 -254
  27. langwatch_scenario-0.2.0/scenario/__init__.py +0 -26
  28. langwatch_scenario-0.2.0/scenario/cache.py +0 -62
  29. langwatch_scenario-0.2.0/scenario/config.py +0 -28
  30. langwatch_scenario-0.2.0/scenario/error_messages.py +0 -76
  31. langwatch_scenario-0.2.0/scenario/pytest_plugin.py +0 -177
  32. langwatch_scenario-0.2.0/scenario/result.py +0 -74
  33. langwatch_scenario-0.2.0/scenario/scenario.py +0 -123
  34. langwatch_scenario-0.2.0/scenario/scenario_executor.py +0 -204
  35. langwatch_scenario-0.2.0/scenario/testing_agent.py +0 -262
  36. langwatch_scenario-0.2.0/scenario/utils.py +0 -121
  37. {langwatch_scenario-0.2.0 → langwatch_scenario-0.4.0}/langwatch_scenario.egg-info/dependency_links.txt +0 -0
  38. {langwatch_scenario-0.2.0 → langwatch_scenario-0.4.0}/langwatch_scenario.egg-info/entry_points.txt +0 -0
  39. {langwatch_scenario-0.2.0 → langwatch_scenario-0.4.0}/langwatch_scenario.egg-info/top_level.txt +0 -0
  40. {langwatch_scenario-0.2.0 → langwatch_scenario-0.4.0}/setup.cfg +0 -0
@@ -0,0 +1,363 @@
1
+ Metadata-Version: 2.4
2
+ Name: langwatch-scenario
3
+ Version: 0.4.0
4
+ Summary: The end-to-end agent testing library
5
+ Author-email: LangWatch Team <support@langwatch.ai>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/langwatch/scenario
8
+ Project-URL: Bug Tracker, https://github.com/langwatch/scenario/issues
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.8
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Requires-Python: >=3.9
18
+ Description-Content-Type: text/markdown
19
+ Requires-Dist: pytest>=8.1.1
20
+ Requires-Dist: litellm>=1.49.0
21
+ Requires-Dist: python-dotenv>=1.0.1
22
+ Requires-Dist: termcolor>=2.4.0
23
+ Requires-Dist: pydantic>=2.7.0
24
+ Requires-Dist: joblib>=1.4.2
25
+ Requires-Dist: wrapt>=1.17.2
26
+ Requires-Dist: pytest-asyncio>=0.26.0
27
+ Requires-Dist: rich<15.0.0,>=13.3.3
28
+ Requires-Dist: pksuid>=1.1.2
29
+ Provides-Extra: dev
30
+ Requires-Dist: black; extra == "dev"
31
+ Requires-Dist: isort; extra == "dev"
32
+ Requires-Dist: pytest-cov; extra == "dev"
33
+ Requires-Dist: pre-commit; extra == "dev"
34
+ Requires-Dist: commitizen; extra == "dev"
35
+ Requires-Dist: pyright; extra == "dev"
36
+ Requires-Dist: pydantic-ai; extra == "dev"
37
+ Requires-Dist: function-schema; extra == "dev"
38
+
39
+ ![scenario](https://github.com/langwatch/scenario/raw/main/assets/scenario-wide.webp)
40
+
41
+ <div align="center">
42
+ <!-- Discord, PyPI, Docs, etc links -->
43
+ </div>
44
+
45
+ # Scenario
46
+
47
+ Scenario is an Agent Testing Framework for testing AI agents through Simulation Testing.
48
+
49
+ You define the conversation scenario and let it play out, it will keep chatting back and forth with _your_ agent until it reaches the desired goal or detects an unexpected behavior based on the criteria you defined.
50
+
51
+ - Test your agents end-to-end conversations with specified scenarios to capture both happy paths and edge cases
52
+ - Full flexibility of how much you want to guide the conversation, from fully scripted scenarios to completely automated simulations
53
+ - Run evaluations at any point of the conversation, designed for multi-turn
54
+ - Works in combination with any testing and LLM evaluation frameworks, completely agnostic
55
+ - Works with any LLM and Agent Framework, easy integration
56
+
57
+ [📺 Video Tutorial](https://www.youtube.com/watch?v=f8NLpkY0Av4)
58
+
59
+ ### See also
60
+
61
+ - [Scenario TypeScript](https://github.com/langwatch/scenario-ts/)
62
+ - [Scenario Go](https://github.com/langwatch/scenario-go/)
63
+
64
+ ## Example
65
+
66
+ ```python
67
+ @pytest.mark.agent_test
68
+ @pytest.mark.asyncio
69
+ async def test_weather_agent():
70
+ # Integrate with your agent
71
+ class WeatherAgent(scenario.AgentAdapter):
72
+ async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
73
+ return weather_agent(input.messages)
74
+
75
+ # Define any custom assertions
76
+ def check_for_weather_tool_call(state: scenario.ScenarioState):
77
+ assert state.has_tool_call("get_current_weather")
78
+
79
+ # Run the scenario
80
+ result = await scenario.run(
81
+ name="checking the weather",
82
+ description="""
83
+ The user is planning a boat trip from Barcelona to Rome,
84
+ and is wondering what the weather will be like.
85
+ """,
86
+ agents=[
87
+ WeatherAgent(),
88
+ scenario.UserSimulatorAgent(model="openai/gpt-4.1-mini"),
89
+ ],
90
+ script=[
91
+ scenario.user(),
92
+ scenario.agent(),
93
+ check_for_weather_tool_call, # check for tool call after the first agent response
94
+ scenario.succeed(),
95
+ ],
96
+ )
97
+
98
+ # Assert the simulation was successful
99
+ assert result.success
100
+ ```
101
+
102
+ > [!NOTE]
103
+ > This is a very basic example, keep reading to see how to run a simulation completely script-free, using a Judge Agent to evaluate in real-time.
104
+
105
+ Check out more examples in the [examples folder](./examples/).
106
+
107
+ ## Getting Started
108
+
109
+ Install pytest and scenario:
110
+
111
+ ```bash
112
+ pip install pytest langwatch-scenario
113
+ ```
114
+
115
+ Now create your first scenario and save it as `tests/test_vegetarian_recipe_agent.py`, copy the full working example below:
116
+
117
+ ```python
118
+ import pytest
119
+ import scenario
120
+ import litellm
121
+
122
+ scenario.configure(default_model="openai/gpt-4.1-mini")
123
+
124
+
125
+ @pytest.mark.agent_test
126
+ @pytest.mark.asyncio
127
+ async def test_vegetarian_recipe_agent():
128
+ class Agent(scenario.AgentAdapter):
129
+ async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
130
+ return vegetarian_recipe_agent(input.messages)
131
+
132
+ # Run a simulation scenario
133
+ result = await scenario.run(
134
+ name="dinner idea",
135
+ description="""
136
+ It's saturday evening, the user is very hungry and tired,
137
+ but have no money to order out, so they are looking for a recipe.
138
+ """,
139
+ agents=[
140
+ Agent(),
141
+ scenario.UserSimulatorAgent(),
142
+ scenario.JudgeAgent(
143
+ criteria=[
144
+ "Agent should not ask more than two follow-up questions",
145
+ "Agent should generate a recipe",
146
+ "Recipe should include a list of ingredients",
147
+ "Recipe should include step-by-step cooking instructions",
148
+ "Recipe should be vegetarian and not include any sort of meat",
149
+ ]
150
+ ),
151
+ ],
152
+ )
153
+
154
+ # Assert for pytest to know whether the test passed
155
+ assert result.success
156
+
157
+
158
+ # Example agent implementation
159
+ import litellm
160
+
161
+
162
+ @scenario.cache()
163
+ def vegetarian_recipe_agent(messages) -> scenario.AgentReturnTypes:
164
+ response = litellm.completion(
165
+ model="openai/gpt-4.1-mini",
166
+ messages=[
167
+ {
168
+ "role": "system",
169
+ "content": """
170
+ You are a vegetarian recipe agent.
171
+ Given the user request, ask AT MOST ONE follow-up question,
172
+ then provide a complete recipe. Keep your responses concise and focused.
173
+ """,
174
+ },
175
+ *messages,
176
+ ],
177
+ )
178
+
179
+ return response.choices[0].message # type: ignore
180
+ ```
181
+
182
+ Create a `.env` file and put your OpenAI API key in it:
183
+
184
+ ```bash
185
+ OPENAI_API_KEY=<your-api-key>
186
+ ```
187
+
188
+ Now run it with pytest:
189
+
190
+ ```bash
191
+ pytest -s tests/test_vegetarian_recipe_agent.py
192
+ ```
193
+
194
+ This is how it will look like:
195
+
196
+ [![asciicast](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11.svg)](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11)
197
+
198
+ You can find the same code example in [examples/test_vegetarian_recipe_agent.py](examples/test_vegetarian_recipe_agent.py).
199
+
200
+ ## Script-free Simulation
201
+
202
+ By providing a User Simulator Agent and a description of the Scenario, the simulated user will automatically generate messages to the agent until the scenario is successful or the maximum number of turns is reached.
203
+
204
+ You can then use a Judge Agent to evaluate the scenario in real-time given certain criteria, at every turn, the Judge Agent will decide if it should let the simulation proceed or end it with a verdict.
205
+
206
+ You can combine it with a script, to control for example the beginning of the conversation, or simply let it run scriptless, this is very useful to test an open case like a vibe coding assistant:
207
+
208
+ ```python
209
+ result = await scenario.run(
210
+ name="dog walking startup landing page",
211
+ description="""
212
+ the user wants to create a new landing page for their dog walking startup
213
+
214
+ send the first message to generate the landing page, then a single follow up request to extend it, then give your final verdict
215
+ """,
216
+ agents=[
217
+ LovableAgentAdapter(template_path=template_path),
218
+ scenario.UserSimulatorAgent(),
219
+ scenario.JudgeAgent(
220
+ criteria=[
221
+ "agent reads the files before go and making changes",
222
+ "agent modified the index.css file, not only the Index.tsx file",
223
+ "agent created a comprehensive landing page",
224
+ "agent extended the landing page with a new section",
225
+ "agent should NOT say it can't read the file",
226
+ "agent should NOT produce incomplete code or be too lazy to finish",
227
+ ],
228
+ ),
229
+ ],
230
+ max_turns=5, # optional
231
+ )
232
+ ```
233
+
234
+ Check out the fully working Lovable Clone example in [examples/test_lovable_clone.py](examples/test_lovable_clone.py).
235
+
236
+ ## Full Control of the Conversation
237
+
238
+ You can specify a script for guiding the scenario by passing a list of steps to the `script` field, those steps are simply arbitrary functions that take the current state of the scenario as an argument, so you can do things like:
239
+
240
+ - Control what the user says, or let it be generated automatically
241
+ - Control what the agent says, or let it be generated automatically
242
+ - Add custom assertions, for example making sure a tool was called
243
+ - Add a custom evaluation, from an external library
244
+ - Let the simulation proceed for a certain number of turns, and evaluate at each new turn
245
+ - Trigger the judge agent to decide on a verdict
246
+ - Add arbitrary messages like mock tool calls in the middle of the conversation
247
+
248
+ Everything is possible, using the same simple structure:
249
+
250
+ ```python
251
+ @pytest.mark.agent_test
252
+ @pytest.mark.asyncio
253
+ async def test_ai_assistant_agent():
254
+ scenario = Scenario(
255
+ name="false assumptions",
256
+ description="""
257
+ The agent makes false assumption that the user is talking about an ATM bank, and user corrects it that they actually mean river banks
258
+ """,
259
+ agent=AiAssistantAgentAdapter,
260
+ criteria=[
261
+ "user should get good recommendations on river crossing",
262
+ "agent should NOT follow up about ATM recommendation after user has corrected them they are just hiking",
263
+ ],
264
+ max_turns=5,
265
+ )
266
+
267
+ def check_if_tool_was_called(state: ScenarioExecutor) -> None:
268
+ assert state.has_tool_call("web_search")
269
+
270
+ result = await scenario.script(
271
+ [
272
+ # Define existing history of messages
273
+ scenario.user("how do I safely approach a bank?"),
274
+
275
+ # Or let it be generate automatically
276
+ scenario.agent(),
277
+
278
+ # Add custom assertions, for example making sure a tool was called
279
+ check_if_tool_was_called,
280
+
281
+ # Another user message
282
+ scenario.user(),
283
+
284
+ # Let the simulation proceed for 2 more turns, print at every turn
285
+ scenario.proceed(
286
+ turns=2,
287
+ on_turn=lambda state: print(f"Turn {state.current_turn}: {state.messages}"),
288
+ ),
289
+
290
+ # Time to make a judgment call
291
+ scenario.judge(),
292
+ ]
293
+ ).run()
294
+
295
+ assert result.success
296
+ ```
297
+
298
+ ## Debug mode
299
+
300
+ You can enable debug mode by setting the `debug` field to `True` in the `Scenario.configure` method or in the specific scenario you are running, or by passing the `--debug` flag to pytest.
301
+
302
+ Debug mode allows you to see the messages in slow motion step by step, and intervene with your own inputs to debug your agent from the middle of the conversation.
303
+
304
+ ```python
305
+ Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), debug=True)
306
+ ```
307
+
308
+ or
309
+
310
+ ```bash
311
+ pytest -s tests/test_vegetarian_recipe_agent.py --debug
312
+ ```
313
+
314
+ ## Cache
315
+
316
+ Each time the scenario runs, the testing agent might chose a different input to start, this is good to make sure it covers the variance of real users as well, however we understand that the non-deterministic nature of it might make it less repeatable, costly and harder to debug. To solve for it, you can use the `cache_key` field in the `Scenario.configure` method or in the specific scenario you are running, this will make the testing agent give the same input for given the same scenario:
317
+
318
+ ```python
319
+ Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), cache_key="42")
320
+ ```
321
+
322
+ To bust the cache, you can simply pass a different `cache_key`, disable it, or delete the cache files located at `~/.scenario/cache`.
323
+
324
+ To go a step further and fully cache the test end-to-end, you can also wrap the LLM calls or any other non-deterministic functions in your application side with the `@scenario_cache` decorator:
325
+
326
+ ```python
327
+ class MyAgent:
328
+ @scenario_cache(ignore=["self"])
329
+ def invoke(self, message, context):
330
+ return client.chat.completions.create(
331
+ # ...
332
+ )
333
+ ```
334
+
335
+ This will cache any function call you decorate when running the tests and make them repeatable, hashed by the function arguments, the scenario being executed, and the `cache_key` you provided. You can exclude arguments that should not be hashed for the cache key by naming them in the `ignore` argument.
336
+
337
+ ## Disable Output
338
+
339
+ You can remove the `-s` flag from pytest to hide the output during test, which will only show up if the test fails. Alternatively, you can set `verbose=False` in the `Scenario.configure` method or in the specific scenario you are running.
340
+
341
+ ## Running in parallel
342
+
343
+ As the number of your scenarios grows, you might want to run them in parallel to speed up your whole test suite. We suggest you to use the [pytest-asyncio-concurrent](https://pypi.org/project/pytest-asyncio-concurrent/) plugin to do so.
344
+
345
+ Simply install the plugin from the link above, then replace the `@pytest.mark.asyncio` annotation in the tests with `@pytest.mark.asyncio_concurrent`, adding a group name to it to mark the group of scenarions that should be run in parallel together, e.g.:
346
+
347
+ ```python
348
+ @pytest.mark.agent_test
349
+ @pytest.mark.asyncio_concurrent(group="vegetarian_recipe_agent")
350
+ async def test_vegetarian_recipe_agent():
351
+ # ...
352
+
353
+ @pytest.mark.agent_test
354
+ @pytest.mark.asyncio_concurrent(group="vegetarian_recipe_agent")
355
+ async def test_user_is_very_hungry():
356
+ # ...
357
+ ```
358
+
359
+ Those two scenarios should now run in parallel.
360
+
361
+ ## License
362
+
363
+ MIT License
@@ -0,0 +1,325 @@
1
+ ![scenario](https://github.com/langwatch/scenario/raw/main/assets/scenario-wide.webp)
2
+
3
+ <div align="center">
4
+ <!-- Discord, PyPI, Docs, etc links -->
5
+ </div>
6
+
7
+ # Scenario
8
+
9
+ Scenario is an Agent Testing Framework for testing AI agents through Simulation Testing.
10
+
11
+ You define the conversation scenario and let it play out, it will keep chatting back and forth with _your_ agent until it reaches the desired goal or detects an unexpected behavior based on the criteria you defined.
12
+
13
+ - Test your agents end-to-end conversations with specified scenarios to capture both happy paths and edge cases
14
+ - Full flexibility of how much you want to guide the conversation, from fully scripted scenarios to completely automated simulations
15
+ - Run evaluations at any point of the conversation, designed for multi-turn
16
+ - Works in combination with any testing and LLM evaluation frameworks, completely agnostic
17
+ - Works with any LLM and Agent Framework, easy integration
18
+
19
+ [📺 Video Tutorial](https://www.youtube.com/watch?v=f8NLpkY0Av4)
20
+
21
+ ### See also
22
+
23
+ - [Scenario TypeScript](https://github.com/langwatch/scenario-ts/)
24
+ - [Scenario Go](https://github.com/langwatch/scenario-go/)
25
+
26
+ ## Example
27
+
28
+ ```python
29
+ @pytest.mark.agent_test
30
+ @pytest.mark.asyncio
31
+ async def test_weather_agent():
32
+ # Integrate with your agent
33
+ class WeatherAgent(scenario.AgentAdapter):
34
+ async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
35
+ return weather_agent(input.messages)
36
+
37
+ # Define any custom assertions
38
+ def check_for_weather_tool_call(state: scenario.ScenarioState):
39
+ assert state.has_tool_call("get_current_weather")
40
+
41
+ # Run the scenario
42
+ result = await scenario.run(
43
+ name="checking the weather",
44
+ description="""
45
+ The user is planning a boat trip from Barcelona to Rome,
46
+ and is wondering what the weather will be like.
47
+ """,
48
+ agents=[
49
+ WeatherAgent(),
50
+ scenario.UserSimulatorAgent(model="openai/gpt-4.1-mini"),
51
+ ],
52
+ script=[
53
+ scenario.user(),
54
+ scenario.agent(),
55
+ check_for_weather_tool_call, # check for tool call after the first agent response
56
+ scenario.succeed(),
57
+ ],
58
+ )
59
+
60
+ # Assert the simulation was successful
61
+ assert result.success
62
+ ```
63
+
64
+ > [!NOTE]
65
+ > This is a very basic example, keep reading to see how to run a simulation completely script-free, using a Judge Agent to evaluate in real-time.
66
+
67
+ Check out more examples in the [examples folder](./examples/).
68
+
69
+ ## Getting Started
70
+
71
+ Install pytest and scenario:
72
+
73
+ ```bash
74
+ pip install pytest langwatch-scenario
75
+ ```
76
+
77
+ Now create your first scenario and save it as `tests/test_vegetarian_recipe_agent.py`, copy the full working example below:
78
+
79
+ ```python
80
+ import pytest
81
+ import scenario
82
+ import litellm
83
+
84
+ scenario.configure(default_model="openai/gpt-4.1-mini")
85
+
86
+
87
+ @pytest.mark.agent_test
88
+ @pytest.mark.asyncio
89
+ async def test_vegetarian_recipe_agent():
90
+ class Agent(scenario.AgentAdapter):
91
+ async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
92
+ return vegetarian_recipe_agent(input.messages)
93
+
94
+ # Run a simulation scenario
95
+ result = await scenario.run(
96
+ name="dinner idea",
97
+ description="""
98
+ It's saturday evening, the user is very hungry and tired,
99
+ but have no money to order out, so they are looking for a recipe.
100
+ """,
101
+ agents=[
102
+ Agent(),
103
+ scenario.UserSimulatorAgent(),
104
+ scenario.JudgeAgent(
105
+ criteria=[
106
+ "Agent should not ask more than two follow-up questions",
107
+ "Agent should generate a recipe",
108
+ "Recipe should include a list of ingredients",
109
+ "Recipe should include step-by-step cooking instructions",
110
+ "Recipe should be vegetarian and not include any sort of meat",
111
+ ]
112
+ ),
113
+ ],
114
+ )
115
+
116
+ # Assert for pytest to know whether the test passed
117
+ assert result.success
118
+
119
+
120
+ # Example agent implementation
121
+ import litellm
122
+
123
+
124
+ @scenario.cache()
125
+ def vegetarian_recipe_agent(messages) -> scenario.AgentReturnTypes:
126
+ response = litellm.completion(
127
+ model="openai/gpt-4.1-mini",
128
+ messages=[
129
+ {
130
+ "role": "system",
131
+ "content": """
132
+ You are a vegetarian recipe agent.
133
+ Given the user request, ask AT MOST ONE follow-up question,
134
+ then provide a complete recipe. Keep your responses concise and focused.
135
+ """,
136
+ },
137
+ *messages,
138
+ ],
139
+ )
140
+
141
+ return response.choices[0].message # type: ignore
142
+ ```
143
+
144
+ Create a `.env` file and put your OpenAI API key in it:
145
+
146
+ ```bash
147
+ OPENAI_API_KEY=<your-api-key>
148
+ ```
149
+
150
+ Now run it with pytest:
151
+
152
+ ```bash
153
+ pytest -s tests/test_vegetarian_recipe_agent.py
154
+ ```
155
+
156
+ This is how it will look like:
157
+
158
+ [![asciicast](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11.svg)](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11)
159
+
160
+ You can find the same code example in [examples/test_vegetarian_recipe_agent.py](examples/test_vegetarian_recipe_agent.py).
161
+
162
+ ## Script-free Simulation
163
+
164
+ By providing a User Simulator Agent and a description of the Scenario, the simulated user will automatically generate messages to the agent until the scenario is successful or the maximum number of turns is reached.
165
+
166
+ You can then use a Judge Agent to evaluate the scenario in real-time given certain criteria, at every turn, the Judge Agent will decide if it should let the simulation proceed or end it with a verdict.
167
+
168
+ You can combine it with a script, to control for example the beginning of the conversation, or simply let it run scriptless, this is very useful to test an open case like a vibe coding assistant:
169
+
170
+ ```python
171
+ result = await scenario.run(
172
+ name="dog walking startup landing page",
173
+ description="""
174
+ the user wants to create a new landing page for their dog walking startup
175
+
176
+ send the first message to generate the landing page, then a single follow up request to extend it, then give your final verdict
177
+ """,
178
+ agents=[
179
+ LovableAgentAdapter(template_path=template_path),
180
+ scenario.UserSimulatorAgent(),
181
+ scenario.JudgeAgent(
182
+ criteria=[
183
+ "agent reads the files before go and making changes",
184
+ "agent modified the index.css file, not only the Index.tsx file",
185
+ "agent created a comprehensive landing page",
186
+ "agent extended the landing page with a new section",
187
+ "agent should NOT say it can't read the file",
188
+ "agent should NOT produce incomplete code or be too lazy to finish",
189
+ ],
190
+ ),
191
+ ],
192
+ max_turns=5, # optional
193
+ )
194
+ ```
195
+
196
+ Check out the fully working Lovable Clone example in [examples/test_lovable_clone.py](examples/test_lovable_clone.py).
197
+
198
+ ## Full Control of the Conversation
199
+
200
+ You can specify a script for guiding the scenario by passing a list of steps to the `script` field, those steps are simply arbitrary functions that take the current state of the scenario as an argument, so you can do things like:
201
+
202
+ - Control what the user says, or let it be generated automatically
203
+ - Control what the agent says, or let it be generated automatically
204
+ - Add custom assertions, for example making sure a tool was called
205
+ - Add a custom evaluation, from an external library
206
+ - Let the simulation proceed for a certain number of turns, and evaluate at each new turn
207
+ - Trigger the judge agent to decide on a verdict
208
+ - Add arbitrary messages like mock tool calls in the middle of the conversation
209
+
210
+ Everything is possible, using the same simple structure:
211
+
212
+ ```python
213
+ @pytest.mark.agent_test
214
+ @pytest.mark.asyncio
215
+ async def test_ai_assistant_agent():
216
+ scenario = Scenario(
217
+ name="false assumptions",
218
+ description="""
219
+ The agent makes false assumption that the user is talking about an ATM bank, and user corrects it that they actually mean river banks
220
+ """,
221
+ agent=AiAssistantAgentAdapter,
222
+ criteria=[
223
+ "user should get good recommendations on river crossing",
224
+ "agent should NOT follow up about ATM recommendation after user has corrected them they are just hiking",
225
+ ],
226
+ max_turns=5,
227
+ )
228
+
229
+ def check_if_tool_was_called(state: ScenarioExecutor) -> None:
230
+ assert state.has_tool_call("web_search")
231
+
232
+ result = await scenario.script(
233
+ [
234
+ # Define existing history of messages
235
+ scenario.user("how do I safely approach a bank?"),
236
+
237
+ # Or let it be generate automatically
238
+ scenario.agent(),
239
+
240
+ # Add custom assertions, for example making sure a tool was called
241
+ check_if_tool_was_called,
242
+
243
+ # Another user message
244
+ scenario.user(),
245
+
246
+ # Let the simulation proceed for 2 more turns, print at every turn
247
+ scenario.proceed(
248
+ turns=2,
249
+ on_turn=lambda state: print(f"Turn {state.current_turn}: {state.messages}"),
250
+ ),
251
+
252
+ # Time to make a judgment call
253
+ scenario.judge(),
254
+ ]
255
+ ).run()
256
+
257
+ assert result.success
258
+ ```
259
+
260
+ ## Debug mode
261
+
262
+ You can enable debug mode by setting the `debug` field to `True` in the `Scenario.configure` method or in the specific scenario you are running, or by passing the `--debug` flag to pytest.
263
+
264
+ Debug mode allows you to see the messages in slow motion step by step, and intervene with your own inputs to debug your agent from the middle of the conversation.
265
+
266
+ ```python
267
+ Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), debug=True)
268
+ ```
269
+
270
+ or
271
+
272
+ ```bash
273
+ pytest -s tests/test_vegetarian_recipe_agent.py --debug
274
+ ```
275
+
276
+ ## Cache
277
+
278
+ Each time the scenario runs, the testing agent might chose a different input to start, this is good to make sure it covers the variance of real users as well, however we understand that the non-deterministic nature of it might make it less repeatable, costly and harder to debug. To solve for it, you can use the `cache_key` field in the `Scenario.configure` method or in the specific scenario you are running, this will make the testing agent give the same input for given the same scenario:
279
+
280
+ ```python
281
+ Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), cache_key="42")
282
+ ```
283
+
284
+ To bust the cache, you can simply pass a different `cache_key`, disable it, or delete the cache files located at `~/.scenario/cache`.
285
+
286
+ To go a step further and fully cache the test end-to-end, you can also wrap the LLM calls or any other non-deterministic functions in your application side with the `@scenario_cache` decorator:
287
+
288
+ ```python
289
+ class MyAgent:
290
+ @scenario_cache(ignore=["self"])
291
+ def invoke(self, message, context):
292
+ return client.chat.completions.create(
293
+ # ...
294
+ )
295
+ ```
296
+
297
+ This will cache any function call you decorate when running the tests and make them repeatable, hashed by the function arguments, the scenario being executed, and the `cache_key` you provided. You can exclude arguments that should not be hashed for the cache key by naming them in the `ignore` argument.
298
+
299
+ ## Disable Output
300
+
301
+ You can remove the `-s` flag from pytest to hide the output during test, which will only show up if the test fails. Alternatively, you can set `verbose=False` in the `Scenario.configure` method or in the specific scenario you are running.
302
+
303
+ ## Running in parallel
304
+
305
+ As the number of your scenarios grows, you might want to run them in parallel to speed up your whole test suite. We suggest you to use the [pytest-asyncio-concurrent](https://pypi.org/project/pytest-asyncio-concurrent/) plugin to do so.
306
+
307
+ Simply install the plugin from the link above, then replace the `@pytest.mark.asyncio` annotation in the tests with `@pytest.mark.asyncio_concurrent`, adding a group name to it to mark the group of scenarions that should be run in parallel together, e.g.:
308
+
309
+ ```python
310
+ @pytest.mark.agent_test
311
+ @pytest.mark.asyncio_concurrent(group="vegetarian_recipe_agent")
312
+ async def test_vegetarian_recipe_agent():
313
+ # ...
314
+
315
+ @pytest.mark.agent_test
316
+ @pytest.mark.asyncio_concurrent(group="vegetarian_recipe_agent")
317
+ async def test_user_is_very_hungry():
318
+ # ...
319
+ ```
320
+
321
+ Those two scenarios should now run in parallel.
322
+
323
+ ## License
324
+
325
+ MIT License