PyPI - langwatch-scenario - Versions diffs - 0.4.0__tar.gz → 0.6.0__tar.gz - Mend

langwatch-scenario 0.4.0tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

{langwatch_scenario-0.4.0 → langwatch_scenario-0.6.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: langwatch-scenario
-Version: 0.4.0
+Version: 0.6.0
 Summary: The end-to-end agent testing library
 Author-email: LangWatch Team <support@langwatch.ai>
 License: MIT
@@ -26,6 +26,11 @@ Requires-Dist: wrapt>=1.17.2
 Requires-Dist: pytest-asyncio>=0.26.0
 Requires-Dist: rich<15.0.0,>=13.3.3
 Requires-Dist: pksuid>=1.1.2
+Requires-Dist: pdoc3>=0.11.6
+Requires-Dist: ag-ui-protocol>=0.1.0
+Requires-Dist: httpx>=0.27.0
+Requires-Dist: rx>=3.2.0
+Requires-Dist: respx>=0.22.0
 Provides-Extra: dev
 Requires-Dist: black; extra == "dev"
 Requires-Dist: isort; extra == "dev"
@@ -44,65 +49,59 @@ Requires-Dist: function-schema; extra == "dev"
 # Scenario
-Scenario is an Agent Testing Framework for testing AI agents through Simulation Testing.
+Scenario is an Agent Testing Framework based on simulations, it can:
-You define the conversation scenario and let it play out, it will keep chatting back and forth with _your_ agent until it reaches the desired goal or detects an unexpected behavior based on the criteria you defined.
-- Test your agents end-to-end conversations with specified scenarios to capture both happy paths and edge cases
-- Full flexibility of how much you want to guide the conversation, from fully scripted scenarios to completely automated simulations
-- Run evaluations at any point of the conversation, designed for multi-turn
-- Works in combination with any testing and LLM evaluation frameworks, completely agnostic
-- Works with any LLM and Agent Framework, easy integration
+- Test real agent behavior by simulating users in different scenarios and edge cases
+- Evaluate and judge at any point of the conversation, powerful multi-turn control
+- Combine it with any LLM eval framework or custom evals, agnostic by design
+- Integrate your Agent by implementing just one `call()` method
+- Available in Python, TypeScript and Go
 [📺 Video Tutorial](https://www.youtube.com/watch?v=f8NLpkY0Av4)
-### See also
+### In other languages
 - [Scenario TypeScript](https://github.com/langwatch/scenario-ts/)
 - [Scenario Go](https://github.com/langwatch/scenario-go/)
 ## Example
+This is how a simple simulation with tool check looks like with Scenario:
 ```python
-@pytest.mark.agent_test
-@pytest.mark.asyncio
-async def test_weather_agent():
-    # Integrate with your agent
-    class WeatherAgent(scenario.AgentAdapter):
-        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
-            return weather_agent(input.messages)
+# Define any custom assertions
+def check_for_weather_tool_call(state: scenario.ScenarioState):
+    assert state.has_tool_call("get_current_weather")
-    # Define any custom assertions
-    def check_for_weather_tool_call(state: scenario.ScenarioState):
-        assert state.has_tool_call("get_current_weather")
+result = await scenario.run(
+    name="checking the weather",
-    # Run the scenario
-    result = await scenario.run(
-        name="checking the weather",
-        description="""
-            The user is planning a boat trip from Barcelona to Rome,
-            and is wondering what the weather will be like.
-        """,
-        agents=[
-            WeatherAgent(),
-            scenario.UserSimulatorAgent(model="openai/gpt-4.1-mini"),
-        ],
-        script=[
-            scenario.user(),
-            scenario.agent(),
-            check_for_weather_tool_call, # check for tool call after the first agent response
-            scenario.succeed(),
-        ],
-    )
+    # Define the prompt to guide the simulation
+    description="""
+        The user is planning a boat trip from Barcelona to Rome,
+        and is wondering what the weather will be like.
+    """,
-    # Assert the simulation was successful
-    assert result.success
+    # Define the agents that will play this simulation
+    agents=[
+        WeatherAgent(),
+        scenario.UserSimulatorAgent(model="openai/gpt-4.1-mini"),
+    ],
+    # (Optional) Control the simulation
+    script=[
+        scenario.user(), # let the user simulator generate a user message
+        scenario.agent(), # agent responds
+        check_for_weather_tool_call, # check for tool call after the first agent response
+        scenario.succeed(), # simulation ends successfully
+    ],
+)
+assert result.success
 ```
 > [!NOTE]
-> This is a very basic example, keep reading to see how to run a simulation completely script-free, using a Judge Agent to evaluate in real-time.
-Check out more examples in the [examples folder](./examples/).
+> Check out full examples in the [examples folder](./examples/).
 ## Getting Started
@@ -193,17 +192,17 @@ pytest -s tests/test_vegetarian_recipe_agent.py
 This is how it will look like:
-[![asciicast](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11.svg)](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11)
+[![asciicast](./assets/ascii-cinema.svg)](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11)
 You can find the same code example in [examples/test_vegetarian_recipe_agent.py](examples/test_vegetarian_recipe_agent.py).
-## Script-free Simulation
+## Simulation on Autopilot
-By providing a User Simulator Agent and a description of the Scenario, the simulated user will automatically generate messages to the agent until the scenario is successful or the maximum number of turns is reached.
+By providing a User Simulator Agent and a description of the Scenario without a script, the simulated user will automatically generate messages to the agent until the scenario is successful or the maximum number of turns is reached.
 You can then use a Judge Agent to evaluate the scenario in real-time given certain criteria, at every turn, the Judge Agent will decide if it should let the simulation proceed or end it with a verdict.
-You can combine it with a script, to control for example the beginning of the conversation, or simply let it run scriptless, this is very useful to test an open case like a vibe coding assistant:
+For example, here is a scenario that tests a vibe coding assistant:
 ```python
 result = await scenario.run(
@@ -233,6 +232,8 @@ result = await scenario.run(
 Check out the fully working Lovable Clone example in [examples/test_lovable_clone.py](examples/test_lovable_clone.py).
+You can also combine it with a partial script too! By for example controlling only the beginning of the conversation, and let the rest proceed on autopilot, see the next section.
 ## Full Control of the Conversation
 You can specify a script for guiding the scenario by passing a list of steps to the `script` field, those steps are simply arbitrary functions that take the current state of the scenario as an argument, so you can do things like:
@@ -250,35 +251,35 @@ Everything is possible, using the same simple structure:
 ```python
 @pytest.mark.agent_test
 @pytest.mark.asyncio
-async def test_ai_assistant_agent():
-    scenario = Scenario(
-        name="false assumptions",
+async def test_early_assumption_bias():
+    result = await scenario.run(
+        name="early assumption bias",
         description="""
             The agent makes false assumption that the user is talking about an ATM bank, and user corrects it that they actually mean river banks
         """,
-        agent=AiAssistantAgentAdapter,
-        criteria=[
-            "user should get good recommendations on river crossing",
-            "agent should NOT follow up about ATM recommendation after user has corrected them they are just hiking",
+        agents=[
+            Agent(),
+            scenario.UserSimulatorAgent(),
+            scenario.JudgeAgent(
+                criteria=[
+                    "user should get good recommendations on river crossing",
+                    "agent should NOT keep following up about ATM recommendation after user has corrected them that they are actually just hiking",
+                ],
+            ),
         ],
-        max_turns=5,
-    )
-    def check_if_tool_was_called(state: ScenarioExecutor) -> None:
-        assert state.has_tool_call("web_search")
-    result = await scenario.script(
-        [
-            # Define existing history of messages
+        max_turns=10,
+        script=[
+            # Define hardcoded messages
+            scenario.agent("Hello, how can I help you today?"),
             scenario.user("how do I safely approach a bank?"),
-            # Or let it be generate automatically
+            # Or let it be generated automatically
             scenario.agent(),
             # Add custom assertions, for example making sure a tool was called
             check_if_tool_was_called,
-            # Another user message
+            # Generate a user follow-up message
             scenario.user(),
             # Let the simulation proceed for 2 more turns, print at every turn
@@ -289,8 +290,8 @@ async def test_ai_assistant_agent():
             # Time to make a judgment call
             scenario.judge(),
-        ]
-    ).run()
+        ],
+    )
     assert result.success
 ```
@@ -302,7 +303,7 @@ You can enable debug mode by setting the `debug` field to `True` in the `Scenari
 Debug mode allows you to see the messages in slow motion step by step, and intervene with your own inputs to debug your agent from the middle of the conversation.
 ```python
-Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), debug=True)
+scenario.configure(default_model="openai/gpt-4.1-mini", debug=True)
 ```
 or
@@ -316,16 +317,17 @@ pytest -s tests/test_vegetarian_recipe_agent.py --debug
 Each time the scenario runs, the testing agent might chose a different input to start, this is good to make sure it covers the variance of real users as well, however we understand that the non-deterministic nature of it might make it less repeatable, costly and harder to debug. To solve for it, you can use the `cache_key` field in the `Scenario.configure` method or in the specific scenario you are running, this will make the testing agent give the same input for given the same scenario:
 ```python
-Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), cache_key="42")
+scenario.configure(default_model="openai/gpt-4.1-mini", cache_key="42")
 ```
 To bust the cache, you can simply pass a different `cache_key`, disable it, or delete the cache files located at `~/.scenario/cache`.
-To go a step further and fully cache the test end-to-end, you can also wrap the LLM calls or any other non-deterministic functions in your application side with the `@scenario_cache` decorator:
+To go a step further and fully cache the test end-to-end, you can also wrap the LLM calls or any other non-deterministic functions in your application side with the `@scenario.cache` decorator:
 ```python
+# Inside your actual agent implementation
 class MyAgent:
-    @scenario_cache(ignore=["self"])
+    @scenario.cache()
     def invoke(self, message, context):
         return client.chat.completions.create(
             # ...
@@ -358,6 +360,26 @@ async def test_user_is_very_hungry():
 Those two scenarios should now run in parallel.
+## Events System
+Scenario automatically publishes events during execution for monitoring and observability. You can enable event reporting by setting environment variables:
+```bash
+# Enable automatic event reporting
+export LANGWATCH_ENDPOINT="https://api.langwatch.ai"
+export LANGWATCH_API_KEY="your-api-key"
+```
+With these variables set, Scenario will automatically:
+- Publish events when scenarios start, finish, and when messages are added
+- Handle retries and error handling automatically
+- Process events asynchronously without blocking your tests
+The events include timing information, conversation history, and success/failure metrics for analysis.
+For advanced customization, see the event classes in the codebase for detailed documentation.
 ## License
 MIT License

{langwatch_scenario-0.4.0 → langwatch_scenario-0.6.0}/README.md RENAMED Viewed

@@ -6,65 +6,59 @@
 # Scenario
-Scenario is an Agent Testing Framework for testing AI agents through Simulation Testing.
+Scenario is an Agent Testing Framework based on simulations, it can:
-You define the conversation scenario and let it play out, it will keep chatting back and forth with _your_ agent until it reaches the desired goal or detects an unexpected behavior based on the criteria you defined.
-- Test your agents end-to-end conversations with specified scenarios to capture both happy paths and edge cases
-- Full flexibility of how much you want to guide the conversation, from fully scripted scenarios to completely automated simulations
-- Run evaluations at any point of the conversation, designed for multi-turn
-- Works in combination with any testing and LLM evaluation frameworks, completely agnostic
-- Works with any LLM and Agent Framework, easy integration
+- Test real agent behavior by simulating users in different scenarios and edge cases
+- Evaluate and judge at any point of the conversation, powerful multi-turn control
+- Combine it with any LLM eval framework or custom evals, agnostic by design
+- Integrate your Agent by implementing just one `call()` method
+- Available in Python, TypeScript and Go
 [📺 Video Tutorial](https://www.youtube.com/watch?v=f8NLpkY0Av4)
-### See also
+### In other languages
 - [Scenario TypeScript](https://github.com/langwatch/scenario-ts/)
 - [Scenario Go](https://github.com/langwatch/scenario-go/)
 ## Example
+This is how a simple simulation with tool check looks like with Scenario:
 ```python
-@pytest.mark.agent_test
-@pytest.mark.asyncio
-async def test_weather_agent():
-    # Integrate with your agent
-    class WeatherAgent(scenario.AgentAdapter):
-        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
-            return weather_agent(input.messages)
+# Define any custom assertions
+def check_for_weather_tool_call(state: scenario.ScenarioState):
+    assert state.has_tool_call("get_current_weather")
-    # Define any custom assertions
-    def check_for_weather_tool_call(state: scenario.ScenarioState):
-        assert state.has_tool_call("get_current_weather")
+result = await scenario.run(
+    name="checking the weather",
-    # Run the scenario
-    result = await scenario.run(
-        name="checking the weather",
-        description="""
-            The user is planning a boat trip from Barcelona to Rome,
-            and is wondering what the weather will be like.
-        """,
-        agents=[
-            WeatherAgent(),
-            scenario.UserSimulatorAgent(model="openai/gpt-4.1-mini"),
-        ],
-        script=[
-            scenario.user(),
-            scenario.agent(),
-            check_for_weather_tool_call, # check for tool call after the first agent response
-            scenario.succeed(),
-        ],
-    )
+    # Define the prompt to guide the simulation
+    description="""
+        The user is planning a boat trip from Barcelona to Rome,
+        and is wondering what the weather will be like.
+    """,
-    # Assert the simulation was successful
-    assert result.success
+    # Define the agents that will play this simulation
+    agents=[
+        WeatherAgent(),
+        scenario.UserSimulatorAgent(model="openai/gpt-4.1-mini"),
+    ],
+    # (Optional) Control the simulation
+    script=[
+        scenario.user(), # let the user simulator generate a user message
+        scenario.agent(), # agent responds
+        check_for_weather_tool_call, # check for tool call after the first agent response
+        scenario.succeed(), # simulation ends successfully
+    ],
+)
+assert result.success
 ```
 > [!NOTE]
-> This is a very basic example, keep reading to see how to run a simulation completely script-free, using a Judge Agent to evaluate in real-time.
-Check out more examples in the [examples folder](./examples/).
+> Check out full examples in the [examples folder](./examples/).
 ## Getting Started
@@ -155,17 +149,17 @@ pytest -s tests/test_vegetarian_recipe_agent.py
 This is how it will look like:
-[![asciicast](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11.svg)](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11)
+[![asciicast](./assets/ascii-cinema.svg)](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11)
 You can find the same code example in [examples/test_vegetarian_recipe_agent.py](examples/test_vegetarian_recipe_agent.py).
-## Script-free Simulation
+## Simulation on Autopilot
-By providing a User Simulator Agent and a description of the Scenario, the simulated user will automatically generate messages to the agent until the scenario is successful or the maximum number of turns is reached.
+By providing a User Simulator Agent and a description of the Scenario without a script, the simulated user will automatically generate messages to the agent until the scenario is successful or the maximum number of turns is reached.
 You can then use a Judge Agent to evaluate the scenario in real-time given certain criteria, at every turn, the Judge Agent will decide if it should let the simulation proceed or end it with a verdict.
-You can combine it with a script, to control for example the beginning of the conversation, or simply let it run scriptless, this is very useful to test an open case like a vibe coding assistant:
+For example, here is a scenario that tests a vibe coding assistant:
 ```python
 result = await scenario.run(
@@ -195,6 +189,8 @@ result = await scenario.run(
 Check out the fully working Lovable Clone example in [examples/test_lovable_clone.py](examples/test_lovable_clone.py).
+You can also combine it with a partial script too! By for example controlling only the beginning of the conversation, and let the rest proceed on autopilot, see the next section.
 ## Full Control of the Conversation
 You can specify a script for guiding the scenario by passing a list of steps to the `script` field, those steps are simply arbitrary functions that take the current state of the scenario as an argument, so you can do things like:
@@ -212,35 +208,35 @@ Everything is possible, using the same simple structure:
 ```python
 @pytest.mark.agent_test
 @pytest.mark.asyncio
-async def test_ai_assistant_agent():
-    scenario = Scenario(
-        name="false assumptions",
+async def test_early_assumption_bias():
+    result = await scenario.run(
+        name="early assumption bias",
         description="""
             The agent makes false assumption that the user is talking about an ATM bank, and user corrects it that they actually mean river banks
         """,
-        agent=AiAssistantAgentAdapter,
-        criteria=[
-            "user should get good recommendations on river crossing",
-            "agent should NOT follow up about ATM recommendation after user has corrected them they are just hiking",
+        agents=[
+            Agent(),
+            scenario.UserSimulatorAgent(),
+            scenario.JudgeAgent(
+                criteria=[
+                    "user should get good recommendations on river crossing",
+                    "agent should NOT keep following up about ATM recommendation after user has corrected them that they are actually just hiking",
+                ],
+            ),
         ],
-        max_turns=5,
-    )
-    def check_if_tool_was_called(state: ScenarioExecutor) -> None:
-        assert state.has_tool_call("web_search")
-    result = await scenario.script(
-        [
-            # Define existing history of messages
+        max_turns=10,
+        script=[
+            # Define hardcoded messages
+            scenario.agent("Hello, how can I help you today?"),
             scenario.user("how do I safely approach a bank?"),
-            # Or let it be generate automatically
+            # Or let it be generated automatically
             scenario.agent(),
             # Add custom assertions, for example making sure a tool was called
             check_if_tool_was_called,
-            # Another user message
+            # Generate a user follow-up message
             scenario.user(),
             # Let the simulation proceed for 2 more turns, print at every turn
@@ -251,8 +247,8 @@ async def test_ai_assistant_agent():
             # Time to make a judgment call
             scenario.judge(),
-        ]
-    ).run()
+        ],
+    )
     assert result.success
 ```
@@ -264,7 +260,7 @@ You can enable debug mode by setting the `debug` field to `True` in the `Scenari
 Debug mode allows you to see the messages in slow motion step by step, and intervene with your own inputs to debug your agent from the middle of the conversation.
 ```python
-Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), debug=True)
+scenario.configure(default_model="openai/gpt-4.1-mini", debug=True)
 ```
 or
@@ -278,16 +274,17 @@ pytest -s tests/test_vegetarian_recipe_agent.py --debug
 Each time the scenario runs, the testing agent might chose a different input to start, this is good to make sure it covers the variance of real users as well, however we understand that the non-deterministic nature of it might make it less repeatable, costly and harder to debug. To solve for it, you can use the `cache_key` field in the `Scenario.configure` method or in the specific scenario you are running, this will make the testing agent give the same input for given the same scenario:
 ```python
-Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), cache_key="42")
+scenario.configure(default_model="openai/gpt-4.1-mini", cache_key="42")
 ```
 To bust the cache, you can simply pass a different `cache_key`, disable it, or delete the cache files located at `~/.scenario/cache`.
-To go a step further and fully cache the test end-to-end, you can also wrap the LLM calls or any other non-deterministic functions in your application side with the `@scenario_cache` decorator:
+To go a step further and fully cache the test end-to-end, you can also wrap the LLM calls or any other non-deterministic functions in your application side with the `@scenario.cache` decorator:
 ```python
+# Inside your actual agent implementation
 class MyAgent:
-    @scenario_cache(ignore=["self"])
+    @scenario.cache()
     def invoke(self, message, context):
         return client.chat.completions.create(
             # ...
@@ -320,6 +317,26 @@ async def test_user_is_very_hungry():
 Those two scenarios should now run in parallel.
+## Events System
+Scenario automatically publishes events during execution for monitoring and observability. You can enable event reporting by setting environment variables:
+```bash
+# Enable automatic event reporting
+export LANGWATCH_ENDPOINT="https://api.langwatch.ai"
+export LANGWATCH_API_KEY="your-api-key"
+```
+With these variables set, Scenario will automatically:
+- Publish events when scenarios start, finish, and when messages are added
+- Handle retries and error handling automatically
+- Process events asynchronously without blocking your tests
+The events include timing information, conversation history, and success/failure metrics for analysis.
+For advanced customization, see the event classes in the codebase for detailed documentation.
 ## License
 MIT License

langwatch-scenario 0.4.0__tar.gz → 0.6.0__tar.gz

langwatch-scenario 0.4.0tar.gz → 0.6.0tar.gz