PyPI - eval-protocol - Versions diffs - 0.2.1__tar.gz → 0.2.3__tar.gz - Mend

eval-protocol 0.2.1tar.gz → 0.2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (301) hide show

{eval_protocol-0.2.1/eval_protocol.egg-info → eval_protocol-0.2.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-protocol
-Version: 0.2.1
+Version: 0.2.3
 Summary: The official Python SDK for Eval Protocol (EP.) EP is an open protocol that standardizes how developers author evals for large language model (LLM) applications.
 Author-email: Fireworks AI <info@fireworks.ai>
 License-Expression: Apache-2.0
@@ -96,8 +96,53 @@ Dynamic: license-file
 [![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
-EP is an open protocol that standardizes how developers author evals for large
-language model (LLM) applications.
+EP is an open specification, Python SDK, and pytest wrapper that provides a
+standardized way to write evaluations for large language model (LLM)
+applications. Start with simple single-turn evals for model selection and prompt
+engineering, then scale up to complex multi-turn reinforcement learning (RL) for
+agents using Model Context Protocol (MCP). EP ensures consistent patterns for
+writing evals, storing traces, and saving results—enabling you to build
+sophisticated agent evaluations that work across real-world scenarios, from
+markdown generation tasks to customer service agents with tool calling
+capabilities.
+## Quick Example
+Here's a simple test function that checks if a model's response contains **bold** text formatting:
+```python test_bold_format.py
+from eval_protocol.models import EvaluateResult, EvaluationRow
+from eval_protocol.pytest import default_single_turn_rollout_processor, evaluation_test
+@evaluation_test(
+    input_messages=[
+        [
+            Message(role="system", content="You are a helpful assistant. Use bold text to highlight important information."),
+            Message(role="user", content="Explain why **evaluations** matter for building AI agents. Make it dramatic!"),
+        ],
+    ],
+    model=["accounts/fireworks/models/llama-v3p1-8b-instruct"],
+    rollout_processor=default_single_turn_rollout_processor,
+    mode="pointwise",
+)
+def test_bold_format(row: EvaluationRow) -> EvaluationRow:
+    """
+    Simple evaluation that checks if the model's response contains bold text.
+    """
+    assistant_response = row.messages[-1].content
+    # Check if response contains **bold** text
+    has_bold = "**" in assistant_response
+    if has_bold:
+        result = EvaluateResult(score=1.0, reason="✅ Response contains bold text")
+    else:
+        result = EvaluateResult(score=0.0, reason="❌ No bold text found")
+    row.evaluation_result = result
+    return row
+```
 ## Documentation

eval_protocol-0.2.3/README.md ADDED Viewed

@@ -0,0 +1,69 @@
+# Eval Protocol (EP)
+[![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
+EP is an open specification, Python SDK, and pytest wrapper that provides a
+standardized way to write evaluations for large language model (LLM)
+applications. Start with simple single-turn evals for model selection and prompt
+engineering, then scale up to complex multi-turn reinforcement learning (RL) for
+agents using Model Context Protocol (MCP). EP ensures consistent patterns for
+writing evals, storing traces, and saving results—enabling you to build
+sophisticated agent evaluations that work across real-world scenarios, from
+markdown generation tasks to customer service agents with tool calling
+capabilities.
+## Quick Example
+Here's a simple test function that checks if a model's response contains **bold** text formatting:
+```python test_bold_format.py
+from eval_protocol.models import EvaluateResult, EvaluationRow
+from eval_protocol.pytest import default_single_turn_rollout_processor, evaluation_test
+@evaluation_test(
+    input_messages=[
+        [
+            Message(role="system", content="You are a helpful assistant. Use bold text to highlight important information."),
+            Message(role="user", content="Explain why **evaluations** matter for building AI agents. Make it dramatic!"),
+        ],
+    ],
+    model=["accounts/fireworks/models/llama-v3p1-8b-instruct"],
+    rollout_processor=default_single_turn_rollout_processor,
+    mode="pointwise",
+)
+def test_bold_format(row: EvaluationRow) -> EvaluationRow:
+    """
+    Simple evaluation that checks if the model's response contains bold text.
+    """
+    assistant_response = row.messages[-1].content
+    # Check if response contains **bold** text
+    has_bold = "**" in assistant_response
+    if has_bold:
+        result = EvaluateResult(score=1.0, reason="✅ Response contains bold text")
+    else:
+        result = EvaluateResult(score=0.0, reason="❌ No bold text found")
+    row.evaluation_result = result
+    return row
+```
+## Documentation
+See our [documentation](https://evalprotocol.io) for more details.
+## Installation
+**This library requires Python >= 3.10.**
+Install with pip:
+```
+pip install eval-protocol
+```
+## License
+[MIT](LICENSE)

{eval_protocol-0.2.1 → eval_protocol-0.2.3}/eval_protocol/_version.py RENAMED Viewed

@@ -8,11 +8,11 @@ import json
 version_json = '''
 {
- "date": "2025-08-04T14:28:02-0700",
+ "date": "2025-08-04T20:35:33-0700",
  "dirty": false,
  "error": null,
- "full-revisionid": "07fda02490d1a09c7ab92595d6397622cb64230d",
- "version": "0.2.1"
+ "full-revisionid": "52b46a7d3f8455d848d8d5138ec4ca4d6343d3d2",
+ "version": "0.2.3"
 }
 '''  # END VERSION_JSON

{eval_protocol-0.2.1 → eval_protocol-0.2.3}/eval_protocol/pytest/default_agent_rollout_processor.py RENAMED Viewed

@@ -1,8 +1,10 @@
+import asyncio
 import json
 import os
-from typing import Any, List, Optional
+from typing import Any, List, Optional, Union
 from mcp.types import CallToolResult
+from openai import NOT_GIVEN, NotGiven
 from openai.types.chat import ChatCompletionMessage, ChatCompletionToolParam
 from openai.types.chat.chat_completion_message_param import ChatCompletionMessageParam
@@ -22,27 +24,43 @@ class Agent:
         self.messages: list[Message] = initial_messages
         self._policy = LiteLLMPolicy(model_id=model)
         self.mcp_client = MCPMultiClient(config_path=config_path) if config_path else None
+        self.tools: Union[List[ChatCompletionToolParam], NotGiven] = NOT_GIVEN
     async def setup(self):
         if self.mcp_client:
             await self.mcp_client.connect_to_servers()
+    async def _get_tools(self) -> Optional[List[ChatCompletionToolParam]]:
+        if self.tools is NOT_GIVEN:
+            self.tools = await self.mcp_client.get_available_tools() if self.mcp_client else None
+        return self.tools
     async def call_agent(self) -> str:
         """
         Call the assistant with the user query.
         """
-        tools = await self.mcp_client.get_available_tools() if self.mcp_client else None
+        tools = await self._get_tools() if self.mcp_client else None
         message = await self._call_model(self.messages, tools)
         self.messages.append(message)
         if message["tool_calls"]:
+            # Create tasks for all tool calls to run them in parallel
+            tool_tasks = []
             for tool_call in message["tool_calls"]:
                 tool_call_id = tool_call["id"]
                 tool_name = tool_call["function"]["name"]
                 tool_args = tool_call["function"]["arguments"]
                 tool_args_dict = json.loads(tool_args)
-                tool_result = await self.mcp_client.call_tool(tool_name, tool_args_dict)
-                content = self._get_content_from_tool_result(tool_result)
+                # Create a task for each tool call
+                task = self._execute_tool_call(tool_call_id, tool_name, tool_args_dict)
+                tool_tasks.append(task)
+            # Execute all tool calls in parallel
+            tool_results = await asyncio.gather(*tool_tasks)
+            # Add all tool results to messages (they will be in the same order as tool_calls)
+            for tool_call, (tool_call_id, content) in zip(message["tool_calls"], tool_results):
                 self.messages.append(
                     {
                         "role": "tool",
@@ -50,18 +68,26 @@ class Agent:
                         "tool_call_id": tool_call_id,
                     }
                 )
+            return await self.call_agent()
         return message["content"]
     async def _call_model(
         self, messages: list[Message], tools: Optional[list[ChatCompletionToolParam]]
     ) -> ChatCompletionMessage:
         messages = [message.model_dump() if hasattr(message, "model_dump") else message for message in messages]
-        response = await self._policy._make_llm_call(
-            messages=messages,
-            tools=tools,
-        )
+        tools = [{"function": tool["function"].model_dump(), "type": "function"} for tool in tools] if tools else []
+        response = await self._policy._make_llm_call(messages=messages, tools=tools)
         return response["choices"][0]["message"]
+    async def _execute_tool_call(self, tool_call_id: str, tool_name: str, tool_args_dict: dict) -> tuple[str, str]:
+        """
+        Execute a single tool call and return the tool_call_id and content.
+        This method is designed to be used with asyncio.gather() for parallel execution.
+        """
+        tool_result = await self.mcp_client.call_tool(tool_name, tool_args_dict)
+        content = self._get_content_from_tool_result(tool_result)
+        return tool_call_id, content
     def _get_content_from_tool_result(self, tool_result: CallToolResult) -> str:
         if tool_result.structuredContent:
             return json.dumps(tool_result.structuredContent)

{eval_protocol-0.2.1 → eval_protocol-0.2.3}/eval_protocol/pytest/default_mcp_gym_rollout_processor.py RENAMED Viewed

@@ -2,6 +2,7 @@ import asyncio
 import os
 import subprocess
 import time
+import socket
 from pathlib import Path
 from typing import List, Optional
@@ -69,11 +70,8 @@ class MCPServerManager:
         self._log_file = log_file
         self._log_file_path = log_file_path
-        # Wait for server to start
-        time.sleep(3)
-        # Check if process is still running
-        if self.process.poll() is not None:
+        # Wait for server to be ready with proper health check
+        if not self._wait_for_server_ready(timeout=15):
             try:
                 with open(self._log_file_path, "r") as f:
                     log_content = f.read()
@@ -82,13 +80,45 @@ class MCPServerManager:
                 print("=" * 50)
                 print(log_content)
                 print("=" * 50)
-                raise RuntimeError(f"Server failed to start. Check log above for details.")
+                raise RuntimeError(f"Server failed to start or become ready. Check log above for details.")
             except Exception as e:
                 stdout, stderr = self.process.communicate()
-                raise RuntimeError(f"Server failed to start. stderr: {stderr}, log error: {e}")
+                raise RuntimeError(f"Server failed to start or become ready. stderr: {stderr}, log error: {e}")
         print(f"✅ Server started successfully on port {self.port}")
+    def _wait_for_server_ready(self, timeout: int = 15) -> bool:
+        """
+        Wait for server to be ready by polling socket connection.
+        """
+        start_time = time.time()
+        health_check_failures = 0
+        while time.time() - start_time < timeout:
+            # Check if process is still running
+            if self.process.poll() is not None:
+                print(f"Server process exited early")
+                return False
+            try:
+                with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+                    s.settimeout(1)
+                    result = s.connect_ex(("localhost", self.port))
+                    if result == 0:
+                        time.sleep(0.5)
+                        return True
+            except Exception as e:
+                health_check_failures += 1
+                # Print first few failures for debugging
+                if health_check_failures <= 3:
+                    print(f"Health check failed: {e}")
+            # Wait before next check
+            time.sleep(0.1)
+        print(f"Server failed to become ready within {timeout} seconds")
+        return False
     def stop(self) -> None:
         """Stop the MCP server."""
         if self.process:

{eval_protocol-0.2.1 → eval_protocol-0.2.3/eval_protocol.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-protocol
-Version: 0.2.1
+Version: 0.2.3
 Summary: The official Python SDK for Eval Protocol (EP.) EP is an open protocol that standardizes how developers author evals for large language model (LLM) applications.
 Author-email: Fireworks AI <info@fireworks.ai>
 License-Expression: Apache-2.0
@@ -96,8 +96,53 @@ Dynamic: license-file
 [![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
-EP is an open protocol that standardizes how developers author evals for large
-language model (LLM) applications.
+EP is an open specification, Python SDK, and pytest wrapper that provides a
+standardized way to write evaluations for large language model (LLM)
+applications. Start with simple single-turn evals for model selection and prompt
+engineering, then scale up to complex multi-turn reinforcement learning (RL) for
+agents using Model Context Protocol (MCP). EP ensures consistent patterns for
+writing evals, storing traces, and saving results—enabling you to build
+sophisticated agent evaluations that work across real-world scenarios, from
+markdown generation tasks to customer service agents with tool calling
+capabilities.
+## Quick Example
+Here's a simple test function that checks if a model's response contains **bold** text formatting:
+```python test_bold_format.py
+from eval_protocol.models import EvaluateResult, EvaluationRow
+from eval_protocol.pytest import default_single_turn_rollout_processor, evaluation_test
+@evaluation_test(
+    input_messages=[
+        [
+            Message(role="system", content="You are a helpful assistant. Use bold text to highlight important information."),
+            Message(role="user", content="Explain why **evaluations** matter for building AI agents. Make it dramatic!"),
+        ],
+    ],
+    model=["accounts/fireworks/models/llama-v3p1-8b-instruct"],
+    rollout_processor=default_single_turn_rollout_processor,
+    mode="pointwise",
+)
+def test_bold_format(row: EvaluationRow) -> EvaluationRow:
+    """
+    Simple evaluation that checks if the model's response contains bold text.
+    """
+    assistant_response = row.messages[-1].content
+    # Check if response contains **bold** text
+    has_bold = "**" in assistant_response
+    if has_bold:
+        result = EvaluateResult(score=1.0, reason="✅ Response contains bold text")
+    else:
+        result = EvaluateResult(score=0.0, reason="❌ No bold text found")
+    row.evaluation_result = result
+    return row
+```
 ## Documentation

{eval_protocol-0.2.1 → eval_protocol-0.2.3}/pyproject.toml RENAMED Viewed

@@ -140,6 +140,7 @@ tau2 = { git = "https://github.com/sierra-research/tau2-bench.git" }
 [dependency-groups]
 dev = [
+    "fastmcp>=2.10.6",
     "haikus==0.3.8",
     "pytest>=8.4.1",
 ]

eval_protocol-0.2.1/README.md DELETED Viewed

@@ -1,24 +0,0 @@
-# Eval Protocol (EP)
-[![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
-EP is an open protocol that standardizes how developers author evals for large
-language model (LLM) applications.
-## Documentation
-See our [documentation](https://evalprotocol.io) for more details.
-## Installation
-**This library requires Python >= 3.10.**
-Install with pip:
-```
-pip install eval-protocol
-```
-## License
-[MIT](LICENSE)