PyPI - cua-agent - Versions diffs - 0.1.24__tar.gz → 0.1.26__tar.gz - Mend

cua-agent 0.1.24tar.gz → 0.1.26tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of cua-agent might be problematic. Click here for more details.

Files changed (76) hide show

{cua_agent-0.1.24 → cua_agent-0.1.26}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: cua-agent
-Version: 0.1.24
+Version: 0.1.26
 Summary: CUA (Computer Use) Agent for AI-driven computer interaction
 Author-Email: TryCua <gh@trycua.com>
 Requires-Python: <3.13,>=3.10
@@ -148,8 +148,10 @@ The agent includes a Gradio-based user interface for easy interaction. To use it
 ```bash
 # Install with Gradio support
 pip install "cua-agent[ui]"
+```
+### Create a simple launcher script
-# Create a simple launcher script
 ```python
 # launch_ui.py
 from agent.ui.gradio.app import create_gradio_ui
@@ -158,10 +160,6 @@ app = create_gradio_ui()
 app.launch(share=False)
 ```
-# Run the launcher
-python launch_ui.py
-```
 ### Setting up API Keys
 For the Gradio UI to show available models, you need to set API keys as environment variables:
@@ -179,28 +177,21 @@ OPENAI_API_KEY=your_key ANTHROPIC_API_KEY=your_key python launch_ui.py
 Without these environment variables, the UI will show "No models available" for the corresponding providers, but you can still use local models with the OMNI loop provider.
+### Using Local Models
+You can use local models with the OMNI loop provider by selecting "Custom model..." from the dropdown. The default provider URL is set to `http://localhost:1234/v1` which works with LM Studio.
+If you're using a different local model server:
+- vLLM: `http://localhost:8000/v1`
+- LocalAI: `http://localhost:8080/v1`
+- Ollama with OpenAI compat API: `http://localhost:11434/v1`
 The Gradio UI provides:
 - Selection of different agent loops (OpenAI, Anthropic, OMNI)
 - Model selection for each provider
 - Configuration of agent parameters
 - Chat interface for interacting with the agent
-You can also embed the Gradio UI in your own application:
-```python
-# Import directly in your application
-from agent.ui.gradio.app import create_gradio_ui
-# Create the UI with advanced features
-demo = create_gradio_ui()
-demo.launch()
-# Or for a simpler interface
-from agent.ui.gradio import registry
-demo = registry(name='cua:gpt-4o')
-demo.launch()
-```
 ## Agent Loops
 The `cua-agent` package provides three agent loops variations, based on different CUA models providers and techniques:
@@ -209,7 +200,7 @@ The `cua-agent` package provides three agent loops variations, based on differen
 |:-----------|:-----------------|:------------|:-------------|
 | `AgentLoop.OPENAI` | • `computer_use_preview` | Use OpenAI Operator CUA model | Not Required |
 | `AgentLoop.ANTHROPIC` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219` | Use Anthropic Computer-Use | Not Required |
-| `AgentLoop.OMNI` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219`<br>• `gpt-4.5-preview`<br>• `gpt-4o`<br>• `gpt-4`<br>• `phi4`<br>• `phi4-mini`<br>• `gemma3`<br>• `...`<br>• `Any Ollama-compatible model` | Use OmniParser for element pixel-detection (SoM) and any VLMs for UI Grounding and Reasoning | OmniParser |
+| `AgentLoop.OMNI` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219`<br>• `gpt-4.5-preview`<br>• `gpt-4o`<br>• `gpt-4`<br>• `phi4`<br>• `phi4-mini`<br>• `gemma3`<br>• `...`<br>• `Any Ollama or OpenAI-compatible model` | Use OmniParser for element pixel-detection (SoM) and any VLMs for UI Grounding and Reasoning | OmniParser |
 ## AgentResponse
 The `AgentResponse` class represents the structured output returned after each agent turn. It contains the agent's response, reasoning, tool usage, and other metadata. The response format aligns with the new [OpenAI Agent SDK specification](https://platform.openai.com/docs/api-reference/responses) for better consistency across different agent loops.
@@ -249,3 +240,26 @@ async for result in agent.run(task):
           print("\nTool Call Output:")
           print(output)
 ```
+### Gradio UI
+You can also interact with the agent using a Gradio interface.
+```python
+# Ensure environment variables (e.g., API keys) are loaded
+# You might need a helper function like load_dotenv_files() if using .env
+# from utils import load_dotenv_files
+# load_dotenv_files()
+from agent.ui.gradio.app import create_gradio_ui
+app = create_gradio_ui()
+app.launch(share=False)
+```
+**Note on Settings Persistence:**
+*   The Gradio UI automatically saves your configuration (Agent Loop, Model Choice, Custom Base URL, Save Trajectory state, Recent Images count) to a file named `.gradio_settings.json` in the project's root directory when you successfully run a task.
+*   This allows your preferences to persist between sessions.
+*   API keys entered into the custom provider field are **not** saved in this file for security reasons. Manage API keys using environment variables (e.g., `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`) or a `.env` file.
+*   It's recommended to add `.gradio_settings.json` to your `.gitignore` file.

{cua_agent-0.1.24 → cua_agent-0.1.26}/README.md RENAMED Viewed

@@ -80,8 +80,10 @@ The agent includes a Gradio-based user interface for easy interaction. To use it
 ```bash
 # Install with Gradio support
 pip install "cua-agent[ui]"
+```
+### Create a simple launcher script
-# Create a simple launcher script
 ```python
 # launch_ui.py
 from agent.ui.gradio.app import create_gradio_ui
@@ -90,10 +92,6 @@ app = create_gradio_ui()
 app.launch(share=False)
 ```
-# Run the launcher
-python launch_ui.py
-```
 ### Setting up API Keys
 For the Gradio UI to show available models, you need to set API keys as environment variables:
@@ -111,28 +109,21 @@ OPENAI_API_KEY=your_key ANTHROPIC_API_KEY=your_key python launch_ui.py
 Without these environment variables, the UI will show "No models available" for the corresponding providers, but you can still use local models with the OMNI loop provider.
+### Using Local Models
+You can use local models with the OMNI loop provider by selecting "Custom model..." from the dropdown. The default provider URL is set to `http://localhost:1234/v1` which works with LM Studio.
+If you're using a different local model server:
+- vLLM: `http://localhost:8000/v1`
+- LocalAI: `http://localhost:8080/v1`
+- Ollama with OpenAI compat API: `http://localhost:11434/v1`
 The Gradio UI provides:
 - Selection of different agent loops (OpenAI, Anthropic, OMNI)
 - Model selection for each provider
 - Configuration of agent parameters
 - Chat interface for interacting with the agent
-You can also embed the Gradio UI in your own application:
-```python
-# Import directly in your application
-from agent.ui.gradio.app import create_gradio_ui
-# Create the UI with advanced features
-demo = create_gradio_ui()
-demo.launch()
-# Or for a simpler interface
-from agent.ui.gradio import registry
-demo = registry(name='cua:gpt-4o')
-demo.launch()
-```
 ## Agent Loops
 The `cua-agent` package provides three agent loops variations, based on different CUA models providers and techniques:
@@ -141,7 +132,7 @@ The `cua-agent` package provides three agent loops variations, based on differen
 |:-----------|:-----------------|:------------|:-------------|
 | `AgentLoop.OPENAI` | • `computer_use_preview` | Use OpenAI Operator CUA model | Not Required |
 | `AgentLoop.ANTHROPIC` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219` | Use Anthropic Computer-Use | Not Required |
-| `AgentLoop.OMNI` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219`<br>• `gpt-4.5-preview`<br>• `gpt-4o`<br>• `gpt-4`<br>• `phi4`<br>• `phi4-mini`<br>• `gemma3`<br>• `...`<br>• `Any Ollama-compatible model` | Use OmniParser for element pixel-detection (SoM) and any VLMs for UI Grounding and Reasoning | OmniParser |
+| `AgentLoop.OMNI` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219`<br>• `gpt-4.5-preview`<br>• `gpt-4o`<br>• `gpt-4`<br>• `phi4`<br>• `phi4-mini`<br>• `gemma3`<br>• `...`<br>• `Any Ollama or OpenAI-compatible model` | Use OmniParser for element pixel-detection (SoM) and any VLMs for UI Grounding and Reasoning | OmniParser |
 ## AgentResponse
 The `AgentResponse` class represents the structured output returned after each agent turn. It contains the agent's response, reasoning, tool usage, and other metadata. The response format aligns with the new [OpenAI Agent SDK specification](https://platform.openai.com/docs/api-reference/responses) for better consistency across different agent loops.
@@ -181,3 +172,26 @@ async for result in agent.run(task):
           print("\nTool Call Output:")
           print(output)
 ```
+### Gradio UI
+You can also interact with the agent using a Gradio interface.
+```python
+# Ensure environment variables (e.g., API keys) are loaded
+# You might need a helper function like load_dotenv_files() if using .env
+# from utils import load_dotenv_files
+# load_dotenv_files()
+from agent.ui.gradio.app import create_gradio_ui
+app = create_gradio_ui()
+app.launch(share=False)
+```
+**Note on Settings Persistence:**
+*   The Gradio UI automatically saves your configuration (Agent Loop, Model Choice, Custom Base URL, Save Trajectory state, Recent Images count) to a file named `.gradio_settings.json` in the project's root directory when you successfully run a task.
+*   This allows your preferences to persist between sessions.
+*   API keys entered into the custom provider field are **not** saved in this file for security reasons. Manage API keys using environment variables (e.g., `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`) or a `.env` file.
+*   It's recommended to add `.gradio_settings.json` to your `.gitignore` file.

{cua_agent-0.1.24 → cua_agent-0.1.26}/agent/core/base.py RENAMED Viewed

@@ -5,10 +5,12 @@ import asyncio
 from abc import ABC, abstractmethod
 from typing import Any, AsyncGenerator, Dict, List, Optional
+from agent.providers.omni.parser import ParseResult
 from computer import Computer
 from .messages import StandardMessageManager, ImageRetentionConfig
 from .types import AgentResponse
 from .experiment import ExperimentManager
+from .callbacks import CallbackManager, CallbackHandler
 logger = logging.getLogger(__name__)
@@ -27,6 +29,7 @@ class BaseLoop(ABC):
         base_dir: Optional[str] = "trajectories",
         save_trajectory: bool = True,
         only_n_most_recent_images: Optional[int] = 2,
+        callback_handlers: Optional[List[CallbackHandler]] = None,
         **kwargs,
     ):
         """Initialize base agent loop.
@@ -75,6 +78,9 @@ class BaseLoop(ABC):
         # Initialize basic tracking
         self.turn_count = 0
+        # Initialize callback manager
+        self.callback_manager = CallbackManager(handlers=callback_handlers or [])
     async def initialize(self) -> None:
         """Initialize both the API client and computer interface with retries."""
@@ -187,3 +193,17 @@ class BaseLoop(ABC):
         """
         if self.experiment_manager:
             self.experiment_manager.save_screenshot(img_base64, action_type)
+    ###########################################
+    # EVENT HOOKS / CALLBACKS
+    ###########################################
+    async def handle_screenshot(self, screenshot_base64: str, action_type: str = "", parsed_screen: Optional[ParseResult] = None) -> None:
+        """Process a screenshot through callback managers
+        Args:
+            screenshot_base64: Base64 encoded screenshot
+            action_type: Type of action that triggered the screenshot
+        """
+        if hasattr(self, 'callback_manager'):
+            await self.callback_manager.on_screenshot(screenshot_base64, action_type, parsed_screen)

{cua_agent-0.1.24 → cua_agent-0.1.26}/agent/core/callbacks.py RENAMED Viewed

@@ -6,6 +6,8 @@ from abc import ABC, abstractmethod
 from datetime import datetime
 from typing import Any, Dict, List, Optional, Protocol
+from agent.providers.omni.parser import ParseResult
 logger = logging.getLogger(__name__)
 class ContentCallback(Protocol):
@@ -20,6 +22,10 @@ class APICallback(Protocol):
     """Protocol for API callbacks."""
     def __call__(self, request: Any, response: Any, error: Optional[Exception] = None) -> None: ...
+class ScreenshotCallback(Protocol):
+    """Protocol for screenshot callbacks."""
+    def __call__(self, screenshot_base64: str, action_type: str = "") -> Optional[str]: ...
 class BaseCallbackManager(ABC):
     """Base class for callback managers."""
@@ -110,7 +116,20 @@ class CallbackManager:
         """
         for handler in self.handlers:
             await handler.on_error(error, **kwargs)
+    async def on_screenshot(self, screenshot_base64: str, action_type: str = "", parsed_screen: Optional[ParseResult] = None) -> None:
+        """Called when a screenshot is taken.
+        Args:
+            screenshot_base64: Base64 encoded screenshot
+            action_type: Type of action that triggered the screenshot
+            parsed_screen: Optional output from parsing the screenshot
+        Returns:
+            Modified screenshot or original if no modifications
+        """
+        for handler in self.handlers:
+            await handler.on_screenshot(screenshot_base64, action_type, parsed_screen)
 class CallbackHandler(ABC):
     """Base class for callback handlers."""
@@ -144,4 +163,40 @@ class CallbackHandler(ABC):
             error: Exception that occurred
             **kwargs: Additional data
         """
-        pass
+        pass
+    @abstractmethod
+    async def on_screenshot(self, screenshot_base64: str, action_type: str = "", parsed_screen: Optional[ParseResult] = None) -> None:
+        """Called when a screenshot is taken.
+        Args:
+            screenshot_base64: Base64 encoded screenshot
+            action_type: Type of action that triggered the screenshot
+        Returns:
+            Optional modified screenshot
+        """
+        pass
+class DefaultCallbackHandler(CallbackHandler):
+    """Default implementation of CallbackHandler with no-op methods.
+    This class implements all abstract methods from CallbackHandler,
+    allowing subclasses to override only the methods they need.
+    """
+    async def on_action_start(self, action: str, **kwargs) -> None:
+        """Default no-op implementation."""
+        pass
+    async def on_action_end(self, action: str, success: bool, **kwargs) -> None:
+        """Default no-op implementation."""
+        pass
+    async def on_error(self, error: Exception, **kwargs) -> None:
+        """Default no-op implementation."""
+        pass
+    async def on_screenshot(self, screenshot_base64: str, action_type: str = "") -> None:
+        """Default no-op implementation."""
+        pass

{cua_agent-0.1.24 → cua_agent-0.1.26}/agent/providers/anthropic/callbacks/manager.py RENAMED Viewed

@@ -3,23 +3,33 @@ import httpx
 from anthropic.types.beta import BetaContentBlockParam
 from ..tools import ToolResult
 class APICallback(Protocol):
     """Protocol for API callbacks."""
-    def __call__(self, request: httpx.Request | None,
-                 response: httpx.Response | object | None,
-                 error: Exception | None) -> None: ...
+    def __call__(
+        self,
+        request: httpx.Request | None,
+        response: httpx.Response | object | None,
+        error: Exception | None,
+    ) -> None: ...
 class ContentCallback(Protocol):
     """Protocol for content callbacks."""
     def __call__(self, content: BetaContentBlockParam) -> None: ...
 class ToolCallback(Protocol):
     """Protocol for tool callbacks."""
     def __call__(self, result: ToolResult, tool_id: str) -> None: ...
 class CallbackManager:
     """Manages various callbacks for the agent system."""
     def __init__(
         self,
         content_callback: ContentCallback,
@@ -27,7 +37,7 @@ class CallbackManager:
         api_callback: APICallback,
     ):
         """Initialize the callback manager.
         Args:
             content_callback: Callback for content updates
             tool_callback: Callback for tool execution results
@@ -36,20 +46,20 @@ class CallbackManager:
         self.content_callback = content_callback
         self.tool_callback = tool_callback
         self.api_callback = api_callback
     def on_content(self, content: BetaContentBlockParam) -> None:
         """Handle content updates."""
         self.content_callback(content)
     def on_tool_result(self, result: ToolResult, tool_id: str) -> None:
         """Handle tool execution results."""
         self.tool_callback(result, tool_id)
     def on_api_interaction(
         self,
         request: httpx.Request | None,
         response: httpx.Response | object | None,
-        error: Exception | None
+        error: Exception | None,
     ) -> None:
         """Handle API interactions."""
-        self.api_callback(request, response, error)
+        self.api_callback(request, response, error)

{cua_agent-0.1.24 → cua_agent-0.1.26}/agent/providers/omni/clients/oaicompat.py RENAMED Viewed

@@ -45,8 +45,8 @@ class OAICompatClient(BaseOmniClient):
             max_tokens: Maximum tokens to generate
             temperature: Generation temperature
         """
-        super().__init__(api_key="EMPTY", model=model)
-        self.api_key = "EMPTY"  # Local endpoints typically don't require an API key
+        super().__init__(api_key=api_key or "EMPTY", model=model)
+        self.api_key = api_key or "EMPTY" # Local endpoints typically don't require an API key
         self.model = model
         self.provider_base_url = (
             provider_base_url or "http://localhost:8000/v1"
@@ -146,10 +146,18 @@ class OAICompatClient(BaseOmniClient):
                 base_url = self.provider_base_url or "http://localhost:8000/v1"
                 # Check if the base URL already includes the chat/completions endpoint
                 endpoint_url = base_url
                 if not endpoint_url.endswith("/chat/completions"):
+                    # If URL is RunPod format, make it OpenAI compatible
+                    if endpoint_url.startswith("https://api.runpod.ai/v2/"):
+                        # Extract RunPod endpoint ID
+                        parts = endpoint_url.split("/")
+                        if len(parts) >= 5:
+                            runpod_id = parts[4]
+                            endpoint_url = f"https://api.runpod.ai/v2/{runpod_id}/openai/v1/chat/completions"
                     # If the URL ends with /v1, append /chat/completions
-                    if endpoint_url.endswith("/v1"):
+                    elif endpoint_url.endswith("/v1"):
                         endpoint_url = f"{endpoint_url}/chat/completions"
                     # If the URL doesn't end with /v1, make sure it has a proper structure
                     elif not endpoint_url.endswith("/"):

{cua_agent-0.1.24 → cua_agent-0.1.26}/agent/providers/omni/loop.py RENAMED Viewed

@@ -147,7 +147,7 @@ class OmniLoop(BaseLoop):
             )
         elif self.provider == LLMProvider.OAICOMPAT:
             self.client = OAICompatClient(
-                api_key="EMPTY",  # Local endpoints typically don't require an API key
+                api_key=self.api_key or "EMPTY",  # Local endpoints typically don't require an API key
                 model=self.model,
                 provider_base_url=self.provider_base_url,
             )
@@ -183,7 +183,7 @@ class OmniLoop(BaseLoop):
                 )
             elif self.provider == LLMProvider.OAICOMPAT:
                 self.client = OAICompatClient(
-                    api_key="EMPTY",  # Local endpoints typically don't require an API key
+                    api_key=self.api_key or "EMPTY",  # Local endpoints typically don't require an API key
                     model=self.model,
                     provider_base_url=self.provider_base_url,
                 )
@@ -443,6 +443,8 @@ class OmniLoop(BaseLoop):
                     except (json.JSONDecodeError, IndexError):
                         try:
                             # Look for JSON object pattern
+                            import re  # Local import to ensure availability
                             json_pattern = r"\{[^}]+\}"
                             json_match = re.search(json_pattern, raw_text)
                             if json_match:
@@ -453,8 +455,20 @@ class OmniLoop(BaseLoop):
                                 logger.error(f"No JSON found in content")
                                 return True, action_screenshot_saved
                         except json.JSONDecodeError as e:
-                            logger.error(f"Failed to parse JSON from text: {str(e)}")
-                            return True, action_screenshot_saved
+                            # Try to sanitize the JSON string and retry
+                            try:
+                                # Remove or replace invalid control characters
+                                import re  # Local import to ensure availability
+                                sanitized_text = re.sub(r"[\x00-\x1F\x7F]", "", raw_text)
+                                # Try parsing again with sanitized text
+                                parsed_content = json.loads(sanitized_text)
+                                logger.info(
+                                    "Successfully parsed JSON after sanitizing control characters"
+                                )
+                            except json.JSONDecodeError:
+                                logger.error(f"Failed to parse JSON from text: {str(e)}")
+                                return True, action_screenshot_saved
             # Step 4: Process the parsed content if available
             if parsed_content:
@@ -534,6 +548,10 @@ class OmniLoop(BaseLoop):
                     img_data = parsed_screen.annotated_image_base64
                     if "," in img_data:
                         img_data = img_data.split(",")[1]
+                    # Process screenshot through hooks and save if needed
+                    await self.handle_screenshot(img_data, action_type="state", parsed_screen=parsed_screen)
                     # Save with a generic "state" action type to indicate this is the current screen state
                     self._save_screenshot(img_data, action_type="state")
                 except Exception as e:
@@ -649,6 +667,8 @@ class OmniLoop(BaseLoop):
                     response=response,
                     messages=self.message_manager.messages,
                     model=self.model,
+                    parsed_screen=parsed_screen,
+                    parser=self.parser
                 )
                 # Yield the response to the caller

{cua_agent-0.1.24 → cua_agent-0.1.26}/agent/providers/openai/loop.py RENAMED Viewed

@@ -194,8 +194,13 @@ class OpenAILoop(BaseLoop):
                 # Convert to base64 if needed
                 if isinstance(screenshot, bytes):
                     screenshot_base64 = base64.b64encode(screenshot).decode("utf-8")
+                elif isinstance(screenshot, (bytearray, memoryview)):
+                    screenshot_base64 = base64.b64encode(screenshot).decode("utf-8")
                 else:
-                    screenshot_base64 = screenshot
+                    screenshot_base64 = str(screenshot)
+                # Emit screenshot callbacks
+                await self.handle_screenshot(screenshot_base64, action_type="initial_state")
                 # Save screenshot if requested
                 if self.save_trajectory:
@@ -204,8 +209,6 @@ class OpenAILoop(BaseLoop):
                         logger.warning(
                             "Converting non-string screenshot_base64 to string for _save_screenshot"
                         )
-                        if isinstance(screenshot_base64, (bytearray, memoryview)):
-                            screenshot_base64 = base64.b64encode(screenshot_base64).decode("utf-8")
                     self._save_screenshot(screenshot_base64, action_type="state")
                     logger.info("Screenshot saved to trajectory")
@@ -336,8 +339,14 @@ class OpenAILoop(BaseLoop):
                         screenshot = await self.computer.interface.screenshot()
                         if isinstance(screenshot, bytes):
                             screenshot_base64 = base64.b64encode(screenshot).decode("utf-8")
+                        elif isinstance(screenshot, (bytearray, memoryview)):
+                            screenshot_base64 = base64.b64encode(bytes(screenshot)).decode("utf-8")
                         else:
-                            screenshot_base64 = screenshot
+                            screenshot_base64 = str(screenshot)
+                        # Process screenshot through hooks
+                        action_type = f"after_{action.get('type', 'action')}"
+                        await self.handle_screenshot(screenshot_base64, action_type=action_type)
                         # Create computer_call_output
                         computer_call_output = {

cua-agent 0.1.24__tar.gz → 0.1.26__tar.gz

Potentially problematic release.

cua-agent 0.1.24tar.gz → 0.1.26tar.gz