PyPI - vision-agent - Versions diffs - 0.2.91__tar.gz → 0.2.92__tar.gz - Mend

vision-agent 0.2.91tar.gz → 0.2.92tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

{vision_agent-0.2.91 → vision_agent-0.2.92}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 0.2.91
+Version: 0.2.92
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai
@@ -57,7 +57,7 @@ code to solve the task for them. Check out our discord for updates and roadmaps!
 ## Web Application
-Try Vision Agent live on [va.landing.ai](https://va.landing.ai/)
+Try Vision Agent live on (note this may not be running the most up-to-date version) [va.landing.ai](https://va.landing.ai/)
 ## Documentation
@@ -79,16 +79,44 @@ using Azure OpenAI please see the Azure setup section):
 export OPENAI_API_KEY="your-api-key"
 ```
-### Important Note on API Usage
-Please be aware that using the API in this project requires you to have API credits (minimum of five US dollars). This is different from the OpenAI subscription used in this chatbot. If you don't have credit, further information can be found [here](https://github.com/landing-ai/vision-agent?tab=readme-ov-file#how-to-get-started-with-openai-api-credits)
 ### Vision Agent
+There are two agents that you can use. Vision Agent is a conversational agent that has
+access to tools that allow it to write an navigate python code and file systems. It can
+converse with the user in natural language. VisionAgentCoder is an agent that can write
+code for vision tasks, such as counting people in an image. However, it cannot converse
+and can only respond with code. VisionAgent can call VisionAgentCoder to write vision
+code.
 #### Basic Usage
-You can interact with the agent as you would with any LLM or LMM model:
+To run the streamlit app locally to chat with Vision Agent, you can run the following
+command:
+```bash
+pip install -r examples/chat/requirements.txt
+export WORKSPACE=/path/to/your/workspace
+export ZMQ_PORT=5555
+streamlit run examples/chat/app.py
+```
+You can find more details about the streamlit app [here](examples/chat/).
+#### Basic Programmatic Usage
 ```python
 >>> from vision_agent.agent import VisionAgent
 >>> agent = VisionAgent()
+>>> resp = agent("Hello")
+>>> print(resp)
+[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "{'thoughts': 'The user has greeted me. I will respond with a greeting and ask how I can assist them.', 'response': 'Hello! How can I assist you today?', 'let_user_respond': True}"}]
+>>> resp.append({"role": "user", "content": "Can you count the number of people in this image?", "media": ["people.jpg"]})
+>>> resp = agent(resp)
+```
+### Vision Agent Coder
+#### Basic Usage
+You can interact with the agent as you would with any LLM or LMM model:
+```python
+>>> from vision_agent.agent import VisionAgentCoder
+>>> agent = VisionAgentCoder()
 >>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
 ```
@@ -129,7 +157,7 @@ To better understand how the model came up with it's answer, you can run it in d
 mode by passing in the verbose argument:
 ```python
->>> agent = VisionAgent(verbose=2)
+>>> agent = VisionAgentCoder(verbose=2)
 ```
 #### Detailed Usage
@@ -219,9 +247,11 @@ def custom_tool(image_path: str) -> str:
     return np.zeros((10, 10))
 ```
-You need to ensure you call `@va.tools.register_tool` with any imports it might use and
-ensure the documentation is in the same format above with description, `Parameters:`,
-`Returns:`, and `Example\n-------`. You can find an example use case [here](examples/custom_tools/).
+You need to ensure you call `@va.tools.register_tool` with any imports it uses. Global
+variables will not be captured by `register_tool` so you need to include them in the
+function. Make sure the documentation is in the same format above with description,
+`Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case
+[here](examples/custom_tools/) as this is what the agent uses to pick and use the tool.
 ### Azure Setup
 If you want to use Azure OpenAI models, you need to have two OpenAI model deployments:
@@ -248,7 +278,7 @@ You can then run Vision Agent using the Azure OpenAI models:
 ```python
 import vision_agent as va
-agent = va.agent.AzureVisionAgent()
+agent = va.agent.AzureVisionAgentCoder()
 ```
 ******************************************************************************************************************************
@@ -257,7 +287,7 @@ agent = va.agent.AzureVisionAgent()
 #### How to get started with OpenAI API credits
-1. Visit the[OpenAI API platform](https://beta.openai.com/signup/) to sign up for an API key.
+1. Visit the [OpenAI API platform](https://beta.openai.com/signup/) to sign up for an API key.
 2. Follow the instructions to purchase and manage your API credits.
 3. Ensure your API key is correctly configured in your project settings.

{vision_agent-0.2.91 → vision_agent-0.2.92}/README.md RENAMED Viewed

@@ -18,7 +18,7 @@ code to solve the task for them. Check out our discord for updates and roadmaps!
 ## Web Application
-Try Vision Agent live on [va.landing.ai](https://va.landing.ai/)
+Try Vision Agent live on (note this may not be running the most up-to-date version) [va.landing.ai](https://va.landing.ai/)
 ## Documentation
@@ -40,16 +40,44 @@ using Azure OpenAI please see the Azure setup section):
 export OPENAI_API_KEY="your-api-key"
 ```
-### Important Note on API Usage
-Please be aware that using the API in this project requires you to have API credits (minimum of five US dollars). This is different from the OpenAI subscription used in this chatbot. If you don't have credit, further information can be found [here](https://github.com/landing-ai/vision-agent?tab=readme-ov-file#how-to-get-started-with-openai-api-credits)
 ### Vision Agent
+There are two agents that you can use. Vision Agent is a conversational agent that has
+access to tools that allow it to write an navigate python code and file systems. It can
+converse with the user in natural language. VisionAgentCoder is an agent that can write
+code for vision tasks, such as counting people in an image. However, it cannot converse
+and can only respond with code. VisionAgent can call VisionAgentCoder to write vision
+code.
 #### Basic Usage
-You can interact with the agent as you would with any LLM or LMM model:
+To run the streamlit app locally to chat with Vision Agent, you can run the following
+command:
+```bash
+pip install -r examples/chat/requirements.txt
+export WORKSPACE=/path/to/your/workspace
+export ZMQ_PORT=5555
+streamlit run examples/chat/app.py
+```
+You can find more details about the streamlit app [here](examples/chat/).
+#### Basic Programmatic Usage
 ```python
 >>> from vision_agent.agent import VisionAgent
 >>> agent = VisionAgent()
+>>> resp = agent("Hello")
+>>> print(resp)
+[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "{'thoughts': 'The user has greeted me. I will respond with a greeting and ask how I can assist them.', 'response': 'Hello! How can I assist you today?', 'let_user_respond': True}"}]
+>>> resp.append({"role": "user", "content": "Can you count the number of people in this image?", "media": ["people.jpg"]})
+>>> resp = agent(resp)
+```
+### Vision Agent Coder
+#### Basic Usage
+You can interact with the agent as you would with any LLM or LMM model:
+```python
+>>> from vision_agent.agent import VisionAgentCoder
+>>> agent = VisionAgentCoder()
 >>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
 ```
@@ -90,7 +118,7 @@ To better understand how the model came up with it's answer, you can run it in d
 mode by passing in the verbose argument:
 ```python
->>> agent = VisionAgent(verbose=2)
+>>> agent = VisionAgentCoder(verbose=2)
 ```
 #### Detailed Usage
@@ -180,9 +208,11 @@ def custom_tool(image_path: str) -> str:
     return np.zeros((10, 10))
 ```
-You need to ensure you call `@va.tools.register_tool` with any imports it might use and
-ensure the documentation is in the same format above with description, `Parameters:`,
-`Returns:`, and `Example\n-------`. You can find an example use case [here](examples/custom_tools/).
+You need to ensure you call `@va.tools.register_tool` with any imports it uses. Global
+variables will not be captured by `register_tool` so you need to include them in the
+function. Make sure the documentation is in the same format above with description,
+`Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case
+[here](examples/custom_tools/) as this is what the agent uses to pick and use the tool.
 ### Azure Setup
 If you want to use Azure OpenAI models, you need to have two OpenAI model deployments:
@@ -209,7 +239,7 @@ You can then run Vision Agent using the Azure OpenAI models:
 ```python
 import vision_agent as va
-agent = va.agent.AzureVisionAgent()
+agent = va.agent.AzureVisionAgentCoder()
 ```
 ******************************************************************************************************************************
@@ -218,7 +248,7 @@ agent = va.agent.AzureVisionAgent()
 #### How to get started with OpenAI API credits
-1. Visit the[OpenAI API platform](https://beta.openai.com/signup/) to sign up for an API key.
+1. Visit the [OpenAI API platform](https://beta.openai.com/signup/) to sign up for an API key.
 2. Follow the instructions to purchase and manage your API credits.
 3. Ensure your API key is correctly configured in your project settings.

{vision_agent-0.2.91 → vision_agent-0.2.92}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
 [tool.poetry]
 name = "vision-agent"
-version = "0.2.91"
+version = "0.2.92"
 description = "Toolset for Vision Agent"
 authors = ["Landing AI <dev@landing.ai>"]
 readme = "README.md"

vision_agent-0.2.92/vision_agent/agent/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+from .agent import Agent
+from .vision_agent import VisionAgent
+from .vision_agent_coder import AzureVisionAgentCoder, VisionAgentCoder

{vision_agent-0.2.91 → vision_agent-0.2.92}/vision_agent/agent/agent.py RENAMED Viewed

@@ -2,7 +2,7 @@ from abc import ABC, abstractmethod
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Union
-from vision_agent.lmm import Message
+from vision_agent.lmm.types import Message
 class Agent(ABC):

vision_agent-0.2.92/vision_agent/agent/agent_utils.py ADDED Viewed

@@ -0,0 +1,43 @@
+import json
+import logging
+import sys
+from typing import Any, Dict
+logging.basicConfig(stream=sys.stdout)
+_LOGGER = logging.getLogger(__name__)
+def extract_json(json_str: str) -> Dict[str, Any]:
+    try:
+        json_dict = json.loads(json_str)
+    except json.JSONDecodeError:
+        input_json_str = json_str
+        if "```json" in json_str:
+            json_str = json_str[json_str.find("```json") + len("```json") :]
+            json_str = json_str[: json_str.find("```")]
+        elif "```" in json_str:
+            json_str = json_str[json_str.find("```") + len("```") :]
+            # get the last ``` not one from an intermediate string
+            json_str = json_str[: json_str.find("}```")]
+        try:
+            json_dict = json.loads(json_str)
+        except json.JSONDecodeError as e:
+            error_msg = f"Could not extract JSON from the given str: {json_str}.\nFunction input:\n{input_json_str}"
+            _LOGGER.exception(error_msg)
+            raise ValueError(error_msg) from e
+    return json_dict  # type: ignore
+def extract_code(code: str) -> str:
+    if "\n```python" in code:
+        start = "\n```python"
+    elif "```python" in code:
+        start = "```python"
+    else:
+        return code
+    code = code[code.find(start) + len(start) :]
+    code = code[: code.find("```")]
+    if code.startswith("python\n"):
+        code = code[len("python\n") :]
+    return code

vision_agent-0.2.92/vision_agent/agent/vision_agent.py ADDED Viewed

@@ -0,0 +1,230 @@
+import copy
+import logging
+import os
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Union, cast
+from vision_agent.agent import Agent
+from vision_agent.agent.agent_utils import extract_json
+from vision_agent.agent.vision_agent_prompts import (
+    EXAMPLES_CODE1,
+    EXAMPLES_CODE2,
+    VA_CODE,
+)
+from vision_agent.lmm import LMM, Message, OpenAILMM
+from vision_agent.tools import META_TOOL_DOCSTRING
+from vision_agent.utils import CodeInterpreterFactory
+from vision_agent.utils.execute import CodeInterpreter
+logging.basicConfig(level=logging.INFO)
+_LOGGER = logging.getLogger(__name__)
+WORKSPACE = Path(os.getenv("WORKSPACE", ""))
+WORKSPACE.mkdir(parents=True, exist_ok=True)
+if str(WORKSPACE) != "":
+    os.environ["PYTHONPATH"] = f"{WORKSPACE}:{os.getenv('PYTHONPATH', '')}"
+class DefaultImports:
+    code = [
+        "from typing import *",
+        "from vision_agent.utils.execute import CodeInterpreter",
+        "from vision_agent.tools.meta_tools import generate_vision_code, edit_vision_code, open_file, create_file, scroll_up, scroll_down, edit_file, get_tool_descriptions",
+    ]
+    @staticmethod
+    def to_code_string() -> str:
+        return "\n".join(DefaultImports.code)
+    @staticmethod
+    def prepend_imports(code: str) -> str:
+        """Run this method to prepend the default imports to the code.
+        NOTE: be sure to run this method after the custom tools have been registered.
+        """
+        return DefaultImports.to_code_string() + "\n\n" + code
+def run_conversation(orch: LMM, chat: List[Message]) -> Dict[str, Any]:
+    chat = copy.deepcopy(chat)
+    conversation = ""
+    for chat_i in chat:
+        if chat_i["role"] == "user":
+            conversation += f"USER: {chat_i['content']}\n\n"
+        elif chat_i["role"] == "observation":
+            conversation += f"OBSERVATION:\n{chat_i['content']}\n\n"
+        elif chat_i["role"] == "assistant":
+            conversation += f"AGENT: {chat_i['content']}\n\n"
+        else:
+            raise ValueError(f"role {chat_i['role']} is not supported")
+    prompt = VA_CODE.format(
+        documentation=META_TOOL_DOCSTRING,
+        examples=f"{EXAMPLES_CODE1}\n{EXAMPLES_CODE2}",
+        dir=WORKSPACE,
+        conversation=conversation,
+    )
+    return extract_json(orch([{"role": "user", "content": prompt}]))
+def run_code_action(code: str, code_interpreter: CodeInterpreter) -> str:
+    # Note the code interpreter needs to keep running in the same environment because
+    # the SWE tools hold state like line numbers and currently open files.
+    result = code_interpreter.exec_cell(DefaultImports.prepend_imports(code))
+    return_str = ""
+    if result.success:
+        for res in result.results:
+            if res.text is not None:
+                return_str += res.text.replace("\\n", "\n")
+        if result.logs.stdout:
+            return_str += "----- stdout -----\n"
+            for log in result.logs.stdout:
+                return_str += log.replace("\\n", "\n")
+    else:
+        # for log in result.logs.stderr:
+        #     return_str += log.replace("\\n", "\n")
+        if result.error:
+            return_str += (
+                "\n" + result.error.value + "\n".join(result.error.traceback_raw)
+            )
+    return return_str
+def parse_execution(response: str) -> Optional[str]:
+    code = None
+    if "<execute_python>" in response:
+        code = response[response.find("<execute_python>") + len("<execute_python>") :]
+        code = code[: code.find("</execute_python>")]
+    return code
+class VisionAgent(Agent):
+    """Vision Agent is an agent that can chat with the user and call tools or other
+    agents to generate code for it. Vision Agent uses python code to execute actions for
+    the user. Vision Agent is inspired by by OpenDev
+    https://github.com/OpenDevin/OpenDevin and CodeAct https://arxiv.org/abs/2402.01030
+    Example
+    -------
+        >>> from vision_agent.agent import VisionAgent
+        >>> agent = VisionAgent()
+        >>> resp = agent("Hello")
+        >>> resp.append({"role": "user", "content": "Can you write a function that counts dogs?", "media": ["dog.jpg"]})
+        >>> resp = agent(resp)
+    """
+    def __init__(
+        self,
+        agent: Optional[LMM] = None,
+        verbosity: int = 0,
+        code_sandbox_runtime: Optional[str] = None,
+    ) -> None:
+        self.agent = (
+            OpenAILMM(temperature=0.0, json_mode=True) if agent is None else agent
+        )
+        self.max_iterations = 100
+        self.verbosity = verbosity
+        self.code_sandbox_runtime = code_sandbox_runtime
+        if self.verbosity >= 1:
+            _LOGGER.setLevel(logging.INFO)
+    def __call__(
+        self,
+        input: Union[str, List[Message]],
+        media: Optional[Union[str, Path]] = None,
+    ) -> str:
+        """Chat with VisionAgent and get the conversation response.
+        Parameters:
+            input (Union[str, List[Message]): A conversation in the format of
+                [{"role": "user", "content": "describe your task here..."}, ...] or a
+                string of just the contents.
+            media (Optional[Union[str, Path]]): The media file to be used in the task.
+        Returns:
+            str: The conversation response.
+        """
+        if isinstance(input, str):
+            input = [{"role": "user", "content": input}]
+            if media is not None:
+                input[0]["media"] = [media]
+        results = self.chat_with_code(input)
+        return results  # type: ignore
+    def chat_with_code(
+        self,
+        chat: List[Message],
+    ) -> List[Message]:
+        """Chat with VisionAgent, it will use code to execute actions to accomplish
+        its tasks.
+        Parameters:
+            chat (List[Message]): A conversation
+                in the format of:
+                [{"role": "user", "content": "describe your task here..."}]
+                or if it contains media files, it should be in the format of:
+                [{"role": "user", "content": "describe your task here...", "media": ["image1.jpg", "image2.jpg"]}]
+        Returns:
+            List[Message]: The conversation response.
+        """
+        if not chat:
+            raise ValueError("chat cannot be empty")
+        with CodeInterpreterFactory.new_instance(
+            code_sandbox_runtime=self.code_sandbox_runtime
+        ) as code_interpreter:
+            orig_chat = copy.deepcopy(chat)
+            int_chat = copy.deepcopy(chat)
+            media_list = []
+            for chat_i in int_chat:
+                if "media" in chat_i:
+                    for media in chat_i["media"]:
+                        media = code_interpreter.upload_file(media)
+                        chat_i["content"] += f" Media name {media}"  # type: ignore
+                        media_list.append(media)
+            int_chat = cast(
+                List[Message],
+                [
+                    (
+                        {
+                            "role": c["role"],
+                            "content": c["content"],
+                            "media": c["media"],
+                        }
+                        if "media" in c
+                        else {"role": c["role"], "content": c["content"]}
+                    )
+                    for c in int_chat
+                ],
+            )
+            finished = False
+            iterations = 0
+            while not finished and iterations < self.max_iterations:
+                response = run_conversation(self.agent, int_chat)
+                if self.verbosity >= 1:
+                    _LOGGER.info(response)
+                int_chat.append({"role": "assistant", "content": str(response)})
+                orig_chat.append({"role": "assistant", "content": str(response)})
+                if response["let_user_respond"]:
+                    break
+                code_action = parse_execution(response["response"])
+                if code_action is not None:
+                    obs = run_code_action(code_action, code_interpreter)
+                    if self.verbosity >= 1:
+                        _LOGGER.info(obs)
+                    int_chat.append({"role": "observation", "content": obs})
+                    orig_chat.append({"role": "observation", "content": obs})
+                iterations += 1
+        return orig_chat
+    def log_progress(self, data: Dict[str, Any]) -> None:
+        pass

vision-agent 0.2.91__tar.gz → 0.2.92__tar.gz

vision-agent 0.2.91tar.gz → 0.2.92tar.gz