PyPI - vision-agent - Versions diffs - 0.2.202__tar.gz → 0.2.203__tar.gz - Mend

vision-agent 0.2.202tar.gz → 0.2.203tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

{vision_agent-0.2.202 → vision_agent-0.2.203}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 0.2.202
+Version: 0.2.203
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai

{vision_agent-0.2.202 → vision_agent-0.2.203}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
 [tool.poetry]
 name = "vision-agent"
-version = "0.2.202"
+version = "0.2.203"
 description = "Toolset for Vision Agent"
 authors = ["Landing AI <dev@landing.ai>"]
 readme = "README.md"

vision_agent-0.2.203/vision_agent/agent/README.md ADDED Viewed

@@ -0,0 +1,89 @@
+## V2 Agents
+This gives an overview of all the V2 agents, how they communicate, and how human-in-the-loop works.
+- `vision_agent_v2` - This is the conversation agent. It can take exactly one action and one response. The actions are fixed JSON actions (meaning it does not execute code but instead returns a JSON and we execute the code on it's behalf). This is so we can control the action arguments as well as pass around the notebook code interpreter.
+- `vision_agent_planner_v2` - This agent is responsible for planning. It can run for N turns, each turn it can take some code action (it can execute it's own python code and has access to `planner_tools`) to test a new potential step in the plan.
+- `vision_agent_coder_v2` - This agent is responsible for the final code. It can call the planner on it's own or it can take in the final `PlanContext` returned by `vision_agent_planner_v2` and use that to write the final code and test it.
+### Communication
+The agents communicate through `AgentMessage`'s and return `PlanContext`'s and `CodeContext`'s for the planner and coder agent respectively.
+```
+_______________
+|VisionAgentV2|
+---------------
+       |                       ____________________
+       -----(AgentMessage)---> |VisionAgentCoderV2|
+                               --------------------
+                                         |                        ______________________
+                                         -----(AgentMessage)----> |VisionAgentPlannerV2|
+                                                                  ----------------------
+                               ____________________                         |
+                               |VisionAgentCoderV2| <----(PlanContext)-------
+                               --------------------
+_______________                          |
+|VisionAgentV2|<-----(CodeContext)--------
+---------------
+```
+#### AgentMessage and Contexts
+`AgentMessage` is a basic chat message but with extended roles. The roles can be your typical `user` and `assistant` but can also be `conversation`, `planner` or `coder` where the come from `VisionAgentV2`, `VisionAgentPlannerV2` and `VisionAgentCoderV2` respectively. `conversation`, `planner` and `coder` are all types of `assistant`. `observation`'s come from responses from executing python code internally by the planner.
+The `VisionAgentPlannerV2` returns `PlanContext` which contains the a finalized version of the plan, including instructions and code snippets used during planning. `VisionAgentCoderV2` will then take in that `PlanContext` and return a `CodeContext` which contains the final code and any additional information.
+#### Callbacks
+If you want to recieve intermediate messages you can use the `update_callback` argument in all the `V2` constructors. This will asynchronously send `AgentMessage`'s to the callback function you provide. You can see an example of how to run this in `examples/chat/app.py`
+### Human-in-the-loop
+Human-in-the-loop is a feature that allows the user to interact with the agents at certain points in the conversation. This is handled by using the `interaction` and `interaction_response` roles in the `AgentMessage`. You can enable this feature by passing `hil=True` to the `VisionAgentV2`, currently you can only use human-in-the-loop if you are also using the `update_callback` to collect the messages and pass them back to `VisionAgentV2`.
+When the planner agent wants to interact with a human, it will return `InteractionContext` which will propogate back up to `VisionAgentV2` and then to the user. This exits the planner so it can ask for human input. If you collect the messages from `update_callback`, you will see the last `AgentMessage` has a role of `interaction` and the content will include a JSON string surrounded by `<interaction>` tags:
+```
+AgentMessage(
+    role="interaction",
+    content="<interaction>{\"prompt\": \"Should I use owl_v2_image or countgd_counting?\"}</interaction>",
+    media=None,
+)
+```
+The user can then add an additional `AgentMessage` with the role `interaction_response` and the response they want to give:
+```
+AgentMessage(
+    role="interaction_response",
+    content="{\"function_name\": \"owl_v2_image\"}",
+    media=None,
+)
+```
+You can see an example of how this works in `examples/chat/chat-app/src/components/ChatSection.tsx` under the `handleSubmit` function.
+#### Human-in-the-loop Caveats
+One issue with this approach is the planner is running code on a notebook and has access to all the previous executions. This means that for the internal notebook that the agents use, usually named `code_interpreter` from `CodeInterpreterFactor.new_instance`, you cannot close it or restart it. This is an issue because the notebook will close once you exit from the `VisionAgentV2` call.
+To fix this you must construct a notebook outside of the chat and ensure it has `non_exiting=True`:
+```python
+agent = VisionAgentV2(
+    verbose=True,
+    update_callback=update_callback,
+    hil=True,
+)
+code_interpreter = CodeInterpreterFactory.new_instance(non_exiting=True)
+agent.chat(
+    [
+        AgentMessage(
+            role="user",
+            content="Hello",
+            media=None
+        )
+    ],
+    code_interpreter=code_interpreter
+)
+```
+An example of this can be seen in `examples/chat/app.py`. Here the `code_intepreter` is constructed outside of the chat and passed in. This is so the notebook does not close when the chat ends or returns when asking for human feedback.

{vision_agent-0.2.202 → vision_agent-0.2.203}/vision_agent/agent/agent.py RENAMED Viewed

@@ -2,7 +2,12 @@ from abc import ABC, abstractmethod
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Union
-from vision_agent.agent.types import AgentMessage, CodeContext, PlanContext
+from vision_agent.agent.types import (
+    AgentMessage,
+    CodeContext,
+    InteractionContext,
+    PlanContext,
+)
 from vision_agent.lmm.types import Message
 from vision_agent.utils.execute import CodeInterpreter
@@ -31,7 +36,7 @@ class AgentCoder(Agent):
         chat: List[AgentMessage],
         max_steps: Optional[int] = None,
         code_interpreter: Optional[CodeInterpreter] = None,
-    ) -> CodeContext:
+    ) -> Union[CodeContext, InteractionContext]:
         pass
     @abstractmethod
@@ -51,5 +56,5 @@ class AgentPlanner(Agent):
         chat: List[AgentMessage],
         max_steps: Optional[int] = None,
         code_interpreter: Optional[CodeInterpreter] = None,
-    ) -> PlanContext:
+    ) -> Union[PlanContext, InteractionContext]:
         pass

{vision_agent-0.2.202 → vision_agent-0.2.203}/vision_agent/agent/types.py RENAMED Viewed

@@ -17,12 +17,12 @@ class AgentMessage(BaseModel):
     interaction: An interaction between the user and the assistant. For example if the
         assistant wants to ask the user for help on a task, it could send an
         interaction message.
+    interaction_response: The user's response to an interaction message.
     conversation: Messages coming from the conversation agent, this is a type of
         assistant messages.
     planner: Messages coming from the planner agent, this is a type of assistant
         messages.
     coder: Messages coming from the coder agent, this is a type of assistant messages.
     """
     role: Union[
@@ -30,6 +30,7 @@ class AgentMessage(BaseModel):
         Literal["assistant"],  # planner, coder and conversation are of type assistant
         Literal["observation"],
         Literal["interaction"],
+        Literal["interaction_response"],
         Literal["conversation"],
         Literal["planner"],
         Literal["coder"],
@@ -39,13 +40,37 @@ class AgentMessage(BaseModel):
 class PlanContext(BaseModel):
+    """PlanContext is a data model that represents the context of a plan.
+    plan: A description of the overall plan.
+    instructions: A list of step-by-step instructions.
+    code: Code snippets that were used during planning.
+    """
     plan: str
     instructions: List[str]
     code: str
 class CodeContext(BaseModel):
+    """CodeContext is a data model that represents final code and test cases.
+    code: The final code that was written.
+    test: The test cases that were written.
+    success: A boolean value indicating whether the code passed the test cases.
+    test_result: The result of running the test cases.
+    """
     code: str
     test: str
     success: bool
     test_result: Execution
+class InteractionContext(BaseModel):
+    """InteractionContext is a data model that represents the context of an interaction.
+    chat: A list of messages exchanged between the user and the assistant.
+    """
+    chat: List[AgentMessage]

{vision_agent-0.2.202 → vision_agent-0.2.203}/vision_agent/agent/vision_agent_coder_v2.py RENAMED Viewed

@@ -18,7 +18,12 @@ from vision_agent.agent.agent_utils import (
     print_code,
     strip_function_calls,
 )
-from vision_agent.agent.types import AgentMessage, CodeContext, PlanContext
+from vision_agent.agent.types import (
+    AgentMessage,
+    CodeContext,
+    InteractionContext,
+    PlanContext,
+)
 from vision_agent.agent.vision_agent_coder_prompts_v2 import CODE, FIX_BUG, TEST
 from vision_agent.agent.vision_agent_planner_v2 import VisionAgentPlannerV2
 from vision_agent.lmm import LMM, AnthropicLMM
@@ -257,6 +262,7 @@ class VisionAgentCoderV2(AgentCoder):
         tester: Optional[LMM] = None,
         debugger: Optional[LMM] = None,
         tool_recommender: Optional[Union[str, Sim]] = None,
+        hil: bool = False,
         verbose: bool = False,
         code_sandbox_runtime: Optional[str] = None,
         update_callback: Callable[[Dict[str, Any]], None] = lambda _: None,
@@ -272,6 +278,7 @@ class VisionAgentCoderV2(AgentCoder):
                 None, a default AnthropicLMM will be used.
             debugger (Optional[LMM]): The language model to use for the debugger agent.
             tool_recommender (Optional[Union[str, Sim]]): The tool recommender to use.
+            hil (bool): Whether to use human-in-the-loop mode.
             verbose (bool): Whether to print out debug information.
             code_sandbox_runtime (Optional[str]): The code sandbox runtime to use, can
                 be one of: None, "local" or "e2b". If None, it will read from the
@@ -283,8 +290,11 @@ class VisionAgentCoderV2(AgentCoder):
         self.planner = (
             planner
             if planner is not None
-            else VisionAgentPlannerV2(verbose=verbose, update_callback=update_callback)
+            else VisionAgentPlannerV2(
+                verbose=verbose, update_callback=update_callback, hil=hil
+            )
         )
         self.coder = (
             coder
             if coder is not None
@@ -311,6 +321,8 @@ class VisionAgentCoderV2(AgentCoder):
         self.verbose = verbose
         self.code_sandbox_runtime = code_sandbox_runtime
         self.update_callback = update_callback
+        if hasattr(self.planner, "update_callback"):
+            self.planner.update_callback = update_callback
     def __call__(
         self,
@@ -331,14 +343,17 @@ class VisionAgentCoderV2(AgentCoder):
         """
         input_msg = convert_message_to_agentmessage(input, media)
-        return self.generate_code(input_msg).code
+        code_or_interaction = self.generate_code(input_msg)
+        if isinstance(code_or_interaction, InteractionContext):
+            return code_or_interaction.chat[-1].content
+        return code_or_interaction.code
     def generate_code(
         self,
         chat: List[AgentMessage],
         max_steps: Optional[int] = None,
         code_interpreter: Optional[CodeInterpreter] = None,
-    ) -> CodeContext:
+    ) -> Union[CodeContext, InteractionContext]:
         """Generate vision code from a conversation.
         Parameters:
@@ -353,6 +368,11 @@ class VisionAgentCoderV2(AgentCoder):
         """
         chat = copy.deepcopy(chat)
+        if not chat or chat[-1].role not in {"user", "interaction_response"}:
+            raise ValueError(
+                f"Last chat message must be from the user or interaction_response, got {chat[-1].role}."
+            )
         with (
             CodeInterpreterFactory.new_instance(self.code_sandbox_runtime)
             if code_interpreter is None
@@ -362,6 +382,10 @@ class VisionAgentCoderV2(AgentCoder):
             plan_context = self.planner.generate_plan(
                 int_chat, max_steps=max_steps, code_interpreter=code_interpreter
             )
+            # the planner needs an interaction, so return before generating code
+            if isinstance(plan_context, InteractionContext):
+                return plan_context
             code_context = self.generate_code_from_plan(
                 orig_chat,
                 plan_context,
@@ -391,6 +415,19 @@ class VisionAgentCoderV2(AgentCoder):
         """
         chat = copy.deepcopy(chat)
+        if not chat or chat[-1].role not in {"user", "interaction_response"}:
+            raise ValueError(
+                f"Last chat message must be from the user or interaction_response, got {chat[-1].role}."
+            )
+        # we don't need the user_interaction response for generating code since it's
+        # already in the plan context
+        while chat[-1].role != "user":
+            chat.pop()
+        if not chat:
+            raise ValueError("Chat must have at least one user message.")
         with (
             CodeInterpreterFactory.new_instance(self.code_sandbox_runtime)
             if code_interpreter is None

{vision_agent-0.2.202 → vision_agent-0.2.203}/vision_agent/agent/vision_agent_planner_v2.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import copy
+import json
 import logging
 import time
 from concurrent.futures import ThreadPoolExecutor, as_completed
@@ -21,7 +22,7 @@ from vision_agent.agent.agent_utils import (
     print_code,
     print_table,
 )
-from vision_agent.agent.types import AgentMessage, PlanContext
+from vision_agent.agent.types import AgentMessage, InteractionContext, PlanContext
 from vision_agent.agent.vision_agent_planner_prompts_v2 import (
     CRITIQUE_PLAN,
     EXAMPLE_PLAN1,
@@ -32,6 +33,7 @@ from vision_agent.agent.vision_agent_planner_prompts_v2 import (
     PLAN,
 )
 from vision_agent.lmm import LMM, AnthropicLMM, Message
+from vision_agent.tools.planner_tools import check_function_call, get_tool_documentation
 from vision_agent.utils.execute import (
     CodeInterpreter,
     CodeInterpreterFactory,
@@ -210,7 +212,7 @@ def execute_code_action(
     obs = execution.text(include_results=False).strip()
     if verbose:
         _CONSOLE.print(
-            f"[bold cyan]Code Execution Output ({end - start:.2f} sec):[/bold cyan] [yellow]{escape(obs)}[/yellow]"
+            f"[bold cyan]Code Execution Output ({end - start:.2f}s):[/bold cyan] [yellow]{escape(obs)}[/yellow]"
         )
     count = 1
@@ -247,6 +249,31 @@ def find_and_replace_code(response: str, code: str) -> str:
     return response[:code_start] + code + response[code_end:]
+def create_hil_response(
+    execution: Execution,
+) -> AgentMessage:
+    content = []
+    for result in execution.results:
+        for format in result.formats():
+            if format == "json":
+                data = result.json
+                try:
+                    data_dict: Dict[str, Any] = dict(data)  # type: ignore
+                    if (
+                        "request" in data_dict
+                        and "function_name" in data_dict["request"]
+                    ):
+                        content.append(data_dict)
+                except Exception:
+                    continue
+    return AgentMessage(
+        role="interaction",
+        content="<interaction>" + json.dumps(content) + "</interaction>",
+        media=None,
+    )
 def maybe_run_code(
     code: Optional[str],
     response: str,
@@ -254,6 +281,7 @@ def maybe_run_code(
     media_list: List[Union[str, Path]],
     model: LMM,
     code_interpreter: CodeInterpreter,
+    hil: bool = False,
     verbose: bool = False,
 ) -> List[AgentMessage]:
     return_chat: List[AgentMessage] = []
@@ -270,6 +298,11 @@ def maybe_run_code(
             AgentMessage(role="planner", content=fixed_response, media=None)
         )
+        # if we are running human-in-the-loop mode, send back an interaction message
+        # make sure we return code from planner and the hil response
+        if check_function_call(code, "get_tool_for_task") and hil:
+            return return_chat + [create_hil_response(execution)]
         media_data = capture_media_from_exec(execution)
         int_chat_elt = AgentMessage(role="observation", content=obs, media=None)
         if media_list:
@@ -285,6 +318,10 @@ def create_finalize_plan(
     model: LMM,
     verbose: bool = False,
 ) -> Tuple[List[AgentMessage], PlanContext]:
+    # if we're in the middle of an interaction, don't finalize the plan
+    if chat[-1].role == "interaction":
+        return [], PlanContext(plan="", instructions=[], code="")
     prompt = FINALIZE_PLAN.format(
         planning=get_planning(chat),
         excluded_tools=str([t.__name__ for t in pt.PLANNER_TOOLS]),
@@ -318,6 +355,27 @@ def get_steps(chat: List[AgentMessage], max_steps: int) -> int:
     return max_steps
+def replace_interaction_with_obs(chat: List[AgentMessage]) -> List[AgentMessage]:
+    chat = copy.deepcopy(chat)
+    new_chat = []
+    for i, chat_i in enumerate(chat):
+        if chat_i.role == "interaction" and (
+            i < len(chat) and chat[i + 1].role == "interaction_response"
+        ):
+            try:
+                response = json.loads(chat[i + 1].content)
+                function_name = response["function_name"]
+                tool_doc = get_tool_documentation(function_name)
+                new_chat.append(AgentMessage(role="observation", content=tool_doc))
+            except json.JSONDecodeError:
+                raise ValueError(f"Invalid JSON in interaction response: {chat_i}")
+        else:
+            new_chat.append(chat_i)
+    return new_chat
 class VisionAgentPlannerV2(AgentPlanner):
     """VisionAgentPlannerV2 is a class that generates a plan to solve a vision task."""
@@ -328,6 +386,7 @@ class VisionAgentPlannerV2(AgentPlanner):
         max_steps: int = 10,
         use_multi_trial_planning: bool = False,
         critique_steps: int = 11,
+        hil: bool = False,
         verbose: bool = False,
         code_sandbox_runtime: Optional[str] = None,
         update_callback: Callable[[Dict[str, Any]], None] = lambda _: None,
@@ -343,6 +402,7 @@ class VisionAgentPlannerV2(AgentPlanner):
             use_multi_trial_planning (bool): Whether to use multi-trial planning.
             critique_steps (int): The number of steps between critiques. If critic steps
                 is larger than max_steps no critiques will be made.
+            hil (bool): Whether to use human-in-the-loop mode.
             verbose (bool): Whether to print out debug information.
             code_sandbox_runtime (Optional[str]): The code sandbox runtime to use, can
                 be one of: None, "local" or "e2b". If None, it will read from the
@@ -365,6 +425,11 @@ class VisionAgentPlannerV2(AgentPlanner):
         self.use_multi_trial_planning = use_multi_trial_planning
         self.critique_steps = critique_steps
+        self.hil = hil
+        if self.hil:
+            DefaultPlanningImports.imports.append(
+                "from vision_agent.tools.planner_tools import get_tool_for_task_human_reviewer as get_tool_for_task"
+            )
         self.verbose = verbose
         self.code_sandbox_runtime = code_sandbox_runtime
         self.update_callback = update_callback
@@ -388,15 +453,17 @@ class VisionAgentPlannerV2(AgentPlanner):
         """
         input_msg = convert_message_to_agentmessage(input, media)
-        plan = self.generate_plan(input_msg)
-        return plan.plan
+        plan_or_interaction = self.generate_plan(input_msg)
+        if isinstance(plan_or_interaction, InteractionContext):
+            return plan_or_interaction.chat[-1].content
+        return plan_or_interaction.plan
     def generate_plan(
         self,
         chat: List[AgentMessage],
         max_steps: Optional[int] = None,
         code_interpreter: Optional[CodeInterpreter] = None,
-    ) -> PlanContext:
+    ) -> Union[PlanContext, InteractionContext]:
         """Generate a plan to solve a vision task.
         Parameters:
@@ -409,24 +476,30 @@ class VisionAgentPlannerV2(AgentPlanner):
                 needed to solve the task.
         """
-        if not chat:
-            raise ValueError("Chat cannot be empty")
+        if not chat or chat[-1].role not in {"user", "interaction_response"}:
+            raise ValueError(
+                f"Last chat message must be from the user or interaction_response, got {chat[-1].role}."
+            )
         chat = copy.deepcopy(chat)
-        code_interpreter = code_interpreter or CodeInterpreterFactory.new_instance(
-            self.code_sandbox_runtime
-        )
         max_steps = max_steps or self.max_steps
-        with code_interpreter:
+        with (
+            CodeInterpreterFactory.new_instance(self.code_sandbox_runtime)
+            if code_interpreter is None
+            else code_interpreter
+        ) as code_interpreter:
             critque_steps = 1
             finished = False
+            interaction = False
             int_chat, _, media_list = add_media_to_chat(chat, code_interpreter)
+            int_chat = replace_interaction_with_obs(int_chat)
             step = get_steps(int_chat, max_steps)
             if "<count>" not in int_chat[-1].content and step == max_steps:
                 int_chat[-1].content += f"\n<count>{step}</count>\n"
-            while step > 0 and not finished:
+            while step > 0 and not finished and not interaction:
                 if self.use_multi_trial_planning:
                     response = run_multi_trial_planning(
                         int_chat, media_list, self.planner
@@ -456,8 +529,10 @@ class VisionAgentPlannerV2(AgentPlanner):
                     media_list,
                     self.planner,
                     code_interpreter,
-                    self.verbose,
+                    hil=self.hil,
+                    verbose=self.verbose,
                 )
+                interaction = updated_chat[-1].role == "interaction"
                 if critque_steps % self.critique_steps == 0:
                     critique = run_critic(int_chat, media_list, self.critic)
@@ -478,14 +553,18 @@ class VisionAgentPlannerV2(AgentPlanner):
                 for chat_elt in updated_chat:
                     self.update_callback(chat_elt.model_dump())
-            updated_chat, plan_context = create_finalize_plan(
-                int_chat, self.planner, self.verbose
-            )
-            int_chat.extend(updated_chat)
-            for chat_elt in updated_chat:
-                self.update_callback(chat_elt.model_dump())
+            context: Union[PlanContext, InteractionContext]
+            if interaction:
+                context = InteractionContext(chat=int_chat)
+            else:
+                updated_chat, context = create_finalize_plan(
+                    int_chat, self.planner, self.verbose
+                )
+                int_chat.extend(updated_chat)
+                for chat_elt in updated_chat:
+                    self.update_callback(chat_elt.model_dump())
-        return plan_context
+        return context
     def log_progress(self, data: Dict[str, Any]) -> None:
         pass

{vision_agent-0.2.202 → vision_agent-0.2.203}/vision_agent/agent/vision_agent_v2.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import copy
+import json
 from pathlib import Path
 from typing import Any, Callable, Dict, List, Optional, Union, cast
@@ -8,7 +9,12 @@ from vision_agent.agent.agent_utils import (
     convert_message_to_agentmessage,
     extract_tag,
 )
-from vision_agent.agent.types import AgentMessage, PlanContext
+from vision_agent.agent.types import (
+    AgentMessage,
+    CodeContext,
+    InteractionContext,
+    PlanContext,
+)
 from vision_agent.agent.vision_agent_coder_v2 import format_code_context
 from vision_agent.agent.vision_agent_prompts_v2 import CONVERSATION
 from vision_agent.lmm import LMM, AnthropicLMM
@@ -39,10 +45,24 @@ def run_conversation(agent: LMM, chat: List[AgentMessage]) -> str:
     return cast(str, response)
+def check_for_interaction(chat: List[AgentMessage]) -> bool:
+    return (
+        len(chat) > 2
+        and chat[-2].role == "interaction"
+        and chat[-1].role == "interaction_response"
+    )
 def extract_conversation_for_generate_code(
     chat: List[AgentMessage],
 ) -> List[AgentMessage]:
     chat = copy.deepcopy(chat)
+    # if we are in the middle of an interaction, return all the intermediate planning
+    # steps
+    if check_for_interaction(chat):
+        return chat
     extracted_chat = []
     for chat_i in chat:
         if chat_i.role == "user":
@@ -66,13 +86,20 @@ def maybe_run_action(
         # to the outside user via it's update_callback, but we don't necessarily have
         # access to that update_callback here, so we re-create the message using
         # format_code_context.
-        code_context = coder.generate_code(
-            extracted_chat, code_interpreter=code_interpreter
-        )
-        return [
-            AgentMessage(role="coder", content=format_code_context(code_context)),
-            AgentMessage(role="observation", content=code_context.test_result.text()),
-        ]
+        context = coder.generate_code(extracted_chat, code_interpreter=code_interpreter)
+        if isinstance(context, CodeContext):
+            return [
+                AgentMessage(role="coder", content=format_code_context(context)),
+                AgentMessage(role="observation", content=context.test_result.text()),
+            ]
+        elif isinstance(context, InteractionContext):
+            return [
+                AgentMessage(
+                    role="interaction",
+                    content=json.dumps([elt.model_dump() for elt in context.chat]),
+                )
+            ]
     elif action == "edit_code":
         extracted_chat = extract_conversation_for_generate_code(chat)
         plan_context = PlanContext(
@@ -80,12 +107,12 @@ def maybe_run_action(
             instructions=[],
             code="",
         )
-        code_context = coder.generate_code_from_plan(
+        context = coder.generate_code_from_plan(
             extracted_chat, plan_context, code_interpreter=code_interpreter
         )
         return [
-            AgentMessage(role="coder", content=format_code_context(code_context)),
-            AgentMessage(role="observation", content=code_context.test_result.text()),
+            AgentMessage(role="coder", content=format_code_context(context)),
+            AgentMessage(role="observation", content=context.test_result.text()),
         ]
     elif action == "view_image":
         pass
@@ -102,6 +129,7 @@ class VisionAgentV2(Agent):
         self,
         agent: Optional[LMM] = None,
         coder: Optional[AgentCoder] = None,
+        hil: bool = False,
         verbose: bool = False,
         code_sandbox_runtime: Optional[str] = None,
         update_callback: Callable[[Dict[str, Any]], None] = lambda x: None,
@@ -113,6 +141,7 @@ class VisionAgentV2(Agent):
                 default AnthropicLMM will be used.
             coder (Optional[AgentCoder]): The coder agent to use for generating vision
                 code. If None, a default VisionAgentCoderV2 will be used.
+            hil (bool): Whether to use human-in-the-loop mode.
             verbose (bool): Whether to print out debug information.
             code_sandbox_runtime (Optional[str]): The code sandbox runtime to use, can
                 be one of: None, "local" or "e2b". If None, it will read from the
@@ -132,7 +161,9 @@ class VisionAgentV2(Agent):
         self.coder = (
             coder
             if coder is not None
-            else VisionAgentCoderV2(verbose=verbose, update_callback=update_callback)
+            else VisionAgentCoderV2(
+                verbose=verbose, update_callback=update_callback, hil=hil
+            )
         )
         self.verbose = verbose
@@ -169,6 +200,7 @@ class VisionAgentV2(Agent):
     def chat(
         self,
         chat: List[AgentMessage],
+        code_interpreter: Optional[CodeInterpreter] = None,
     ) -> List[AgentMessage]:
         """Conversational interface to the agent. This is the main method to use to
         interact with the agent. It takes in a list of messages and returns the agent's
@@ -177,28 +209,47 @@ class VisionAgentV2(Agent):
         Parameters:
             chat (List[AgentMessage]): The input to the agent. This should be a list of
                 AgentMessage objects.
+            code_interpreter (Optional[CodeInterpreter]): The code interpreter to use.
         Returns:
             List[AgentMessage]: The agent's response as a list of AgentMessage objects.
         """
+        chat = copy.deepcopy(chat)
+        if not chat or chat[-1].role not in {"user", "interaction_response"}:
+            raise ValueError(
+                f"Last chat message must be from the user or interaction_response, got {chat[-1].role}."
+            )
         return_chat = []
-        with CodeInterpreterFactory.new_instance(
-            self.code_sandbox_runtime
+        with (
+            CodeInterpreterFactory.new_instance(self.code_sandbox_runtime)
+            if code_interpreter is None
+            else code_interpreter
         ) as code_interpreter:
             int_chat, _, _ = add_media_to_chat(chat, code_interpreter)
-            response_context = run_conversation(self.agent, int_chat)
-            return_chat.append(
-                AgentMessage(role="conversation", content=response_context)
-            )
-            self.update_callback(return_chat[-1].model_dump())
-            action = extract_tag(response_context, "action")
+            # if we had an interaction and then recieved an observation from the user
+            # go back into the same action to finish it.
+            action = None
+            if check_for_interaction(int_chat):
+                action = "generate_or_edit_vision_code"
+            else:
+                response_context = run_conversation(self.agent, int_chat)
+                return_chat.append(
+                    AgentMessage(role="conversation", content=response_context)
+                )
+                self.update_callback(return_chat[-1].model_dump())
+                action = extract_tag(response_context, "action")
             updated_chat = maybe_run_action(
                 self.coder, action, int_chat, code_interpreter=code_interpreter
             )
-            if updated_chat is not None:
+            # return an interaction early to get users feedback
+            if updated_chat is not None and updated_chat[-1].role == "interaction":
+                return_chat.extend(updated_chat)
+            elif updated_chat is not None and updated_chat[-1].role != "interaction":
                 # do not append updated_chat to return_chat becuase the observation
                 # from running the action will have already been added via the callbacks
                 obs_response_context = run_conversation(

{vision_agent-0.2.202 → vision_agent-0.2.203}/vision_agent/tools/planner_tools.py RENAMED Viewed

@@ -3,7 +3,9 @@ import shutil
 import tempfile
 from typing import Any, Callable, Dict, List, Optional, Tuple, cast
+import libcst as cst
 import numpy as np
+from IPython.display import display
 from PIL import Image
 import vision_agent.tools as T
@@ -21,8 +23,13 @@ from vision_agent.agent.vision_agent_planner_prompts_v2 import (
     TEST_TOOLS_EXAMPLE1,
     TEST_TOOLS_EXAMPLE2,
 )
-from vision_agent.lmm import AnthropicLMM
-from vision_agent.utils.execute import CodeInterpreterFactory
+from vision_agent.lmm import LMM, AnthropicLMM
+from vision_agent.utils.execute import (
+    CodeInterpreter,
+    CodeInterpreterFactory,
+    Execution,
+    MimeType,
+)
 from vision_agent.utils.image_utils import convert_to_b64
 from vision_agent.utils.sim import load_cached_sim
@@ -33,6 +40,16 @@ _LOGGER = logging.getLogger(__name__)
 EXAMPLES = f"\n{TEST_TOOLS_EXAMPLE1}\n{TEST_TOOLS_EXAMPLE2}\n"
+def format_tool_output(tool_thoughts: str, tool_docstring: str) -> str:
+    return_str = "[get_tool_for_task output]\n"
+    if tool_thoughts.strip() != "":
+        return_str += f"{tool_thoughts}\n\n"
+    return_str += (
+        f"Tool Documentation:\n{tool_docstring}\n[end of get_tool_for_task output]\n"
+    )
+    return return_str
 def extract_tool_info(
     tool_choice_context: Dict[str, Any]
 ) -> Tuple[Optional[Callable], str, str, str]:
@@ -46,6 +63,70 @@ def extract_tool_info(
     return tool, tool_thoughts, tool_docstring, ""
+def run_tool_testing(
+    task: str,
+    image_paths: List[str],
+    lmm: LMM,
+    exclude_tools: Optional[List[str]],
+    code_interpreter: CodeInterpreter,
+) -> tuple[str, str, Execution]:
+    """Helper function to generate and run tool testing code."""
+    query = lmm.generate(CATEGORIZE_TOOL_REQUEST.format(task=task))
+    category = extract_tag(query, "category")  # type: ignore
+    if category is None:
+        category = task
+    else:
+        category = (
+            f"I need models from the {category.strip()} category of tools. {task}"
+        )
+    tool_docs = TOOL_RECOMMENDER.top_k(category, k=10, thresh=0.2)
+    if exclude_tools is not None and len(exclude_tools) > 0:
+        cleaned_tool_docs = []
+        for tool_doc in tool_docs:
+            if not tool_doc["name"] in exclude_tools:
+                cleaned_tool_docs.append(tool_doc)
+        tool_docs = cleaned_tool_docs
+    tool_docs_str = "\n".join([e["doc"] for e in tool_docs])
+    prompt = TEST_TOOLS.format(
+        tool_docs=tool_docs_str,
+        previous_attempts="",
+        user_request=task,
+        examples=EXAMPLES,
+        media=str(image_paths),
+    )
+    response = lmm.generate(prompt, media=image_paths)
+    code = extract_tag(response, "code")  # type: ignore
+    if code is None:
+        raise ValueError(f"Could not extract code from response: {response}")
+    tool_output = code_interpreter.exec_isolation(DefaultImports.prepend_imports(code))
+    tool_output_str = tool_output.text(include_results=False).strip()
+    count = 1
+    while (
+        not tool_output.success
+        or (len(tool_output.logs.stdout) == 0 and len(tool_output.logs.stderr) == 0)
+    ) and count <= 3:
+        if tool_output_str.strip() == "":
+            tool_output_str = "EMPTY"
+        prompt = TEST_TOOLS.format(
+            tool_docs=tool_docs_str,
+            previous_attempts=f"<code>\n{code}\n</code>\nTOOL OUTPUT\n{tool_output_str}",
+            user_request=task,
+            examples=EXAMPLES,
+            media=str(image_paths),
+        )
+        code = extract_code(lmm.generate(prompt, media=image_paths))  # type: ignore
+        tool_output = code_interpreter.exec_isolation(
+            DefaultImports.prepend_imports(code)
+        )
+        tool_output_str = tool_output.text(include_results=False).strip()
+    return code, tool_docs_str, tool_output
 def get_tool_for_task(
     task: str, images: List[np.ndarray], exclude_tools: Optional[List[str]] = None
 ) -> None:
@@ -90,61 +171,11 @@ def get_tool_for_task(
             Image.fromarray(image).save(image_path)
             image_paths.append(image_path)
-        query = lmm.generate(CATEGORIZE_TOOL_REQUEST.format(task=task))
-        category = extract_tag(query, "category")  # type: ignore
-        if category is None:
-            category = task
-        else:
-            category = (
-                f"I need models from the {category.strip()} category of tools. {task}"
-            )
-        tool_docs = TOOL_RECOMMENDER.top_k(category, k=10, thresh=0.2)
-        if exclude_tools is not None and len(exclude_tools) > 0:
-            cleaned_tool_docs = []
-            for tool_doc in tool_docs:
-                if not tool_doc["name"] in exclude_tools:
-                    cleaned_tool_docs.append(tool_doc)
-            tool_docs = cleaned_tool_docs
-        tool_docs_str = "\n".join([e["doc"] for e in tool_docs])
-        prompt = TEST_TOOLS.format(
-            tool_docs=tool_docs_str,
-            previous_attempts="",
-            user_request=task,
-            examples=EXAMPLES,
-            media=str(image_paths),
-        )
-        response = lmm.generate(prompt, media=image_paths)
-        code = extract_tag(response, "code")  # type: ignore
-        if code is None:
-            raise ValueError(f"Could not extract code from response: {response}")
-        tool_output = code_interpreter.exec_isolation(
-            DefaultImports.prepend_imports(code)
+        code, tool_docs_str, tool_output = run_tool_testing(
+            task, image_paths, lmm, exclude_tools, code_interpreter
         )
         tool_output_str = tool_output.text(include_results=False).strip()
-        count = 1
-        while (
-            not tool_output.success
-            or (len(tool_output.logs.stdout) == 0 and len(tool_output.logs.stderr) == 0)
-        ) and count <= 3:
-            if tool_output_str.strip() == "":
-                tool_output_str = "EMPTY"
-            prompt = TEST_TOOLS.format(
-                tool_docs=tool_docs_str,
-                previous_attempts=f"<code>\n{code}\n</code>\nTOOL OUTPUT\n{tool_output_str}",
-                user_request=task,
-                examples=EXAMPLES,
-                media=str(image_paths),
-            )
-            code = extract_code(lmm.generate(prompt, media=image_paths))  # type: ignore
-            tool_output = code_interpreter.exec_isolation(
-                DefaultImports.prepend_imports(code)
-            )
-            tool_output_str = tool_output.text(include_results=False).strip()
         error_message = ""
         prompt = PICK_TOOL.format(
             tool_docs=tool_docs_str,
@@ -178,9 +209,87 @@ def get_tool_for_task(
         except Exception as e:
             _LOGGER.error(f"Error removing temp directory: {e}")
-    print(
-        f"[get_tool_for_task output]\n{tool_thoughts}\n\nTool Documentation:\n{tool_docstring}\n[end of get_tool_for_task output]\n"
-    )
+    print(format_tool_output(tool_thoughts, tool_docstring))
+def get_tool_documentation(tool_name: str) -> str:
+    # use same format as get_tool_for_task
+    tool_doc = T.TOOLS_DF[T.TOOLS_DF["name"] == tool_name]["doc"].values[0]
+    return format_tool_output("", tool_doc)
+def get_tool_for_task_human_reviewer(
+    task: str, images: List[np.ndarray], exclude_tools: Optional[List[str]] = None
+) -> None:
+    # NOTE: this should be the same documentation as get_tool_for_task
+    """Given a task and one or more images this function will find a tool to accomplish
+    the jobs. It prints the tool documentation and thoughts on why it chose the tool.
+    It can produce tools for the following types of tasks:
+        - Object detection and counting
+        - Classification
+        - Segmentation
+        - OCR
+        - VQA
+        - Depth and pose estimation
+        - Video object tracking
+    Wait until the documentation is printed to use the function so you know what the
+    input and output signatures are.
+    Parameters:
+        task: str: The task to accomplish.
+        images: List[np.ndarray]: The images to use for the task.
+        exclude_tools: Optional[List[str]]: A list of tool names to exclude from the
+            recommendations. This is helpful if you are calling get_tool_for_task twice
+            and do not want the same tool recommended.
+    Returns:
+        The tool to use for the task is printed to stdout
+    Examples
+    --------
+        >>> get_tool_for_task("Give me an OCR model that can find 'hot chocolate' in the image", [image])
+    """
+    lmm = AnthropicLMM()
+    with (
+        tempfile.TemporaryDirectory() as tmpdirname,
+        CodeInterpreterFactory.new_instance() as code_interpreter,
+    ):
+        image_paths = []
+        for i, image in enumerate(images[:3]):
+            image_path = f"{tmpdirname}/image_{i}.png"
+            Image.fromarray(image).save(image_path)
+            image_paths.append(image_path)
+        _, _, tool_output = run_tool_testing(
+            task, image_paths, lmm, exclude_tools, code_interpreter
+        )
+        # need to re-display results for the outer notebook to see them
+        for result in tool_output.results:
+            if "json" in result.formats():
+                display({MimeType.APPLICATION_JSON: result.json}, raw=True)
+def check_function_call(code: str, function_name: str) -> bool:
+    class FunctionCallVisitor(cst.CSTVisitor):
+        def __init__(self) -> None:
+            self.function_name = function_name
+            self.function_called = False
+        def visit_Call(self, node: cst.Call) -> None:
+            if (
+                isinstance(node.func, cst.Name)
+                and node.func.value == self.function_name
+            ):
+                self.function_called = True
+    tree = cst.parse_module(code)
+    visitor = FunctionCallVisitor()
+    tree.visit(visitor)
+    return visitor.function_called
 def finalize_plan(user_request: str, chain_of_thoughts: str) -> str:

{vision_agent-0.2.202 → vision_agent-0.2.203}/vision_agent/tools/tool_utils.py RENAMED Viewed

@@ -213,6 +213,8 @@ def _call_post(
     files_in_b64 = None
     if files:
         files_in_b64 = [(file[0], b64encode(file[1]).decode("utf-8")) for file in files]
+    tool_call_trace = None
     try:
         if files is not None:
             response = session.post(url, data=payload, files=files)
@@ -250,9 +252,10 @@ def _call_post(
         tool_call_trace.response = result
         return result
     finally:
-        trace = tool_call_trace.model_dump()
-        trace["type"] = "tool_call"
-        display({MimeType.APPLICATION_JSON: trace}, raw=True)
+        if tool_call_trace is not None:
+            trace = tool_call_trace.model_dump()
+            trace["type"] = "tool_call"
+            display({MimeType.APPLICATION_JSON: trace}, raw=True)
 def filter_bboxes_by_threshold(

{vision_agent-0.2.202 → vision_agent-0.2.203}/vision_agent/utils/execute.py RENAMED Viewed

@@ -398,17 +398,20 @@ class CodeInterpreter(abc.ABC):
         self,
         timeout: int,
         remote_path: Optional[Union[str, Path]] = None,
+        non_exiting: bool = False,
         *args: Any,
         **kwargs: Any,
     ) -> None:
         self.timeout = timeout
         self.remote_path = Path(remote_path if remote_path is not None else WORKSPACE)
+        self.non_exiting = non_exiting
     def __enter__(self) -> Self:
         return self
     def __exit__(self, *exc_info: Any) -> None:
-        self.close()
+        if not self.non_exiting:
+            self.close()
     def close(self, *args: Any, **kwargs: Any) -> None:
         raise NotImplementedError()
@@ -571,8 +574,9 @@ class LocalCodeInterpreter(CodeInterpreter):
         self,
         timeout: int = _SESSION_TIMEOUT,
         remote_path: Optional[Union[str, Path]] = None,
+        non_exiting: bool = False,
     ) -> None:
-        super().__init__(timeout=timeout)
+        super().__init__(timeout=timeout, non_exiting=non_exiting)
         self.nb = nbformat.v4.new_notebook()
         # Set the notebook execution path to the remote path
         self.remote_path = Path(remote_path if remote_path is not None else WORKSPACE)
@@ -692,6 +696,7 @@ class CodeInterpreterFactory:
     def new_instance(
         code_sandbox_runtime: Optional[str] = None,
         remote_path: Optional[Union[str, Path]] = None,
+        non_exiting: bool = False,
     ) -> CodeInterpreter:
         if not code_sandbox_runtime:
             code_sandbox_runtime = os.getenv("CODE_SANDBOX_RUNTIME", "local")
@@ -702,7 +707,9 @@ class CodeInterpreterFactory:
             )
         elif code_sandbox_runtime == "local":
             instance = LocalCodeInterpreter(
-                timeout=_SESSION_TIMEOUT, remote_path=remote_path
+                timeout=_SESSION_TIMEOUT,
+                remote_path=remote_path,
+                non_exiting=non_exiting,
             )
         else:
             raise ValueError(