PyPI - vision-agent - Versions diffs - 0.2.231__tar.gz → 0.2.232__tar.gz - Mend

vision-agent 0.2.231tar.gz → 0.2.232tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

{vision_agent-0.2.231 → vision_agent-0.2.232}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 0.2.231
+Version: 0.2.232
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai

{vision_agent-0.2.231 → vision_agent-0.2.232}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
 [tool.poetry]
 name = "vision-agent"
-version = "0.2.231"
+version = "0.2.232"
 description = "Toolset for Vision Agent"
 authors = ["Landing AI <dev@landing.ai>"]
 readme = "README.md"

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/.sim_tools/df.csv RENAMED Viewed

@@ -514,7 +514,7 @@ desc,doc,name
         >>> vit_nsfw_classification(image)
         {""label"": ""normal"", ""scores"": 0.68},
     ",vit_nsfw_classification
-'video_temporal_localization' will run qwen2vl on each chunk_length_frames value selected for the video. It can detect multiple objects independently per chunk_length_frames given a text prompt such as a referring expression but does not track objects across frames. It returns a list of floats with a value of 1.0 if the objects are found in a given chunk_length_frames of the video.,"video_temporal_localization(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen2vl', chunk_length_frames: Optional[int] = 2) -> List[float]:
+'video_temporal_localization' will run qwen2vl on each chunk_length_frames value selected for the video. It can detect multiple objects independently per chunk_length_frames given a text prompt such as a referring expression but does not track objects across frames. It returns a list of floats with a value of 1.0 if the objects are found in a given chunk_length_frames of the video.,"video_temporal_localization(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen2vl', chunk_length_frames: int = 2) -> List[float]:
 'video_temporal_localization' will run qwen2vl on each chunk_length_frames
     value selected for the video. It can detect multiple objects independently per
     chunk_length_frames given a text prompt such as a referring expression
@@ -527,7 +527,7 @@ desc,doc,name
         frames (List[np.ndarray]): The reference frames used for the question
         model (str): The model to use for the inference. Valid values are
             'qwen2vl', 'gpt4o'.
-        chunk_length_frames (Optional[int]): length of each chunk in frames
+        chunk_length_frames (int): length of each chunk in frames
     Returns:
         List[float]: A list of floats with a value of 1.0 if the objects to be found
@@ -540,16 +540,18 @@ desc,doc,name
     ",video_temporal_localization
 "'flux_image_inpainting' performs image inpainting to fill the masked regions, given by mask, in the image, given image based on the text prompt and surrounding image context. It can be used to edit regions of an image according to the prompt given.","flux_image_inpainting(prompt: str, image: numpy.ndarray, mask: numpy.ndarray) -> numpy.ndarray:
 'flux_image_inpainting' performs image inpainting to fill the masked regions,
-    given by mask, in the image, given image based on the text prompt and surrounding image context.
-    It can be used to edit regions of an image according to the prompt given.
+    given by mask, in the image, given image based on the text prompt and surrounding
+    image context. It can be used to edit regions of an image according to the prompt
+    given.
     Parameters:
         prompt (str): A detailed text description guiding what should be generated
-            in the masked area. More detailed and specific prompts typically yield better results.
-        image (np.ndarray): The source image to be inpainted.
-            The image will serve as the base context for the inpainting process.
-        mask (np.ndarray): A binary mask image with 0's and 1's,
-            where 1 indicates areas to be inpainted and 0 indicates areas to be preserved.
+            in the masked area. More detailed and specific prompts typically yield
+            better results.
+        image (np.ndarray): The source image to be inpainted. The image will serve as
+            the base context for the inpainting process.
+        mask (np.ndarray): A binary mask image with 0's and 1's, where 1 indicates
+            areas to be inpainted and 0 indicates areas to be preserved.
     Returns:
         np.ndarray: The generated image(s) as a numpy array in RGB format with values
@@ -658,7 +660,7 @@ desc,doc,name
     -------
         >>> save_image(image)
     ",save_image
-'save_video' is a utility function that saves a list of frames as a mp4 video file on disk.,"save_video(frames: List[numpy.ndarray], output_video_path: Optional[str] = None, fps: float = 1) -> str:
+'save_video' is a utility function that saves a list of frames as a mp4 video file on disk.,"save_video(frames: List[numpy.ndarray], output_video_path: Optional[str] = None, fps: float = 5) -> str:
 'save_video' is a utility function that saves a list of frames as a mp4 video file on disk.
     Parameters:

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/agent/agent_utils.py RENAMED Viewed

@@ -148,8 +148,10 @@ def format_plan_v2(plan: PlanContext) -> str:
     plan_str += "Instructions:\n"
     for v in plan.instructions:
         plan_str += f"    - {v}\n"
-    plan_str += "Code:\n"
-    plan_str += plan.code
+    if plan.code:
+        plan_str += "Code:\n"
+        plan_str += plan.code
     return plan_str

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/agent/vision_agent_coder_prompts_v2.py RENAMED Viewed

@@ -6,7 +6,7 @@ FEEDBACK = """
 CODE = """
-**Role**: You are an expoert software programmer.
+**Role**: You are an expert software programmer.
 **Task**: You are given a plan by a planning agent that solves a vision problem posed by the user. You are also given code snippets that the planning agent used to solve the task. Your job is to organize the code so that it can be easily called by the user to solve the task.

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/agent/vision_agent_coder_v2.py RENAMED Viewed

@@ -425,6 +425,8 @@ class VisionAgentCoderV2(AgentCoder):
             chat (List[AgentMessage]): The input to the agent. This should be a list of
                 AgentMessage objects.
             plan_context (PlanContext): The plan context that was previously generated.
+                If plan_context.code is not provided, then the code will be generated
+                from the chat messages.
             code_interpreter (Optional[CodeInterpreter]): The code interpreter to use.
         Returns:
@@ -455,12 +457,24 @@ class VisionAgentCoderV2(AgentCoder):
             int_chat, _, media_list = add_media_to_chat(chat, code_interpreter)
             tool_docs = retrieve_tools(plan_context.instructions, self.tool_recommender)
+            # If code is not provided from the plan_context then generate it, else use
+            # the provided code and start with testing
+            if not plan_context.code.strip():
+                code = write_code(
+                    coder=self.coder,
+                    chat=int_chat,
+                    tool_docs=tool_docs,
+                    plan=format_plan_v2(plan_context),
+                )
+            else:
+                code = plan_context.code
             code_context = test_code(
                 tester=self.tester,
                 debugger=self.debugger,
                 chat=int_chat,
                 plan=format_plan_v2(plan_context),
-                code=plan_context.code,
+                code=code,
                 tool_docs=tool_docs,
                 code_interpreter=code_interpreter,
                 media_list=media_list,

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/agent/vision_agent_planner_prompts_v2.py RENAMED Viewed

@@ -458,6 +458,8 @@ You are given a task: "{task}" from the user. You must extract the type of categ
 - "DocQA" - answering questions about a document or extracting information from a document.
 - "video object tracking" - tracking objects in a video.
 - "depth and pose estimation" - estimating the depth or pose of objects in an image.
+- "temporal localization" - localizing the time period an event occurs in a video.
+- "inpainting" - filling in masked parts of an image.
 Return the category or categories (comma separated) inside tags <category># your categories here</category>. If you are unsure about a task, it is better to include more categories than less.
 """
@@ -651,22 +653,24 @@ PICK_TOOL = """
 """
 FINALIZE_PLAN = """
-**Role**: You are an expert AI model that can understand the user request and construct plans to accomplish it.
+**Task**: You are given a chain of thoughts, python executions and observations from a planning agent as it tries to construct a plan to solve a user request. Your task is to summarize the plan it found so that another programming agent to write a program to accomplish the user request.
-**Task**: You are given a chain of thoughts, python executions and observations from a planning agent as it tries to construct a plan to solve a user request. Your task is to summarize the plan it found so that another programming agnet to write a program to accomplish the user request.
+**Documentation**: You can use these tools to help you visualize or save the output:
+{tool_desc}
 **Planning**: Here is chain of thoughts, executions and observations from the planning agent:
 {planning}
 **Instructions**:
 1. Summarize the plan that the planning agent found.
-2. Write a single function that solves the problem based on what the planner found.
-3. Specifically call out the tools used and the order in which they were used. Only include tools obtained from calling `get_tool_for_task`.
+2. Write a single function that solves the problem based on what the planner found and only returns the final solution.
+3. Only use tools obtained from calling `get_tool_for_task`.
 4. Do not include {excluded_tools} tools in your instructions.
-5. Add final instructions for visualizing the output with `overlay_bounding_boxes` or `overlay_segmentation_masks` and saving it to a file with `save_image` or `save_video`.
-6. Use the default FPS for extracting frames from videos unless otherwise specified by the user.
-7. Include the expected answer in your 'plan' so that the programming agent can properly test if it has the correct answer.
-8. Respond in the following format with JSON surrounded by <json> tags and code surrounded by <code> tags:
+5. Ensure the function is well documented and easy to understand.
+6. Ensure you visualize the output with `overlay_bounding_boxes` or `overlay_segmentation_masks` and save it to a file with `save_image` or `save_video`.
+7. Use the default FPS for extracting frames from videos unless otherwise specified by the user.
+8. Include the expected answer in your 'plan' so that the programming agent can properly test if it has the correct answer.
+9. Respond in the following format with JSON surrounded by <json> tags and code surrounded by <code> tags:
 <json>
 {{

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/agent/vision_agent_planner_v2.py RENAMED Viewed

@@ -326,6 +326,7 @@ def create_finalize_plan(
         return [], PlanContext(plan="", instructions=[], code="")
     prompt = FINALIZE_PLAN.format(
+        tool_desc=UTIL_DOCSTRING,
         planning=get_planning(chat),
         excluded_tools=str([t.__name__ for t in pt.PLANNER_TOOLS]),
     )

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/agent/vision_agent_prompts_v2.py RENAMED Viewed

@@ -42,6 +42,8 @@ AGENT: <response>I am VisionAgent, an agent built by LandingAI, to help users wr
 - Understanding documents
 - Pose estimation
 - Visual question answering for both images and videos
+- Action recognition in videos
+- Image inpainting
 How can I help you?</response>
 --- END EXAMPLE2 ---
@@ -54,7 +56,8 @@ Here is the current conversation so far:
 **Instructions**:
 1. Only respond with a single <response> tag and a single <action> tag.
-2. Respond in the following format, the <action> tag is optional and can be excluded if you do not want to take any action:
+2. You can only take one action at a time in response to the user's message. Do not offer to fix code on the user's behalf, only if they have directly asked you to.
+3. Respond in the following format, the <action> tag is optional and can be excluded if you do not want to take any action:
 <response>Your response to the user's message</response>
 <action>The action you want to take from **Actions**</action>

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/agent/vision_agent_v2.py RENAMED Viewed

@@ -112,14 +112,17 @@ def maybe_run_action(
                 )
             ]
     elif action == "edit_code":
+        # We don't want to pass code in plan_context.code so the coder will generate
+        # new code from plan_context.plan
         plan_context = PlanContext(
-            plan="Edit the latest code observed in the fewest steps possible according to the user's feedback.",
+            plan="Edit the latest code observed in the fewest steps possible according to the user's feedback."
+            + ("<code>\n" + final_code + "\n</code>" if final_code is not None else ""),
             instructions=[
                 chat_i.content
                 for chat_i in extracted_chat
                 if chat_i.role == "user" and "<final_code>" not in chat_i.content
             ],
-            code=final_code if final_code is not None else "",
+            code="",
         )
         context = coder.generate_code_from_plan(
             extracted_chat, plan_context, code_interpreter=code_interpreter

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/tools/planner_tools.py RENAMED Viewed

@@ -2,7 +2,7 @@ import inspect
 import logging
 import tempfile
 from concurrent.futures import ThreadPoolExecutor, as_completed
-from typing import Any, Callable, Dict, List, Optional, Tuple, cast
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union, cast
 import libcst as cst
 import numpy as np
@@ -235,7 +235,9 @@ def run_tool_testing(
 def get_tool_for_task(
-    task: str, images: List[np.ndarray], exclude_tools: Optional[List[str]] = None
+    task: str,
+    images: Union[Dict[str, List[np.ndarray]], List[np.ndarray]],
+    exclude_tools: Optional[List[str]] = None,
 ) -> None:
     """Given a task and one or more images this function will find a tool to accomplish
     the jobs. It prints the tool documentation and thoughts on why it chose the tool.
@@ -248,6 +250,8 @@ def get_tool_for_task(
         - VQA
         - Depth and pose estimation
         - Video object tracking
+        - Video temporal localization (action recognition)
+        - Image inpainting
     Only ask for one type of task at a time, for example a task needing to identify
     text is one OCR task while needing to identify non-text objects is an OD task. Wait
@@ -256,7 +260,8 @@ def get_tool_for_task(
     Parameters:
         task: str: The task to accomplish.
-        images: List[np.ndarray]: The images to use for the task.
+        images: Union[Dict[str, List[np.ndarray]], List[np.ndarray]]: The images to use
+            for the task. If a key is provided, it is used as the file name.
         exclude_tools: Optional[List[str]]: A list of tool names to exclude from the
             recommendations. This is helpful if you are calling get_tool_for_task twice
             and do not want the same tool recommended.
@@ -266,20 +271,29 @@ def get_tool_for_task(
     Examples
     --------
-        >>> get_tool_for_task("Give me an OCR model that can find 'hot chocolate' in the image", [image])
+        >>> get_tool_for_task(
+        >>>     "Give me an OCR model that can find 'hot chocolate' in the image",
+        >>>     {"image": [image]})
+        >>> get_tool_for_taks(
+        >>>     "I need a tool that can paint a background for this image and maks",
+        >>>     {"image": [image], "mask": [mask]})
     """
     tool_tester = CONFIG.create_tool_tester()
     tool_chooser = CONFIG.create_tool_chooser()
+    if isinstance(images, list):
+        images = {"image": images}
     with (
         tempfile.TemporaryDirectory() as tmpdirname,
         CodeInterpreterFactory.new_instance() as code_interpreter,
     ):
         image_paths = []
-        for i, image in enumerate(images[:3]):
-            image_path = f"{tmpdirname}/image_{i}.png"
-            Image.fromarray(image).save(image_path)
-            image_paths.append(image_path)
+        for k in images.keys():
+            for i, image in enumerate(images[k]):
+                image_path = f"{tmpdirname}/{k}_{i}.png"
+                Image.fromarray(image).save(image_path)
+                image_paths.append(image_path)
         code, tool_docs_str, tool_output = run_tool_testing(
             task, image_paths, tool_tester, exclude_tools, code_interpreter
@@ -300,20 +314,26 @@ def get_tool_documentation(tool_name: str) -> str:
 def get_tool_for_task_human_reviewer(
-    task: str, images: List[np.ndarray], exclude_tools: Optional[List[str]] = None
+    task: str,
+    images: Union[Dict[str, List[np.ndarray]], List[np.ndarray]],
+    exclude_tools: Optional[List[str]] = None,
 ) -> None:
     # NOTE: this will have the same documentation as get_tool_for_task
     tool_tester = CONFIG.create_tool_tester()
+    if isinstance(images, list):
+        images = {"image": images}
     with (
         tempfile.TemporaryDirectory() as tmpdirname,
         CodeInterpreterFactory.new_instance() as code_interpreter,
     ):
         image_paths = []
-        for i, image in enumerate(images[:3]):
-            image_path = f"{tmpdirname}/image_{i}.png"
-            Image.fromarray(image).save(image_path)
-            image_paths.append(image_path)
+        for k in images.keys():
+            for i, image in enumerate(images[k]):
+                image_path = f"{tmpdirname}/{k}_{i}.png"
+                Image.fromarray(image).save(image_path)
+                image_paths.append(image_path)
         tools = [
             t.__name__

{vision_agent-0.2.231 → vision_agent-0.2.232}/vision_agent/tools/tools.py RENAMED Viewed

@@ -1727,22 +1727,46 @@ def video_temporal_localization(
     }
     payload["chunk_length_frames"] = chunk_length_frames
-    data = send_inference_request(
-        payload, "video-temporal-localization", files=files, v2=True
-    )
+    segments = split_frames_into_segments(frames, segment_size=50, overlap=0)
+    def _apply_temporal_localization(
+        segment: List[np.ndarray],
+    ) -> List[float]:
+        segment_buffer_bytes = [("video", frames_to_bytes(segment))]
+        data = send_inference_request(
+            payload, "video-temporal-localization", files=segment_buffer_bytes, v2=True
+        )
+        chunked_data = [cast(float, value) for value in data]
+        full_data = []
+        for value in chunked_data:
+            full_data.extend([value] * chunk_length_frames)
+        return full_data[: len(segment)]
+    with ThreadPoolExecutor() as executor:
+        futures = {
+            executor.submit(_apply_temporal_localization, segment): segment_index
+            for segment_index, segment in enumerate(segments)
+        }
+        localization_per_segment = []
+        for future in as_completed(futures):
+            segment_index = futures[future]
+            localization_per_segment.append((segment_index, future.result()))
+    localization_per_segment = [
+        x[1] for x in sorted(localization_per_segment, key=lambda x: x[0])  # type: ignore
+    ]
+    localizations = cast(List[float], [e for o in localization_per_segment for e in o])
     _display_tool_trace(
         video_temporal_localization.__name__,
         payload,
-        data,
+        localization_per_segment,
         files,
     )
-    chunked_data = [cast(float, value) for value in data]
-    full_data = []
-    for value in chunked_data:
-        full_data.extend([value] * chunk_length_frames)
-    return full_data[: len(frames)]
+    return localizations
 def vit_image_classification(image: np.ndarray) -> Dict[str, Any]:
@@ -2028,16 +2052,18 @@ def flux_image_inpainting(
     mask: np.ndarray,
 ) -> np.ndarray:
     """'flux_image_inpainting' performs image inpainting to fill the masked regions,
-    given by mask, in the image, given image based on the text prompt and surrounding image context.
-    It can be used to edit regions of an image according to the prompt given.
+    given by mask, in the image, given image based on the text prompt and surrounding
+    image context. It can be used to edit regions of an image according to the prompt
+    given.
     Parameters:
         prompt (str): A detailed text description guiding what should be generated
-            in the masked area. More detailed and specific prompts typically yield better results.
-        image (np.ndarray): The source image to be inpainted.
-            The image will serve as the base context for the inpainting process.
-        mask (np.ndarray): A binary mask image with 0's and 1's,
-            where 1 indicates areas to be inpainted and 0 indicates areas to be preserved.
+            in the masked area. More detailed and specific prompts typically yield
+            better results.
+        image (np.ndarray): The source image to be inpainted. The image will serve as
+            the base context for the inpainting process.
+        mask (np.ndarray): A binary mask image with 0's and 1's, where 1 indicates
+            areas to be inpainted and 0 indicates areas to be preserved.
     Returns:
         np.ndarray: The generated image(s) as a numpy array in RGB format with values