PyPI - vision-agent - Versions diffs - 0.2.224__py3-none-any.whl → 0.2.226__py3-none-any.whl - Mend

vision-agent 0.2.224py3-none-any.whl → 0.2.226py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

vision_agent/.sim_tools/df.csv +49 -91
vision_agent/.sim_tools/embs.npy +0 -0
vision_agent/agent/agent_utils.py +13 -0
vision_agent/agent/vision_agent_coder_prompts_v2.py +1 -1
vision_agent/agent/vision_agent_coder_v2.py +6 -1
vision_agent/agent/vision_agent_planner_prompts_v2.py +42 -33
vision_agent/agent/vision_agent_v2.py +30 -22
vision_agent/tools/planner_tools.py +4 -2
vision_agent/tools/tools.py +119 -123
vision_agent/utils/sim.py +6 -0
vision_agent/utils/video_tracking.py +305 -0
{vision_agent-0.2.224.dist-info → vision_agent-0.2.226.dist-info}/METADATA +1 -1
{vision_agent-0.2.224.dist-info → vision_agent-0.2.226.dist-info}/RECORD +15 -14
{vision_agent-0.2.224.dist-info → vision_agent-0.2.226.dist-info}/LICENSE +0 -0
{vision_agent-0.2.224.dist-info → vision_agent-0.2.226.dist-info}/WHEEL +0 -0

vision_agent/.sim_tools/df.csv CHANGED Viewed

@@ -65,25 +65,30 @@ desc,doc,name
             },
         ]
     ",owlv2_sam2_instance_segmentation
-"'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores.","owlv2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
-'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores.
+"'owlv2_sam2_video_tracking' is a tool that can track and segment multiple objects in a video given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, masks and associated probability scores and is useful for tracking and counting without duplicating counts.","owlv2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
+'owlv2_sam2_video_tracking' is a tool that can track and segment multiple
+    objects in a video given a text prompt such as category names or referring
+    expressions. The categories in the text prompt are separated by commas. It returns
+    a list of bounding boxes, label names, masks and associated probability scores and
+    is useful for tracking and counting without duplicating counts.
     Parameters:
         prompt (str): The prompt to ground to the image.
-        image (np.ndarray): The image to ground the prompt to.
+        frames (List[np.ndarray]): The list of frames to ground the prompt to.
+        chunk_length (Optional[int]): The number of frames to re-run owlv2 to find
+            new objects.
         fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
             fine-tuned model ID here to use it.
     Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
-            bounding box, and mask of the detected objects with normalized coordinates
-            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
-            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
-            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
-            the background.
+        List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
+            label, segmentation mask and bounding boxes. The outer list represents each
+            frame and the inner list is the entities per frame. The detected objects
+            have normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin
+            and ymin are the coordinates of the top-left and xmax and ymax are the
+            coordinates of the bottom-right of the bounding box. The mask is binary 2D
+            numpy array where 1 indicates the object and 0 indicates the background.
+            The label names are prefixed with their ID represent the total count.
     Example
     -------
@@ -170,25 +175,28 @@ desc,doc,name
             },
         ]
     ",countgd_sam2_instance_segmentation
-"'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores.","countgd_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10) -> List[List[Dict[str, Any]]]:
-'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores.
+"'countgd_sam2_video_tracking' is a tool that can track and segment multiple objects in a video given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, masks and associated probability scores and is useful for tracking and counting without duplicating counts.","countgd_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10) -> List[List[Dict[str, Any]]]:
+'countgd_sam2_video_tracking' is a tool that can track and segment multiple
+    objects in a video given a text prompt such as category names or referring
+    expressions. The categories in the text prompt are separated by commas. It returns
+    a list of bounding boxes, label names, masks and associated probability scores and
+    is useful for tracking and counting without duplicating counts.
     Parameters:
         prompt (str): The prompt to ground to the image.
-        image (np.ndarray): The image to ground the prompt to.
-        chunk_length (Optional[int]): The number of frames to re-run florence2 to find
+        frames (List[np.ndarray]): The list of frames to ground the prompt to.
+        chunk_length (Optional[int]): The number of frames to re-run countgd to find
             new objects.
     Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
-            bounding box, and mask of the detected objects with normalized coordinates
-            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
-            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
-            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
-            the background.
+        List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
+            label, segmentation mask and bounding boxes. The outer list represents each
+            frame and the inner list is the entities per frame. The detected objects
+            have normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin
+            and ymin are the coordinates of the top-left and xmax and ymax are the
+            coordinates of the bottom-right of the bounding box. The mask is binary 2D
+            numpy array where 1 indicates the object and 0 indicates the background.
+            The label names are prefixed with their ID represent the total count.
     Example
     -------
@@ -265,12 +273,12 @@ desc,doc,name
             },
         ]
     ",florence2_sam2_instance_segmentation
-'florence2_sam2_video_tracking' is a tool that can segment and track multiple entities in a video given a text prompt such as category names or referring expressions. You can optionally separate the categories in the text with commas. It can find new objects every 'chunk_length' frames and is useful for tracking and counting without duplicating counts and always outputs scores of 1.0.,"florence2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
-'florence2_sam2_video_tracking' is a tool that can segment and track multiple
-    entities in a video given a text prompt such as category names or referring
-    expressions. You can optionally separate the categories in the text with commas. It
-    can find new objects every 'chunk_length' frames and is useful for tracking and
-    counting without duplicating counts and always outputs scores of 1.0.
+"'florence2_sam2_video_tracking' is a tool that can track and segment multiple objects in a video given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, masks and associated probability scores and is useful for tracking and counting without duplicating counts.","florence2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
+'florence2_sam2_video_tracking' is a tool that can track and segment multiple
+    objects in a video given a text prompt such as category names or referring
+    expressions. The categories in the text prompt are separated by commas. It returns
+    a list of bounding boxes, label names, masks and associated probability scores and
+    is useful for tracking and counting without duplicating counts.
     Parameters:
         prompt (str): The prompt to ground to the video.
@@ -282,10 +290,13 @@ desc,doc,name
     Returns:
         List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
-        label, segment mask and bounding boxes. The outer list represents each frame
-        and the inner list is the entities per frame. The label contains the object ID
-        followed by the label name. The objects are only identified in the first framed
-        and tracked throughout the video.
+            label, segmentation mask and bounding boxes. The outer list represents each
+            frame and the inner list is the entities per frame. The detected objects
+            have normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin
+            and ymin are the coordinates of the top-left and xmax and ymax are the
+            coordinates of the bottom-right of the bounding box. The mask is binary 2D
+            numpy array where 1 indicates the object and 0 indicates the background.
+            The label names are prefixed with their ID represent the total count.
     Example
     -------
@@ -445,43 +456,6 @@ desc,doc,name
         >>> qwen2_vl_video_vqa('Which football player made the goal?', frames)
         'Lionel Messi'
     ",qwen2_vl_video_vqa
-"'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores.","detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]:
-'detr_segmentation' is a tool that can segment common objects in an
-    image without any text prompt. It returns a list of detected objects
-    as labels, their regions as masks and their scores.
-    Parameters:
-        image (np.ndarray): The image used to segment things and objects
-    Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label
-            and mask of the detected objects. The mask is binary 2D numpy array where 1
-            indicates the object and 0 indicates the background.
-    Example
-    -------
-        >>> detr_segmentation(image)
-        [
-            {
-                'score': 0.45,
-                'label': 'window',
-                'mask': array([[0, 0, 0, ..., 0, 0, 0],
-                    [0, 0, 0, ..., 0, 0, 0],
-                    ...,
-                    [0, 0, 0, ..., 0, 0, 0],
-                    [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
-            },
-            {
-                'score': 0.70,
-                'label': 'bird',
-                'mask': array([[0, 0, 0, ..., 0, 0, 0],
-                    [0, 0, 0, ..., 0, 0, 0],
-                    ...,
-                    [0, 0, 0, ..., 0, 0, 0],
-                    [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
-            },
-        ]
-    ",detr_segmentation
 'depth_anything_v2' is a tool that runs depth_anythingv2 model to generate a depth image from a given RGB image. The returned depth image is monochrome and represents depth values as pixel intesities with pixel values ranging from 0 to 255.,"depth_anything_v2(image: numpy.ndarray) -> numpy.ndarray:
 'depth_anything_v2' is a tool that runs depth_anythingv2 model to generate a
     depth image from a given RGB image. The returned depth image is monochrome and
@@ -522,22 +496,6 @@ desc,doc,name
                 [10, 11, 15, ..., 202, 202, 205],
                 [10, 10, 10, ..., 200, 200, 200]], dtype=uint8),
     ",generate_pose_image
-'vit_image_classification' is a tool that can classify an image. It returns a list of classes and their probability scores based on image content.,"vit_image_classification(image: numpy.ndarray) -> Dict[str, Any]:
-'vit_image_classification' is a tool that can classify an image. It returns a
-    list of classes and their probability scores based on image content.
-    Parameters:
-        image (np.ndarray): The image to classify or tag
-    Returns:
-        Dict[str, Any]: A dictionary containing the labels and scores. One dictionary
-            contains a list of labels and other a list of scores.
-    Example
-    -------
-        >>> vit_image_classification(image)
-        {""labels"": [""leopard"", ""lemur, otter"", ""bird""], ""scores"": [0.68, 0.30, 0.02]},
-    ",vit_image_classification
 'vit_nsfw_classification' is a tool that can classify an image as 'nsfw' or 'normal'. It returns the predicted label and their probability scores based on image content.,"vit_nsfw_classification(image: numpy.ndarray) -> Dict[str, Any]:
 'vit_nsfw_classification' is a tool that can classify an image as 'nsfw' or 'normal'.
     It returns the predicted label and their probability scores based on image content.
@@ -566,7 +524,7 @@ desc,doc,name
         prompt (str): The question about the video
         frames (List[np.ndarray]): The reference frames used for the question
         model (str): The model to use for the inference. Valid values are
-            'qwen2vl', 'gpt4o', 'internlm-xcomposer'
+            'qwen2vl', 'gpt4o'.
         chunk_length_frames (Optional[int]): length of each chunk in frames
     Returns:
@@ -641,7 +599,7 @@ desc,doc,name
         >>> closest_distance(det1, det2, image_size)
         141.42
     ",minimum_distance
-"'extract_frames_and_timestamps' extracts frames and timestamps from a video which can be a file path, url or youtube link, returns a list of dictionaries with keys ""frame"" and ""timestamp"" where ""frame"" is a numpy array and ""timestamp"" is the relative time in seconds where the frame was captured. The frame is a numpy array.","extract_frames_and_timestamps(video_uri: Union[str, pathlib.Path], fps: float = 1) -> List[Dict[str, Union[numpy.ndarray, float]]]:
+"'extract_frames_and_timestamps' extracts frames and timestamps from a video which can be a file path, url or youtube link, returns a list of dictionaries with keys ""frame"" and ""timestamp"" where ""frame"" is a numpy array and ""timestamp"" is the relative time in seconds where the frame was captured. The frame is a numpy array.","extract_frames_and_timestamps(video_uri: Union[str, pathlib.Path], fps: float = 5) -> List[Dict[str, Union[numpy.ndarray, float]]]:
 'extract_frames_and_timestamps' extracts frames and timestamps from a video
     which can be a file path, url or youtube link, returns a list of dictionaries
     with keys ""frame"" and ""timestamp"" where ""frame"" is a numpy array and ""timestamp"" is
@@ -651,7 +609,7 @@ desc,doc,name
     Parameters:
         video_uri (Union[str, Path]): The path to the video file, url or youtube link
         fps (float, optional): The frame rate per second to extract the frames. Defaults
-            to 1.
+            to 5.
     Returns:
         List[Dict[str, Union[np.ndarray, float]]]: A list of dictionaries containing the

vision_agent/.sim_tools/embs.npy CHANGED Viewed

Binary file

vision_agent/agent/agent_utils.py CHANGED Viewed

@@ -153,6 +153,19 @@ def format_plan_v2(plan: PlanContext) -> str:
     return plan_str
+def format_conversation(chat: List[AgentMessage]) -> str:
+    chat = copy.deepcopy(chat)
+    prompt = ""
+    for chat_i in chat:
+        if chat_i.role == "user":
+            prompt += f"USER: {chat_i.content}\n\n"
+        elif chat_i.role == "observation" or chat_i.role == "coder":
+            prompt += f"OBSERVATION: {chat_i.content}\n\n"
+        elif chat_i.role == "conversation":
+            prompt += f"AGENT: {chat_i.content}\n\n"
+    return prompt
 def format_plans(plans: Dict[str, Any]) -> str:
     plan_str = ""
     for k, v in plans.items():

vision_agent/agent/vision_agent_coder_prompts_v2.py CHANGED Viewed

@@ -65,7 +65,7 @@ This is the documentation for the functions you have access to. You may call any
 7. DO NOT assert the output value, run the code and assert only the output format or data structure.
 8. DO NOT use try except block to handle the error, let the error be raised if the code is incorrect.
 9. DO NOT import the testing function as it will available in the testing environment.
-10. Print the output of the function that is being tested.
+10. Print the output of the function that is being tested and ensure it is not empty.
 11. Use the output of the function that is being tested as the return value of the testing function.
 12. Run the testing function in the end and don't assign a variable to its output.
 13. Output your test code using <code> tags:

vision_agent/agent/vision_agent_coder_v2.py CHANGED Viewed

@@ -202,7 +202,12 @@ def write_and_test_code(
         tool_docs=tool_docs,
         plan=plan,
     )
-    code = strip_function_calls(code)
+    try:
+        code = strip_function_calls(code)
+    except Exception:
+        # the code may be malformatted, this will fail in the exec call and the agent
+        # will attempt to debug it
+        pass
     test = write_test(
         tester=tester,
         chat=chat,

vision_agent/agent/vision_agent_planner_prompts_v2.py CHANGED Viewed

@@ -136,8 +136,9 @@ Tool Documentation:
 countgd_object_detection(prompt: str, image: numpy.ndarray, box_threshold: float = 0.23) -> List[Dict[str, Any]]:
     'countgd_object_detection' is a tool that can detect multiple instances of an
     object given a text prompt. It is particularly useful when trying to detect and
-    count a large number of objects. It returns a list of bounding boxes with
-    normalized coordinates, label names and associated confidence scores.
+    count a large number of objects. You can optionally separate object names in the
+    prompt with commas. It returns a list of bounding boxes with normalized
+    coordinates, label names and associated confidence scores.
     Parameters:
         prompt (str): The object that needs to be counted.
@@ -272,40 +273,47 @@ OBSERVATION:
 [get_tool_for_task output]
 For tracking boxes moving on a conveyor belt, we need a tool that can consistently track the same box across frames without losing it or double counting. Looking at the outputs: florence2_sam2_video_tracking successfully tracks the single box across all 5 frames, maintaining consistent tracking IDs and showing the box's movement along the conveyor.
-'florence2_sam2_video_tracking' is a tool that can segment and track multiple
-entities in a video given a text prompt such as category names or referring
-expressions. You can optionally separate the categories in the text with commas. It
-can find new objects every 'chunk_length' frames and is useful for tracking and
-counting without duplicating counts and always outputs scores of 1.0.
+Tool Documentation:
+def florence2_sam2_video_tracking(prompt: str, frames: List[np.ndarray], chunk_length: Optional[int] = 10) -> List[List[Dict[str, Any]]]:
+    'florence2_sam2_video_tracking' is a tool that can track and segment multiple
+    objects in a video given a text prompt such as category names or referring
+    expressions. The categories in the text prompt are separated by commas. It returns
+    a list of bounding boxes, label names, masks and associated probability scores and
+    is useful for tracking and counting without duplicating counts.
-Parameters:
-    prompt (str): The prompt to ground to the video.
-    frames (List[np.ndarray]): The list of frames to ground the prompt to.
-    chunk_length (Optional[int]): The number of frames to re-run florence2 to find
-        new objects.
+    Parameters:
+        prompt (str): The prompt to ground to the video.
+        frames (List[np.ndarray]): The list of frames to ground the prompt to.
+        chunk_length (Optional[int]): The number of frames to re-run florence2 to find
+            new objects.
+        fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
+            fine-tuned model ID here to use it.
-Returns:
-    List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
-    label,segment mask and bounding boxes. The outer list represents each frame and
-    the inner list is the entities per frame. The label contains the object ID
-    followed by the label name. The objects are only identified in the first framed
-    and tracked throughout the video.
+    Returns:
+        List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
+            label, segmentation mask and bounding boxes. The outer list represents each
+            frame and the inner list is the entities per frame. The detected objects
+            have normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin
+            and ymin are the coordinates of the top-left and xmax and ymax are the
+            coordinates of the bottom-right of the bounding box. The mask is binary 2D
+            numpy array where 1 indicates the object and 0 indicates the background.
+            The label names are prefixed with their ID represent the total count.
-Example
--------
-    >>> florence2_sam2_video("car, dinosaur", frames)
-    [
+    Example
+    -------
+        >>> florence2_sam2_video_tracking("car, dinosaur", frames)
         [
-            {
-                'label': '0: dinosaur',
-                'bbox': [0.1, 0.11, 0.35, 0.4],
-                'mask': array([[0, 0, 0, ..., 0, 0, 0],
-                    ...,
-                    [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
-            },
-        ],
-        ...
-    ]
+            [
+                {
+                    'label': '0: dinosaur',
+                    'bbox': [0.1, 0.11, 0.35, 0.4],
+                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
+                        ...,
+                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
+                },
+            ],
+            ...
+        ]
 [end of get_tool_for_task output]
 <count>8</count>
@@ -691,7 +699,8 @@ FINALIZE_PLAN = """
 4. Specifically call out the tools used and the order in which they were used. Only include tools obtained from calling `get_tool_for_task`.
 5. Do not include {excluded_tools} tools in your instructions.
 6. Add final instructions for visualizing the output with `overlay_bounding_boxes` or `overlay_segmentation_masks` and saving it to a file with `save_file` or `save_video`.
-6. Respond in the following format with JSON surrounded by <json> tags and code surrounded by <code> tags:
+7. Use the default FPS for extracting frames from videos unless otherwise specified by the user.
+8. Respond in the following format with JSON surrounded by <json> tags and code surrounded by <code> tags:
 <json>
 {{

vision_agent/agent/vision_agent_v2.py CHANGED Viewed

@@ -1,13 +1,14 @@
 import copy
 import json
 from pathlib import Path
-from typing import Any, Callable, Dict, List, Optional, Union, cast
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union, cast
 from vision_agent.agent import Agent, AgentCoder, VisionAgentCoderV2
 from vision_agent.agent.agent_utils import (
     add_media_to_chat,
     convert_message_to_agentmessage,
     extract_tag,
+    format_conversation,
 )
 from vision_agent.agent.types import (
     AgentMessage,
@@ -22,19 +23,6 @@ from vision_agent.lmm.types import Message
 from vision_agent.utils.execute import CodeInterpreter, CodeInterpreterFactory
-def format_conversation(chat: List[AgentMessage]) -> str:
-    chat = copy.deepcopy(chat)
-    prompt = ""
-    for chat_i in chat:
-        if chat_i.role == "user":
-            prompt += f"USER: {chat_i.content}\n\n"
-        elif chat_i.role == "observation" or chat_i.role == "coder":
-            prompt += f"OBSERVATION: {chat_i.content}\n\n"
-        elif chat_i.role == "conversation":
-            prompt += f"AGENT: {chat_i.content}\n\n"
-    return prompt
 def run_conversation(agent: LMM, chat: List[AgentMessage]) -> str:
     # only keep last 10 messages
     conv = format_conversation(chat[-10:])
@@ -55,23 +43,39 @@ def check_for_interaction(chat: List[AgentMessage]) -> bool:
 def extract_conversation_for_generate_code(
     chat: List[AgentMessage],
-) -> List[AgentMessage]:
+) -> Tuple[List[AgentMessage], Optional[str]]:
     chat = copy.deepcopy(chat)
     # if we are in the middle of an interaction, return all the intermediate planning
     # steps
     if check_for_interaction(chat):
-        return chat
+        return chat, None
     extracted_chat = []
     for chat_i in chat:
         if chat_i.role == "user":
             extracted_chat.append(chat_i)
         elif chat_i.role == "coder":
-            if "<final_code>" in chat_i.content and "<final_test>" in chat_i.content:
+            if "<final_code>" in chat_i.content:
                 extracted_chat.append(chat_i)
-    return extracted_chat
+    # only keep the last <final_code> and <final_test>
+    final_code = None
+    extracted_chat_strip_code: List[AgentMessage] = []
+    for chat_i in reversed(extracted_chat):
+        if "<final_code>" in chat_i.content and final_code is None:
+            extracted_chat_strip_code = [chat_i] + extracted_chat_strip_code
+            final_code = extract_tag(chat_i.content, "final_code")
+            if final_code is not None:
+                test_code = extract_tag(chat_i.content, "final_test")
+                final_code += "\n" + test_code if test_code is not None else ""
+        if "<final_code>" in chat_i.content and final_code is not None:
+            continue
+        extracted_chat_strip_code = [chat_i] + extracted_chat_strip_code
+    return extracted_chat_strip_code[-5:], final_code
 def maybe_run_action(
@@ -81,7 +85,7 @@ def maybe_run_action(
     code_interpreter: Optional[CodeInterpreter] = None,
 ) -> Optional[List[AgentMessage]]:
     if action == "generate_or_edit_vision_code":
-        extracted_chat = extract_conversation_for_generate_code(chat)
+        extracted_chat, _ = extract_conversation_for_generate_code(chat)
         # there's an issue here because coder.generate_code will send it's code_context
         # to the outside user via it's update_callback, but we don't necessarily have
         # access to that update_callback here, so we re-create the message using
@@ -101,11 +105,15 @@ def maybe_run_action(
                 )
             ]
     elif action == "edit_code":
-        extracted_chat = extract_conversation_for_generate_code(chat)
+        extracted_chat, final_code = extract_conversation_for_generate_code(chat)
         plan_context = PlanContext(
             plan="Edit the latest code observed in the fewest steps possible according to the user's feedback.",
-            instructions=[],
-            code="",
+            instructions=[
+                chat_i.content
+                for chat_i in extracted_chat
+                if chat_i.role == "user" and "<final_code>" not in chat_i.content
+            ],
+            code=final_code if final_code is not None else "",
         )
         context = coder.generate_code_from_plan(
             extracted_chat, plan_context, code_interpreter=code_interpreter

vision_agent/tools/planner_tools.py CHANGED Viewed

@@ -193,8 +193,10 @@ def get_tool_for_task(
         - Depth and pose estimation
         - Video object tracking
-    Wait until the documentation is printed to use the function so you know what the
-    input and output signatures are.
+    Only ask for one type of task at a time, for example a task needing to identify
+    text is one OCR task while needing to identify non-text objects is an OD task. Wait
+    until the documentation is printed to use the function so you know what the input
+    and output signatures are.
     Parameters:
         task: str: The task to accomplish.

vision_agent/tools/tools.py CHANGED Viewed

@@ -6,7 +6,6 @@ import tempfile
 import urllib.request
 from base64 import b64encode
 from concurrent.futures import ThreadPoolExecutor, as_completed
-from enum import Enum
 from importlib import resources
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple, Union, cast
@@ -54,6 +53,13 @@ from vision_agent.utils.video import (
     frames_to_bytes,
     video_writer,
 )
+from vision_agent.utils.video_tracking import (
+    ODModels,
+    merge_segments,
+    post_process,
+    process_segment,
+    split_frames_into_segments,
+)
 register_heif_opener()
@@ -224,12 +230,6 @@ def sam2(
     return ret["return_data"]  # type: ignore
-class ODModels(str, Enum):
-    COUNTGD = "countgd"
-    FLORENCE2 = "florence2"
-    OWLV2 = "owlv2"
 def od_sam2_video_tracking(
     od_model: ODModels,
     prompt: str,
@@ -237,105 +237,92 @@ def od_sam2_video_tracking(
     chunk_length: Optional[int] = 10,
     fine_tune_id: Optional[str] = None,
 ) -> Dict[str, Any]:
-    results: List[Optional[List[Dict[str, Any]]]] = [None] * len(frames)
+    SEGMENT_SIZE = 50
+    OVERLAP = 1  # Number of overlapping frames between segments
-    if chunk_length is None:
-        step = 1  # Process every frame
-    elif chunk_length <= 0:
-        raise ValueError("chunk_length must be a positive integer or None.")
-    else:
-        step = chunk_length  # Process frames with the specified step size
+    image_size = frames[0].shape[:2]
+    # Split frames into segments with overlap
+    segments = split_frames_into_segments(frames, SEGMENT_SIZE, OVERLAP)
+    def _apply_object_detection(  # inner method to avoid circular importing issues.
+        od_model: ODModels,
+        prompt: str,
+        segment_index: int,
+        frame_number: int,
+        fine_tune_id: str,
+        segment_frames: list,
+    ) -> tuple:
+        """
+        Applies the specified object detection model to the given image.
+        Args:
+            od_model: The object detection model to use.
+            prompt: The prompt for the object detection model.
+            segment_index: The index of the current segment.
+            frame_number: The number of the current frame.
+            fine_tune_id: Optional fine-tune ID for the model.
+            segment_frames: List of frames for the current segment.
+        Returns:
+            A tuple containing the object detection results and the name of the function used.
+        """
-    for idx in range(0, len(frames), step):
         if od_model == ODModels.COUNTGD:
-            results[idx] = countgd_object_detection(prompt=prompt, image=frames[idx])
+            segment_results = countgd_object_detection(
+                prompt=prompt, image=segment_frames[frame_number]
+            )
             function_name = "countgd_object_detection"
         elif od_model == ODModels.OWLV2:
-            results[idx] = owlv2_object_detection(
-                prompt=prompt, image=frames[idx], fine_tune_id=fine_tune_id
+            segment_results = owlv2_object_detection(
+                prompt=prompt,
+                image=segment_frames[frame_number],
+                fine_tune_id=fine_tune_id,
             )
             function_name = "owlv2_object_detection"
         elif od_model == ODModels.FLORENCE2:
-            results[idx] = florence2_object_detection(
-                prompt=prompt, image=frames[idx], fine_tune_id=fine_tune_id
+            segment_results = florence2_object_detection(
+                prompt=prompt,
+                image=segment_frames[frame_number],
+                fine_tune_id=fine_tune_id,
             )
             function_name = "florence2_object_detection"
         else:
             raise NotImplementedError(
                 f"Object detection model '{od_model}' is not implemented."
             )
-    image_size = frames[0].shape[:2]
-    def _transform_detections(
-        input_list: List[Optional[List[Dict[str, Any]]]],
-    ) -> List[Optional[Dict[str, Any]]]:
-        output_list: List[Optional[Dict[str, Any]]] = []
-        for _, frame in enumerate(input_list):
-            if frame is not None:
-                labels = [detection["label"] for detection in frame]
-                bboxes = [
-                    denormalize_bbox(detection["bbox"], image_size)
-                    for detection in frame
-                ]
-                output_list.append(
-                    {
-                        "labels": labels,
-                        "bboxes": bboxes,
-                    }
-                )
-            else:
-                output_list.append(None)
-        return output_list
+        return segment_results, function_name
+    # Process each segment and collect detections
+    detections_per_segment: List[Any] = []
+    for segment_index, segment in enumerate(segments):
+        segment_detections = process_segment(
+            segment_frames=segment,
+            od_model=od_model,
+            prompt=prompt,
+            fine_tune_id=fine_tune_id,
+            chunk_length=chunk_length,
+            image_size=image_size,
+            segment_index=segment_index,
+            object_detection_tool=_apply_object_detection,
+        )
+        detections_per_segment.append(segment_detections)
-    output = _transform_detections(results)
+    merged_detections = merge_segments(detections_per_segment)
+    post_processed = post_process(merged_detections, image_size)
     buffer_bytes = frames_to_bytes(frames)
     files = [("video", buffer_bytes)]
-    payload = {"bboxes": json.dumps(output), "chunk_length_frames": chunk_length}
-    metadata = {"function_name": function_name}
-    detections = send_task_inference_request(
-        payload,
-        "sam2",
-        files=files,
-        metadata=metadata,
-    )
-    return_data = []
-    for frame in detections:
-        return_frame_data = []
-        for detection in frame:
-            mask = rle_decode_array(detection["mask"])
-            label = str(detection["id"]) + ": " + detection["label"]
-            return_frame_data.append(
-                {"label": label, "mask": mask, "score": 1.0, "rle": detection["mask"]}
-            )
-        return_data.append(return_frame_data)
-    return_data = add_bboxes_from_masks(return_data)
-    return_data = nms(return_data, iou_threshold=0.95)
-    # We save the RLE for display purposes, re-calculting RLE can get very expensive.
-    # Deleted here because we are returning the numpy masks instead
-    display_data = []
-    for frame in return_data:
-        display_frame_data = []
-        for obj in frame:
-            display_frame_data.append(
-                {
-                    "label": obj["label"],
-                    "score": obj["score"],
-                    "bbox": denormalize_bbox(obj["bbox"], image_size),
-                    "mask": obj["rle"],
-                }
-            )
-            del obj["rle"]
-        display_data.append(display_frame_data)
-    return {"files": files, "return_data": return_data, "display_data": detections}
+    return {
+        "files": files,
+        "return_data": post_processed["return_data"],
+        "display_data": post_processed["display_data"],
+    }
 # Owl V2 Tools
@@ -528,24 +515,29 @@ def owlv2_sam2_video_tracking(
     chunk_length: Optional[int] = 10,
     fine_tune_id: Optional[str] = None,
 ) -> List[List[Dict[str, Any]]]:
-    """'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores.
+    """'owlv2_sam2_video_tracking' is a tool that can track and segment multiple
+    objects in a video given a text prompt such as category names or referring
+    expressions. The categories in the text prompt are separated by commas. It returns
+    a list of bounding boxes, label names, masks and associated probability scores and
+    is useful for tracking and counting without duplicating counts.
     Parameters:
         prompt (str): The prompt to ground to the image.
-        image (np.ndarray): The image to ground the prompt to.
+        frames (List[np.ndarray]): The list of frames to ground the prompt to.
+        chunk_length (Optional[int]): The number of frames to re-run owlv2 to find
+            new objects.
         fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
             fine-tuned model ID here to use it.
     Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
-            bounding box, and mask of the detected objects with normalized coordinates
-            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
-            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
-            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
-            the background.
+        List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
+            label, segmentation mask and bounding boxes. The outer list represents each
+            frame and the inner list is the entities per frame. The detected objects
+            have normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin
+            and ymin are the coordinates of the top-left and xmax and ymax are the
+            coordinates of the bottom-right of the bounding box. The mask is binary 2D
+            numpy array where 1 indicates the object and 0 indicates the background.
+            The label names are prefixed with their ID represent the total count.
     Example
     -------
@@ -755,11 +747,11 @@ def florence2_sam2_video_tracking(
     chunk_length: Optional[int] = 10,
     fine_tune_id: Optional[str] = None,
 ) -> List[List[Dict[str, Any]]]:
-    """'florence2_sam2_video_tracking' is a tool that can segment and track multiple
-    entities in a video given a text prompt such as category names or referring
-    expressions. You can optionally separate the categories in the text with commas. It
-    can find new objects every 'chunk_length' frames and is useful for tracking and
-    counting without duplicating counts and always outputs scores of 1.0.
+    """'florence2_sam2_video_tracking' is a tool that can track and segment multiple
+    objects in a video given a text prompt such as category names or referring
+    expressions. The categories in the text prompt are separated by commas. It returns
+    a list of bounding boxes, label names, masks and associated probability scores and
+    is useful for tracking and counting without duplicating counts.
     Parameters:
         prompt (str): The prompt to ground to the video.
@@ -771,10 +763,13 @@ def florence2_sam2_video_tracking(
     Returns:
         List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
-        label, segment mask and bounding boxes. The outer list represents each frame
-        and the inner list is the entities per frame. The label contains the object ID
-        followed by the label name. The objects are only identified in the first framed
-        and tracked throughout the video.
+            label, segmentation mask and bounding boxes. The outer list represents each
+            frame and the inner list is the entities per frame. The detected objects
+            have normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin
+            and ymin are the coordinates of the top-left and xmax and ymax are the
+            coordinates of the bottom-right of the bounding box. The mask is binary 2D
+            numpy array where 1 indicates the object and 0 indicates the background.
+            The label names are prefixed with their ID represent the total count.
     Example
     -------
@@ -1089,24 +1084,27 @@ def countgd_sam2_video_tracking(
     frames: List[np.ndarray],
     chunk_length: Optional[int] = 10,
 ) -> List[List[Dict[str, Any]]]:
-    """'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores.
+    """'countgd_sam2_video_tracking' is a tool that can track and segment multiple
+    objects in a video given a text prompt such as category names or referring
+    expressions. The categories in the text prompt are separated by commas. It returns
+    a list of bounding boxes, label names, masks and associated probability scores and
+    is useful for tracking and counting without duplicating counts.
     Parameters:
         prompt (str): The prompt to ground to the image.
-        image (np.ndarray): The image to ground the prompt to.
-        chunk_length (Optional[int]): The number of frames to re-run florence2 to find
+        frames (List[np.ndarray]): The list of frames to ground the prompt to.
+        chunk_length (Optional[int]): The number of frames to re-run countgd to find
             new objects.
     Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
-            bounding box, and mask of the detected objects with normalized coordinates
-            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
-            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
-            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
-            the background.
+        List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
+            label, segmentation mask and bounding boxes. The outer list represents each
+            frame and the inner list is the entities per frame. The detected objects
+            have normalized coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin
+            and ymin are the coordinates of the top-left and xmax and ymax are the
+            coordinates of the bottom-right of the bounding box. The mask is binary 2D
+            numpy array where 1 indicates the object and 0 indicates the background.
+            The label names are prefixed with their ID represent the total count.
     Example
     -------
@@ -1546,7 +1544,7 @@ def video_temporal_localization(
         prompt (str): The question about the video
         frames (List[np.ndarray]): The reference frames used for the question
         model (str): The model to use for the inference. Valid values are
-            'qwen2vl', 'gpt4o', 'internlm-xcomposer'
+            'qwen2vl', 'gpt4o'.
         chunk_length_frames (Optional[int]): length of each chunk in frames
     Returns:
@@ -2115,7 +2113,7 @@ def closest_box_distance(
 def extract_frames_and_timestamps(
-    video_uri: Union[str, Path], fps: float = 1
+    video_uri: Union[str, Path], fps: float = 5
 ) -> List[Dict[str, Union[np.ndarray, float]]]:
     """'extract_frames_and_timestamps' extracts frames and timestamps from a video
     which can be a file path, url or youtube link, returns a list of dictionaries
@@ -2126,7 +2124,7 @@ def extract_frames_and_timestamps(
     Parameters:
         video_uri (Union[str, Path]): The path to the video file, url or youtube link
         fps (float, optional): The frame rate per second to extract the frames. Defaults
-            to 1.
+            to 5.
     Returns:
         List[Dict[str, Union[np.ndarray, float]]]: A list of dictionaries containing the
@@ -2649,10 +2647,8 @@ FUNCTION_TOOLS = [
     ocr,
     qwen2_vl_images_vqa,
     qwen2_vl_video_vqa,
-    detr_segmentation,
     depth_anything_v2,
     generate_pose_image,
-    vit_image_classification,
     vit_nsfw_classification,
     video_temporal_localization,
     flux_image_inpainting,

vision_agent/utils/sim.py CHANGED Viewed

@@ -133,6 +133,12 @@ class Sim:
         df: pd.DataFrame,
     ) -> bool:
         load_dir = Path(load_dir)
+        if (
+            not Path(load_dir / "df.csv").exists()
+            or not Path(load_dir / "embs.npy").exists()
+        ):
+            return False
         df_load = pd.read_csv(load_dir / "df.csv")
         if platform.system() == "Windows":
             df_load["doc"] = df_load["doc"].apply(lambda x: x.replace("\r", ""))

vision_agent/utils/video_tracking.py ADDED Viewed

@@ -0,0 +1,305 @@
+import json
+from enum import Enum
+from typing import Any, Callable, Dict, List, Optional, Tuple
+import numpy as np
+from vision_agent.tools.tool_utils import (
+    add_bboxes_from_masks,
+    nms,
+    send_task_inference_request,
+)
+from vision_agent.utils.image_utils import denormalize_bbox, rle_decode_array
+from vision_agent.utils.video import frames_to_bytes
+class ODModels(str, Enum):
+    COUNTGD = "countgd"
+    FLORENCE2 = "florence2"
+    OWLV2 = "owlv2"
+def split_frames_into_segments(
+    frames: List[np.ndarray], segment_size: int = 50, overlap: int = 1
+) -> List[List[np.ndarray]]:
+    """
+    Splits the list of frames into segments with a specified size and overlap.
+    Args:
+        frames (List[np.ndarray]): List of video frames.
+        segment_size (int, optional): Number of frames per segment. Defaults to 50.
+        overlap (int, optional): Number of overlapping frames between segments. Defaults to 1.
+    Returns:
+        List[List[np.ndarray]]: List of frame segments.
+    """
+    segments = []
+    start = 0
+    segment_count = 0
+    while start < len(frames):
+        end = start + segment_size
+        if end > len(frames):
+            end = len(frames)
+        if start != 0:
+            # Include the last frame of the previous segment
+            segment = frames[start - overlap : end]
+        else:
+            segment = frames[start:end]
+        segments.append(segment)
+        start += segment_size
+        segment_count += 1
+    return segments
+def process_segment(
+    segment_frames: List[np.ndarray],
+    od_model: ODModels,
+    prompt: str,
+    fine_tune_id: Optional[str],
+    chunk_length: Optional[int],
+    image_size: Tuple[int, ...],
+    segment_index: int,
+    object_detection_tool: Callable,
+) -> Any:
+    """
+    Processes a segment of frames with the specified object detection model.
+    Args:
+        segment_frames (List[np.ndarray]): Frames in the segment.
+        od_model (ODModels): Object detection model to use.
+        prompt (str): Prompt for the model.
+        fine_tune_id (Optional[str]): Fine-tune model ID.
+        chunk_length (Optional[int]): Chunk length for processing.
+        image_size (Tuple[int, int]): Size of the images.
+        segment_index (int): Index of the segment.
+        object_detection_tool (Callable): Object detection tool to use.
+    Returns:
+       Any: Detections for the segment.
+    """
+    segment_results: List[Optional[List[Dict[str, Any]]]] = [None] * len(segment_frames)
+    if chunk_length is None:
+        step = 1
+    elif chunk_length <= 0:
+        raise ValueError("chunk_length must be a positive integer or None.")
+    else:
+        step = chunk_length
+    function_name = ""
+    for idx in range(0, len(segment_frames), step):
+        frame_number = idx
+        segment_results[idx], function_name = object_detection_tool(
+            od_model, prompt, segment_index, frame_number, fine_tune_id, segment_frames
+        )
+    transformed_detections = transform_detections(
+        segment_results, image_size, segment_index
+    )
+    buffer_bytes = frames_to_bytes(segment_frames)
+    files = [("video", buffer_bytes)]
+    payload = {
+        "bboxes": json.dumps(transformed_detections),
+        "chunk_length_frames": chunk_length,
+    }
+    metadata = {"function_name": function_name}
+    segment_detections = send_task_inference_request(
+        payload,
+        "sam2",
+        files=files,
+        metadata=metadata,
+    )
+    return segment_detections
+def transform_detections(
+    input_list: List[Optional[List[Dict[str, Any]]]],
+    image_size: Tuple[int, ...],
+    segment_index: int,
+) -> List[Optional[Dict[str, Any]]]:
+    """
+    Transforms raw detections into a standardized format.
+    Args:
+        input_list (List[Optional[List[Dict[str, Any]]]]): Raw detections.
+        image_size (Tuple[int, int]): Size of the images.
+        segment_index (int): Index of the segment.
+    Returns:
+        List[Optional[Dict[str, Any]]]: Transformed detections.
+    """
+    output_list: List[Optional[Dict[str, Any]]] = []
+    for frame_idx, frame in enumerate(input_list):
+        if frame is not None:
+            labels = [detection["label"] for detection in frame]
+            bboxes = [
+                denormalize_bbox(detection["bbox"], image_size) for detection in frame
+            ]
+            output_list.append(
+                {
+                    "labels": labels,
+                    "bboxes": bboxes,
+                }
+            )
+        else:
+            output_list.append(None)
+    return output_list
+def _calculate_mask_iou(mask1: np.ndarray, mask2: np.ndarray) -> float:
+    mask1 = mask1.astype(bool)
+    mask2 = mask2.astype(bool)
+    intersection = np.sum(np.logical_and(mask1, mask2))
+    union = np.sum(np.logical_or(mask1, mask2))
+    if union == 0:
+        iou = 0.0
+    else:
+        iou = intersection / union
+    return iou
+def _match_by_iou(
+    first_param: List[Dict],
+    second_param: List[Dict],
+    iou_threshold: float = 0.8,
+) -> Tuple[List[Dict], Dict[int, int]]:
+    max_id = max((item["id"] for item in first_param), default=0)
+    matched_new_item_indices = set()
+    id_mapping = {}
+    for new_index, new_item in enumerate(second_param):
+        matched_id = None
+        for existing_item in first_param:
+            iou = _calculate_mask_iou(
+                existing_item["decoded_mask"], new_item["decoded_mask"]
+            )
+            if iou > iou_threshold:
+                matched_id = existing_item["id"]
+                matched_new_item_indices.add(new_index)
+                id_mapping[new_item["id"]] = matched_id
+                break
+        if matched_id:
+            new_item["id"] = matched_id
+        else:
+            max_id += 1
+            id_mapping[new_item["id"]] = max_id
+            new_item["id"] = max_id
+    unmatched_items = [
+        item for i, item in enumerate(second_param) if i not in matched_new_item_indices
+    ]
+    combined_list = first_param + unmatched_items
+    return combined_list, id_mapping
+def _update_ids(detections: List[Dict], id_mapping: Dict[int, int]) -> None:
+    for inner_list in detections:
+        for detection in inner_list:
+            if detection["id"] in id_mapping:
+                detection["id"] = id_mapping[detection["id"]]
+            else:
+                max_new_id = max(id_mapping.values(), default=0)
+                detection["id"] = max_new_id + 1
+                id_mapping[detection["id"]] = detection["id"]
+def _convert_to_2d(detections_per_segment: List[Any]) -> List[Any]:
+    result = []
+    for i, segment in enumerate(detections_per_segment):
+        if i == 0:
+            result.extend(segment)
+        else:
+            result.extend(segment[1:])
+    return result
+def merge_segments(detections_per_segment: List[Any]) -> List[Any]:
+    """
+    Merges detections from all segments into a unified result.
+    Args:
+        detections_per_segment (List[Any]): List of detections per segment.
+    Returns:
+        List[Any]: Merged detections.
+    """
+    for segment in detections_per_segment:
+        for detection in segment:
+            for item in detection:
+                item["decoded_mask"] = rle_decode_array(item["mask"])
+    for segment_idx in range(len(detections_per_segment) - 1):
+        combined_detection, id_mapping = _match_by_iou(
+            detections_per_segment[segment_idx][-1],
+            detections_per_segment[segment_idx + 1][0],
+        )
+        _update_ids(detections_per_segment[segment_idx + 1], id_mapping)
+    merged_result = _convert_to_2d(detections_per_segment)
+    return merged_result
+def post_process(
+    merged_detections: List[Any],
+    image_size: Tuple[int, ...],
+) -> Dict[str, Any]:
+    """
+    Performs post-processing on merged detections, including NMS and preparing display data.
+    Args:
+        merged_detections (List[Any]): Merged detections from all segments.
+        image_size (Tuple[int, int]): Size of the images.
+    Returns:
+        Dict[str, Any]: Post-processed data including return_data and display_data.
+    """
+    return_data = []
+    for frame_idx, frame in enumerate(merged_detections):
+        return_frame_data = []
+        for detection in frame:
+            label = f"{detection['id']}: {detection['label']}"
+            return_frame_data.append(
+                {
+                    "label": label,
+                    "mask": detection["decoded_mask"],
+                    "rle": detection["mask"],
+                    "score": 1.0,
+                }
+            )
+            del detection["decoded_mask"]
+        return_data.append(return_frame_data)
+    return_data = add_bboxes_from_masks(return_data)
+    return_data = nms(return_data, iou_threshold=0.95)
+    # We save the RLE for display purposes, re-calculting RLE can get very expensive.
+    # Deleted here because we are returning the numpy masks instead
+    display_data = []
+    for frame in return_data:
+        display_frame_data = []
+        for obj in frame:
+            display_frame_data.append(
+                {
+                    "label": obj["label"],
+                    "bbox": denormalize_bbox(obj["bbox"], image_size),
+                    "mask": obj["rle"],
+                    "score": obj["score"],
+                }
+            )
+            del obj["rle"]
+        display_data.append(display_frame_data)
+    return {"return_data": return_data, "display_data": display_data}

{vision_agent-0.2.224.dist-info → vision_agent-0.2.226.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 0.2.224
+Version: 0.2.226
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai

{vision_agent-0.2.224.dist-info → vision_agent-0.2.226.dist-info}/RECORD RENAMED Viewed

@@ -1,23 +1,23 @@
-vision_agent/.sim_tools/df.csv,sha256=1cpUFKN48Iuq6HvaG5OhbHs2RghESicb3ouKVTLhm-s,43360
-vision_agent/.sim_tools/embs.npy,sha256=Nji50P_8aV0hhyxp-Kfh_YmXAFWTHSwwBGrNJWitHPU,270464
+vision_agent/.sim_tools/df.csv,sha256=Vamicw8MiSGildK1r3-HXY4cKiq17GZxsgBsHbk7jpM,42158
+vision_agent/.sim_tools/embs.npy,sha256=YJe8EcKVNmeX_75CS2T1sbY-sUS_1HQAMT-34zc18a0,254080
 vision_agent/__init__.py,sha256=EAb4-f9iyuEYkBrX4ag1syM8Syx8118_t0R6_C34M9w,57
 vision_agent/agent/README.md,sha256=Q4w7FWw38qaWosQYAZ7NqWx8Q5XzuWrlv7nLhjUd1-8,5527
 vision_agent/agent/__init__.py,sha256=M8CffavdIh8Zh-skznLHIaQkYGCGK7vk4dq1FaVkbs4,617
 vision_agent/agent/agent.py,sha256=_1tHWAs7Jm5tqDzEcPfCRvJV3uRRveyh4n9_9pd6I1w,1565
-vision_agent/agent/agent_utils.py,sha256=NmrqjhSb6fpnrB8XGWtaywZjr9n89otusOZpcbWLf9k,13534
+vision_agent/agent/agent_utils.py,sha256=pP4u5tiami7C3ChgjgYLqJITnmkTI1_GsUj6g5czSRk,13994
 vision_agent/agent/types.py,sha256=DkFm3VMMrKlhYyfxEmZx4keppD72Ov3wmLCbM2J2o10,2437
 vision_agent/agent/vision_agent.py,sha256=I75bEU-os9Lf9OSICKfvQ_H_ftg-zOwgTwWnu41oIdo,23555
 vision_agent/agent/vision_agent_coder.py,sha256=flUxOibyGZK19BCSK5mhaD3HjCxHw6c6FtKom6N2q1E,27359
 vision_agent/agent/vision_agent_coder_prompts.py,sha256=gPLVXQMNSzYnQYpNm0wlH_5FPkOTaFDV24bqzK3jQ40,12221
-vision_agent/agent/vision_agent_coder_prompts_v2.py,sha256=9v5HwbNidSzYUEFl6ZMniWWOmyLITM_moWLtKVaTen8,4845
-vision_agent/agent/vision_agent_coder_v2.py,sha256=G3I8O89gzE2VczQGPWV149aYaOjbbfB1lmgGuwFWvo4,16118
+vision_agent/agent/vision_agent_coder_prompts_v2.py,sha256=idmSMfxebPULqqvllz3gqRzGDchEvS5dkGngvBs4PGo,4872
+vision_agent/agent/vision_agent_coder_v2.py,sha256=i1qgXp5YsWVRoA_qO429Ef-aKZBakveCl1F_2ZbSzk8,16287
 vision_agent/agent/vision_agent_planner.py,sha256=fFzjNkZBKkh8Y_oS06ATI4qz31xmIJvixb_tV1kX8KA,18590
 vision_agent/agent/vision_agent_planner_prompts.py,sha256=mn9NlZpRkW4XAvlNuMZwIs1ieHCFds5aYZJ55WXupZY,6733
-vision_agent/agent/vision_agent_planner_prompts_v2.py,sha256=lzfJFvBYW_-Ue4OevgljI8bAQxgKC4Rdv5SmP6UsAxE,34102
+vision_agent/agent/vision_agent_planner_prompts_v2.py,sha256=YgemW2PRPYd8o8XpmwSJBUOJSxMUXMNr2DZNQnS4jEI,34988
 vision_agent/agent/vision_agent_planner_v2.py,sha256=vvxfmGydBIKB8CtNSAJyPvdEXkG7nIO5-Hs2SjNc48Y,20465
 vision_agent/agent/vision_agent_prompts.py,sha256=NtGdCfzzilCRtscKALC9FK55d1h4CBpMnbhLzg0PYlc,13772
 vision_agent/agent/vision_agent_prompts_v2.py,sha256=-vCWat-ARlCOOOeIDIFhg-kcwRRwjTXYEwsvvqPeaCs,1972
-vision_agent/agent/vision_agent_v2.py,sha256=6gGVV3FlL4NLzHRpjMqMz-fEP6f_JhwwOjUKczZ3TPA,10231
+vision_agent/agent/vision_agent_v2.py,sha256=1wu_vH_onic2kLYPKW2nAF2e6Zz5vmUt5Acv4Seq3sQ,10796
 vision_agent/clients/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 vision_agent/clients/http.py,sha256=k883i6M_4nl7zwwHSI-yP5sAgQZIDPM1nrKD6YFJ3Xs,2009
 vision_agent/clients/landing_public_api.py,sha256=lU2ev6E8NICmR8DMUljuGcVFy5VNJQ4WQkWC8WnnJEc,1503
@@ -28,19 +28,20 @@ vision_agent/lmm/lmm.py,sha256=x_nIyDNDZwq4-pfjnJTmcyyJZ2_B7TjkA5jZp88YVO8,17103
 vision_agent/lmm/types.py,sha256=ZEXR_ptBL0ZwDMTDYkgxUCmSZFmBYPQd2jreNzr_8UY,221
 vision_agent/tools/__init__.py,sha256=15O7eQVn0bitmzUO5OxKdA618PoiLt6Z02gmKsSNMFM,2765
 vision_agent/tools/meta_tools.py,sha256=TPeS7QWnc_PmmU_ndiDT03dXbQ5yDSP33E7U8cSj7Ls,28660
-vision_agent/tools/planner_tools.py,sha256=CvaJ2vGM8O_CYvsoSk1avxAMqpIu3tv4C2bY0p1X-X4,13519
+vision_agent/tools/planner_tools.py,sha256=qQvPuCif-KbFi7KsXKkTCfpgEQEJJ6oq6WB3gOuG2Xg,13686
 vision_agent/tools/prompts.py,sha256=V1z4YJLXZuUl_iZ5rY0M5hHc_2tmMEUKr0WocXKGt4E,1430
 vision_agent/tools/tool_utils.py,sha256=q9cqXO2AvigUdO1krjnOy8o0goYhgS6eILl6-F5Kxyk,10211
-vision_agent/tools/tools.py,sha256=60S5ItFG9yKzVb8FU8oLFj_aouDg2-4vlieDbSgfPdQ,91306
+vision_agent/tools/tools.py,sha256=zqoo4ml9ZS99kOeOIN6Zplq7pxOwBrVZKKFUVIzsjfw,91712
 vision_agent/tools/tools_types.py,sha256=8hYf2OZhI58gvf65KGaeGkt4EQ56nwLFqIQDPHioOBc,2339
 vision_agent/utils/__init__.py,sha256=QKk4zVjMwGxQI0MQ-aZZA50N-qItxRY4EB9CwQkZ2HY,185
 vision_agent/utils/exceptions.py,sha256=booSPSuoULF7OXRr_YbC4dtKt6gM_HyiFQHBuaW86C4,2052
 vision_agent/utils/execute.py,sha256=vOEP5Ys7S2lc0_7pOJbgk7OaWi85hrCNu9_8Bo3zk6I,29356
 vision_agent/utils/image_utils.py,sha256=z_ONgcza125B10NkoGwPOzXnL470bpTWZbkB16NeeH0,12188
-vision_agent/utils/sim.py,sha256=znsInUDrsyBi3OlgAlV3rDn5UQQRfJAWXTXm7D7eJA8,9125
+vision_agent/utils/sim.py,sha256=qr-6UWAxxGwtwIAKZjZCY_pu9VwBI_TTB8bfrGsaABg,9282
 vision_agent/utils/type_defs.py,sha256=BE12s3JNQy36QvauXHjwyeffVh5enfcvd4vTzSwvEZI,1384
 vision_agent/utils/video.py,sha256=e1VwKhXzzlC5LcFMyrcQYrPnpnX4wxDpnQ-76sB4jgM,6001
-vision_agent-0.2.224.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
-vision_agent-0.2.224.dist-info/METADATA,sha256=wT49_byW9-Oz6-1eSlP3cW_AFGbWaxtKrYsGB4nT62o,20039
-vision_agent-0.2.224.dist-info/WHEEL,sha256=7Z8_27uaHI_UZAc4Uox4PpBhQ9Y5_modZXWMxtUi4NU,88
-vision_agent-0.2.224.dist-info/RECORD,,
+vision_agent/utils/video_tracking.py,sha256=EeOiSY8gjvvneuAnv-BO7yOyMBF_-1Irk_lLLOt3bDM,9452
+vision_agent-0.2.226.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
+vision_agent-0.2.226.dist-info/METADATA,sha256=_7jZokNbQLK6Ups2psyRKbPDjUIzU3daxCpfrHZ6gSU,20039
+vision_agent-0.2.226.dist-info/WHEEL,sha256=7Z8_27uaHI_UZAc4Uox4PpBhQ9Y5_modZXWMxtUi4NU,88
+vision_agent-0.2.226.dist-info/RECORD,,

{vision_agent-0.2.224.dist-info → vision_agent-0.2.226.dist-info}/LICENSE RENAMED Viewed

File without changes

{vision_agent-0.2.224.dist-info → vision_agent-0.2.226.dist-info}/WHEEL RENAMED Viewed

File without changes

vision-agent 0.2.224__py3-none-any.whl → 0.2.226__py3-none-any.whl

vision-agent 0.2.224py3-none-any.whl → 0.2.226py3-none-any.whl