PyPI - vision-agent - Versions diffs - 0.2.231__py3-none-any.whl → 0.2.233__py3-none-any.whl - Mend

vision-agent 0.2.231py3-none-any.whl → 0.2.233py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

vision_agent/.sim_tools/df.csv +12 -10
vision_agent/agent/agent_utils.py +5 -3
vision_agent/agent/vision_agent_coder_prompts_v2.py +1 -1
vision_agent/agent/vision_agent_coder_v2.py +16 -2
vision_agent/agent/vision_agent_planner_prompts_v2.py +138 -71
vision_agent/agent/vision_agent_planner_v2.py +1 -0
vision_agent/agent/vision_agent_prompts_v2.py +4 -1
vision_agent/agent/vision_agent_v2.py +11 -7
vision_agent/tools/planner_tools.py +33 -13
vision_agent/tools/tools.py +44 -18
{vision_agent-0.2.231.dist-info → vision_agent-0.2.233.dist-info}/METADATA +1 -1
{vision_agent-0.2.231.dist-info → vision_agent-0.2.233.dist-info}/RECORD +14 -14
{vision_agent-0.2.231.dist-info → vision_agent-0.2.233.dist-info}/LICENSE +0 -0
{vision_agent-0.2.231.dist-info → vision_agent-0.2.233.dist-info}/WHEEL +0 -0

vision_agent/.sim_tools/df.csv CHANGED Viewed

@@ -514,7 +514,7 @@ desc,doc,name
         >>> vit_nsfw_classification(image)
         {""label"": ""normal"", ""scores"": 0.68},
     ",vit_nsfw_classification
-'video_temporal_localization' will run qwen2vl on each chunk_length_frames value selected for the video. It can detect multiple objects independently per chunk_length_frames given a text prompt such as a referring expression but does not track objects across frames. It returns a list of floats with a value of 1.0 if the objects are found in a given chunk_length_frames of the video.,"video_temporal_localization(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen2vl', chunk_length_frames: Optional[int] = 2) -> List[float]:
+'video_temporal_localization' will run qwen2vl on each chunk_length_frames value selected for the video. It can detect multiple objects independently per chunk_length_frames given a text prompt such as a referring expression but does not track objects across frames. It returns a list of floats with a value of 1.0 if the objects are found in a given chunk_length_frames of the video.,"video_temporal_localization(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen2vl', chunk_length_frames: int = 2) -> List[float]:
 'video_temporal_localization' will run qwen2vl on each chunk_length_frames
     value selected for the video. It can detect multiple objects independently per
     chunk_length_frames given a text prompt such as a referring expression
@@ -527,7 +527,7 @@ desc,doc,name
         frames (List[np.ndarray]): The reference frames used for the question
         model (str): The model to use for the inference. Valid values are
             'qwen2vl', 'gpt4o'.
-        chunk_length_frames (Optional[int]): length of each chunk in frames
+        chunk_length_frames (int): length of each chunk in frames
     Returns:
         List[float]: A list of floats with a value of 1.0 if the objects to be found
@@ -540,16 +540,18 @@ desc,doc,name
     ",video_temporal_localization
 "'flux_image_inpainting' performs image inpainting to fill the masked regions, given by mask, in the image, given image based on the text prompt and surrounding image context. It can be used to edit regions of an image according to the prompt given.","flux_image_inpainting(prompt: str, image: numpy.ndarray, mask: numpy.ndarray) -> numpy.ndarray:
 'flux_image_inpainting' performs image inpainting to fill the masked regions,
-    given by mask, in the image, given image based on the text prompt and surrounding image context.
-    It can be used to edit regions of an image according to the prompt given.
+    given by mask, in the image, given image based on the text prompt and surrounding
+    image context. It can be used to edit regions of an image according to the prompt
+    given.
     Parameters:
         prompt (str): A detailed text description guiding what should be generated
-            in the masked area. More detailed and specific prompts typically yield better results.
-        image (np.ndarray): The source image to be inpainted.
-            The image will serve as the base context for the inpainting process.
-        mask (np.ndarray): A binary mask image with 0's and 1's,
-            where 1 indicates areas to be inpainted and 0 indicates areas to be preserved.
+            in the masked area. More detailed and specific prompts typically yield
+            better results.
+        image (np.ndarray): The source image to be inpainted. The image will serve as
+            the base context for the inpainting process.
+        mask (np.ndarray): A binary mask image with 0's and 1's, where 1 indicates
+            areas to be inpainted and 0 indicates areas to be preserved.
     Returns:
         np.ndarray: The generated image(s) as a numpy array in RGB format with values
@@ -658,7 +660,7 @@ desc,doc,name
     -------
         >>> save_image(image)
     ",save_image
-'save_video' is a utility function that saves a list of frames as a mp4 video file on disk.,"save_video(frames: List[numpy.ndarray], output_video_path: Optional[str] = None, fps: float = 1) -> str:
+'save_video' is a utility function that saves a list of frames as a mp4 video file on disk.,"save_video(frames: List[numpy.ndarray], output_video_path: Optional[str] = None, fps: float = 5) -> str:
 'save_video' is a utility function that saves a list of frames as a mp4 video file on disk.
     Parameters:

vision_agent/agent/agent_utils.py CHANGED Viewed

@@ -148,8 +148,10 @@ def format_plan_v2(plan: PlanContext) -> str:
     plan_str += "Instructions:\n"
     for v in plan.instructions:
         plan_str += f"    - {v}\n"
-    plan_str += "Code:\n"
-    plan_str += plan.code
+    if plan.code:
+        plan_str += "Code:\n"
+        plan_str += plan.code
     return plan_str
@@ -158,7 +160,7 @@ def format_conversation(chat: List[AgentMessage]) -> str:
     prompt = ""
     for chat_i in chat:
         if chat_i.role == "user" or chat_i.role == "coder":
-            if "<final_code>" in chat_i.role:
+            if "<final_code>" in chat_i.content:
                 prompt += f"OBSERVATION: {chat_i.content}\n\n"
             elif chat_i.role == "user":
                 prompt += f"USER: {chat_i.content}\n\n"

vision_agent/agent/vision_agent_coder_prompts_v2.py CHANGED Viewed

@@ -6,7 +6,7 @@ FEEDBACK = """
 CODE = """
-**Role**: You are an expoert software programmer.
+**Role**: You are an expert software programmer.
 **Task**: You are given a plan by a planning agent that solves a vision problem posed by the user. You are also given code snippets that the planning agent used to solve the task. Your job is to organize the code so that it can be easily called by the user to solve the task.

vision_agent/agent/vision_agent_coder_v2.py CHANGED Viewed

@@ -425,6 +425,8 @@ class VisionAgentCoderV2(AgentCoder):
             chat (List[AgentMessage]): The input to the agent. This should be a list of
                 AgentMessage objects.
             plan_context (PlanContext): The plan context that was previously generated.
+                If plan_context.code is not provided, then the code will be generated
+                from the chat messages.
             code_interpreter (Optional[CodeInterpreter]): The code interpreter to use.
         Returns:
@@ -441,7 +443,7 @@ class VisionAgentCoderV2(AgentCoder):
         # we don't need the user_interaction response for generating code since it's
         # already in the plan context
-        while chat[-1].role != "user":
+        while len(chat) > 0 and chat[-1].role != "user":
             chat.pop()
         if not chat:
@@ -455,12 +457,24 @@ class VisionAgentCoderV2(AgentCoder):
             int_chat, _, media_list = add_media_to_chat(chat, code_interpreter)
             tool_docs = retrieve_tools(plan_context.instructions, self.tool_recommender)
+            # If code is not provided from the plan_context then generate it, else use
+            # the provided code and start with testing
+            if not plan_context.code.strip():
+                code = write_code(
+                    coder=self.coder,
+                    chat=int_chat,
+                    tool_docs=tool_docs,
+                    plan=format_plan_v2(plan_context),
+                )
+            else:
+                code = plan_context.code
             code_context = test_code(
                 tester=self.tester,
                 debugger=self.debugger,
                 chat=int_chat,
                 plan=format_plan_v2(plan_context),
-                code=plan_context.code,
+                code=code,
                 tool_docs=tool_docs,
                 code_interpreter=code_interpreter,
                 media_list=media_list,

vision_agent/agent/vision_agent_planner_prompts_v2.py CHANGED Viewed

@@ -50,7 +50,7 @@ From this aerial view of a busy urban street, it's difficult to clearly see or c
 [suggestion 0]
 The image is very large and the items you need to detect are small.
-Step 1: You should start by splitting the image into sections and runing the detection algorithm on each section:
+Step 1: You should start by splitting the image into overlapping sections and runing the detection algorithm on each section:
 def subdivide_image(image):
     height, width, _ = image.shape
@@ -66,41 +66,96 @@ def subdivide_image(image):
 get_tool_for_task('<your prompt here>', subdivide_image(image))
-Step 2: Once you have the detections from each subdivided image, you will need to merge them back together to remove overlapping predictions:
-def translate_ofset(bbox, offset_x, offset_y):
-    return (bbox[0] + offset_x, bbox[1] + offset_y, bbox[2] + offset_x, bbox[3] + offset_y)
-def bounding_boxes_overlap(bbox1, bbox2):
-    if bbox1[2] < bbox2[0] or bbox2[0] > bbox1[2]:
-        return False
-    if bbox1[3] < bbox2[1] or bbox2[3] > bbox1[3]:
-        return False
-    return True
-def merge_bounding_boxes(bbox1, bbox2):
-    x_min = min(bbox1[0], bbox2[0])
-    y_min = min(bbox1[1], bbox2[1])
-    x_max = max(bbox1[2], bbox2[2])
-    y_max = max(bbox1[3], bbox2[3])
-    return (x_min, y_min, x_max, y_max)
-def merge_bounding_box_list(bboxes):
-    merged_bboxes = []
-    while bboxes:
-        bbox = bboxes.pop()
-        overlap_found = False
-        for i, other_bbox in enumerate(merged_bboxes):
-            if bounding_boxes_overlap(bbox, other_bbox):
-                merged_bboxes[i] = merge_bounding_boxes(bbox, other_bbox)
-                overlap_found = True
+Step 2: Once you have the detections from each subdivided image, you will need to merge them back together to remove overlapping predictions, be sure to tranlate the offset back to the original image:
+def bounding_box_match(b1: List[float], b2: List[float], iou_threshold: float = 0.1) -> bool:
+    # Calculate intersection coordinates
+    x1 = max(b1[0], b2[0])
+    y1 = max(b1[1], b2[1])
+    x2 = min(b1[2], b2[2])
+    y2 = min(b1[3], b2[3])
+    # Calculate intersection area
+    if x2 < x1 or y2 < y1:
+        return False  # No overlap
+    intersection = (x2 - x1) * (y2 - y1)
+    # Calculate union area
+    area1 = (b1[2] - b1[0]) * (b1[3] - b1[1])
+    area2 = (b2[2] - b2[0]) * (b2[3] - b2[1])
+    union = area1 + area2 - intersection
+    # Calculate IoU
+    iou = intersection / union if union > 0 else 0
+    return iou >= iou_threshold
+def merge_bounding_box_list(detections):
+    merged_detections = []
+    for detection in detections:
+        matching_box = None
+        for i, other in enumerate(merged_detections):
+            if bounding_box_match(detection["bbox"], other["bbox"]):
+                matching_box = i
                 break
-        if not overlap_found:
-          p
-          merged_bboxes.append(bbox)
-    return merged_bboxes
-detection = merge_bounding_box_list(detection_from_subdivided_images)
+        if matching_box is not None:
+            # Keep the box with higher confidence score
+            if detection["score"] > merged_detections[matching_box]["score"]:
+                merged_detections[matching_box] = detection
+        else:
+            merged_detections.append(detection)
+def sub_image_to_original(elt, sub_image_position, original_size):
+    offset_x, offset_y = sub_image_position
+    return {
+        "label": elt["label"],
+        "score": elt["score"],
+        "bbox": [
+            (elt["bbox"][0] + offset_x) / original_size[1],
+            (elt["bbox"][1] + offset_y) / original_size[0],
+            (elt["bbox"][2] + offset_x) / original_size[1],
+            (elt["bbox"][3] + offset_y) / original_size[0],
+        ],
+    }
+def normalized_to_unnormalized(elt, image_size):
+    return {
+        "label": elt["label"],
+        "score": elt["score"],
+        "bbox": [
+            elt["bbox"][0] * image_size[1],
+            elt["bbox"][1] * image_size[0],
+            elt["bbox"][2] * image_size[1],
+            elt["bbox"][3] * image_size[0],
+        ],
+    }
+height, width, _ = image.shape
+mid_width = width // 2
+mid_height = height // 2
+detection_from_subdivided_images = []
+for i, sub_image in enumerate(subdivided_images):
+    detections = <your detection function here>("pedestrian", sub_image)
+    unnorm_detections = [
+        normalized_to_unnormalized(
+            detection, (sub_image.shape[0], sub_image.shape[1])
+        )
+        for detection in detections
+    ]
+    offset_x = i % 2 * (mid_width - int(mid_width * 0.1))
+    offset_y = i // 2 * (mid_height - int(mid_height * 0.1))
+    offset_detections = [
+        sub_image_to_original(
+            unnorm_detection, (offset_x, offset_y), (height, width)
+        )
+        for unnorm_detection in unnorm_detections
+    ]
+    detection_from_subdivided_images.extend(offset_detections)
+detections = merge_bounding_box_list(detection_from_subdivided_images)
 [end of suggestion 0]
 [end of suggestion]
 <count>9</count>
@@ -164,36 +219,44 @@ countgd_object_detection(prompt: str, image: numpy.ndarray, box_threshold: float
 AGENT: <thinking>I need to now merge the boxes from all region and use the countgd_object_detection tool with the prompt 'pedestrian' as suggested by get_tool_for_task.</thinking>
 <execute_python>
-def translate_ofset(bbox, offset_x, offset_y):
-    return (bbox[0] + offset_x, bbox[1] + offset_y, bbox[2] + offset_x, bbox[3] + offset_y)
-def bounding_boxes_overlap(bbox1, bbox2):
-    if bbox1[2] < bbox2[0] or bbox2[0] > bbox1[2]:
-        return False
-    if bbox1[3] < bbox2[1] or bbox2[3] > bbox1[3]:
-        return False
-    return True
-def merge_bounding_boxes(bbox1, bbox2):
-    x_min = min(bbox1[0], bbox2[0])
-    y_min = min(bbox1[1], bbox2[1])
-    x_max = max(bbox1[2], bbox2[2])
-    y_max = max(bbox1[3], bbox2[3])
-    return (x_min, y_min, x_max, y_max)
-def merge_bounding_box_list(bboxes):
-    merged_bboxes = []
-    while bboxes:
-        bbox = bboxes.pop()
-        overlap_found = False
-        for i, other_bbox in enumerate(merged_bboxes):
-            if bounding_boxes_overlap(bbox, other_bbox):
-                merged_bboxes[i] = merge_bounding_boxes(bbox, other_bbox)
-                overlap_found = True
+def bounding_box_match(b1: List[float], b2: List[float], iou_threshold: float = 0.1) -> bool:
+    # Calculate intersection coordinates
+    x1 = max(b1[0], b2[0])
+    y1 = max(b1[1], b2[1])
+    x2 = min(b1[2], b2[2])
+    y2 = min(b1[3], b2[3])
+    # Calculate intersection area
+    if x2 < x1 or y2 < y1:
+        return False  # No overlap
+    intersection = (x2 - x1) * (y2 - y1)
+    # Calculate union area
+    area1 = (b1[2] - b1[0]) * (b1[3] - b1[1])
+    area2 = (b2[2] - b2[0]) * (b2[3] - b2[1])
+    union = area1 + area2 - intersection
+    # Calculate IoU
+    iou = intersection / union if union > 0 else 0
+    return iou >= iou_threshold
+def merge_bounding_box_list(detections):
+    merged_detections = []
+    for detection in detections:
+        matching_box = None
+        for i, other in enumerate(merged_detections):
+            if bounding_box_match(detection["bbox"], other["bbox"]):
+                matching_box = i
                 break
-        if not overlap_found:
-            merged_bboxes.append(bbox)
-    return merged_bboxes
+        if matching_box is not None:
+            # Keep the box with higher confidence score
+            if detection["score"] > merged_detections[matching_box]["score"]:
+                merged_detections[matching_box] = detection
+        else:
+            merged_detections.append(detection)
 detections = []
 for region in subdivide_image(image):
@@ -458,6 +521,8 @@ You are given a task: "{task}" from the user. You must extract the type of categ
 - "DocQA" - answering questions about a document or extracting information from a document.
 - "video object tracking" - tracking objects in a video.
 - "depth and pose estimation" - estimating the depth or pose of objects in an image.
+- "temporal localization" - localizing the time period an event occurs in a video.
+- "inpainting" - filling in masked parts of an image.
 Return the category or categories (comma separated) inside tags <category># your categories here</category>. If you are unsure about a task, it is better to include more categories than less.
 """
@@ -651,22 +716,24 @@ PICK_TOOL = """
 """
 FINALIZE_PLAN = """
-**Role**: You are an expert AI model that can understand the user request and construct plans to accomplish it.
+**Task**: You are given a chain of thoughts, python executions and observations from a planning agent as it tries to construct a plan to solve a user request. Your task is to summarize the plan it found so that another programming agent to write a program to accomplish the user request.
-**Task**: You are given a chain of thoughts, python executions and observations from a planning agent as it tries to construct a plan to solve a user request. Your task is to summarize the plan it found so that another programming agnet to write a program to accomplish the user request.
+**Documentation**: You can use these tools to help you visualize or save the output:
+{tool_desc}
 **Planning**: Here is chain of thoughts, executions and observations from the planning agent:
 {planning}
 **Instructions**:
 1. Summarize the plan that the planning agent found.
-2. Write a single function that solves the problem based on what the planner found.
-3. Specifically call out the tools used and the order in which they were used. Only include tools obtained from calling `get_tool_for_task`.
+2. Write a single function that solves the problem based on what the planner found and only returns the final solution.
+3. Only use tools obtained from calling `get_tool_for_task`.
 4. Do not include {excluded_tools} tools in your instructions.
-5. Add final instructions for visualizing the output with `overlay_bounding_boxes` or `overlay_segmentation_masks` and saving it to a file with `save_image` or `save_video`.
-6. Use the default FPS for extracting frames from videos unless otherwise specified by the user.
-7. Include the expected answer in your 'plan' so that the programming agent can properly test if it has the correct answer.
-8. Respond in the following format with JSON surrounded by <json> tags and code surrounded by <code> tags:
+5. Ensure the function is well documented and easy to understand.
+6. Ensure you visualize the output with `overlay_bounding_boxes` or `overlay_segmentation_masks` and save it to a file with `save_image` or `save_video`.
+7. Use the default FPS for extracting frames from videos unless otherwise specified by the user.
+8. Include the expected answer in your 'plan' so that the programming agent can properly test if it has the correct answer.
+9. Respond in the following format with JSON surrounded by <json> tags and code surrounded by <code> tags:
 <json>
 {{

vision_agent/agent/vision_agent_planner_v2.py CHANGED Viewed

@@ -326,6 +326,7 @@ def create_finalize_plan(
         return [], PlanContext(plan="", instructions=[], code="")
     prompt = FINALIZE_PLAN.format(
+        tool_desc=UTIL_DOCSTRING,
         planning=get_planning(chat),
         excluded_tools=str([t.__name__ for t in pt.PLANNER_TOOLS]),
     )

vision_agent/agent/vision_agent_prompts_v2.py CHANGED Viewed

@@ -42,6 +42,8 @@ AGENT: <response>I am VisionAgent, an agent built by LandingAI, to help users wr
 - Understanding documents
 - Pose estimation
 - Visual question answering for both images and videos
+- Action recognition in videos
+- Image inpainting
 How can I help you?</response>
 --- END EXAMPLE2 ---
@@ -54,7 +56,8 @@ Here is the current conversation so far:
 **Instructions**:
 1. Only respond with a single <response> tag and a single <action> tag.
-2. Respond in the following format, the <action> tag is optional and can be excluded if you do not want to take any action:
+2. You can only take one action at a time in response to the user's message. Do not offer to fix code on the user's behalf, only if they have directly asked you to.
+3. Respond in the following format, the <action> tag is optional and can be excluded if you do not want to take any action:
 <response>Your response to the user's message</response>
 <action>The action you want to take from **Actions**</action>

vision_agent/agent/vision_agent_v2.py CHANGED Viewed

@@ -27,7 +27,7 @@ CONFIG = Config()
 def extract_conversation(
-    chat: List[AgentMessage],
+    chat: List[AgentMessage], include_conv: bool = False
 ) -> Tuple[List[AgentMessage], Optional[str]]:
     chat = copy.deepcopy(chat)
@@ -43,6 +43,8 @@ def extract_conversation(
         elif chat_i.role == "coder":
             if "<final_code>" in chat_i.content:
                 extracted_chat.append(chat_i)
+        elif include_conv and chat_i.role == "conversation":
+            extracted_chat.append(chat_i)
     # only keep the last <final_code> and <final_test>
     final_code = None
@@ -64,10 +66,9 @@ def extract_conversation(
 def run_conversation(agent: LMM, chat: List[AgentMessage]) -> str:
-    extracted_chat, _ = extract_conversation(chat)
-    extracted_chat = extracted_chat[-10:]
+    extracted_chat, _ = extract_conversation(chat, include_conv=True)
-    conv = format_conversation(chat)
+    conv = format_conversation(extracted_chat)
     prompt = CONVERSATION.format(
         conversation=conv,
     )
@@ -112,14 +113,17 @@ def maybe_run_action(
                 )
             ]
     elif action == "edit_code":
+        # We don't want to pass code in plan_context.code so the coder will generate
+        # new code from plan_context.plan
         plan_context = PlanContext(
-            plan="Edit the latest code observed in the fewest steps possible according to the user's feedback.",
+            plan="Edit the latest code observed in the fewest steps possible according to the user's feedback."
+            + ("<code>\n" + final_code + "\n</code>" if final_code is not None else ""),
             instructions=[
                 chat_i.content
                 for chat_i in extracted_chat
                 if chat_i.role == "user" and "<final_code>" not in chat_i.content
             ],
-            code=final_code if final_code is not None else "",
+            code="",
         )
         context = coder.generate_code_from_plan(
             extracted_chat, plan_context, code_interpreter=code_interpreter
@@ -260,7 +264,7 @@ class VisionAgentV2(Agent):
                 # do not append updated_chat to return_chat becuase the observation
                 # from running the action will have already been added via the callbacks
                 obs_response_context = run_conversation(
-                    self.agent, return_chat + updated_chat
+                    self.agent, int_chat + return_chat + updated_chat
                 )
                 return_chat.append(
                     AgentMessage(role="conversation", content=obs_response_context)

vision_agent/tools/planner_tools.py CHANGED Viewed

@@ -2,7 +2,7 @@ import inspect
 import logging
 import tempfile
 from concurrent.futures import ThreadPoolExecutor, as_completed
-from typing import Any, Callable, Dict, List, Optional, Tuple, cast
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union, cast
 import libcst as cst
 import numpy as np
@@ -235,7 +235,9 @@ def run_tool_testing(
 def get_tool_for_task(
-    task: str, images: List[np.ndarray], exclude_tools: Optional[List[str]] = None
+    task: str,
+    images: Union[Dict[str, List[np.ndarray]], List[np.ndarray]],
+    exclude_tools: Optional[List[str]] = None,
 ) -> None:
     """Given a task and one or more images this function will find a tool to accomplish
     the jobs. It prints the tool documentation and thoughts on why it chose the tool.
@@ -248,6 +250,8 @@ def get_tool_for_task(
         - VQA
         - Depth and pose estimation
         - Video object tracking
+        - Video temporal localization (action recognition)
+        - Image inpainting
     Only ask for one type of task at a time, for example a task needing to identify
     text is one OCR task while needing to identify non-text objects is an OD task. Wait
@@ -256,7 +260,8 @@ def get_tool_for_task(
     Parameters:
         task: str: The task to accomplish.
-        images: List[np.ndarray]: The images to use for the task.
+        images: Union[Dict[str, List[np.ndarray]], List[np.ndarray]]: The images to use
+            for the task. If a key is provided, it is used as the file name.
         exclude_tools: Optional[List[str]]: A list of tool names to exclude from the
             recommendations. This is helpful if you are calling get_tool_for_task twice
             and do not want the same tool recommended.
@@ -266,20 +271,29 @@ def get_tool_for_task(
     Examples
     --------
-        >>> get_tool_for_task("Give me an OCR model that can find 'hot chocolate' in the image", [image])
+        >>> get_tool_for_task(
+        >>>     "Give me an OCR model that can find 'hot chocolate' in the image",
+        >>>     {"image": [image]})
+        >>> get_tool_for_taks(
+        >>>     "I need a tool that can paint a background for this image and maks",
+        >>>     {"image": [image], "mask": [mask]})
     """
     tool_tester = CONFIG.create_tool_tester()
     tool_chooser = CONFIG.create_tool_chooser()
+    if isinstance(images, list):
+        images = {"image": images}
     with (
         tempfile.TemporaryDirectory() as tmpdirname,
         CodeInterpreterFactory.new_instance() as code_interpreter,
     ):
         image_paths = []
-        for i, image in enumerate(images[:3]):
-            image_path = f"{tmpdirname}/image_{i}.png"
-            Image.fromarray(image).save(image_path)
-            image_paths.append(image_path)
+        for k in images.keys():
+            for i, image in enumerate(images[k]):
+                image_path = f"{tmpdirname}/{k}_{i}.png"
+                Image.fromarray(image).save(image_path)
+                image_paths.append(image_path)
         code, tool_docs_str, tool_output = run_tool_testing(
             task, image_paths, tool_tester, exclude_tools, code_interpreter
@@ -300,20 +314,26 @@ def get_tool_documentation(tool_name: str) -> str:
 def get_tool_for_task_human_reviewer(
-    task: str, images: List[np.ndarray], exclude_tools: Optional[List[str]] = None
+    task: str,
+    images: Union[Dict[str, List[np.ndarray]], List[np.ndarray]],
+    exclude_tools: Optional[List[str]] = None,
 ) -> None:
     # NOTE: this will have the same documentation as get_tool_for_task
     tool_tester = CONFIG.create_tool_tester()
+    if isinstance(images, list):
+        images = {"image": images}
     with (
         tempfile.TemporaryDirectory() as tmpdirname,
         CodeInterpreterFactory.new_instance() as code_interpreter,
     ):
         image_paths = []
-        for i, image in enumerate(images[:3]):
-            image_path = f"{tmpdirname}/image_{i}.png"
-            Image.fromarray(image).save(image_path)
-            image_paths.append(image_path)
+        for k in images.keys():
+            for i, image in enumerate(images[k]):
+                image_path = f"{tmpdirname}/{k}_{i}.png"
+                Image.fromarray(image).save(image_path)
+                image_paths.append(image_path)
         tools = [
             t.__name__

vision_agent/tools/tools.py CHANGED Viewed

@@ -1727,22 +1727,46 @@ def video_temporal_localization(
     }
     payload["chunk_length_frames"] = chunk_length_frames
-    data = send_inference_request(
-        payload, "video-temporal-localization", files=files, v2=True
-    )
+    segments = split_frames_into_segments(frames, segment_size=50, overlap=0)
+    def _apply_temporal_localization(
+        segment: List[np.ndarray],
+    ) -> List[float]:
+        segment_buffer_bytes = [("video", frames_to_bytes(segment))]
+        data = send_inference_request(
+            payload, "video-temporal-localization", files=segment_buffer_bytes, v2=True
+        )
+        chunked_data = [cast(float, value) for value in data]
+        full_data = []
+        for value in chunked_data:
+            full_data.extend([value] * chunk_length_frames)
+        return full_data[: len(segment)]
+    with ThreadPoolExecutor() as executor:
+        futures = {
+            executor.submit(_apply_temporal_localization, segment): segment_index
+            for segment_index, segment in enumerate(segments)
+        }
+        localization_per_segment = []
+        for future in as_completed(futures):
+            segment_index = futures[future]
+            localization_per_segment.append((segment_index, future.result()))
+    localization_per_segment = [
+        x[1] for x in sorted(localization_per_segment, key=lambda x: x[0])  # type: ignore
+    ]
+    localizations = cast(List[float], [e for o in localization_per_segment for e in o])
     _display_tool_trace(
         video_temporal_localization.__name__,
         payload,
-        data,
+        localization_per_segment,
         files,
     )
-    chunked_data = [cast(float, value) for value in data]
-    full_data = []
-    for value in chunked_data:
-        full_data.extend([value] * chunk_length_frames)
-    return full_data[: len(frames)]
+    return localizations
 def vit_image_classification(image: np.ndarray) -> Dict[str, Any]:
@@ -2028,16 +2052,18 @@ def flux_image_inpainting(
     mask: np.ndarray,
 ) -> np.ndarray:
     """'flux_image_inpainting' performs image inpainting to fill the masked regions,
-    given by mask, in the image, given image based on the text prompt and surrounding image context.
-    It can be used to edit regions of an image according to the prompt given.
+    given by mask, in the image, given image based on the text prompt and surrounding
+    image context. It can be used to edit regions of an image according to the prompt
+    given.
     Parameters:
         prompt (str): A detailed text description guiding what should be generated
-            in the masked area. More detailed and specific prompts typically yield better results.
-        image (np.ndarray): The source image to be inpainted.
-            The image will serve as the base context for the inpainting process.
-        mask (np.ndarray): A binary mask image with 0's and 1's,
-            where 1 indicates areas to be inpainted and 0 indicates areas to be preserved.
+            in the masked area. More detailed and specific prompts typically yield
+            better results.
+        image (np.ndarray): The source image to be inpainted. The image will serve as
+            the base context for the inpainting process.
+        mask (np.ndarray): A binary mask image with 0's and 1's, where 1 indicates
+            areas to be inpainted and 0 indicates areas to be preserved.
     Returns:
         np.ndarray: The generated image(s) as a numpy array in RGB format with values

{vision_agent-0.2.231.dist-info → vision_agent-0.2.233.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 0.2.231
+Version: 0.2.233
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai

{vision_agent-0.2.231.dist-info → vision_agent-0.2.233.dist-info}/RECORD RENAMED Viewed

@@ -1,23 +1,23 @@
-vision_agent/.sim_tools/df.csv,sha256=XdcgkjC7CjF_CoJnXmFkYOPUBwHemiwsauh62b1eh1M,42472
+vision_agent/.sim_tools/df.csv,sha256=oVUuyoVTCnayorbGUAvWed8l1YA_-rF9rSF78fMtvuU,42468
 vision_agent/.sim_tools/embs.npy,sha256=YJe8EcKVNmeX_75CS2T1sbY-sUS_1HQAMT-34zc18a0,254080
 vision_agent/__init__.py,sha256=EAb4-f9iyuEYkBrX4ag1syM8Syx8118_t0R6_C34M9w,57
 vision_agent/agent/README.md,sha256=Q4w7FWw38qaWosQYAZ7NqWx8Q5XzuWrlv7nLhjUd1-8,5527
 vision_agent/agent/__init__.py,sha256=M8CffavdIh8Zh-skznLHIaQkYGCGK7vk4dq1FaVkbs4,617
 vision_agent/agent/agent.py,sha256=_1tHWAs7Jm5tqDzEcPfCRvJV3uRRveyh4n9_9pd6I1w,1565
-vision_agent/agent/agent_utils.py,sha256=IXxN9XruaeNTreUrdztb3kWJhimpsdH6hjv6xT4jg1Q,14062
+vision_agent/agent/agent_utils.py,sha256=4RgG8SUEGuMFHkIt0jCFkRQF6G1PZp3Ub4LuVYKF7Ic,14092
 vision_agent/agent/types.py,sha256=dIdxATH_PP76pD5Wfo0oofWt6iPQh0vpf48QbEQSzhs,2472
 vision_agent/agent/vision_agent.py,sha256=fH9NOLk7twL1fPr9vLSqkaYhah-gfDWfTOVF2FfMyzI,23461
 vision_agent/agent/vision_agent_coder.py,sha256=flUxOibyGZK19BCSK5mhaD3HjCxHw6c6FtKom6N2q1E,27359
 vision_agent/agent/vision_agent_coder_prompts.py,sha256=_kkPLezUVnBXieNPlxMQab_6J6P7F-aa6ItF5NhZZsM,12281
-vision_agent/agent/vision_agent_coder_prompts_v2.py,sha256=idmSMfxebPULqqvllz3gqRzGDchEvS5dkGngvBs4PGo,4872
-vision_agent/agent/vision_agent_coder_v2.py,sha256=ZR2PQoMqNM6yK3vn_0rrCJf_EplRKye7t7bVjyl51ls,16476
+vision_agent/agent/vision_agent_coder_prompts_v2.py,sha256=NUMWq-Lxq5JmmyWs3C5O_1Hm-zCbf9I_yPK5UtWGspE,4871
+vision_agent/agent/vision_agent_coder_v2.py,sha256=yQYcO0s4BI9pWaAQQAVtkwWa3UF5w0iLKvwpeJ6iegM,17077
 vision_agent/agent/vision_agent_planner.py,sha256=fFzjNkZBKkh8Y_oS06ATI4qz31xmIJvixb_tV1kX8KA,18590
 vision_agent/agent/vision_agent_planner_prompts.py,sha256=rYRdJthc-sQN57VgCBKrF09Sd73BSxcBdjNe6C4WNZ8,6837
-vision_agent/agent/vision_agent_planner_prompts_v2.py,sha256=5xTx93lNpoyT4eAD9jicwDyDAkuW7eQqicr17zCjrQw,33337
-vision_agent/agent/vision_agent_planner_v2.py,sha256=7hBQdg9y4oCLDiQ54Kh12_uIMywedKKNPWiKPRA01cQ,20568
+vision_agent/agent/vision_agent_planner_prompts_v2.py,sha256=U88z1Y7CifFs7t53aUrl8qjWtBYs0f_F5vyg_0VYJko,35528
+vision_agent/agent/vision_agent_planner_v2.py,sha256=NUyi57zxCmOO004_cJcCCDa4UgcKSWB1WCGuyOhhXQE,20602
 vision_agent/agent/vision_agent_prompts.py,sha256=KaJwYPUP7_GvQsCPPs6Fdawmi3AQWmWajBUuzj7gTG4,13812
-vision_agent/agent/vision_agent_prompts_v2.py,sha256=AW_bW1boGiCLyLFd3h4GQenfDACttQagDHwpBkSW4Xo,2518
-vision_agent/agent/vision_agent_v2.py,sha256=335VT0hk0jkB14y4W3cJo5ueEu1wY_jjN-R_m2xaQ30,10752
+vision_agent/agent/vision_agent_prompts_v2.py,sha256=Wyxa15NOe75PefAfw3_RRwvgjg8YVqCrU7WvvWoYJpk,2733
+vision_agent/agent/vision_agent_v2.py,sha256=86_pPdkkMBk08TTFZ7zu9QG37Iz9uI8Nmt79wwm_EIA,11053
 vision_agent/clients/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 vision_agent/clients/http.py,sha256=k883i6M_4nl7zwwHSI-yP5sAgQZIDPM1nrKD6YFJ3Xs,2009
 vision_agent/clients/landing_public_api.py,sha256=lU2ev6E8NICmR8DMUljuGcVFy5VNJQ4WQkWC8WnnJEc,1503
@@ -33,10 +33,10 @@ vision_agent/lmm/lmm.py,sha256=arwfYPWme_RxCxSpEQ0ZkpHO22GFPCwVeoSvXqLPOAk,19288
 vision_agent/lmm/types.py,sha256=ZEXR_ptBL0ZwDMTDYkgxUCmSZFmBYPQd2jreNzr_8UY,221
 vision_agent/tools/__init__.py,sha256=zopUrANPx7p0NGy6BxmEaYhDrj8DX8w7BLfgmCbz-mU,2897
 vision_agent/tools/meta_tools.py,sha256=TPeS7QWnc_PmmU_ndiDT03dXbQ5yDSP33E7U8cSj7Ls,28660
-vision_agent/tools/planner_tools.py,sha256=Mk3N-I-Qs4ezeyv8EL9BxdxmJG5oWiH5bFkvgwJKB0s,14660
+vision_agent/tools/planner_tools.py,sha256=8pJZCGGOGIqGiV2or52BjyRP6eDlporuQ2hXCIHfLTQ,15382
 vision_agent/tools/prompts.py,sha256=V1z4YJLXZuUl_iZ5rY0M5hHc_2tmMEUKr0WocXKGt4E,1430
 vision_agent/tools/tool_utils.py,sha256=xJRWF96Ge9RvhhVHrOtifjUYoc4HIJ2y7c2VOQ2Lp8s,10152
-vision_agent/tools/tools.py,sha256=3B3xWFVA3qfAO6ySSQ2yUPUAiTrgJomL48hLO_VP6RQ,106015
+vision_agent/tools/tools.py,sha256=Eb2paiXjik0HyGeZzXctTpJCLG0V3NnNL9awtaB8HN4,107011
 vision_agent/tools/tools_types.py,sha256=8hYf2OZhI58gvf65KGaeGkt4EQ56nwLFqIQDPHioOBc,2339
 vision_agent/utils/__init__.py,sha256=QKk4zVjMwGxQI0MQ-aZZA50N-qItxRY4EB9CwQkZ2HY,185
 vision_agent/utils/exceptions.py,sha256=booSPSuoULF7OXRr_YbC4dtKt6gM_HyiFQHBuaW86C4,2052
@@ -46,7 +46,7 @@ vision_agent/utils/sim.py,sha256=DYya76dYVtifFyXilMLxBzGgyfyeqhEwU4RJ4894lCI,979
 vision_agent/utils/type_defs.py,sha256=BE12s3JNQy36QvauXHjwyeffVh5enfcvd4vTzSwvEZI,1384
 vision_agent/utils/video.py,sha256=e1VwKhXzzlC5LcFMyrcQYrPnpnX4wxDpnQ-76sB4jgM,6001
 vision_agent/utils/video_tracking.py,sha256=wK5dOutqV2t2aeaxedstCBa7xy-NNQE0-QZqKu1QUds,9498
-vision_agent-0.2.231.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
-vision_agent-0.2.231.dist-info/METADATA,sha256=N8t9F4hZ4bgyZeDhrVepMZzO5dtRmzRB8VI6fq1fFAA,5760
-vision_agent-0.2.231.dist-info/WHEEL,sha256=7Z8_27uaHI_UZAc4Uox4PpBhQ9Y5_modZXWMxtUi4NU,88
-vision_agent-0.2.231.dist-info/RECORD,,
+vision_agent-0.2.233.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
+vision_agent-0.2.233.dist-info/METADATA,sha256=EoNuerRth0lHRC7TK2Xh7w6V__YtUJraKk9yN8AMx2U,5760
+vision_agent-0.2.233.dist-info/WHEEL,sha256=7Z8_27uaHI_UZAc4Uox4PpBhQ9Y5_modZXWMxtUi4NU,88
+vision_agent-0.2.233.dist-info/RECORD,,

{vision_agent-0.2.231.dist-info → vision_agent-0.2.233.dist-info}/LICENSE RENAMED Viewed

File without changes

{vision_agent-0.2.231.dist-info → vision_agent-0.2.233.dist-info}/WHEEL RENAMED Viewed

File without changes

vision-agent 0.2.231__py3-none-any.whl → 0.2.233__py3-none-any.whl

vision-agent 0.2.231py3-none-any.whl → 0.2.233py3-none-any.whl