PyPI - vision-agent - Versions diffs - 0.2.215__tar.gz → 0.2.216__tar.gz - Mend

vision-agent 0.2.215tar.gz → 0.2.216tar.gz

Files changed (46) hide show

{vision_agent-0.2.215 → vision_agent-0.2.216}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 0.2.215
+Version: 0.2.216
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai

{vision_agent-0.2.215 → vision_agent-0.2.216}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
 [tool.poetry]
 name = "vision-agent"
-version = "0.2.215"
+version = "0.2.216"
 description = "Toolset for Vision Agent"
 authors = ["Landing AI <dev@landing.ai>"]
 readme = "README.md"

{vision_agent-0.2.215 → vision_agent-0.2.216}/vision_agent/.sim_tools/df.csv RENAMED Viewed

@@ -444,6 +444,35 @@ desc,doc,name
         >>> qwen2_vl_video_vqa('Which football player made the goal?', frames)
         'Lionel Messi'
     ",qwen2_vl_video_vqa
+"'document_extraction' is a tool that can extract structured information out of documents with different layouts. It returns the extracted data in a structured hierarchical format containing text, tables, pictures, charts, and other information.","document_extraction(image: numpy.ndarray) -> Dict[str, Any]:
+'document_extraction' is a tool that can extract structured information out of
+    documents with different layouts. It returns the extracted data in a structured
+    hierarchical format containing text, tables, pictures, charts, and other
+    information.
+    Parameters:
+        image (np.ndarray): The document image to analyze
+    Returns:
+        Dict[str, Any]: A dictionary containing the extracted information.
+    Example
+    -------
+        >>> document_analysis(image)
+        {'pages':
+            [{'bbox': [0, 0, 1700, 2200],
+                    'chunks': [{'bbox': [1371, 75, 1503, 112],
+                                'label': 'page_header',
+                                'order': 75
+                                'caption': 'Annual Report 2024',
+                                'summary': 'This annual report summarizes ...' },
+                               {'bbox': [201, 1119, 1497, 1647],
+                                'label': table',
+                                'order': 1119,
+                                'caption': [{'Column 1': 'Value 1', 'Column 2': 'Value 2'},
+                                'summary': 'This table illustrates a trend of ...'},
+                    ],
+    ",document_extraction
 'video_temporal_localization' will run qwen2vl on each chunk_length_frames value selected for the video. It can detect multiple objects independently per chunk_length_frames given a text prompt such as a referring expression but does not track objects across frames. It returns a list of floats with a value of 1.0 if the objects are found in a given chunk_length_frames of the video.,"video_temporal_localization(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen2vl', chunk_length_frames: Optional[int] = 2) -> List[float]:
 'video_temporal_localization' will run qwen2vl on each chunk_length_frames
     value selected for the video. It can detect multiple objects independently per
@@ -513,6 +542,78 @@ desc,doc,name
         >>> siglip_classification(image, ['dog', 'cat', 'bird'])
         {""labels"": [""dog"", ""cat"", ""bird""], ""scores"": [0.68, 0.30, 0.02]},
     ",siglip_classification
+"'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores.","owlv2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
+'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text
+    prompt such as category names or referring expressions. The categories in the text
+    prompt are separated by commas. It returns a list of bounding boxes, label names,
+    mask file names and associated probability scores.
+    Parameters:
+        prompt (str): The prompt to ground to the image.
+        image (np.ndarray): The image to ground the prompt to.
+    Returns:
+        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
+            bounding box, and mask of the detected objects with normalized coordinates
+            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
+            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
+            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
+            the background.
+    Example
+    -------
+        >>> countgd_sam2_video_tracking(""car, dinosaur"", frames)
+        [
+            [
+                {
+                    'label': '0: dinosaur',
+                    'bbox': [0.1, 0.11, 0.35, 0.4],
+                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0],
+                        ...,
+                        [0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
+                },
+            ],
+            ...
+        ]
+    ",owlv2_sam2_video_tracking
+"'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores.","countgd_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10) -> List[List[Dict[str, Any]]]:
+'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text
+    prompt such as category names or referring expressions. The categories in the text
+    prompt are separated by commas. It returns a list of bounding boxes, label names,
+    mask file names and associated probability scores.
+    Parameters:
+        prompt (str): The prompt to ground to the image.
+        image (np.ndarray): The image to ground the prompt to.
+    Returns:
+        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
+            bounding box, and mask of the detected objects with normalized coordinates
+            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
+            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
+            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
+            the background.
+    Example
+    -------
+        >>> countgd_sam2_video_tracking(""car, dinosaur"", frames)
+        [
+            [
+                {
+                    'label': '0: dinosaur',
+                    'bbox': [0.1, 0.11, 0.35, 0.4],
+                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0],
+                        ...,
+                        [0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
+                },
+            ],
+            ...
+        ]
+    ",countgd_sam2_video_tracking
 "'extract_frames_and_timestamps' extracts frames and timestamps from a video which can be a file path, url or youtube link, returns a list of dictionaries with keys ""frame"" and ""timestamp"" where ""frame"" is a numpy array and ""timestamp"" is the relative time in seconds where the frame was captured. The frame is a numpy array.","extract_frames_and_timestamps(video_uri: Union[str, pathlib.Path], fps: float = 1) -> List[Dict[str, Union[numpy.ndarray, float]]]:
 'extract_frames_and_timestamps' extracts frames and timestamps from a video
     which can be a file path, url or youtube link, returns a list of dictionaries

{vision_agent-0.2.215 → vision_agent-0.2.216}/vision_agent/.sim_tools/embs.npy RENAMED Viewed

Binary file

{vision_agent-0.2.215 → vision_agent-0.2.216}/vision_agent/tools/__init__.py RENAMED Viewed

@@ -32,7 +32,7 @@ from .tools import (
     countgd_sam2_video_tracking,
     depth_anything_v2,
     detr_segmentation,
-    document_analysis,
+    document_extraction,
     extract_frames_and_timestamps,
     florence2_ocr,
     florence2_phrase_grounding,

{vision_agent-0.2.215 → vision_agent-0.2.216}/vision_agent/tools/planner_tools.py RENAMED Viewed

@@ -143,7 +143,14 @@ def run_tool_testing(
     code = extract_tag(response, "code")  # type: ignore
     if code is None:
         raise ValueError(f"Could not extract code from response: {response}")
-    code = process_code(code)
+    # If there's a syntax error with the code, process_code can crash. Executing the
+    # code and then sending the error to the LLM should correct it.
+    try:
+        code = process_code(code)
+    except Exception as e:
+        _LOGGER.error(f"Error processing code: {e}")
     tool_output = code_interpreter.exec_isolation(DefaultImports.prepend_imports(code))
     tool_output_str = tool_output.text(include_results=False).strip()
@@ -167,6 +174,7 @@ def run_tool_testing(
             DefaultImports.prepend_imports(code)
         )
         tool_output_str = tool_output.text(include_results=False).strip()
+        count += 1
     return code, tool_docs_str, tool_output

{vision_agent-0.2.215 → vision_agent-0.2.216}/vision_agent/tools/tools.py RENAMED Viewed

@@ -119,6 +119,120 @@ def _display_tool_trace(
     display({MimeType.APPLICATION_JSON: tool_call_trace.model_dump()}, raw=True)
+class ODModels(str, Enum):
+    COUNTGD = "countgd"
+    FLORENCE2 = "florence2"
+    OWLV2 = "owlv2"
+def od_sam2_video_tracking(
+    od_model: ODModels,
+    prompt: str,
+    frames: List[np.ndarray],
+    chunk_length: Optional[int] = 10,
+    fine_tune_id: Optional[str] = None,
+) -> Dict[str, Any]:
+    results: List[Optional[List[Dict[str, Any]]]] = [None] * len(frames)
+    if chunk_length is None:
+        step = 1  # Process every frame
+    elif chunk_length <= 0:
+        raise ValueError("chunk_length must be a positive integer or None.")
+    else:
+        step = chunk_length  # Process frames with the specified step size
+    for idx in range(0, len(frames), step):
+        if od_model == ODModels.COUNTGD:
+            results[idx] = countgd_object_detection(prompt=prompt, image=frames[idx])
+            function_name = "countgd_object_detection"
+        elif od_model == ODModels.OWLV2:
+            results[idx] = owl_v2_image(
+                prompt=prompt, image=frames[idx], fine_tune_id=fine_tune_id
+            )
+            function_name = "owl_v2_image"
+        elif od_model == ODModels.FLORENCE2:
+            results[idx] = florence2_sam2_image(
+                prompt=prompt, image=frames[idx], fine_tune_id=fine_tune_id
+            )
+            function_name = "florence2_sam2_image"
+        else:
+            raise NotImplementedError(
+                f"Object detection model '{od_model}' is not implemented."
+            )
+    image_size = frames[0].shape[:2]
+    def _transform_detections(
+        input_list: List[Optional[List[Dict[str, Any]]]]
+    ) -> List[Optional[Dict[str, Any]]]:
+        output_list: List[Optional[Dict[str, Any]]] = []
+        for _, frame in enumerate(input_list):
+            if frame is not None:
+                labels = [detection["label"] for detection in frame]
+                bboxes = [
+                    denormalize_bbox(detection["bbox"], image_size)
+                    for detection in frame
+                ]
+                output_list.append(
+                    {
+                        "labels": labels,
+                        "bboxes": bboxes,
+                    }
+                )
+            else:
+                output_list.append(None)
+        return output_list
+    output = _transform_detections(results)
+    buffer_bytes = frames_to_bytes(frames)
+    files = [("video", buffer_bytes)]
+    payload = {"bboxes": json.dumps(output), "chunk_length": chunk_length}
+    metadata = {"function_name": function_name}
+    detections = send_task_inference_request(
+        payload,
+        "sam2",
+        files=files,
+        metadata=metadata,
+    )
+    return_data = []
+    for frame in detections:
+        return_frame_data = []
+        for detection in frame:
+            mask = rle_decode_array(detection["mask"])
+            label = str(detection["id"]) + ": " + detection["label"]
+            return_frame_data.append(
+                {"label": label, "mask": mask, "score": 1.0, "rle": detection["mask"]}
+            )
+        return_data.append(return_frame_data)
+    return_data = add_bboxes_from_masks(return_data)
+    return_data = nms(return_data, iou_threshold=0.95)
+    # We save the RLE for display purposes, re-calculting RLE can get very expensive.
+    # Deleted here because we are returning the numpy masks instead
+    display_data = []
+    for frame in return_data:
+        display_frame_data = []
+        for obj in frame:
+            display_frame_data.append(
+                {
+                    "label": obj["label"],
+                    "score": obj["score"],
+                    "bbox": denormalize_bbox(obj["bbox"], image_size),
+                    "mask": obj["rle"],
+                }
+            )
+            del obj["rle"]
+        display_data.append(display_frame_data)
+    return {"files": files, "return_data": return_data, "display_data": detections}
 def owl_v2_image(
     prompt: str,
     image: np.ndarray,
@@ -302,6 +416,64 @@ def owl_v2_video(
     return bboxes_formatted
+def owlv2_sam2_video_tracking(
+    prompt: str,
+    frames: List[np.ndarray],
+    chunk_length: Optional[int] = 10,
+    fine_tune_id: Optional[str] = None,
+) -> List[List[Dict[str, Any]]]:
+    """'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text
+    prompt such as category names or referring expressions. The categories in the text
+    prompt are separated by commas. It returns a list of bounding boxes, label names,
+    mask file names and associated probability scores.
+    Parameters:
+        prompt (str): The prompt to ground to the image.
+        image (np.ndarray): The image to ground the prompt to.
+    Returns:
+        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
+            bounding box, and mask of the detected objects with normalized coordinates
+            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
+            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
+            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
+            the background.
+    Example
+    -------
+        >>> countgd_sam2_video_tracking("car, dinosaur", frames)
+        [
+            [
+                {
+                    'label': '0: dinosaur',
+                    'bbox': [0.1, 0.11, 0.35, 0.4],
+                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0],
+                        ...,
+                        [0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
+                },
+            ],
+            ...
+        ]
+    """
+    ret = od_sam2_video_tracking(
+        ODModels.OWLV2,
+        prompt=prompt,
+        frames=frames,
+        chunk_length=chunk_length,
+        fine_tune_id=fine_tune_id,
+    )
+    _display_tool_trace(
+        owlv2_sam2_video_tracking.__name__,
+        {},
+        ret["display_data"],
+        ret["files"],
+    )
+    return ret["return_data"]  # type: ignore
 def florence2_sam2_image(
     prompt: str, image: np.ndarray, fine_tune_id: Optional[str] = None
 ) -> List[Dict[str, Any]]:
@@ -834,6 +1006,59 @@ def countgd_sam2_object_detection(
     return seg_ret["return_data"]  # type: ignore
+def countgd_sam2_video_tracking(
+    prompt: str,
+    frames: List[np.ndarray],
+    chunk_length: Optional[int] = 10,
+) -> List[List[Dict[str, Any]]]:
+    """'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text
+    prompt such as category names or referring expressions. The categories in the text
+    prompt are separated by commas. It returns a list of bounding boxes, label names,
+    mask file names and associated probability scores.
+    Parameters:
+        prompt (str): The prompt to ground to the image.
+        image (np.ndarray): The image to ground the prompt to.
+    Returns:
+        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
+            bounding box, and mask of the detected objects with normalized coordinates
+            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
+            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
+            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
+            the background.
+    Example
+    -------
+        >>> countgd_sam2_video_tracking("car, dinosaur", frames)
+        [
+            [
+                {
+                    'label': '0: dinosaur',
+                    'bbox': [0.1, 0.11, 0.35, 0.4],
+                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0],
+                        ...,
+                        [0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
+                },
+            ],
+            ...
+        ]
+    """
+    ret = od_sam2_video_tracking(
+        ODModels.COUNTGD, prompt=prompt, frames=frames, chunk_length=chunk_length
+    )
+    _display_tool_trace(
+        countgd_sam2_video_tracking.__name__,
+        {},
+        ret["display_data"],
+        ret["files"],
+    )
+    return ret["return_data"]  # type: ignore
 def countgd_example_based_counting(
     visual_prompts: List[List[float]],
     image: np.ndarray,
@@ -1879,11 +2104,11 @@ def closest_box_distance(
     return cast(float, np.sqrt(horizontal_distance**2 + vertical_distance**2))
-def document_analysis(image: np.ndarray) -> Dict[str, Any]:
-    """'document_analysis' is an understanding tool that can handle various
-    types of document image layouts. It returns a structured output containing the text,
-    tables, pictures, charts and information caption, summary, labels, bounding boxes, etc
-    avoiding information loss.
+def document_extraction(image: np.ndarray) -> Dict[str, Any]:
+    """'document_extraction' is a tool that can extract structured information out of
+    documents with different layouts. It returns the extracted data in a structured
+    hierarchical format containing text, tables, pictures, charts, and other
+    information.
     Parameters:
         image (np.ndarray): The document image to analyze
@@ -1894,20 +2119,18 @@ def document_analysis(image: np.ndarray) -> Dict[str, Any]:
     Example
     -------
         >>> document_analysis(image)
-        {'pages': [{'bbox': [left_0, top_0, right_0, bottom_0],
-                    'chunks': [{'bbox': [left_1, top_1, right_1, bottom_1],
-                                'caption': 'TITLE',
+        {'pages':
+            [{'bbox': [0, 0, 1.0, 1.0],
+                    'chunks': [{'bbox': [0.8, 0.1, 1.0, 0.2],
                                 'label': 'page_header',
-                                'summary': 'The image contains a single word ...' },
-                               {'bbox': [left_2, top_2, right_2, bottom_2],
-                                'caption': {'data': [{'value': 200, 'year': '2024' ...},
-                                    'title': 'Total CapEx Spending',
-                                    'type': 'bar chart',
-                                    'unit': 'Billion USD',
-                                    'xAxis': 'Year',
-                                    'yAxis': 'Total CapEx Spending'},
-                                'label': 'picture',
-                                'summary': 'This bar chart illustrates the trend of ...'},
+                                'order': 75
+                                'caption': 'Annual Report 2024',
+                                'summary': 'This annual report summarizes ...' },
+                               {'bbox': [0.2, 0.9, 0.9, 1.0],
+                                'label': table',
+                                'order': 1119,
+                                'caption': [{'Column 1': 'Value 1', 'Column 2': 'Value 2'},
+                                'summary': 'This table illustrates a trend of ...'},
                     ],
     """
@@ -1919,7 +2142,7 @@ def document_analysis(image: np.ndarray) -> Dict[str, Any]:
         "model": "document-analysis",
     }
-    response: dict[str, Any] = send_inference_request(
+    data: Dict[str, Any] = send_inference_request(
         payload=payload,
         endpoint_name="document-analysis",
         files=files,
@@ -1927,14 +2150,28 @@ def document_analysis(image: np.ndarray) -> Dict[str, Any]:
         metadata_payload={"function_name": "document_analysis"},
     )
+    # don't display normalized bboxes
     _display_tool_trace(
-        document_analysis.__name__,
+        document_extraction.__name__,
         payload,
-        response,
+        data,
         files,
     )
-    return response
+    def normalize(data: Any) -> Dict[str, Any]:
+        if isinstance(data, Dict):
+            if "bbox" in data:
+                data["bbox"] = normalize_bbox(data["bbox"], image.shape[:2])
+            for key in data:
+                data[key] = normalize(data[key])
+        elif isinstance(data, List):
+            for i in range(len(data)):
+                data[i] = normalize(data[i])
+        return data  # type: ignore
+    data = normalize(data)
+    return data
 # Utility and visualization functions
@@ -2453,197 +2690,6 @@ def _plot_counting(
     return image
-class ODModels(str, Enum):
-    COUNTGD = "countgd"
-    FLORENCE2 = "florence2"
-    OWLV2 = "owlv2"
-def od_sam2_video_tracking(
-    od_model: ODModels,
-    prompt: str,
-    frames: List[np.ndarray],
-    chunk_length: Optional[int] = 10,
-    fine_tune_id: Optional[str] = None,
-) -> List[List[Dict[str, Any]]]:
-    results: List[Optional[List[Dict[str, Any]]]] = [None] * len(frames)
-    if chunk_length is None:
-        step = 1  # Process every frame
-    elif chunk_length <= 0:
-        raise ValueError("chunk_length must be a positive integer or None.")
-    else:
-        step = chunk_length  # Process frames with the specified step size
-    for idx in range(0, len(frames), step):
-        if od_model == ODModels.COUNTGD:
-            results[idx] = countgd_object_detection(prompt=prompt, image=frames[idx])
-            function_name = "countgd_object_detection"
-        elif od_model == ODModels.OWLV2:
-            results[idx] = owl_v2_image(
-                prompt=prompt, image=frames[idx], fine_tune_id=fine_tune_id
-            )
-            function_name = "owl_v2_image"
-        elif od_model == ODModels.FLORENCE2:
-            results[idx] = florence2_sam2_image(
-                prompt=prompt, image=frames[idx], fine_tune_id=fine_tune_id
-            )
-            function_name = "florence2_sam2_image"
-        else:
-            raise NotImplementedError(
-                f"Object detection model '{od_model}' is not implemented."
-            )
-    image_size = frames[0].shape[:2]
-    def _transform_detections(
-        input_list: List[Optional[List[Dict[str, Any]]]]
-    ) -> List[Optional[Dict[str, Any]]]:
-        output_list: List[Optional[Dict[str, Any]]] = []
-        for idx, frame in enumerate(input_list):
-            if frame is not None:
-                labels = [detection["label"] for detection in frame]
-                bboxes = [
-                    denormalize_bbox(detection["bbox"], image_size)
-                    for detection in frame
-                ]
-                output_list.append(
-                    {
-                        "labels": labels,
-                        "bboxes": bboxes,
-                    }
-                )
-            else:
-                output_list.append(None)
-        return output_list
-    output = _transform_detections(results)
-    buffer_bytes = frames_to_bytes(frames)
-    files = [("video", buffer_bytes)]
-    payload = {"bboxes": json.dumps(output), "chunk_length": chunk_length}
-    metadata = {"function_name": function_name}
-    detections = send_task_inference_request(
-        payload,
-        "sam2",
-        files=files,
-        metadata=metadata,
-    )
-    return_data = []
-    for frame in detections:
-        return_frame_data = []
-        for detection in frame:
-            mask = rle_decode_array(detection["mask"])
-            label = str(detection["id"]) + ": " + detection["label"]
-            return_frame_data.append({"label": label, "mask": mask, "score": 1.0})
-        return_data.append(return_frame_data)
-    return_data = add_bboxes_from_masks(return_data)
-    return nms(return_data, iou_threshold=0.95)
-def countgd_sam2_video_tracking(
-    prompt: str,
-    frames: List[np.ndarray],
-    chunk_length: Optional[int] = 10,
-) -> List[List[Dict[str, Any]]]:
-    """'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores.
-    Parameters:
-        prompt (str): The prompt to ground to the image.
-        image (np.ndarray): The image to ground the prompt to.
-    Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
-            bounding box, and mask of the detected objects with normalized coordinates
-            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
-            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
-            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
-            the background.
-    Example
-    -------
-        >>> countgd_sam2_video_tracking("car, dinosaur", frames)
-        [
-            [
-                {
-                    'label': '0: dinosaur',
-                    'bbox': [0.1, 0.11, 0.35, 0.4],
-                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
-                        [0, 0, 0, ..., 0, 0, 0],
-                        ...,
-                        [0, 0, 0, ..., 0, 0, 0],
-                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
-                },
-            ],
-            ...
-        ]
-    """
-    return od_sam2_video_tracking(
-        ODModels.COUNTGD, prompt=prompt, frames=frames, chunk_length=chunk_length
-    )
-def owlv2_sam2_video_tracking(
-    prompt: str,
-    frames: List[np.ndarray],
-    chunk_length: Optional[int] = 10,
-    fine_tune_id: Optional[str] = None,
-) -> List[List[Dict[str, Any]]]:
-    """'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores.
-    Parameters:
-        prompt (str): The prompt to ground to the image.
-        image (np.ndarray): The image to ground the prompt to.
-    Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
-            bounding box, and mask of the detected objects with normalized coordinates
-            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
-            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
-            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
-            the background.
-    Example
-    -------
-        >>> countgd_sam2_video_tracking("car, dinosaur", frames)
-        [
-            [
-                {
-                    'label': '0: dinosaur',
-                    'bbox': [0.1, 0.11, 0.35, 0.4],
-                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
-                        [0, 0, 0, ..., 0, 0, 0],
-                        ...,
-                        [0, 0, 0, ..., 0, 0, 0],
-                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
-                },
-            ],
-            ...
-        ]
-    """
-    return od_sam2_video_tracking(
-        ODModels.OWLV2,
-        prompt=prompt,
-        frames=frames,
-        chunk_length=chunk_length,
-        fine_tune_id=fine_tune_id,
-    )
 FUNCTION_TOOLS = [
     owl_v2_image,
     owl_v2_video,
@@ -2663,6 +2709,7 @@ FUNCTION_TOOLS = [
     minimum_distance,
     qwen2_vl_images_vqa,
     qwen2_vl_video_vqa,
+    document_extraction,
     video_temporal_localization,
     flux_image_inpainting,
     siglip_classification,