PyPI - vision-agent - Versions diffs - 0.2.221__py3-none-any.whl → 0.2.222__py3-none-any.whl - Mend

vision-agent 0.2.221py3-none-any.whl → 0.2.222py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

vision_agent/.sim_tools/df.csv +253 -244
vision_agent/.sim_tools/embs.npy +0 -0
vision_agent/agent/vision_agent_planner_prompts_v2.py +28 -23
vision_agent/tools/__init__.py +6 -10
vision_agent/tools/tools.py +639 -787
vision_agent/utils/sim.py +24 -1
{vision_agent-0.2.221.dist-info → vision_agent-0.2.222.dist-info}/METADATA +1 -1
{vision_agent-0.2.221.dist-info → vision_agent-0.2.222.dist-info}/RECORD +10 -10
{vision_agent-0.2.221.dist-info → vision_agent-0.2.222.dist-info}/LICENSE +0 -0
{vision_agent-0.2.221.dist-info → vision_agent-0.2.222.dist-info}/WHEEL +0 -0

vision_agent/.sim_tools/df.csv CHANGED Viewed

@@ -1,9 +1,9 @@
 desc,doc,name
-"'owl_v2_image' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions on images. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.","owl_v2_image(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, fine_tune_id: Optional[str] = None) -> List[Dict[str, Any]]:
-'owl_v2_image' is a tool that can detect and count multiple objects given a text
-    prompt such as category names or referring expressions on images. The categories in
-    text prompt are separated by commas. It returns a list of bounding boxes with
-    normalized coordinates, label names and associated probability scores.
+"'owlv2_object_detection' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions on images. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.","owlv2_object_detection(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, fine_tune_id: Optional[str] = None) -> List[Dict[str, Any]]:
+'owlv2_object_detection' is a tool that can detect and count multiple objects
+    given a text prompt such as category names or referring expressions on images. The
+    categories in text prompt are separated by commas. It returns a list of bounding
+    boxes with normalized coordinates, label names and associated probability scores.
     Parameters:
         prompt (str): The prompt to ground to the image.
@@ -22,96 +22,87 @@ desc,doc,name
     Example
     -------
-        >>> owl_v2_image(""car, dinosaur"", image)
+        >>> owlv2_object_detection(""car, dinosaur"", image)
         [
             {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]},
             {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
         ]
-    ",owl_v2_image
-"'owl_v2_video' will run owl_v2 on each frame of a video. It can detect multiple objects independently per frame given a text prompt such as a category name or referring expression but does not track objects across frames. The categories in text prompt are separated by commas. It returns a list of lists where each inner list contains the score, label, and bounding box of the detections for that frame.","owl_v2_video(prompt: str, frames: List[numpy.ndarray], box_threshold: float = 0.1, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
-'owl_v2_video' will run owl_v2 on each frame of a video. It can detect multiple
-    objects independently per frame given a text prompt such as a category name or
-    referring expression but does not track objects across frames. The categories in
-    text prompt are separated by commas. It returns a list of lists where each inner
-    list contains the score, label, and bounding box of the detections for that frame.
+    ",owlv2_object_detection
+"'owlv2_sam2_instance_segmentation' is a tool that can detect and count multiple instances of objects given a text prompt such as category names or referring expressions on images. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names, masks and associated probability scores.","owlv2_sam2_instance_segmentation(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1) -> List[Dict[str, Any]]:
+'owlv2_sam2_instance_segmentation' is a tool that can detect and count multiple
+    instances of objects given a text prompt such as category names or referring
+    expressions on images. The categories in text prompt are separated by commas. It
+    returns a list of bounding boxes with normalized coordinates, label names, masks
+    and associated probability scores.
     Parameters:
-        prompt (str): The prompt to ground to the video.
-        frames (List[np.ndarray]): The list of frames to ground the prompt to.
-        box_threshold (float, optional): The threshold for the box detection. Defaults
-            to 0.30.
-        fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
-            fine-tuned model ID here to use it.
+        prompt (str): The object that needs to be counted.
+        image (np.ndarray): The image that contains multiple instances of the object.
+        box_threshold (float, optional): The threshold for detection. Defaults
+            to 0.10.
     Returns:
-        List[List[Dict[str, Any]]]: A list of lists of dictionaries containing the
-            score, label, and bounding box of the detected objects with normalized
-            coordinates between 0 and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the
-            coordinates of the top-left and xmax and ymax are the coordinates of the
-            bottom-right of the bounding box.
+        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
+            bounding box, and mask of the detected objects with normalized coordinates
+            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
+            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
+            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
+            the background.
     Example
     -------
-        >>> owl_v2_video(""car, dinosaur"", frames)
+        >>> owlv2_sam2_instance_segmentation(""flower"", image)
         [
-            [
-                {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]},
-                {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
-            ],
-            ...
+            {
+                'score': 0.49,
+                'label': 'flower',
+                'bbox': [0.1, 0.11, 0.35, 0.4],
+                'mask': array([[0, 0, 0, ..., 0, 0, 0],
+                    [0, 0, 0, ..., 0, 0, 0],
+                    ...,
+                    [0, 0, 0, ..., 0, 0, 0],
+                    [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
+            },
         ]
-    ",owl_v2_video
-"'ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
-'ocr' extracts text from an image. It returns a list of detected text, bounding
-    boxes with normalized coordinates, and confidence scores. The results are sorted
-    from top-left to bottom right.
+    ",owlv2_sam2_instance_segmentation
+"'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores.","owlv2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
+'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text
+    prompt such as category names or referring expressions. The categories in the text
+    prompt are separated by commas. It returns a list of bounding boxes, label names,
+    mask file names and associated probability scores.
     Parameters:
-        image (np.ndarray): The image to extract text from.
+        prompt (str): The prompt to ground to the image.
+        image (np.ndarray): The image to ground the prompt to.
+        fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
+            fine-tuned model ID here to use it.
     Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the detected text, bbox
-            with normalized coordinates, and confidence score.
+        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
+            bounding box, and mask of the detected objects with normalized coordinates
+            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
+            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
+            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
+            the background.
     Example
     -------
-        >>> ocr(image)
+        >>> owlv2_sam2_video_tracking(""car, dinosaur"", frames)
         [
-            {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
+            [
+                {
+                    'label': '0: dinosaur',
+                    'bbox': [0.1, 0.11, 0.35, 0.4],
+                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0],
+                        ...,
+                        [0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
+                },
+            ],
+            ...
         ]
-    ",ocr
-'vit_image_classification' is a tool that can classify an image. It returns a list of classes and their probability scores based on image content.,"vit_image_classification(image: numpy.ndarray) -> Dict[str, Any]:
-'vit_image_classification' is a tool that can classify an image. It returns a
-    list of classes and their probability scores based on image content.
-    Parameters:
-        image (np.ndarray): The image to classify or tag
-    Returns:
-        Dict[str, Any]: A dictionary containing the labels and scores. One dictionary
-            contains a list of labels and other a list of scores.
-    Example
-    -------
-        >>> vit_image_classification(image)
-        {""labels"": [""leopard"", ""lemur, otter"", ""bird""], ""scores"": [0.68, 0.30, 0.02]},
-    ",vit_image_classification
-'vit_nsfw_classification' is a tool that can classify an image as 'nsfw' or 'normal'. It returns the predicted label and their probability scores based on image content.,"vit_nsfw_classification(image: numpy.ndarray) -> Dict[str, Any]:
-'vit_nsfw_classification' is a tool that can classify an image as 'nsfw' or 'normal'.
-    It returns the predicted label and their probability scores based on image content.
-    Parameters:
-        image (np.ndarray): The image to classify or tag
-    Returns:
-        Dict[str, Any]: A dictionary containing the labels and scores. One dictionary
-            contains a list of labels and other a list of scores.
-    Example
-    -------
-        >>> vit_nsfw_classification(image)
-        {""label"": ""normal"", ""scores"": 0.68},
-    ",vit_nsfw_classification
+    ",owlv2_sam2_video_tracking
 "'countgd_object_detection' is a tool that can detect multiple instances of an object given a text prompt. It is particularly useful when trying to detect and count a large number of objects. You can optionally separate object names in the prompt with commas. It returns a list of bounding boxes with normalized coordinates, label names and associated confidence scores.","countgd_object_detection(prompt: str, image: numpy.ndarray, box_threshold: float = 0.23) -> List[Dict[str, Any]]:
 'countgd_object_detection' is a tool that can detect multiple instances of an
     object given a text prompt. It is particularly useful when trying to detect and
@@ -142,12 +133,12 @@ desc,doc,name
             {'score': 0.98, 'label': 'flower', 'bbox': [0.44, 0.24, 0.49, 0.58},
         ]
     ",countgd_object_detection
-"'countgd_sam2_object_detection' is a tool that can detect multiple instances of an object given a text prompt. It is particularly useful when trying to detect and count a large number of objects. You can optionally separate object names in the prompt with commas. It returns a list of bounding boxes with normalized coordinates, label names, masks associated confidence scores.","countgd_sam2_object_detection(prompt: str, image: numpy.ndarray, box_threshold: float = 0.23) -> List[Dict[str, Any]]:
-'countgd_sam2_object_detection' is a tool that can detect multiple instances of
-    an object given a text prompt. It is particularly useful when trying to detect and
-    count a large number of objects. You can optionally separate object names in the
-    prompt with commas. It returns a list of bounding boxes with normalized coordinates,
-    label names, masks associated confidence scores.
+"'countgd_sam2_instance_segmentation' is a tool that can detect multiple instances of an object given a text prompt. It is particularly useful when trying to detect and count a large number of objects. You can optionally separate object names in the prompt with commas. It returns a list of bounding boxes with normalized coordinates, label names, masks associated confidence scores.","countgd_sam2_instance_segmentation(prompt: str, image: numpy.ndarray, box_threshold: float = 0.23) -> List[Dict[str, Any]]:
+'countgd_sam2_instance_segmentation' is a tool that can detect multiple
+    instances of an object given a text prompt. It is particularly useful when trying
+    to detect and count a large number of objects. You can optionally separate object
+    names in the prompt with commas. It returns a list of bounding boxes with
+    normalized coordinates, label names, masks associated confidence scores.
     Parameters:
         prompt (str): The object that needs to be counted.
@@ -165,7 +156,7 @@ desc,doc,name
     Example
     -------
-        >>> countgd_object_detection(""flower"", image)
+        >>> countgd_sam2_instance_segmentation(""flower"", image)
         [
             {
                 'score': 0.49,
@@ -178,7 +169,45 @@ desc,doc,name
                     [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
             },
         ]
-    ",countgd_sam2_object_detection
+    ",countgd_sam2_instance_segmentation
+"'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores.","countgd_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10) -> List[List[Dict[str, Any]]]:
+'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text
+    prompt such as category names or referring expressions. The categories in the text
+    prompt are separated by commas. It returns a list of bounding boxes, label names,
+    mask file names and associated probability scores.
+    Parameters:
+        prompt (str): The prompt to ground to the image.
+        image (np.ndarray): The image to ground the prompt to.
+        chunk_length (Optional[int]): The number of frames to re-run florence2 to find
+            new objects.
+    Returns:
+        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
+            bounding box, and mask of the detected objects with normalized coordinates
+            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
+            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
+            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
+            the background.
+    Example
+    -------
+        >>> countgd_sam2_video_tracking(""car, dinosaur"", frames)
+        [
+            [
+                {
+                    'label': '0: dinosaur',
+                    'bbox': [0.1, 0.11, 0.35, 0.4],
+                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0],
+                        ...,
+                        [0, 0, 0, ..., 0, 0, 0],
+                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
+                },
+            ],
+            ...
+        ]
+    ",countgd_sam2_video_tracking
 "'florence2_ocr' is a tool that can detect text and text regions in an image. Each text region contains one line of text. It returns a list of detected text, the text region as a bounding box with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","florence2_ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
 'florence2_ocr' is a tool that can detect text and text regions in an image.
     Each text region contains one line of text. It returns a list of detected text,
@@ -199,11 +228,12 @@ desc,doc,name
             {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
         ]
     ",florence2_ocr
-"'florence2_sam2_image' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores of 1.0.","florence2_sam2_image(prompt: str, image: numpy.ndarray, fine_tune_id: Optional[str] = None) -> List[Dict[str, Any]]:
-'florence2_sam2_image' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores of 1.0.
+"'florence2_sam2_instance_segmentation' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores of 1.0.","florence2_sam2_instance_segmentation(prompt: str, image: numpy.ndarray, fine_tune_id: Optional[str] = None) -> List[Dict[str, Any]]:
+'florence2_sam2_instance_segmentation' is a tool that can segment multiple
+    objects given a text prompt such as category names or referring expressions. The
+    categories in the text prompt are separated by commas. It returns a list of
+    bounding boxes, label names, mask file names and associated probability scores of
+    1.0.
     Parameters:
         prompt (str): The prompt to ground to the image.
@@ -221,7 +251,7 @@ desc,doc,name
     Example
     -------
-        >>> florence2_sam2_image(""car, dinosaur"", image)
+        >>> florence2_sam2_instance_segmentation(""car, dinosaur"", image)
         [
             {
                 'score': 1.0,
@@ -234,7 +264,7 @@ desc,doc,name
                     [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
             },
         ]
-    ",florence2_sam2_image
+    ",florence2_sam2_instance_segmentation
 'florence2_sam2_video_tracking' is a tool that can segment and track multiple entities in a video given a text prompt such as category names or referring expressions. You can optionally separate the categories in the text with commas. It can find new objects every 'chunk_length' frames and is useful for tracking and counting without duplicating counts and always outputs scores of 1.0.,"florence2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
 'florence2_sam2_video_tracking' is a tool that can segment and track multiple
     entities in a video given a text prompt such as category names or referring
@@ -259,7 +289,7 @@ desc,doc,name
     Example
     -------
-        >>> florence2_sam2_video(""car, dinosaur"", frames)
+        >>> florence2_sam2_video_tracking(""car, dinosaur"", frames)
         [
             [
                 {
@@ -275,8 +305,8 @@ desc,doc,name
             ...
         ]
     ",florence2_sam2_video_tracking
-"'florence2_phrase_grounding' is a tool that can detect multiple objects given a text prompt which can be object names or caption. You can optionally separate the object names in the text with commas. It returns a list of bounding boxes with normalized coordinates, label names and associated confidence scores of 1.0.","florence2_phrase_grounding(prompt: str, image: numpy.ndarray, fine_tune_id: Optional[str] = None) -> List[Dict[str, Any]]:
-'florence2_phrase_grounding' is a tool that can detect multiple
+"'florence2_object_detection' is a tool that can detect multiple objects given a text prompt which can be object names or caption. You can optionally separate the object names in the text with commas. It returns a list of bounding boxes with normalized coordinates, label names and associated confidence scores of 1.0.","florence2_object_detection(prompt: str, image: numpy.ndarray, fine_tune_id: Optional[str] = None) -> List[Dict[str, Any]]:
+'florence2_object_detection' is a tool that can detect multiple
     objects given a text prompt which can be object names or caption. You
     can optionally separate the object names in the text with commas. It returns a list
     of bounding boxes with normalized coordinates, label names and associated
@@ -297,12 +327,12 @@ desc,doc,name
     Example
     -------
-        >>> florence2_phrase_grounding('person looking at a coyote', image)
+        >>> florence2_object_detection('person looking at a coyote', image)
         [
             {'score': 1.0, 'label': 'person', 'bbox': [0.1, 0.11, 0.35, 0.4]},
             {'score': 1.0, 'label': 'coyote', 'bbox': [0.34, 0.21, 0.85, 0.5},
         ]
-    ",florence2_phrase_grounding
+    ",florence2_object_detection
 'claude35_text_extraction' is a tool that can extract text from an image. It returns the extracted text as a string and can be used as an alternative to OCR if you do not need to know the exact bounding box of the text.,"claude35_text_extraction(image: numpy.ndarray) -> str:
 'claude35_text_extraction' is a tool that can extract text from an image. It
     returns the extracted text as a string and can be used as an alternative to OCR if
@@ -314,6 +344,107 @@ desc,doc,name
     Returns:
         str: The extracted text from the image.
     ",claude35_text_extraction
+"'document_extraction' is a tool that can extract structured information out of documents with different layouts. It returns the extracted data in a structured hierarchical format containing text, tables, pictures, charts, and other information.","document_extraction(image: numpy.ndarray) -> Dict[str, Any]:
+'document_extraction' is a tool that can extract structured information out of
+    documents with different layouts. It returns the extracted data in a structured
+    hierarchical format containing text, tables, pictures, charts, and other
+    information.
+    Parameters:
+        image (np.ndarray): The document image to analyze
+    Returns:
+        Dict[str, Any]: A dictionary containing the extracted information.
+    Example
+    -------
+        >>> document_analysis(image)
+        {'pages':
+            [{'bbox': [0, 0, 1.0, 1.0],
+                    'chunks': [{'bbox': [0.8, 0.1, 1.0, 0.2],
+                                'label': 'page_header',
+                                'order': 75
+                                'caption': 'Annual Report 2024',
+                                'summary': 'This annual report summarizes ...' },
+                               {'bbox': [0.2, 0.9, 0.9, 1.0],
+                                'label': 'table',
+                                'order': 1119,
+                                'caption': [{'Column 1': 'Value 1', 'Column 2': 'Value 2'},
+                                'summary': 'This table illustrates a trend of ...'},
+                    ],
+    ",document_extraction
+"'document_qa' is a tool that can answer any questions about arbitrary documents, presentations, or tables. It's very useful for document QA tasks, you can ask it a specific question or ask it to return a JSON object answering multiple questions about the document.","document_qa(prompt: str, image: numpy.ndarray) -> str:
+'document_qa' is a tool that can answer any questions about arbitrary documents,
+    presentations, or tables. It's very useful for document QA tasks, you can ask it a
+    specific question or ask it to return a JSON object answering multiple questions
+    about the document.
+    Parameters:
+        prompt (str): The question to be answered about the document image.
+        image (np.ndarray): The document image to analyze.
+    Returns:
+        str: The answer to the question based on the document's context.
+    Example
+    -------
+        >>> document_qa(image, question)
+        'The answer to the question ...'
+    ",document_qa
+"'ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
+'ocr' extracts text from an image. It returns a list of detected text, bounding
+    boxes with normalized coordinates, and confidence scores. The results are sorted
+    from top-left to bottom right.
+    Parameters:
+        image (np.ndarray): The image to extract text from.
+    Returns:
+        List[Dict[str, Any]]: A list of dictionaries containing the detected text, bbox
+            with normalized coordinates, and confidence score.
+    Example
+    -------
+        >>> ocr(image)
+        [
+            {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
+        ]
+    ",ocr
+'qwen2_vl_images_vqa' is a tool that can answer any questions about arbitrary images including regular images or images of documents or presentations. It can be very useful for document QA or OCR text extraction. It returns text as an answer to the question.,"qwen2_vl_images_vqa(prompt: str, images: List[numpy.ndarray]) -> str:
+'qwen2_vl_images_vqa' is a tool that can answer any questions about arbitrary
+    images including regular images or images of documents or presentations. It can be
+    very useful for document QA or OCR text extraction. It returns text as an answer to
+    the question.
+    Parameters:
+        prompt (str): The question about the document image
+        images (List[np.ndarray]): The reference images used for the question
+    Returns:
+        str: A string which is the answer to the given prompt.
+    Example
+    -------
+        >>> qwen2_vl_images_vqa('Give a summary of the document', images)
+        'The document talks about the history of the United States of America and its...'
+    ",qwen2_vl_images_vqa
+'qwen2_vl_video_vqa' is a tool that can answer any questions about arbitrary videos including regular videos or videos of documents or presentations. It returns text as an answer to the question.,"qwen2_vl_video_vqa(prompt: str, frames: List[numpy.ndarray]) -> str:
+'qwen2_vl_video_vqa' is a tool that can answer any questions about arbitrary videos
+    including regular videos or videos of documents or presentations. It returns text
+    as an answer to the question.
+    Parameters:
+        prompt (str): The question about the video
+        frames (List[np.ndarray]): The reference frames used for the question
+    Returns:
+        str: A string which is the answer to the given prompt.
+    Example
+    -------
+        >>> qwen2_vl_video_vqa('Which football player made the goal?', frames)
+        'Lionel Messi'
+    ",qwen2_vl_video_vqa
 "'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores.","detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]:
 'detr_segmentation' is a tool that can segment common objects in an
     image without any text prompt. It returns a list of detected objects
@@ -391,106 +522,38 @@ desc,doc,name
                 [10, 11, 15, ..., 202, 202, 205],
                 [10, 10, 10, ..., 200, 200, 200]], dtype=uint8),
     ",generate_pose_image
-"'minimum_distance' calculates the minimum distance between two detections which can include bounding boxes and or masks. This will return the closest distance between the objects, not the distance between the centers of the objects.","minimum_distance(det1: Dict[str, Any], det2: Dict[str, Any], image_size: Tuple[int, int]) -> float:
-'minimum_distance' calculates the minimum distance between two detections which
-    can include bounding boxes and or masks. This will return the closest distance
-    between the objects, not the distance between the centers of the objects.
-    Parameters:
-        det1 (Dict[str, Any]): The first detection of boxes or masks.
-        det2 (Dict[str, Any]): The second detection of boxes or masks.
-        image_size (Tuple[int, int]): The size of the image given as (height, width).
-    Returns:
-        float: The closest distance between the two detections.
-    Example
-    -------
-        >>> closest_distance(det1, det2, image_size)
-        141.42
-    ",minimum_distance
-'qwen2_vl_images_vqa' is a tool that can answer any questions about arbitrary images including regular images or images of documents or presentations. It can be very useful for document QA or OCR text extraction. It returns text as an answer to the question.,"qwen2_vl_images_vqa(prompt: str, images: List[numpy.ndarray]) -> str:
-'qwen2_vl_images_vqa' is a tool that can answer any questions about arbitrary
-    images including regular images or images of documents or presentations. It can be
-    very useful for document QA or OCR text extraction. It returns text as an answer to
-    the question.
-    Parameters:
-        prompt (str): The question about the document image
-        images (List[np.ndarray]): The reference images used for the question
-    Returns:
-        str: A string which is the answer to the given prompt.
-    Example
-    -------
-        >>> qwen2_vl_images_vqa('Give a summary of the document', images)
-        'The document talks about the history of the United States of America and its...'
-    ",qwen2_vl_images_vqa
-'qwen2_vl_video_vqa' is a tool that can answer any questions about arbitrary videos including regular videos or videos of documents or presentations. It returns text as an answer to the question.,"qwen2_vl_video_vqa(prompt: str, frames: List[numpy.ndarray]) -> str:
-'qwen2_vl_video_vqa' is a tool that can answer any questions about arbitrary videos
-    including regular videos or videos of documents or presentations. It returns text
-    as an answer to the question.
-    Parameters:
-        prompt (str): The question about the video
-        frames (List[np.ndarray]): The reference frames used for the question
-    Returns:
-        str: A string which is the answer to the given prompt.
-    Example
-    -------
-        >>> qwen2_vl_video_vqa('Which football player made the goal?', frames)
-        'Lionel Messi'
-    ",qwen2_vl_video_vqa
-"'document_extraction' is a tool that can extract structured information out of documents with different layouts. It returns the extracted data in a structured hierarchical format containing text, tables, pictures, charts, and other information.","document_extraction(image: numpy.ndarray) -> Dict[str, Any]:
-'document_extraction' is a tool that can extract structured information out of
-    documents with different layouts. It returns the extracted data in a structured
-    hierarchical format containing text, tables, pictures, charts, and other
-    information.
+'vit_image_classification' is a tool that can classify an image. It returns a list of classes and their probability scores based on image content.,"vit_image_classification(image: numpy.ndarray) -> Dict[str, Any]:
+'vit_image_classification' is a tool that can classify an image. It returns a
+    list of classes and their probability scores based on image content.
     Parameters:
-        image (np.ndarray): The document image to analyze
+        image (np.ndarray): The image to classify or tag
     Returns:
-        Dict[str, Any]: A dictionary containing the extracted information.
+        Dict[str, Any]: A dictionary containing the labels and scores. One dictionary
+            contains a list of labels and other a list of scores.
     Example
     -------
-        >>> document_analysis(image)
-        {'pages':
-            [{'bbox': [0, 0, 1.0, 1.0],
-                    'chunks': [{'bbox': [0.8, 0.1, 1.0, 0.2],
-                                'label': 'page_header',
-                                'order': 75
-                                'caption': 'Annual Report 2024',
-                                'summary': 'This annual report summarizes ...' },
-                               {'bbox': [0.2, 0.9, 0.9, 1.0],
-                                'label': table',
-                                'order': 1119,
-                                'caption': [{'Column 1': 'Value 1', 'Column 2': 'Value 2'},
-                                'summary': 'This table illustrates a trend of ...'},
-                    ],
-    ",document_extraction
-"'document_qa' is a tool that can answer any questions about arbitrary documents, presentations, or tables. It's very useful for document QA tasks, you can ask it a specific question or ask it to return a JSON object answering multiple questions about the document.","document_qa(prompt: str, image: numpy.ndarray) -> str:
-'document_qa' is a tool that can answer any questions about arbitrary documents,
-    presentations, or tables. It's very useful for document QA tasks, you can ask it a
-    specific question or ask it to return a JSON object answering multiple questions
-    about the document.
+        >>> vit_image_classification(image)
+        {""labels"": [""leopard"", ""lemur, otter"", ""bird""], ""scores"": [0.68, 0.30, 0.02]},
+    ",vit_image_classification
+'vit_nsfw_classification' is a tool that can classify an image as 'nsfw' or 'normal'. It returns the predicted label and their probability scores based on image content.,"vit_nsfw_classification(image: numpy.ndarray) -> Dict[str, Any]:
+'vit_nsfw_classification' is a tool that can classify an image as 'nsfw' or 'normal'.
+    It returns the predicted label and their probability scores based on image content.
     Parameters:
-        prompt (str): The question to be answered about the document image.
-        image (np.ndarray): The document image to analyze.
+        image (np.ndarray): The image to classify or tag
     Returns:
-        str: The answer to the question based on the document's context.
+        Dict[str, Any]: A dictionary containing the labels and scores. One dictionary
+            contains a list of labels and other a list of scores.
     Example
     -------
-        >>> document_qa(image, question)
-        'The answer to the question ...'
-    ",document_qa
+        >>> vit_nsfw_classification(image)
+        {""label"": ""normal"", ""scores"": 0.68},
+    ",vit_nsfw_classification
 'video_temporal_localization' will run qwen2vl on each chunk_length_frames value selected for the video. It can detect multiple objects independently per chunk_length_frames given a text prompt such as a referring expression but does not track objects across frames. It returns a list of floats with a value of 1.0 if the objects are found in a given chunk_length_frames of the video.,"video_temporal_localization(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen2vl', chunk_length_frames: Optional[int] = 2) -> List[float]:
 'video_temporal_localization' will run qwen2vl on each chunk_length_frames
     value selected for the video. It can detect multiple objects independently per
@@ -560,78 +623,24 @@ desc,doc,name
         >>> siglip_classification(image, ['dog', 'cat', 'bird'])
         {""labels"": [""dog"", ""cat"", ""bird""], ""scores"": [0.68, 0.30, 0.02]},
     ",siglip_classification
-"'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores.","owlv2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10, fine_tune_id: Optional[str] = None) -> List[List[Dict[str, Any]]]:
-'owlv2_sam2_video_tracking' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores.
-    Parameters:
-        prompt (str): The prompt to ground to the image.
-        image (np.ndarray): The image to ground the prompt to.
-    Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
-            bounding box, and mask of the detected objects with normalized coordinates
-            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
-            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
-            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
-            the background.
-    Example
-    -------
-        >>> countgd_sam2_video_tracking(""car, dinosaur"", frames)
-        [
-            [
-                {
-                    'label': '0: dinosaur',
-                    'bbox': [0.1, 0.11, 0.35, 0.4],
-                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
-                        [0, 0, 0, ..., 0, 0, 0],
-                        ...,
-                        [0, 0, 0, ..., 0, 0, 0],
-                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
-                },
-            ],
-            ...
-        ]
-    ",owlv2_sam2_video_tracking
-"'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, mask file names and associated probability scores.","countgd_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], chunk_length: Optional[int] = 10) -> List[List[Dict[str, Any]]]:
-'countgd_sam2_video_tracking' is a tool that can segment multiple objects given a text
-    prompt such as category names or referring expressions. The categories in the text
-    prompt are separated by commas. It returns a list of bounding boxes, label names,
-    mask file names and associated probability scores.
+"'minimum_distance' calculates the minimum distance between two detections which can include bounding boxes and or masks. This will return the closest distance between the objects, not the distance between the centers of the objects.","minimum_distance(det1: Dict[str, Any], det2: Dict[str, Any], image_size: Tuple[int, int]) -> float:
+'minimum_distance' calculates the minimum distance between two detections which
+    can include bounding boxes and or masks. This will return the closest distance
+    between the objects, not the distance between the centers of the objects.
     Parameters:
-        prompt (str): The prompt to ground to the image.
-        image (np.ndarray): The image to ground the prompt to.
+        det1 (Dict[str, Any]): The first detection of boxes or masks.
+        det2 (Dict[str, Any]): The second detection of boxes or masks.
+        image_size (Tuple[int, int]): The size of the image given as (height, width).
     Returns:
-        List[Dict[str, Any]]: A list of dictionaries containing the score, label,
-            bounding box, and mask of the detected objects with normalized coordinates
-            (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
-            and xmax and ymax are the coordinates of the bottom-right of the bounding box.
-            The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
-            the background.
+        float: The closest distance between the two detections.
     Example
     -------
-        >>> countgd_sam2_video_tracking(""car, dinosaur"", frames)
-        [
-            [
-                {
-                    'label': '0: dinosaur',
-                    'bbox': [0.1, 0.11, 0.35, 0.4],
-                    'mask': array([[0, 0, 0, ..., 0, 0, 0],
-                        [0, 0, 0, ..., 0, 0, 0],
-                        ...,
-                        [0, 0, 0, ..., 0, 0, 0],
-                        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
-                },
-            ],
-            ...
-        ]
-    ",countgd_sam2_video_tracking
+        >>> closest_distance(det1, det2, image_size)
+        141.42
+    ",minimum_distance
 "'extract_frames_and_timestamps' extracts frames and timestamps from a video which can be a file path, url or youtube link, returns a list of dictionaries with keys ""frame"" and ""timestamp"" where ""frame"" is a numpy array and ""timestamp"" is the relative time in seconds where the frame was captured. The frame is a numpy array.","extract_frames_and_timestamps(video_uri: Union[str, pathlib.Path], fps: float = 1) -> List[Dict[str, Union[numpy.ndarray, float]]]:
 'extract_frames_and_timestamps' extracts frames and timestamps from a video
     which can be a file path, url or youtube link, returns a list of dictionaries

vision-agent 0.2.221__py3-none-any.whl → 0.2.222__py3-none-any.whl

vision-agent 0.2.221py3-none-any.whl → 0.2.222py3-none-any.whl