PyPI - vision-agent - Versions diffs - 0.2.2__tar.gz → 0.2.4__tar.gz - Mend

vision-agent 0.2.2tar.gz → 0.2.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

{vision_agent-0.2.2 → vision_agent-0.2.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 0.2.2
+Version: 0.2.4
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai
@@ -58,7 +58,7 @@ pip install vision-agent
 ```
 Ensure you have an OpenAI API key and set it as an environment variable (if you are
-using Azure OpenAI please see the additional setup section):
+using Azure OpenAI please see the Azure setup section):
 ```bash
 export OPENAI_API_KEY="your-api-key"
@@ -123,26 +123,55 @@ you. For example:
 }]
 ```
+#### Custom Tools
+You can also add your own custom tools for your vision agent to use:
+```python
+>>> from vision_agent.tools import Tool, register_tool
+>>> @register_tool
+>>> class NumItems(Tool):
+>>>    name = "num_items_"
+>>>    description = "Returns the number of items in a list."
+>>>    usage = {
+>>>        "required_parameters": [{"name": "prompt", "type": "list"}],
+>>>        "examples": [
+>>>            {
+>>>                "scenario": "How many items are in this list? ['a', 'b', 'c']",
+>>>                "parameters": {"prompt": "['a', 'b', 'c']"},
+>>>            }
+>>>        ],
+>>>    }
+>>>    def __call__(self, prompt: list[str]) -> int:
+>>>        return len(prompt)
+```
+This will register it with the list of tools Vision Agent has access to. It will be able
+to pick it based on the tool description and use it based on the usage provided.
+#### Tool List
 | Tool | Description |
 | --- | --- |
 | CLIP | CLIP is a tool that can classify or tag any image given a set of input classes or tags. |
+| ImageCaption| ImageCaption is a tool that can generate a caption for an image. |
 | GroundingDINO | GroundingDINO is a tool that can detect arbitrary objects with inputs such as category names or referring expressions. |
 | GroundingSAM | GroundingSAM is a tool that can detect and segment arbitrary objects with inputs such as category names or referring expressions. |
-| Counter | Counter detects and counts the number of objects in an image given an input such as a category name or referring expression. |
+| DINOv | DINOv is a tool that can detect arbitrary objects with using a referring mask. |
+| ExtractFrames | ExtractFrames extracts frames with motion from a video. |
 | Crop | Crop crops an image given a bounding box and returns a file name of the cropped image. |
 | BboxArea | BboxArea returns the area of the bounding box in pixels normalized to 2 decimal places. |
 | SegArea | SegArea returns the area of the segmentation mask in pixels normalized to 2 decimal places. |
 | BboxIoU | BboxIoU returns the intersection over union of two bounding boxes normalized to 2 decimal places. |
 | SegIoU | SegIoU returns the intersection over union of two segmentation masks normalized to 2 decimal places. |
-| ExtractFrames | ExtractFrames extracts frames with motion from a video. |
+| BoxDistance | BoxDistance returns the minimum distance between two bounding boxes normalized to 2 decimal places. |
+| BboxContains | BboxContains returns the intersection of two boxes over the target box area. It is good for check if one box is contained within another box. |
 | ExtractFrames | ExtractFrames extracts frames with motion from a video. |
 | ZeroShotCounting | ZeroShotCounting returns the total number of objects belonging to a single class in a given image |
 | VisualPromptCounting | VisualPromptCounting returns the total number of objects belonging to a single class given an image and visual prompt |
+| OCR | OCR returns the text detected in an image along with the location. |
 It also has a basic set of calculate tools such as add, subtract, multiply and divide.
-### Additional Setup
+### Azure Setup
 If you want to use Azure OpenAI models, you can set the environment variable:
 ```bash

{vision_agent-0.2.2 → vision_agent-0.2.4}/README.md RENAMED Viewed

@@ -31,7 +31,7 @@ pip install vision-agent
 ```
 Ensure you have an OpenAI API key and set it as an environment variable (if you are
-using Azure OpenAI please see the additional setup section):
+using Azure OpenAI please see the Azure setup section):
 ```bash
 export OPENAI_API_KEY="your-api-key"
@@ -96,26 +96,55 @@ you. For example:
 }]
 ```
+#### Custom Tools
+You can also add your own custom tools for your vision agent to use:
+```python
+>>> from vision_agent.tools import Tool, register_tool
+>>> @register_tool
+>>> class NumItems(Tool):
+>>>    name = "num_items_"
+>>>    description = "Returns the number of items in a list."
+>>>    usage = {
+>>>        "required_parameters": [{"name": "prompt", "type": "list"}],
+>>>        "examples": [
+>>>            {
+>>>                "scenario": "How many items are in this list? ['a', 'b', 'c']",
+>>>                "parameters": {"prompt": "['a', 'b', 'c']"},
+>>>            }
+>>>        ],
+>>>    }
+>>>    def __call__(self, prompt: list[str]) -> int:
+>>>        return len(prompt)
+```
+This will register it with the list of tools Vision Agent has access to. It will be able
+to pick it based on the tool description and use it based on the usage provided.
+#### Tool List
 | Tool | Description |
 | --- | --- |
 | CLIP | CLIP is a tool that can classify or tag any image given a set of input classes or tags. |
+| ImageCaption| ImageCaption is a tool that can generate a caption for an image. |
 | GroundingDINO | GroundingDINO is a tool that can detect arbitrary objects with inputs such as category names or referring expressions. |
 | GroundingSAM | GroundingSAM is a tool that can detect and segment arbitrary objects with inputs such as category names or referring expressions. |
-| Counter | Counter detects and counts the number of objects in an image given an input such as a category name or referring expression. |
+| DINOv | DINOv is a tool that can detect arbitrary objects with using a referring mask. |
+| ExtractFrames | ExtractFrames extracts frames with motion from a video. |
 | Crop | Crop crops an image given a bounding box and returns a file name of the cropped image. |
 | BboxArea | BboxArea returns the area of the bounding box in pixels normalized to 2 decimal places. |
 | SegArea | SegArea returns the area of the segmentation mask in pixels normalized to 2 decimal places. |
 | BboxIoU | BboxIoU returns the intersection over union of two bounding boxes normalized to 2 decimal places. |
 | SegIoU | SegIoU returns the intersection over union of two segmentation masks normalized to 2 decimal places. |
-| ExtractFrames | ExtractFrames extracts frames with motion from a video. |
+| BoxDistance | BoxDistance returns the minimum distance between two bounding boxes normalized to 2 decimal places. |
+| BboxContains | BboxContains returns the intersection of two boxes over the target box area. It is good for check if one box is contained within another box. |
 | ExtractFrames | ExtractFrames extracts frames with motion from a video. |
 | ZeroShotCounting | ZeroShotCounting returns the total number of objects belonging to a single class in a given image |
 | VisualPromptCounting | VisualPromptCounting returns the total number of objects belonging to a single class given an image and visual prompt |
+| OCR | OCR returns the text detected in an image along with the location. |
 It also has a basic set of calculate tools such as add, subtract, multiply and divide.
-### Additional Setup
+### Azure Setup
 If you want to use Azure OpenAI models, you can set the environment variable:
 ```bash

{vision_agent-0.2.2 → vision_agent-0.2.4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
 [tool.poetry]
 name = "vision-agent"
-version = "0.2.2"
+version = "0.2.4"
 description = "Toolset for Vision Agent"
 authors = ["Landing AI <dev@landing.ai>"]
 readme = "README.md"

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/agent/vision_agent.py RENAMED Viewed

@@ -377,6 +377,7 @@ def visualize_result(all_tool_results: List[Dict]) -> Sequence[Union[str, Path]]
             "dinov_",
             "zero_shot_counting_",
             "visual_prompt_counting_",
+            "ocr_",
         ]:
             continue
@@ -428,7 +429,7 @@ class VisionAgent(Agent):
     ):
         """VisionAgent constructor.
-        Parameters
+        Parameters:
             task_model: the model to use for task decomposition.
             answer_model: the model to use for reasoning and concluding the answer.
             reflect_model: the model to use for self reflection.
@@ -504,24 +505,39 @@ class VisionAgent(Agent):
         reference_data: Optional[Dict[str, str]] = None,
         visualize_output: Optional[bool] = False,
     ) -> Tuple[str, List[Dict]]:
+        """Chat with the vision agent and return the final answer and all tool results.
+        Parameters:
+            chat: a conversation in the format of
+                [{"role": "user", "content": "describe your task here..."}].
+            image: the input image referenced in the chat parameter.
+            reference_data: a dictionary containing the reference image and mask. in the
+                format of {"image": "image.jpg", "mask": "mask.jpg}
+            visualize_output: whether to visualize the output.
+        Returns:
+            A tuple where the first item is the final answer and the second item is a
+            list of all the tool results. The last item in the tool results also
+            contains the visualized output.
+        """
         question = chat[0]["content"]
         if image:
             question += f" Image name: {image}"
         if reference_data:
-            if not (
-                "image" in reference_data
-                and ("mask" in reference_data or "bbox" in reference_data)
-            ):
-                raise ValueError(
-                    f"Reference data must contain 'image' and a visual prompt which can be 'mask' or 'bbox'. but got {reference_data}"
-                )
-            visual_prompt_data = (
-                f"Reference mask: {reference_data['mask']}"
+            question += (
+                f" Reference image: {reference_data['image']}"
+                if "image" in reference_data
+                else ""
+            )
+            question += (
+                f" Reference mask: {reference_data['mask']}"
                 if "mask" in reference_data
-                else f"Reference bbox: {reference_data['bbox']}"
+                else ""
             )
             question += (
-                f" Reference image: {reference_data['image']}, {visual_prompt_data}"
+                f" Reference bbox: {reference_data['bbox']}"
+                if "bbox" in reference_data
+                else ""
             )
         reflections = ""

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/llm/llm.py RENAMED Viewed

@@ -131,6 +131,11 @@ class OpenAILLM(LLM):
     def generate_zero_shot_counter(self, question: str) -> Callable:
         return lambda x: ZeroShotCounting()(**{"image": x})
+    def generate_image_qa_tool(self, question: str) -> Callable:
+        from vision_agent.tools import ImageQuestionAnswering
+        return lambda x: ImageQuestionAnswering()(**{"prompt": question, "image": x})
 class AzureOpenAILLM(OpenAILLM):
     def __init__(

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/lmm/lmm.py RENAMED Viewed

@@ -11,11 +11,7 @@ from openai import AzureOpenAI, OpenAI
 from vision_agent.tools import (
     CHOOSE_PARAMS,
-    CLIP,
     SYSTEM_PROMPT,
-    GroundingDINO,
-    GroundingSAM,
-    ZeroShotCounting,
 )
 _LOGGER = logging.getLogger(__name__)
@@ -205,6 +201,8 @@ class OpenAILMM(LMM):
         return cast(str, response.choices[0].message.content)
     def generate_classifier(self, question: str) -> Callable:
+        from vision_agent.tools import CLIP
         api_doc = CLIP.description + "\n" + str(CLIP.usage)
         prompt = CHOOSE_PARAMS.format(api_doc=api_doc, question=question)
         response = self.client.chat.completions.create(
@@ -228,6 +226,8 @@ class OpenAILMM(LMM):
         return lambda x: CLIP()(**{"prompt": params["prompt"], "image": x})
     def generate_detector(self, question: str) -> Callable:
+        from vision_agent.tools import GroundingDINO
         api_doc = GroundingDINO.description + "\n" + str(GroundingDINO.usage)
         prompt = CHOOSE_PARAMS.format(api_doc=api_doc, question=question)
         response = self.client.chat.completions.create(
@@ -251,6 +251,8 @@ class OpenAILMM(LMM):
         return lambda x: GroundingDINO()(**{"prompt": params["prompt"], "image": x})
     def generate_segmentor(self, question: str) -> Callable:
+        from vision_agent.tools import GroundingSAM
         api_doc = GroundingSAM.description + "\n" + str(GroundingSAM.usage)
         prompt = CHOOSE_PARAMS.format(api_doc=api_doc, question=question)
         response = self.client.chat.completions.create(
@@ -274,8 +276,15 @@ class OpenAILMM(LMM):
         return lambda x: GroundingSAM()(**{"prompt": params["prompt"], "image": x})
     def generate_zero_shot_counter(self, question: str) -> Callable:
+        from vision_agent.tools import ZeroShotCounting
         return lambda x: ZeroShotCounting()(**{"image": x})
+    def generate_image_qa_tool(self, question: str) -> Callable:
+        from vision_agent.tools import ImageQuestionAnswering
+        return lambda x: ImageQuestionAnswering()(**{"prompt": question, "image": x})
 class AzureOpenAILMM(OpenAILMM):
     def __init__(

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/tools/__init__.py RENAMED Viewed

@@ -1,6 +1,7 @@
 from .prompts import CHOOSE_PARAMS, SYSTEM_PROMPT
 from .tools import (  # Counter,
     CLIP,
+    OCR,
     TOOLS,
     BboxArea,
     BboxIoU,
@@ -13,7 +14,10 @@ from .tools import (  # Counter,
     ImageCaption,
     ZeroShotCounting,
     VisualPromptCounting,
+    VisualQuestionAnswering,
+    ImageQuestionAnswering,
     SegArea,
     SegIoU,
     Tool,
+    register_tool,
 )

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/tools/tools.py RENAMED Viewed

@@ -1,8 +1,9 @@
+import io
 import logging
 import tempfile
 from abc import ABC
 from pathlib import Path
-from typing import Any, Dict, List, Tuple, Union, cast
+from typing import Any, Dict, List, Tuple, Type, Union, cast
 import numpy as np
 import requests
@@ -11,13 +12,14 @@ from PIL.Image import Image as ImageType
 from vision_agent.image_utils import (
     convert_to_b64,
+    denormalize_bbox,
     get_image_size,
-    rle_decode,
     normalize_bbox,
-    denormalize_bbox,
+    rle_decode,
 )
 from vision_agent.tools.video import extract_frames_from_video
 from vision_agent.type_defs import LandingaiAPIKey
+from vision_agent.lmm import OpenAILMM
 _LOGGER = logging.getLogger(__name__)
 _LND_API_KEY = LandingaiAPIKey().api_key
@@ -29,6 +31,9 @@ class Tool(ABC):
     description: str
     usage: Dict
+    def __call__(self, *args: Any, **kwargs: Any) -> Any:
+        raise NotImplementedError
 class NoOp(Tool):
     name = "noop_"
@@ -108,8 +113,7 @@ class CLIP(Tool):
 class ImageCaption(Tool):
-    r"""ImageCaption is a tool that can caption an image based on its contents
-    or tags.
+    r"""ImageCaption is a tool that can caption an image based on its contents or tags.
     Example
     -------
@@ -120,26 +124,20 @@ class ImageCaption(Tool):
     """
     name = "image_caption_"
-    description = "'image_caption_' is a tool that can caption an image based on its contents or tags. It returns a text describing the image"
+    description = "'image_caption_' is a tool that can caption an image based on its contents or tags. It returns a text describing the image."
     usage = {
         "required_parameters": [
             {"name": "image", "type": "str"},
         ],
         "examples": [
             {
-                "scenario": "Can you describe this image ? Image name: cat.jpg",
+                "scenario": "Can you describe this image? Image name: cat.jpg",
                 "parameters": {"image": "cat.jpg"},
             },
             {
-                "scenario": "Can you caption this image with their main contents ? Image name: cat_dog.jpg",
+                "scenario": "Can you caption this image with their main contents? Image name: cat_dog.jpg",
                 "parameters": {"image": "cat_dog.jpg"},
             },
-            {
-                "scenario": "Can you build me a image captioning tool ? Image name: shirts.jpg",
-                "parameters": {
-                    "image": "shirts.jpg",
-                },
-            },
         ],
     }
@@ -487,15 +485,15 @@ class ZeroShotCounting(Tool):
         ],
         "examples": [
             {
-                "scenario": "Can you count the lids in the image ? Image name: lids.jpg",
+                "scenario": "Can you count the lids in the image? Image name: lids.jpg",
                 "parameters": {"image": "lids.jpg"},
             },
             {
-                "scenario": "Can you count the total number of objects in this image ? Image name: tray.jpg",
+                "scenario": "Can you count the total number of objects in this image? Image name: tray.jpg",
                 "parameters": {"image": "tray.jpg"},
             },
             {
-                "scenario": "Can you build me an object counting tool ? Image name: shirts.jpg",
+                "scenario": "Can you build me an object counting tool? Image name: shirts.jpg",
                 "parameters": {
                     "image": "shirts.jpg",
                 },
@@ -505,7 +503,7 @@ class ZeroShotCounting(Tool):
     # TODO: Add support for input multiple images, which aligns with the output type.
     def __call__(self, image: Union[str, ImageType]) -> Dict:
-        """Invoke the Image captioning model.
+        """Invoke the Zero shot counting model.
         Parameters:
             image: the input image.
@@ -569,7 +567,7 @@ class VisualPromptCounting(Tool):
     # TODO: Add support for input multiple images, which aligns with the output type.
     def __call__(self, image: Union[str, ImageType], prompt: str) -> Dict:
-        """Invoke the Image captioning model.
+        """Invoke the few shot counting model.
         Parameters:
             image: the input image.
@@ -590,6 +588,144 @@ class VisualPromptCounting(Tool):
         return _send_inference_request(data, "tools")
+class VisualQuestionAnswering(Tool):
+    r"""VisualQuestionAnswering is a tool that can explain contents of an image and answer questions about the same
+    Example
+    -------
+        >>> import vision_agent as va
+        >>> vqa_tool = va.tools.VisualQuestionAnswering()
+        >>> vqa_tool(image="image1.jpg", prompt="describe this image in detail")
+        {'text': "The image contains a cat sitting on a table with a bowl of milk."}
+    """
+    name = "visual_question_answering_"
+    description = "'visual_question_answering_' is a tool that can describe the contents of the image and it can also answer basic questions about the image."
+    usage = {
+        "required_parameters": [
+            {"name": "image", "type": "str"},
+            {"name": "prompt", "type": "str"},
+        ],
+        "examples": [
+            {
+                "scenario": "Describe this image in detail. Image name: cat.jpg",
+                "parameters": {
+                    "image": "cats.jpg",
+                    "prompt": "Describe this image in detail",
+                },
+            },
+            {
+                "scenario": "Can you help me with this street sign in this image ? What does it say ? Image name: sign.jpg",
+                "parameters": {
+                    "image": "sign.jpg",
+                    "prompt": "Can you help me with this street sign ? What does it say ?",
+                },
+            },
+            {
+                "scenario": "Describe the weather in the image for me ? Image name: weather.jpg",
+                "parameters": {
+                    "image": "weather.jpg",
+                    "prompt": "Describe the weather in the image for me ",
+                },
+            },
+            {
+                "scenario": "Which 2 are the least frequent bins in this histogram ? Image name: chart.jpg",
+                "parameters": {
+                    "image": "chart.jpg",
+                    "prompt": "Which 2 are the least frequent bins in this histogram",
+                },
+            },
+        ],
+    }
+    def __call__(self, image: str, prompt: str) -> Dict:
+        """Invoke the visual question answering model.
+        Parameters:
+            image: the input image.
+        Returns:
+            A dictionary containing the key 'text' and the answer to the prompt. E.g. {'text': 'This image contains a cat sitting on a table with a bowl of milk.'}
+        """
+        gpt = OpenAILMM()
+        return {"text": gpt(input=prompt, images=[image])}
+class ImageQuestionAnswering(Tool):
+    r"""ImageQuestionAnswering is a tool that can explain contents of an image and answer questions about the same
+    It is same as VisualQuestionAnswering but this tool is not used by agents. It is used when user requests a tool for VQA using generate_image_qa_tool function.
+    It is also useful if the user wants the data to be not exposed to OpenAI endpoints
+    Example
+    -------
+        >>> import vision_agent as va
+        >>> vqa_tool = va.tools.ImageQuestionAnswering()
+        >>> vqa_tool(image="image1.jpg", prompt="describe this image in detail")
+        {'text': "The image contains a cat sitting on a table with a bowl of milk."}
+    """
+    name = "image_question_answering_"
+    description = "'image_question_answering_' is a tool that can describe the contents of the image and it can also answer basic questions about the image."
+    usage = {
+        "required_parameters": [
+            {"name": "image", "type": "str"},
+            {"name": "prompt", "type": "str"},
+        ],
+        "examples": [
+            {
+                "scenario": "Describe this image in detail. Image name: cat.jpg",
+                "parameters": {
+                    "image": "cats.jpg",
+                    "prompt": "Describe this image in detail",
+                },
+            },
+            {
+                "scenario": "Can you help me with this street sign in this image ? What does it say ? Image name: sign.jpg",
+                "parameters": {
+                    "image": "sign.jpg",
+                    "prompt": "Can you help me with this street sign ? What does it say ?",
+                },
+            },
+            {
+                "scenario": "Describe the weather in the image for me ? Image name: weather.jpg",
+                "parameters": {
+                    "image": "weather.jpg",
+                    "prompt": "Describe the weather in the image for me ",
+                },
+            },
+            {
+                "scenario": "Can you generate an image question answering tool ? Image name: chart.jpg, prompt: Which 2 are the least frequent bins in this histogram",
+                "parameters": {
+                    "image": "chart.jpg",
+                    "prompt": "Which 2 are the least frequent bins in this histogram",
+                },
+            },
+        ],
+    }
+    def __call__(self, image: Union[str, ImageType], prompt: str) -> Dict:
+        """Invoke the visual question answering model.
+        Parameters:
+            image: the input image.
+        Returns:
+            A dictionary containing the key 'text' and the answer to the prompt. E.g. {'text': 'This image contains a cat sitting on a table with a bowl of milk.'}
+        """
+        image_b64 = convert_to_b64(image)
+        data = {
+            "image": image_b64,
+            "prompt": prompt,
+            "tool": "image_question_answering",
+        }
+        return _send_inference_request(data, "tools")
 class Crop(Tool):
     r"""Crop crops an image given a bounding box and returns a file name of the cropped image."""
@@ -865,6 +1001,57 @@ class ExtractFrames(Tool):
         return result
+class OCR(Tool):
+    name = "ocr_"
+    description = "'ocr_' extracts text from an image."
+    usage = {
+        "required_parameters": [
+            {"name": "image", "type": "str"},
+        ],
+        "examples": [
+            {
+                "scenario": "Can you extract the text from this image? Image name: image.png",
+                "parameters": {"image": "image.png"},
+            },
+        ],
+    }
+    _API_KEY = "land_sk_WVYwP00xA3iXely2vuar6YUDZ3MJT9yLX6oW5noUkwICzYLiDV"
+    _URL = "https://app.landing.ai/ocr/v1/detect-text"
+    def __call__(self, image: str) -> dict:
+        pil_image = Image.open(image).convert("RGB")
+        image_size = pil_image.size[::-1]
+        image_buffer = io.BytesIO()
+        pil_image.save(image_buffer, format="PNG")
+        buffer_bytes = image_buffer.getvalue()
+        image_buffer.close()
+        res = requests.post(
+            self._URL,
+            files={"images": buffer_bytes},
+            data={"language": "en"},
+            headers={"contentType": "multipart/form-data", "apikey": self._API_KEY},
+        )
+        if res.status_code != 200:
+            _LOGGER.error(f"Request failed: {res.text}")
+            raise ValueError(f"Request failed: {res.text}")
+        data = res.json()
+        output: Dict[str, List] = {"labels": [], "bboxes": [], "scores": []}
+        for det in data[0]:
+            output["labels"].append(det["text"])
+            box = [
+                det["location"][0]["x"],
+                det["location"][0]["y"],
+                det["location"][2]["x"],
+                det["location"][2]["y"],
+            ]
+            box = normalize_bbox(box, image_size)
+            output["bboxes"].append(box)
+            output["scores"].append(round(det["score"], 2))
+        return output
 class Calculator(Tool):
     r"""Calculator is a tool that can perform basic arithmetic operations."""
@@ -896,11 +1083,11 @@ TOOLS = {
         [
             NoOp,
             CLIP,
-            ImageCaption,
             GroundingDINO,
             AgentGroundingSAM,
             ZeroShotCounting,
             VisualPromptCounting,
+            VisualQuestionAnswering,
             AgentDINOv,
             ExtractFrames,
             Crop,
@@ -910,6 +1097,7 @@ TOOLS = {
             SegIoU,
             BboxContains,
             BoxDistance,
+            OCR,
             Calculator,
         ]
     )
@@ -917,6 +1105,31 @@ TOOLS = {
 }
+def register_tool(tool: Type[Tool]) -> Type[Tool]:
+    r"""Add a tool to the list of available tools.
+    Parameters:
+        tool: The tool to add.
+    """
+    if (
+        not hasattr(tool, "name")
+        or not hasattr(tool, "description")
+        or not hasattr(tool, "usage")
+    ):
+        raise ValueError(
+            "The tool must have 'name', 'description' and 'usage' attributes."
+        )
+    TOOLS[len(TOOLS)] = {
+        "name": tool.name,
+        "description": tool.description,
+        "usage": tool.usage,
+        "class": tool,
+    }
+    return tool
 def _send_inference_request(
     payload: Dict[str, Any], endpoint_name: str
 ) -> Dict[str, Any]:

{vision_agent-0.2.2 → vision_agent-0.2.4}/LICENSE RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/__init__.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/agent/__init__.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/agent/agent.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/agent/easytool.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/agent/easytool_prompts.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/agent/reflexion.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/agent/reflexion_prompts.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/agent/vision_agent_prompts.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/fonts/__init__.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/fonts/default_font_ch_en.ttf RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/image_utils.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/llm/__init__.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/lmm/__init__.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/tools/prompts.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/tools/video.py RENAMED Viewed

File without changes

{vision_agent-0.2.2 → vision_agent-0.2.4}/vision_agent/type_defs.py RENAMED Viewed

File without changes

vision-agent 0.2.2__tar.gz → 0.2.4__tar.gz

vision-agent 0.2.2tar.gz → 0.2.4tar.gz