PyPI - vision-agent - Versions diffs - 1.0.4__tar.gz → 1.0.7__tar.gz - Mend

vision-agent 1.0.4tar.gz → 1.0.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{vision_agent-1.0.4 → vision_agent-1.0.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 1.0.4
+Version: 1.0.7
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai
@@ -65,10 +65,10 @@ pip install vision-agent
 ```bash
 export ANTHROPIC_API_KEY="your-api-key"
-export OPENAI_API_KEY="your-api-key"
+export GEMINI_API_KEY="your-api-key"
 ```
-> **_NOTE:_** We found using both Anthropic Claude-3.5 and OpenAI o1 to be provide the best performance for VisionAgent. If you want to use a different LLM provider or only one, see 'Using Other LLM Providers' below.
+> **_NOTE:_** We found using both Anthropic Claude-3.7 and Gemini-2.0-Flash-Exp to be provide the best performance for VisionAgent. If you want to use a different LLM provider or only one, see 'Using Other LLM Providers' below.
 You will also need to set your VisionAgent API key to be able to authenticate when using the hosted vision tools that we provide through our APIs. Currently, the APIs are free to use so you will only need to get it from [here](https://va.landing.ai/account/api-key).
@@ -147,5 +147,33 @@ directory. For example to change to Anthropic simply just run:
 cp vision_agent/configs/anthropic_config.py vision_agent/configs/config.py
 ```
+You can also modify the existing `config.py` file yourself to use a different LLM
+provider, for example if you wanted to change the planner from Anthropic inside
+`config.py` to OpenAI you would replace this code:
+```python
+    planner: Type[LMM] = Field(default=AnthropicLMM)
+    planner_kwargs: dict = Field(
+        default_factory=lambda: {
+            "model_name": "claude-3-7-sonnet-20250219",
+            "temperature": 0.0,
+            "image_size": 768,
+        }
+    )
+```
+with this code:
+```python
+    planner: Type[LMM] = Field(default=OpenAILMM)
+    planner_kwargs: dict = Field(
+        default_factory=lambda: {
+            "model_name": "gpt-4o-2024-11-20",
+            "temperature": 0.0,
+            "image_size": 768,
+            "image_detail": "low",
+        }
+    )
+```
 > **_NOTE:_** VisionAgent moves fast and we are constantly updating and changing the library. If you have any questions or need help, please reach out to us on our discord channel.

{vision_agent-1.0.4 → vision_agent-1.0.7}/README.md RENAMED Viewed

@@ -23,10 +23,10 @@ pip install vision-agent
 ```bash
 export ANTHROPIC_API_KEY="your-api-key"
-export OPENAI_API_KEY="your-api-key"
+export GEMINI_API_KEY="your-api-key"
 ```
-> **_NOTE:_** We found using both Anthropic Claude-3.5 and OpenAI o1 to be provide the best performance for VisionAgent. If you want to use a different LLM provider or only one, see 'Using Other LLM Providers' below.
+> **_NOTE:_** We found using both Anthropic Claude-3.7 and Gemini-2.0-Flash-Exp to be provide the best performance for VisionAgent. If you want to use a different LLM provider or only one, see 'Using Other LLM Providers' below.
 You will also need to set your VisionAgent API key to be able to authenticate when using the hosted vision tools that we provide through our APIs. Currently, the APIs are free to use so you will only need to get it from [here](https://va.landing.ai/account/api-key).
@@ -105,4 +105,32 @@ directory. For example to change to Anthropic simply just run:
 cp vision_agent/configs/anthropic_config.py vision_agent/configs/config.py
 ```
+You can also modify the existing `config.py` file yourself to use a different LLM
+provider, for example if you wanted to change the planner from Anthropic inside
+`config.py` to OpenAI you would replace this code:
+```python
+    planner: Type[LMM] = Field(default=AnthropicLMM)
+    planner_kwargs: dict = Field(
+        default_factory=lambda: {
+            "model_name": "claude-3-7-sonnet-20250219",
+            "temperature": 0.0,
+            "image_size": 768,
+        }
+    )
+```
+with this code:
+```python
+    planner: Type[LMM] = Field(default=OpenAILMM)
+    planner_kwargs: dict = Field(
+        default_factory=lambda: {
+            "model_name": "gpt-4o-2024-11-20",
+            "temperature": 0.0,
+            "image_size": 768,
+            "image_detail": "low",
+        }
+    )
+```
 > **_NOTE:_** VisionAgent moves fast and we are constantly updating and changing the library. If you have any questions or need help, please reach out to us on our discord channel.

{vision_agent-1.0.4 → vision_agent-1.0.7}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
 [tool.poetry]
 name = "vision-agent"
-version = "1.0.4"
+version = "1.0.7"
 description = "Toolset for Vision Agent"
 authors = ["Landing AI <dev@landing.ai>"]
 readme = "README.md"

{vision_agent-1.0.4 → vision_agent-1.0.7}/vision_agent/.sim_tools/df.csv RENAMED Viewed

@@ -1,15 +1,15 @@
 desc,doc,name
-"'owlv2_object_detection' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions on images. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.","owlv2_object_detection(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1) -> List[Dict[str, Any]]:
-'owlv2_object_detection' is a tool that can detect and count multiple objects
-    given a text prompt such as category names or referring expressions on images. The
-    categories in text prompt are separated by commas. It returns a list of bounding
-    boxes with normalized coordinates, label names and associated probability scores.
+"'glee_object_detection' is a tool that can detect multiple objects given a text prompt such as object names or referring expressions on images. It's particularly good at detecting specific objects given detailed descriptive prompts. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.","glee_object_detection(prompt: str, image: numpy.ndarray, box_threshold: float = 0.23) -> List[Dict[str, Any]]:
+'glee_object_detection' is a tool that can detect multiple objects given a
+    text prompt such as object names or referring expressions on images. It's
+    particularly good at detecting specific objects given detailed descriptive prompts.
+    It returns a list of bounding boxes with normalized coordinates, label names and
+    associated probability scores.
     Parameters:
-        prompt (str): The prompt to ground to the image.
+        prompt (str): The prompt to ground to the image, only supports a single prompt
+            with no commas or periods.
         image (np.ndarray): The image to ground the prompt to.
-        box_threshold (float, optional): The threshold for the box detection. Defaults
-            to 0.10.
     Returns:
         List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
@@ -20,24 +20,23 @@ desc,doc,name
     Example
     -------
-        >>> owlv2_object_detection(""car, dinosaur"", image)
+        >>> glee_object_detection(""person holding a box"", image)
         [
-            {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]},
-            {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
+            {'score': 0.99, 'label': 'person holding a box', 'bbox': [0.1, 0.11, 0.35, 0.4]},
+            {'score': 0.98, 'label': 'person holding a box', 'bbox': [0.2, 0.21, 0.45, 0.5},
         ]
-    ",owlv2_object_detection
-"'owlv2_sam2_instance_segmentation' is a tool that can detect and count multiple instances of objects given a text prompt such as category names or referring expressions on images. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names, masks and associated probability scores.","owlv2_sam2_instance_segmentation(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1) -> List[Dict[str, Any]]:
-'owlv2_sam2_instance_segmentation' is a tool that can detect and count multiple
-    instances of objects given a text prompt such as category names or referring
-    expressions on images. The categories in text prompt are separated by commas. It
-    returns a list of bounding boxes with normalized coordinates, label names, masks
-    and associated probability scores.
+    ",glee_object_detection
+"'glee_sam2_instance_segmentation' is a tool that can detect multiple instances given a text prompt such as object names or referring expressions on images. It's particularly good at detecting specific objects given detailed descriptive prompts. It returns a list of bounding boxes with normalized coordinates, label names, masks and associated probability scores.","glee_sam2_instance_segmentation(prompt: str, image: numpy.ndarray, box_threshold: float = 0.23) -> List[Dict[str, Any]]:
+'glee_sam2_instance_segmentation' is a tool that can detect multiple
+    instances given a text prompt such as object names or referring expressions on
+    images. It's particularly good at detecting specific objects given detailed
+    descriptive prompts. It returns a list of bounding boxes with normalized
+    coordinates, label names, masks and associated probability scores.
     Parameters:
-        prompt (str): The object that needs to be counted.
+        prompt (str): The object that needs to be counted, only supports a single
+            prompt with no commas or periods.
         image (np.ndarray): The image that contains multiple instances of the object.
-        box_threshold (float, optional): The threshold for detection. Defaults
-            to 0.10.
     Returns:
         List[Dict[str, Any]]: A list of dictionaries containing the score, label,
@@ -49,11 +48,11 @@ desc,doc,name
     Example
     -------
-        >>> owlv2_sam2_instance_segmentation(""flower"", image)
+        >>> glee_sam2_instance_segmentation(""a large blue flower"", image)
         [
             {
                 'score': 0.49,
-                'label': 'flower',
+                'label': 'a large blue flower',
                 'bbox': [0.1, 0.11, 0.35, 0.4],
                 'mask': array([[0, 0, 0, ..., 0, 0, 0],
                     [0, 0, 0, ..., 0, 0, 0],
@@ -62,21 +61,21 @@ desc,doc,name
                     [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
             },
         ]
-    ",owlv2_sam2_instance_segmentation
-"'owlv2_sam2_video_tracking' is a tool that can track and segment multiple objects in a video given a text prompt such as category names or referring expressions. The categories in the text prompt are separated by commas. It returns a list of bounding boxes, label names, masks and associated probability scores and is useful for tracking and counting without duplicating counts.","owlv2_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], box_threshold: float = 0.1, chunk_length: Optional[int] = 25) -> List[List[Dict[str, Any]]]:
-'owlv2_sam2_video_tracking' is a tool that can track and segment multiple
-    objects in a video given a text prompt such as category names or referring
-    expressions. The categories in the text prompt are separated by commas. It returns
-    a list of bounding boxes, label names, masks and associated probability scores and
-    is useful for tracking and counting without duplicating counts.
+    ",glee_sam2_instance_segmentation
+"'glee_sam2_video_tracking' is a tool that can track and segment multiple objects in a video given a text prompt such as object names or referring expressions. It's particularly good at detecting specific objects given detailed descriptive prompts and returns a list of bounding boxes, label names, masks and associated probability scores and is useful for tracking and counting without duplicating counts.","glee_sam2_video_tracking(prompt: str, frames: List[numpy.ndarray], box_threshold: float = 0.23, chunk_length: Optional[int] = 25) -> List[List[Dict[str, Any]]]:
+'glee_sam2_video_tracking' is a tool that can track and segment multiple
+    objects in a video given a text prompt such as object names or referring
+    expressions. It's particularly good at detecting specific objects given detailed
+    descriptive prompts and returns a list of bounding boxes, label names, masks and
+    associated probability scores and is useful for tracking and counting without
+    duplicating counts.
     Parameters:
-        prompt (str): The prompt to ground to the image.
+        prompt (str): The prompt to ground to the image, only supports a single prompt
+            with  no commas or periods.
         frames (List[np.ndarray]): The list of frames to ground the prompt to.
-        box_threshold (float, optional): The threshold for the box detection. Defaults
-            to 0.10.
-        chunk_length (Optional[int]): The number of frames to re-run owlv2 to find
-            new objects.
+        chunk_length (Optional[int]): The number of frames to re-run agentic object detection to
+            to find new objects.
     Returns:
         List[List[Dict[str, Any]]]: A list of list of dictionaries containing the
@@ -90,11 +89,11 @@ desc,doc,name
     Example
     -------
-        >>> owlv2_sam2_video_tracking(""car, dinosaur"", frames)
+        >>> glee_sam2_video_tracking(""a runner with yellow shoes"", frames)
         [
             [
                 {
-                    'label': '0: dinosaur',
+                    'label': '0: a runner with yellow shoes',
                     'bbox': [0.1, 0.11, 0.35, 0.4],
                     'mask': array([[0, 0, 0, ..., 0, 0, 0],
                         [0, 0, 0, ..., 0, 0, 0],
@@ -105,7 +104,7 @@ desc,doc,name
             ],
             ...
         ]
-    ",owlv2_sam2_video_tracking
+    ",glee_sam2_video_tracking
 "'countgd_object_detection' is a tool that can detect multiple instances of an object given a text prompt. It is particularly useful when trying to detect and count a large number of objects. You can optionally separate object names in the prompt with commas. It returns a list of bounding boxes with normalized coordinates, label names and associated confidence scores.","countgd_object_detection(prompt: str, image: numpy.ndarray, box_threshold: float = 0.23) -> List[Dict[str, Any]]:
 'countgd_object_detection' is a tool that can detect multiple instances of an
     object given a text prompt. It is particularly useful when trying to detect and
@@ -417,8 +416,8 @@ desc,doc,name
             {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
         ]
     ",ocr
-'qwen2_vl_images_vqa' is a tool that can answer any questions about arbitrary images including regular images or images of documents or presentations. It can be very useful for document QA or OCR text extraction. It returns text as an answer to the question.,"qwen2_vl_images_vqa(prompt: str, images: List[numpy.ndarray]) -> str:
-'qwen2_vl_images_vqa' is a tool that can answer any questions about arbitrary
+'qwen25_vl_images_vqa' is a tool that can answer any questions about arbitrary images including regular images or images of documents or presentations. It can be very useful for document QA or OCR text extraction. It returns text as an answer to the question.,"qwen25_vl_images_vqa(prompt: str, images: List[numpy.ndarray]) -> str:
+'qwen25_vl_images_vqa' is a tool that can answer any questions about arbitrary
     images including regular images or images of documents or presentations. It can be
     very useful for document QA or OCR text extraction. It returns text as an answer to
     the question.
@@ -432,11 +431,11 @@ desc,doc,name
     Example
     -------
-        >>> qwen2_vl_images_vqa('Give a summary of the document', images)
+        >>> qwen25_vl_images_vqa('Give a summary of the document', images)
         'The document talks about the history of the United States of America and its...'
-    ",qwen2_vl_images_vqa
-'qwen2_vl_video_vqa' is a tool that can answer any questions about arbitrary videos including regular videos or videos of documents or presentations. It returns text as an answer to the question.,"qwen2_vl_video_vqa(prompt: str, frames: List[numpy.ndarray]) -> str:
-'qwen2_vl_video_vqa' is a tool that can answer any questions about arbitrary videos
+    ",qwen25_vl_images_vqa
+'qwen25_vl_video_vqa' is a tool that can answer any questions about arbitrary videos including regular videos or videos of documents or presentations. It returns text as an answer to the question.,"qwen25_vl_video_vqa(prompt: str, frames: List[numpy.ndarray]) -> str:
+'qwen25_vl_video_vqa' is a tool that can answer any questions about arbitrary videos
     including regular videos or videos of documents or presentations. It returns text
     as an answer to the question.
@@ -449,10 +448,10 @@ desc,doc,name
     Example
     -------
-        >>> qwen2_vl_video_vqa('Which football player made the goal?', frames)
+        >>> qwen25_vl_video_vqa('Which football player made the goal?', frames)
         'Lionel Messi'
-    ",qwen2_vl_video_vqa
-'activity_recognition' is a tool that can recognize activities in a video given a text prompt. It can be used to identify where specific activities or actions happen in a video and returns a list of 0s and 1s to indicate the activity.,"activity_recognition(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen2vl', chunk_length_frames: int = 10) -> List[float]:
+    ",qwen25_vl_video_vqa
+'activity_recognition' is a tool that can recognize activities in a video given a text prompt. It can be used to identify where specific activities or actions happen in a video and returns a list of 0s and 1s to indicate the activity.,"activity_recognition(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen25vl', chunk_length_frames: int = 10) -> List[float]:
 'activity_recognition' is a tool that can recognize activities in a video given a
     text prompt. It can be used to identify where specific activities or actions
     happen in a video and returns a list of 0s and 1s to indicate the activity.

{vision_agent-1.0.4 → vision_agent-1.0.7}/vision_agent/.sim_tools/embs.npy RENAMED Viewed

Binary file

vision_agent-1.0.7/vision_agent/agent/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .agent import Agent, AgentCoder, AgentPlanner
+from .vision_agent_coder_v2 import VisionAgentCoderV2
+from .vision_agent_planner_v2 import VisionAgentPlannerV2
+from .vision_agent_v2 import VisionAgentV2

{vision_agent-1.0.4 → vision_agent-1.0.7}/vision_agent/agent/vision_agent_planner_prompts_v2.py RENAMED Viewed

@@ -9,21 +9,21 @@ PLAN = """
 **Example Planning**: Here are some examples of how you can search for a plan, in the examples the user output is denoted by USER, your output is denoted by AGENT and the observations after your code execution are denoted by OBSERVATION:
 {examples}
-**Current Planning**: This is the plan you are currently working on
+**Current Planning**: Below are your previous thoughts, code and observations from the planning process:
 --- START CURRENT PLANNING ---
 {planning}
 --- END CURRENT PLANNING ---
 **Instructions**:
-1. Read over the user request and context provided and output <thinking> tags to indicate your thought process. You can <count> number of turns to complete the user's request.
+1. Read over the user request and context provided and output <thinking> tags to indicate your thought process. You have <count> number of turns to complete the user's request.
 2. You can execute python code in the ipython notebook using <execute_python> tags. Only output one <execute_python> tag at a time.
-3. Only output <finalize_plan> when you are done planning and want to end the planning process. DO NOT output <finalize_plan> with <execute_python> tags, only after OBSERVATION's.
+3. Output <finalize_plan> when you have written your final code, you do not need to wait until all turns are used to do this. DO NOT output <finalize_plan> with <execute_python> tags, only after OBSERVATION.
 4. Only load/save files from {media_list} unless you specifically saved the file previously.
-5. Ensure you always call `suggestion` and `claude35_vqa` initially and `get_tool_for_task` to get the right tool for the subtask.
+5. Ensure you always call `suggestion` and `vqa` initially and `get_tool_for_task` to get the right tool for the subtask.
 6. Calling `plt.imshow` or `save_image` will display the image to you so you can check your results. If you see an image after <execute_python> it's generated from your code.
 7. Be sure to print results returned for tools so you can see the output.
 8. DO NOT hard code the answer into your code, it should be dynamic and work for any similar request.
-9. DO NOT over index on claude35_vqa, if tool output is close to claude35_vqa's output you do not need to improve the tool output, tools are often better at things like counting and detecting small objects.
+9. DO NOT over index on the vqa model. If tool output is close to vqa's output do not improve the tool output. If you have >10 detected objects, or the objects are very small, trust the tool output over the vqa model.
 10. You can only respond in the following format with a single <thinking>, <execute_python> or <finalize_plan> tag:
 <thinking>Your thought process...</thinking>
@@ -39,13 +39,13 @@ USER: Count the number of pedestrians in the image.
 AGENT: <thinking>I need to gather more information, I will ask for a description of the image and a solution to the problem to work towards as well as a suggestion on how best to solve it.</thinking>
 <execute_python>
 image = load_image('drone.jpg')
-claude35_vqa('Can you describe this image? How many pedestrians do you count in the image?', [image])
+vqa('Can you describe this image? How many pedestrians do you count in the image?', [image])
 suggestion('How can I count the number of pedestrians in the image?', [image])
 </execute_python>
-OBSERVATION: [claude35_vqa output]
+OBSERVATION: [vqa output]
 From this aerial view of a busy urban street, it's difficult to clearly see or count individual pedestrians. The image shows a bird's eye view of a city intersection with multiple lanes of traffic, parked cars, sidewalks, and some green spaces. While there may be people in cars or on the sidewalks, they are not distinctly visible from this height and perspective. The focus of the image is more on the urban infrastructure, traffic patterns, and overall city layout rather than on individuals.
-[end of claude35_vqa output]
+[end of vqa output]
 [suggestions]
 [suggestion 0]
@@ -161,7 +161,7 @@ detections = merge_bounding_box_list(detection_from_subdivided_images)
 [end of suggestion]
 <count>9</count>
-AGENT: <thinking>Following the suggested plan, I will subdivide the image into smaller parts and find the right tool on the smaller parts. I will also re-check with claude35_vqa on the subdivided images.</thinking>
+AGENT: <thinking>Following the suggested plan, I will subdivide the image into smaller parts and find the right tool on the smaller parts. I will also re-check with vqa on the subdivided images.</thinking>
 <execute_python>
 def subdivide_image(image):
     height, width, _ = image.shape
@@ -176,14 +176,14 @@ def subdivide_image(image):
     return [top_left, top_right, bottom_left, bottom_right]
 images = subdivide_image(image)
-claude35_vqa('How many pedestrians do you see in these images?', images)
+vqa('How many pedestrians do you see in these images?', images)
 get_tool_for_task('Count the number of pedestrians in the images', images)
 </execute_python>
 OBSERVATION:
-[claude35_vqa_output]
+[vqa_output]
 In these aerial images, I can see approximately 5-6 pedestrians walking in various locations, including near the retail signage are and along the sidewalks.
-[end of claude35_vqa_output]
+[end of vqa_output]
 [get_tool_for_task output]
 After examining the image, I can see it's an aerial view of a busy urban intersection with multiple lanes of traffic. There are numerous cars visible, each likely containing at least one person (the driver). While it's difficult to count exact individuals, I can estimate based on the number of vehicles. The countgd_object_detection tool is the best choice for this task as it can detect and count multiple instances of an object given a text prompt. I will use this tool with the prompt 'pedistrian' to count the number of pedestrians in the image.
@@ -269,7 +269,7 @@ print(len(detections))
 OBSERVATION: 24
 <count>7</count>
-AGENT: <thinking>countgd_object_detection detected 24 pedestrians in the image by subdividing the image into smaller regions, this differs from claude35_vqa's answer of 5-6 pedestrians. I will use countgd_object_detection's answer since it is specifically designed to count objects.</thinking>
+AGENT: <thinking>countgd_object_detection detected 24 pedestrians in the image by subdividing the image into smaller regions, this differs from vqa model's answer of 5-6 pedestrians. I will use countgd_object_detection's answer since it is specifically designed to count objects.</thinking>
 <finalize_plan>This plan successfully detected 24 pedestrians in the image by subdividing the image into smaller regions and using the countgd_object_detection tool on each region. The bounding boxes were then merged to count the unique pedestrians.</finalize_plan>
 --- END EXAMPLE1 ---
 """
@@ -283,13 +283,13 @@ AGENT: <thinking>I need to gather more information, I will ask for a description
 <execute_python>
 frames = extract_frames_and_timestamps('video.mp4', 1)
 frames  = [f["frame"] for f in frames]
-claude35_vqa('Can you describe this video? How many boxes are there and do they have tape on them?', frames[:5])
+vqa('Can you describe this video? How many boxes are there and do they have tape on them?', frames[:5])
 suggestion('How can I identify boxes with tape on them in the video?', frames[:5])
 </execute_python>
-OBSERVATION: [claude35_vqa output]
+OBSERVATION: [vqa output]
 In this video, there appear to be 2 cardboard boxes moving along a roller conveyor system. The boxes are brown/kraft colored and appear to have tape sealing them. You can see clear packing tape on the top and sides of the boxes.
-[end of claude35_vqa output]
+[end of vqa output]
 [suggestions]
 [suggestion 0]
@@ -400,20 +400,18 @@ get_tool_for_task('Identify if there is tape on the boxes', crops[:3])
 OBSERVATION:
 [get_tool_for_task output]
-owlv2_object_detection performed best with the prompt 'tape', as it specifically detected multiple instances of tape with localized bounding boxes, which matches what's visible in the images.
+glee_object_detection performed best with the prompt 'tape', as it specifically detected multiple instances of tape with localized bounding boxes, which matches what's visible in the images.
-'owlv2_object_detection' is a tool that can detect and count multiple objects given a
-text prompt such as category names or referring expressions on images. The categories
-in text prompt are separated by commas. It returns a list of bounding boxes with
-normalized coordinates, label names and associated probability scores.
+'glee_object_detection' is a tool that can detect multiple objects given a
+text prompt such as object names or referring expressions on images. It's
+particularly good at detecting specific objects given detailed descriptive prompts.
+It returns a list of bounding boxes with normalized coordinates, label names and
+associated probability scores.
 Parameters:
-    prompt (str): The prompt to ground to the image.
+    prompt (str): The prompt to ground to the image, only supports a single prompt
+        with no commas or periods.
     image (np.ndarray): The image to ground the prompt to.
-    box_threshold (float, optional): The threshold for the box detection. Defaults
-        to 0.10.
-    fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
-        fine-tuned model ID here to use it.
 Returns:
     List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
@@ -424,10 +422,10 @@ Returns:
 Example
 -------
-    >>> owlv2_object_detection("car, dinosaur", image)
+    >>> glee_object_detection("person holding a box", image)
     [
-        {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]},
-        {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
+        {'score': 0.99, 'label': 'person holding a box', 'bbox': [0.1, 0.11, 0.35, 0.4]},
+        {'score': 0.98, 'label': 'person holding a box', 'bbox': [0.2, 0.21, 0.45, 0.5},
     ]
 [end of get_tool_for_task output]
 <count>7</count>
@@ -445,7 +443,7 @@ for frame, frame_predictions in zip(frames, track_predictions):
             int(obj["bbox"][0] * width) : int(obj["bbox"][2] * width),
             :,
         ]
-        detections = owlv2_object_detection("tape", crop)
+        detections = glee_object_detection("tape", crop)
         obj_to_info[obj["label"]].extend(detections)
@@ -512,7 +510,7 @@ PICK_PLAN = """
 CATEGORIZE_TOOL_REQUEST = """
 You are given a task: "{task}" from the user. You must extract the type of category this task belongs to, it can be one or more of the following:
 - "VQA" - answering questions about an image or video, can be used for most tasks, should generally be included.
-- "object detection and counting" - detecting objects or counting objects from a text prompt in an image.
+- "object detection" - detecting objects or counting objects from a text prompt in an image.
 - "instance segmentation" - segmenting objects in an image given a text prompt.
 - "classification" - classifying objects in an image given a text prompt.
 - "segmentation" - segmenting objects in an image or video given a text prompt.
@@ -569,27 +567,27 @@ Count the number of pedestrians across all the images.
 <code>
 from concurrent.futures import ThreadPoolExecutor, as_completed
-from vision_agent.tools import load_image, owlv2_object_detection, florence2_object_detection, countgd_object_detection
+from vision_agent.tools import load_image, agentic_object_detection, glee_object_detection, countgd_object_detection
 # process functions in a try catch so that if it fails it doesn't cause `as_completed` to hang
-def process_owlv2(image_paths):
+def process_agentic(image_paths):
     try:
         results = []
         for image_path in image_paths:
             image = load_image(image_path)
-            results.extend(owlv2_object_detection("person", image))
+            results.extend(agentic_object_detection("person", image))
     except Exception as e:
-        results = f"Encountered error when executing process_owlv2: {str(e)}"
+        results = f"Encountered error when executing process_agentic: {str(e)}"
     return results
-def process_florence2(image_paths):
+def process_glee_object_detection(image_paths):
     try:
         results = []
         for image_path in image_paths:
             image = load_image(image_path)
-            results.extend(florence2_object_detection("person", image))
+            results.extend(glee_object_detection("person", image))
     except Exception as e:
-        results = f"Encountered error when executing process_florence2: {str(e)}"
+        results = f"Encountered error when executing process_glee: {str(e)}"
     return results
 def process_countgd(image_paths):
@@ -606,8 +604,8 @@ image_paths = ["image1.jpg", "image2.jpg", "image3.jpg", "image4.jpg"]
 with ThreadPoolExecutor() as executor:
     futures = {{
-        executor.submit(process_owlv2, image_paths): "owlv2_object_detection",
-        executor.submit(process_florence2, image_paths): "florence2_phrase_grounding",
+        executor.submit(process_agentic, image_paths): "agentic_object_detection",
+        executor.submit(process_glee, image_paths): "glee_object_detection",
         executor.submit(process_countgd, image_paths): "countgd_object_detection",
     }}
@@ -632,7 +630,7 @@ Count the number of people in the video.
 <code>
 import numpy as np
 from concurrent.futures import ThreadPoolExecutor, as_completed
-from vision_agent.tools import extract_frames_and_timestamps, owlv2_sam2_video_tracking, florence2_sam2_video_tracking
+from vision_agent.tools import extract_frames_and_timestamps, countgd_sam2_video_tracking, glee_sam2_video_tracking
 # sample at 1 FPS and use the first 10 frames to reduce processing time
 frames = extract_frames_and_timestamps("video.mp4", 1)
@@ -649,18 +647,18 @@ def remove_arrays(o):
     else:
         return o
-def process_owlv2_sam2_video_tracking(frames):
+def process_countgd_sam2_video_tracking(frames):
     try:
         # run with chunk_length=1 to ensure best results
-        results = owlv2_sam2_video_tracking("person", frames, chunk_length=1)
+        results = countgd_sam2_video_tracking("person", frames, chunk_length=1)
     except Exception as e:
-        results = f"Encountered error when executing process_owlv2_sam2_video_tracking: {str(e)}"
+        results = f"Encountered error when executing process_countgd_sam2_video_tracking: {str(e)}"
     return results
-def process_florence2_sam2_video_tracking(frames):
+def process_glee_sam2_video_tracking(frames):
     try:
         # run with chunk_length=1 to ensure best results
-        results = florence2_sam2_video_tracking("person", frames, chunk_length=1)
+        results = glee_sam2_video_tracking("person", frames, chunk_length=1)
     except Exception as e:
         results = f"Encountered error when executing process_florence2_sam2: {str(e)}"
     return results
@@ -668,8 +666,8 @@ def process_florence2_sam2_video_tracking(frames):
 with ThreadPoolExecutor() as executor:
     futures = {{
-        executor.submit(process_owlv2_sam2_video_tracking, frames): "owlv2_sam2_video_tracking",
-        executor.submit(process_florence2_sam2_video_tracking, frames): "florence2_sam2_video_tracking",
+        executor.submit(process_countgd_sam2_video_tracking, frames): "countgd_sam2_video_tracking",
+        executor.submit(process_glee_sam2_video_tracking, frames): "glee_sam2_video_tracking",
     }}
     final_results = {{}}
     for future in as_completed(futures):
@@ -701,7 +699,7 @@ PICK_TOOL = """
 **Instructions**:
 1. Re-read the user request, plans, tool outputs and examine the image.
 2. Given the user request, try to solve the problem yourself.
-3. Pick which tool output best matches your solution and the user request, DO NOT focus on other factors.
+3. Pick which tool output best matches your solution and the user request, DO NOT focus on other factors like confidence score.
 4. DO NOT modify confidence thresholds unless the tool output is completely wrong.
 5. Remember for videos that in order to count objects a video some sort of tracking is needed, or else you will overcount the objects.
 7. Return the following JSON format inside <json> tags using the exact tool name for best_tool:
@@ -717,7 +715,7 @@ PICK_TOOL = """
 FINALIZE_PLAN = """
 **Task**: You are given a chain of thoughts, python executions and observations from a planning agent as it tries to construct a plan to solve a user request. Your task is to summarize the plan it found so that another programming agent to write a program to accomplish the user request.
-**Documentation**: You can use these tools to help you visualize or save the output (they are imported `from vision_agent.tools import *`):
+**Documentation**: You can use these tools to help complete the code (they are imported `from vision_agent.tools import *`):
 {tool_desc}
 **Planning**: Here is chain of thoughts, executions and observations from the planning agent:
@@ -726,13 +724,14 @@ FINALIZE_PLAN = """
 **Instructions**:
 1. Summarize the plan that the planning agent found.
 2. Write a single function that solves the problem based on what the planner found and only returns the final solution.
-3. Only use tools obtained from calling `get_tool_for_task`.
-4. Do not include {excluded_tools} tools in your instructions.
-5. Ensure the function is well documented and easy to understand.
-6. Ensure you visualize the output with `overlay_bounding_boxes` or `overlay_segmentation_masks`, if bounding boxes or segmentaiton masks are produced, and save it to a file with `save_image` or `save_video`.
-7. Use the default FPS for extracting frames from videos unless otherwise specified by the user.
-8. Include the expected answer in your 'plan' so that the programming agent can properly test if it has the correct answer.
-9. Respond in the following format with JSON surrounded by <json> tags and code surrounded by <code> tags:
+3. Ensure the function takes in a single argument for the image or video path, all other arguments must be keyword arguments with default values.
+4. Only use tools obtained from calling `get_tool_for_task` and the ones provided in the documentation.
+5. Do not include {excluded_tools} tools in your instructions.
+6. Ensure the function is well documented and easy to understand, DO NOT escape quotes in the function documentation.
+7. Ensure you visualize the output with `overlay_bounding_boxes` or `overlay_segmentation_masks`, if bounding boxes or segmentaiton masks are produced, and save it to a file with `save_image` or `save_video`.
+8. Use the default FPS for extracting frames from videos unless otherwise specified by the user.
+9. Include the expected answer in your 'plan' so that the programming agent can properly test if it has the correct answer.
+10. Respond in the following format with JSON surrounded by <json> tags and code surrounded by <code> tags:
 <json>
 {{
@@ -744,7 +743,7 @@ FINALIZE_PLAN = """
 </json>
 <code>
-# Code snippets here
+# Your function code here
 </code>
 """

{vision_agent-1.0.4 → vision_agent-1.0.7}/vision_agent/agent/vision_agent_planner_v2.py RENAMED Viewed

@@ -73,7 +73,7 @@ class DefaultPlanningImports:
         "import cv2",
         "from typing import *",
         "from vision_agent.tools import *",
-        "from vision_agent.tools.planner_tools import claude35_vqa, suggestion, get_tool_for_task",
+        "from vision_agent.tools.planner_tools import vqa, suggestion, get_tool_for_task",
         "from pillow_heif import register_heif_opener",
         "register_heif_opener()",
         "import matplotlib.pyplot as plt",
@@ -228,6 +228,7 @@ def execute_code_action(
     count = 1
     while not execution.success and count <= 3:
+        start = time.time()
         prompt = FIX_BUG.format(chat_history=get_planning(chat), code=code, error=obs)
         response = cast(str, model.chat([{"role": "user", "content": prompt}]))
         new_code = extract_tag(response, "code", extract_markdown="python")
@@ -243,7 +244,7 @@ def execute_code_action(
         if verbose:
             print_code(f"Fixing Bug Round {count}:", code)
             _CONSOLE.print(
-                f"[bold cyan]Code Execution Output:[/bold cyan] [yellow]{escape(obs)}[/yellow]"
+                f"[bold cyan]Code Execution Output ({end - start:.2f}s):[/bold cyan] [yellow]{escape(obs)}[/yellow]"
             )
         count += 1

vision-agent 1.0.4__tar.gz → 1.0.7__tar.gz

vision-agent 1.0.4tar.gz → 1.0.7tar.gz