vision-agent 1.1.16__tar.gz → 1.1.18__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. {vision_agent-1.1.16 → vision_agent-1.1.18}/PKG-INFO +4 -4
  2. {vision_agent-1.1.16 → vision_agent-1.1.18}/pyproject.toml +4 -4
  3. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/.sim_tools/df.csv +12 -12
  4. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/.sim_tools/embs.npy +0 -0
  5. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/__init__.py +1 -0
  6. vision_agent-1.1.18/vision_agent/agent/vision_agent_prompts_v3.py +372 -0
  7. vision_agent-1.1.18/vision_agent/agent/vision_agent_v3.py +278 -0
  8. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/lmm/lmm.py +219 -57
  9. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/__init__.py +3 -3
  10. vision_agent-1.1.18/vision_agent/tools/planner_v3_tools.py +206 -0
  11. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/tools.py +55 -64
  12. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/agent.py +24 -8
  13. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/tools.py +1 -1
  14. {vision_agent-1.1.16 → vision_agent-1.1.18}/.gitignore +0 -0
  15. {vision_agent-1.1.16 → vision_agent-1.1.18}/LICENSE +0 -0
  16. {vision_agent-1.1.16 → vision_agent-1.1.18}/README.md +0 -0
  17. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/__init__.py +0 -0
  18. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/README.md +0 -0
  19. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/agent.py +0 -0
  20. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_coder_prompts_v2.py +0 -0
  21. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_coder_v2.py +0 -0
  22. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_planner_prompts_v2.py +0 -0
  23. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_planner_v2.py +0 -0
  24. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_prompts_v2.py +0 -0
  25. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_v2.py +0 -0
  26. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/clients/__init__.py +0 -0
  27. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/clients/http.py +0 -0
  28. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/configs/__init__.py +0 -0
  29. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/configs/anthropic_config.py +0 -0
  30. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/configs/config.py +0 -0
  31. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/configs/openai_config.py +0 -0
  32. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/fonts/__init__.py +0 -0
  33. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/fonts/default_font_ch_en.ttf +0 -0
  34. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/lmm/__init__.py +0 -0
  35. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/models/__init__.py +0 -0
  36. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/models/agent_types.py +0 -0
  37. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/models/lmm_types.py +0 -0
  38. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/models/tools_types.py +0 -0
  39. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/sim/__init__.py +0 -0
  40. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/sim/sim.py +0 -0
  41. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/meta_tools.py +0 -0
  42. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/planner_tools.py +0 -0
  43. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/prompts.py +0 -0
  44. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/__init__.py +0 -0
  45. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/exceptions.py +0 -0
  46. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/execute.py +0 -0
  47. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/image_utils.py +0 -0
  48. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/tools_doc.py +0 -0
  49. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/video.py +0 -0
  50. {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/video_tracking.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: vision-agent
3
- Version: 1.1.16
3
+ Version: 1.1.18
4
4
  Summary: Toolset for Vision Agent
5
5
  Project-URL: Homepage, https://landing.ai
6
6
  Project-URL: repository, https://github.com/landing-ai/vision-agent
@@ -8,7 +8,7 @@ Project-URL: documentation, https://github.com/landing-ai/vision-agent
8
8
  Author-email: Landing AI <dev@landing.ai>
9
9
  License-File: LICENSE
10
10
  Requires-Python: <4.0,>=3.9
11
- Requires-Dist: anthropic<0.32,>=0.31.0
11
+ Requires-Dist: anthropic>=0.54.0
12
12
  Requires-Dist: av<12,>=11.0.0
13
13
  Requires-Dist: dotenv<0.10,>=0.9.9
14
14
  Requires-Dist: flake8<8,>=7.0.0
@@ -20,7 +20,7 @@ Requires-Dist: matplotlib<4,>=3.9.2
20
20
  Requires-Dist: nbclient<0.11,>=0.10.0
21
21
  Requires-Dist: nbformat<6,>=5.10.4
22
22
  Requires-Dist: numpy<2.0.0,>=1.21.0
23
- Requires-Dist: openai==1.55.3
23
+ Requires-Dist: openai>=1.86.0
24
24
  Requires-Dist: opencv-python==4.*
25
25
  Requires-Dist: opentelemetry-api<2,>=1.29.0
26
26
  Requires-Dist: pandas==2.*
@@ -36,7 +36,7 @@ Requires-Dist: tabulate<0.10,>=0.9.0
36
36
  Requires-Dist: tenacity<9,>=8.3.0
37
37
  Requires-Dist: tqdm<5.0.0,>=4.64.0
38
38
  Requires-Dist: typing-extensions==4.*
39
- Requires-Dist: yt-dlp>=2025.3.31
39
+ Requires-Dist: yt-dlp>=2025.6.9
40
40
  Description-Content-Type: text/markdown
41
41
 
42
42
  <div align="center">
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
4
4
 
5
5
  [project]
6
6
  name = "vision-agent"
7
- version = "1.1.16"
7
+ version = "1.1.18"
8
8
  description = "Toolset for Vision Agent"
9
9
  authors = [{ name = "Landing AI", email = "dev@landing.ai" }]
10
10
  requires-python = ">=3.9,<4.0"
@@ -15,7 +15,7 @@ dependencies = [
15
15
  "requests==2.*",
16
16
  "tqdm>=4.64.0,<5.0.0",
17
17
  "pandas==2.*",
18
- "openai==1.55.3",
18
+ "openai>=1.86.0",
19
19
  "httpx==0.27.2",
20
20
  "flake8>=7.0.0,<8",
21
21
  "typing_extensions==4.*",
@@ -28,7 +28,7 @@ dependencies = [
28
28
  "ipykernel>=6.29.4,<7",
29
29
  "tenacity>=8.3.0,<9",
30
30
  "pillow-heif>=0.16.0,<0.17",
31
- "anthropic>=0.31.0,<0.32",
31
+ "anthropic>=0.54.0",
32
32
  "pydantic>=2.0.0,<3",
33
33
  "av>=11.0.0,<12",
34
34
  "libcst>=1.5.0,<2",
@@ -38,7 +38,7 @@ dependencies = [
38
38
  "dotenv>=0.9.9,<0.10",
39
39
  "pymupdf>=1.23.0,<2",
40
40
  "google-genai>=1.0.0,<2",
41
- "yt-dlp>=2025.3.31",
41
+ "yt-dlp>=2025.6.9",
42
42
  ]
43
43
 
44
44
  [project.urls]
@@ -388,8 +388,8 @@ desc,doc,name
388
388
  -------
389
389
  >>> document_qa(image, question)
390
390
  'The answer to the question ...'",document_qa
391
- "'ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
392
- 'ocr' extracts text from an image. It returns a list of detected text, bounding
391
+ "'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","paddle_ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
392
+ 'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding
393
393
  boxes with normalized coordinates, and confidence scores. The results are sorted
394
394
  from top-left to bottom right.
395
395
 
@@ -402,10 +402,10 @@ desc,doc,name
402
402
 
403
403
  Example
404
404
  -------
405
- >>> ocr(image)
405
+ >>> paddle_ocr(image)
406
406
  [
407
407
  {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
408
- ]",ocr
408
+ ]",paddle_ocr
409
409
  "'gemini_image_generation' performs either image inpainting given an image and text prompt, or image generation given a prompt. It can be used to edit parts of an image or the entire image according to the prompt given.","gemini_image_generation(prompt: str, image: Optional[numpy.ndarray] = None) -> numpy.ndarray:
410
410
  'gemini_image_generation' performs either image inpainting given an image and text prompt, or image generation given a prompt.
411
411
  It can be used to edit parts of an image or the entire image according to the prompt given.
@@ -484,26 +484,26 @@ desc,doc,name
484
484
  {'start_time': 2, 'end_time': 4, 'location': 'Outdoor area', 'description': 'A person approaches a white bicycle parked in a row. The person then swings their leg over the bike and gets on it.', 'label': 0},
485
485
  {'start_time': 10, 'end_time': 13, 'location': 'Outdoor area', 'description': 'A person gets off a white bicycle parked in a row. The person swings their leg over the bike and dismounts.', 'label': 1},
486
486
  ]",agentic_activity_recognition
487
- 'depth_anything_v2' is a tool that runs depth anything v2 model to generate a depth image from a given RGB image. The returned depth image is monochrome and represents depth values as pixel intensities with pixel values ranging from 0 to 255.,"depth_anything_v2(image: numpy.ndarray) -> numpy.ndarray:
488
- 'depth_anything_v2' is a tool that runs depth anything v2 model to generate a
489
- depth image from a given RGB image. The returned depth image is monochrome and
490
- represents depth values as pixel intensities with pixel values ranging from 0 to 255.
487
+ "'depth_pro' is a tool that runs the Apple DepthPro model to generate a depth map from a given RGB image. The returned depth map has the same dimensions as the input image, with each pixel indicating the distance from the camera in meters.","depth_pro(image: numpy.ndarray) -> numpy.ndarray:
488
+ 'depth_pro' is a tool that runs the Apple DepthPro model to generate a
489
+ depth map from a given RGB image. The returned depth map has the same dimensions
490
+ as the input image, with each pixel indicating the distance from the camera in meters.
491
491
 
492
492
  Parameters:
493
493
  image (np.ndarray): The image to used to generate depth image
494
494
 
495
495
  Returns:
496
- np.ndarray: A grayscale depth image with pixel values ranging from 0 to 255
497
- where high values represent closer objects and low values further.
496
+ np.ndarray: A depth map with float32 pixel values that represent
497
+ the distance from the camera in meters.
498
498
 
499
499
  Example
500
500
  -------
501
- >>> depth_anything_v2(image)
501
+ >>> depth_pro(image)
502
502
  array([[0, 0, 0, ..., 0, 0, 0],
503
503
  [0, 20, 24, ..., 0, 100, 103],
504
504
  ...,
505
505
  [10, 11, 15, ..., 202, 202, 205],
506
- [10, 10, 10, ..., 200, 200, 200]], dtype=uint8),",depth_anything_v2
506
+ [10, 10, 10, ..., 200, 200, 200]], dtype=np.float32),",depth_pro
507
507
  'generate_pose_image' is a tool that generates a open pose bone/stick image from a given RGB image. The returned bone image is RGB with the pose amd keypoints colored and background as black.,"generate_pose_image(image: numpy.ndarray) -> numpy.ndarray:
508
508
  'generate_pose_image' is a tool that generates a open pose bone/stick image from
509
509
  a given RGB image. The returned bone image is RGB with the pose amd keypoints colored
@@ -2,3 +2,4 @@ from .agent import Agent, AgentCoder, AgentPlanner
2
2
  from .vision_agent_coder_v2 import VisionAgentCoderV2
3
3
  from .vision_agent_planner_v2 import VisionAgentPlannerV2
4
4
  from .vision_agent_v2 import VisionAgentV2
5
+ from .vision_agent_v3 import VisionAgentV3
@@ -0,0 +1,372 @@
1
+ TOOLS = """
2
+ load_image(image_path: str) -> np.ndarray:
3
+ A function that loads an image from a file path and returns it as a numpy array.
4
+
5
+ instance_segmentation(prompt: str, image: np.ndarray, threshold: float = 0.23, nms_threshold: float = 0.5) -> list[dict[str, str | float | list[float] | np.ndarray]]:
6
+ A function that takes a prompt and an image and returns a list of dictionaries.
7
+ Each dictionary represents an object detected in the image and contains it's label,
8
+ confidence score, bounding box coordinates and mask. The prompt can be a noun
9
+ phrase such as 'dog' or 'person', the model generally has higher recall and lower
10
+ precision. Do not use plural prompts only single for detecting multiple instances.
11
+ An example of the return value:
12
+ [{
13
+ "label": "dog",
14
+ "score": 0.95,
15
+ "bbox": [0.1, 0.2, 0.3, 0.4], # normalized coordinates
16
+ "mask": np.ndarray # binary mask
17
+ }, ...]
18
+
19
+ ocr(image: np.ndarray) -> list[dict[str, str | float | list[float]]:
20
+ A function that takes an image and returns the text detected in the image with
21
+ bounding boxes. For example:
22
+ [{
23
+ "score": # float confidence,
24
+ "bbox": [x1, y1, x2, y2] # normalized coordinates
25
+ "label": "this is some text inside the bounding box",
26
+ }, ...]
27
+
28
+ depth_estimation(image: np.ndarray) -> np.ndarray:
29
+ A function that takes an image and returns the depth map of the image as a numpy
30
+ array. The value represents an estimate in meters of the distance of the object
31
+ from the camera.
32
+
33
+ visualize_bounding_boxes(image: np.ndarray, bounding_boxes: list[dict[str, str | float | list[float] | np.ndarray]]) -> np.ndarray:
34
+ A function that takes an image and a list of bounding boxes and returns an image
35
+ that displays the boxes on the image.
36
+
37
+ visualize_segmentation_masks(image: np.ndarray, segmentation_masks: list[dict[str, str | float | list[float] | np.ndarray]]) -> np.ndarray:
38
+ A function that takes an image and a list of segmentation masks and returns an image
39
+ that displays the masks on the image.
40
+
41
+ get_crops(image: np.ndarray, bounding_box: list[dict[str, str | float | list[float] | np.ndarray]]) -> list[np.ndarray]:
42
+ A function that takes an image and a list of bounding boxes and returns the cropped
43
+ bounding boxes.
44
+
45
+ rotate_90(image: np.ndarray, k: int) -> np.ndarray:
46
+ Rotates the image 90 degrees k times. The function takes an image and an integer.
47
+
48
+ display_image(image: Union[np.ndarray, PILImageType, matplotlib.figure.Figure, str]) -> None:
49
+ A function that takes an image and displays it. The image can be a numpy array, a
50
+ PIL image, a matplotlib figure or a string (path to the image).
51
+
52
+ iou(pred1: list[float] | np.ndarray, pred2: list[float] | np.ndarray) -> float:
53
+ A function that takes either two bounding boxes or two masks and returns the
54
+ intersection over union. Remember to unnormalize the bounding boxes before
55
+ calculating the iou.
56
+ """
57
+
58
+ EXAMPLES = """
59
+ <code>
60
+ # This is a plan for trying to identify a missing object that existing detectors cannot find by taking advantage of a grid pattern positioning of the non-missing elements:
61
+
62
+ widths = [detection["bbox"][2] - detection["bbox"][0] for detection in detections]
63
+ heights = [detection["bbox"][3] - detection["bbox"][1] for detection in detections]
64
+
65
+ med_width = np.median(widths)
66
+ med_height = np.median(heights)
67
+
68
+ sorted_detections = sorted(detections, key=lambda x: x["bbox"][1])
69
+ rows = []
70
+ current_row = []
71
+ prev_y = sorted_detections[0]["bbox"][1]
72
+
73
+ for detection in sorted_detections:
74
+ if abs(detection["bbox"][1] - prev_y) > med_height / 2:
75
+ rows.append(current_row)
76
+ current_row = []
77
+ current_row.append(detection)
78
+ prev_y = detection["bbox"][1]
79
+
80
+ if current_row:
81
+ rows.append(current_row)
82
+ sorted_rows = [sorted(row, key=lambda x: x["bbox"][0]) for row in rows]
83
+ max_cols = max(len(row) for row in sorted_rows)
84
+ max_rows = len(sorted_rows)
85
+
86
+ column_positions = []
87
+ for col in range(max(len(row) for row in sorted_rows)):
88
+ column = [row[col] for row in sorted_rows if col < len(row)]
89
+ med_left = np.median([d["bbox"][0] for d in column])
90
+ med_right = np.median([d["bbox"][2] for d in column])
91
+ column_positions.append((med_left, med_right))
92
+
93
+ row_positions = []
94
+ for row in sorted_rows:
95
+ med_top = np.median([d["bbox"][1] for d in row])
96
+ med_bottom = np.median([d["bbox"][3] for d in row])
97
+ row_positions.append((med_top, med_bottom))
98
+
99
+
100
+ def find_element(left, right, top, bottom, elements):
101
+ center_x = (left + right) / 2
102
+ center_y = (top + bottom) / 2
103
+ for element in elements:
104
+ x_min, y_min, x_max, y_max = element["bbox"]
105
+ elt_center_x = (x_min + x_max) / 2
106
+ elt_center_y = (y_min + y_max) / 2
107
+ if (abs(center_x - elt_center_x) < med_width / 2) and (
108
+ abs(center_y - elt_center_y) < med_height / 2
109
+ ):
110
+ return element
111
+ return
112
+
113
+ missing_elements = []
114
+ for row in range(max_rows):
115
+ for col in range(max_cols):
116
+ left, right = column_positions[col]
117
+ top, bottom = row_positions[row]
118
+ match = find_element(left, right, top, bottom, sorted_rows[row])
119
+ if match is None:
120
+ missing_elements.append((left, top, right, bottom))
121
+ </code>
122
+
123
+ <code>
124
+ # This a plan to find objects that are only identifiable when compared to other objects such "find the smaller object"
125
+
126
+ from sklearn.cluster import KMeans
127
+ import numpy as np
128
+
129
+ detections = instance_segmentation("object", image)
130
+
131
+ def get_area(detection):
132
+ return (detection["bbox"][2] - detection["bbox"][0]) * (detection["bbox"][3] - detection["bbox"][1])
133
+
134
+ areas = [get_area(detection) for detection in detections]
135
+ X = np.array(areas)[:, None]
136
+
137
+ kmeans = KMeans(n_clusters=<number of clusters>).fit(X)
138
+ smallest_cluster = np.argmin(kmeans.cluster_centers_)
139
+ largest_cluster = np.argmax(kmeans.cluster_centers_)
140
+
141
+ clusters = kmeans.predict(X)
142
+ smallest_detections = [detection for detection, cluster in zip(detections, clusters) if cluster == smallest_cluster]
143
+ largest_detections = [detection for detection, cluster in zip(detections, clusters) if cluster == largest_cluster]
144
+ </code>
145
+
146
+ <code>
147
+ # This is a plan to help find object that are identified spatially relative to other 'anchor' objects, for example find the boxes to the right of the chair
148
+
149
+ # First find a model that can detect the location of the anchor objects
150
+ anchor_dets = instance_segmentation("anchor object", image)
151
+ # Then find a model that can detect the location of the relative objects
152
+ relative_dets = instance_segmentation("relative object", image)
153
+
154
+ # This will give you relative objects 'above' the anchor objects since it's the
155
+ # distance between the lower left corner of the relative object and the upper left
156
+ # corner of the anchor object. The remaining functions can be used to get the other
157
+ # relative positions.
158
+ def above_distance(box1, box2):
159
+ return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
160
+ box1["bbox"][3] - box2["bbox"][1]
161
+ ) ** 2
162
+
163
+ def below_distance(box1, box2):
164
+ return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
165
+ box1["bbox"][1] - box2["bbox"][3]
166
+ ) ** 2
167
+
168
+ def right_distance(box1, box2):
169
+ return (box1["bbox"][0] - box2["bbox"][2]) ** 2 + (
170
+ box1["bbox"][1] - box2["bbox"][1]
171
+ ) ** 2
172
+
173
+ def left_distance(box1, box2):
174
+ return (box1["bbox"][2] - box2["bbox"][0]) ** 2 + (
175
+ box1["bbox"][1] - box2["bbox"][1]
176
+ ) ** 2
177
+
178
+ closest_boxes = []
179
+ for anchor_det in anchor_dets:
180
+ # You can use any of the above functions to get the relative position
181
+ distances = [
182
+ (relative_det, above_distance(relative_det, anchor_det))
183
+ for relative_det in relative_dets
184
+ ]
185
+ # You must grab the nearest object for each of the anchors. This line will give
186
+ # you the box directly above the anchor box (or below, left, right depending on
187
+ # the function used)
188
+ closest_box = min(distances, key=lambda x: x[1])[0]
189
+ closest_boxes.append(closest_box)
190
+ </code>
191
+
192
+ <code>
193
+ # This is a plan to help you find objects according to their depth, for example find the person nearest to the camera
194
+
195
+ # First find a model to estimate the depth of the image
196
+ depth = depth_estimation(image)
197
+ # Then find a model to segment the objects in the image
198
+ masks = instance_segmentation("object", image)
199
+
200
+ for elt in masks:
201
+ # Multiply the depth by the mask and keep track of the mean depth for the masked
202
+ # object
203
+ depth_mask = depth * elt["mask"]
204
+ elt["mean_depth"] = depth_mask.mean()
205
+
206
+ # Sort the masks by mean depth in reverse, objects that are closer will have higher
207
+ # mean depth values and further objects will have lower mean depth values.
208
+ masks = sorted(masks, key=lambda x: x["mean_depth"], reverse=True)
209
+ closest_mask = masks[0]
210
+ </code>
211
+
212
+ <code>
213
+ # This plan helps you assign objects to other objects, for example "count the number of people sitting at a table"
214
+
215
+ pred = instance_segmentation("object 1, object 2", image)
216
+ objects_1 = [p for p in pred if p["label"] == "object 1"]
217
+ objects_2 = [p for p in pred if p["label"] == "object 2"]
218
+
219
+ def box_iou(bbox1: np.ndarray, bbox2: np.ndarray) -> float:
220
+ # Get coordinates of intersection rectangle
221
+ x1 = max(bbox1[0], bbox2[0])
222
+ y1 = max(bbox1[1], bbox2[1])
223
+ x2 = min(bbox1[2], bbox2[2])
224
+ y2 = min(bbox1[3], bbox2[3])
225
+
226
+ # Calculate area of intersection
227
+ intersection = max(0, x2 - x1) * max(0, y2 - y1)
228
+ if intersection == 0:
229
+ return 0.0
230
+
231
+ # Calculate area of both boxes
232
+ box1_area = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
233
+ box2_area = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
234
+
235
+ # Calculate IoU
236
+ union = box1_area + box2_area - intersection
237
+ return intersection / union if union > 0 else 0.0
238
+
239
+ # initialize assignment counts
240
+ objects_2_counts = {i: 0 for i in range(len(objects_2))}
241
+ # you can set a minimum iou threshold for assignment
242
+ iou_threshold = 0.05
243
+
244
+ # You can expand the object 2 box by a certain percentage if needed to help with the
245
+ # assignment.
246
+ for object_2 in objects_2:
247
+ box = object_2["bbox"]
248
+ # If your camera is at an angle you need to expand the top of the box like so:
249
+ box = [0.9 * box[0], 0.9 * box[1], 1.1 * box[2], box[3]]
250
+ # If the camera is top down you should expand all sides of the box like this:
251
+ box = [0.9 * box[0], 0.9 * box[1], 1.1 * box[2], 1.1 * box[3]]
252
+
253
+ object_2["bbox"] = box
254
+
255
+ for object_1 in objects_1:
256
+ best_iou = 0
257
+ best_object_2_idx = None
258
+
259
+ for j, object_2 in enumerate(objects_2):
260
+ iou = box_iou(object_1["bbox"], object_2["bbox"])
261
+ if iou > best_iou and iou > iou_threshold:
262
+ best_iou = iou
263
+ best_object_2_idx = j
264
+
265
+ if best_object_2_idx is not None:
266
+ objects_2_counts[best_object_2_idx] += 1
267
+ </code>
268
+
269
+ <code>
270
+ # This plan helps you adjust the threshold to get the best results for a specific object detection model
271
+
272
+ dets = instance_segmentation("object", image)
273
+ proposed_threshold = 0.3
274
+ for d in dets:
275
+ d["dist"] = abs(d["score"] - proposed_threshold)
276
+
277
+ nearby_dets = sorted(dets, key=lambda x: x["dist"])[:10]
278
+ unnormalized_coords = []
279
+ for d in nearby_dets:
280
+ unnormalized_coords.append([
281
+ d["bbox"][0] * image.shape[1],
282
+ d["bbox"][1] * image.shape[0],
283
+ d["bbox"][2] * image.shape[1],
284
+ d["bbox"][3] * image.shape[0],
285
+ ])
286
+
287
+ crops = crop_image(image, unnormalized_coords)
288
+
289
+ for i, crop in enumerate(crops):
290
+ print(f"object {i}")
291
+ display(crop)
292
+ </code>
293
+
294
+ <code>
295
+ # this plan helps you analyze bounding boxes and filter out possible false positives
296
+
297
+ image = load_image(image_path)
298
+ dets = instance_segmentation("object", image)
299
+ crops = get_crops(image, dets)
300
+
301
+ # you can only display the first 5 crops
302
+ for i, crop in enumerate(crops[:5]):
303
+ print(f"object {i}, label: {dets[i]['label']}, score: {dets[i]['score']}")
304
+ display_image(crop)
305
+
306
+ # in your final <answer> you can choose only the crops that you want to keep
307
+ </code>
308
+
309
+ <code>
310
+ # If you want to zoom into an area of the image, you can use the following plan
311
+
312
+ image = load_image(image_path)
313
+ center_crop = get_crops(image, [{"bbox": [0.25, 0.25, 0.75, 0.75]}])[0]
314
+
315
+ # or you could split the image into 4 quadrants and display them
316
+ quadrants = get_crops(image, [{"bbox": [0, 0, 0.5, 0.5]}, {"bbox": [0.5, 0, 1, 0.5]}, {"bbox": [0, 0.5, 0.5, 1]}, {"bbox": [0.5, 0.5, 1, 1]}])
317
+ </code>
318
+ """
319
+
320
+
321
+ def get_init_prompt(
322
+ model: str,
323
+ turns: int,
324
+ question: str,
325
+ category: str,
326
+ image_path: str,
327
+ tool_docs: str = TOOLS,
328
+ examples: str = EXAMPLES,
329
+ ) -> str:
330
+ if category == "localization":
331
+ cat_text = "Return bounding box coordinates for this question."
332
+ elif category == "text":
333
+ cat_text = "Return only text for this question."
334
+ elif category == "counting":
335
+ cat_text = "Return only the number of objects for this question."
336
+ else:
337
+ cat_text = ""
338
+
339
+ initial_prompt = f"""You are given a question and an image from the user. Your task is using the vision tools and multi-modal reasoning to answer the user's question in a generalized way, so the user could give you a different image and run it through your code to get the correct answer to the same question. You have {turns} turns to solve the problem, be sure to utilize the tools to their fullest extent, also reflect on the results of the tools to best solve the problem. The tools are not perfect, so you may need to look into the results, refer back to the image and refine your answer.
340
+
341
+ Here is the documentation for the tools you have access to, you can implement new utilities if needed, but DO NOT redefine these functions or make placeholder functions for them or import them, you will have access to them:
342
+ <docs>
343
+ {tool_docs}
344
+ </docs>
345
+
346
+ Here's some example pseudo code plans that you can use to help solve your problem:
347
+ <examples>
348
+ {examples}
349
+ </examples>
350
+
351
+ <instructions>
352
+ - You should approximate the answer using a symbolic approach as much as possible (e.g. using combinations of the tools and python code to get the answer) since it's more robust and generalizable. However, if you need to use your own reasoning and observation to get the answer, you can do so.
353
+ - ALWAYS PUT THE CODE YOU WANT TO EXECUTE IN <code> TAGS (IMPORTANT!)
354
+ - You can progressively refine your answer using your own reflection on both the tasks, code outputs and tools to get the best results.
355
+ - During your iteration output <code> tags to execute code which will be run in a jupyter notebook environment.
356
+ - Do not re-define the tools as an another function, you will have access to them (e.g load_image, etc)
357
+ - For object detection queries, include only the bounding box coordinates within <answer> and omit any image data.
358
+ - If one of the tool results has low confidence or you believe it's wrong, you should focus on that result manually by yourself, observe and verify it, do not just accept it.
359
+ - Within <answer>, provide only the final answer without any explanation. For instance, for a bounding box query, return in a list of bounding box, for example <answer>[[0.234375, 0.68359375, 0.33203125, 0.7356770833333334]]</answer> or <answer>3</answer> or <answer>blue</answer>. For other tasks, return a single word or number; if the task pertains to pricing, prefix the number with a dollar sign, e.g., <answer>$3.4</answer>.
360
+ - If you cannot find an answer to the question, for example you are asked to find an object and it does not exist in the image, you should return <answer>None</answer>.
361
+ - In your first turn, you should look into the requested image, observe intelligently, tell me what you see, think about the user question, and plan your next steps or analysis.
362
+ - Each turn you only visualize or display any image or crop maximum 5 images, use `display_image` or any tool that can display an image in an intelligent way. You can first analyze the result, then target few of them to display for further analysis or understanding, do not waste your turns displaying all the images or waste one turn to display many irrelevant images.
363
+ - Output in the following format:
364
+ {"<thinking>Your very verbose visual observation, Your thoughts on the next step, reflecting on the last code execution result, planning on the next steps</thinking>" if model != "claude" else ""}
365
+ <code>The python code you want to execute (image manipulation, tool calls, etc)</code>
366
+ <answer>The final answer you want to output, but compulsory to output if you are on your final turn</answer>
367
+ </instructions>
368
+
369
+ Here is the user question: {question}
370
+ Here is the image path: {image_path}
371
+ {cat_text}"""
372
+ return initial_prompt