vision-agent 1.1.16__tar.gz → 1.1.18__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {vision_agent-1.1.16 → vision_agent-1.1.18}/PKG-INFO +4 -4
- {vision_agent-1.1.16 → vision_agent-1.1.18}/pyproject.toml +4 -4
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/.sim_tools/df.csv +12 -12
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/.sim_tools/embs.npy +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/__init__.py +1 -0
- vision_agent-1.1.18/vision_agent/agent/vision_agent_prompts_v3.py +372 -0
- vision_agent-1.1.18/vision_agent/agent/vision_agent_v3.py +278 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/lmm/lmm.py +219 -57
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/__init__.py +3 -3
- vision_agent-1.1.18/vision_agent/tools/planner_v3_tools.py +206 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/tools.py +55 -64
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/agent.py +24 -8
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/tools.py +1 -1
- {vision_agent-1.1.16 → vision_agent-1.1.18}/.gitignore +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/LICENSE +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/README.md +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/__init__.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/README.md +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/agent.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_coder_prompts_v2.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_coder_v2.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_planner_prompts_v2.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_planner_v2.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_prompts_v2.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/agent/vision_agent_v2.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/clients/__init__.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/clients/http.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/configs/__init__.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/configs/anthropic_config.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/configs/config.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/configs/openai_config.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/fonts/__init__.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/fonts/default_font_ch_en.ttf +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/lmm/__init__.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/models/__init__.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/models/agent_types.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/models/lmm_types.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/models/tools_types.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/sim/__init__.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/sim/sim.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/meta_tools.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/planner_tools.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/tools/prompts.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/__init__.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/exceptions.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/execute.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/image_utils.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/tools_doc.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/video.py +0 -0
- {vision_agent-1.1.16 → vision_agent-1.1.18}/vision_agent/utils/video_tracking.py +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: vision-agent
|
3
|
-
Version: 1.1.
|
3
|
+
Version: 1.1.18
|
4
4
|
Summary: Toolset for Vision Agent
|
5
5
|
Project-URL: Homepage, https://landing.ai
|
6
6
|
Project-URL: repository, https://github.com/landing-ai/vision-agent
|
@@ -8,7 +8,7 @@ Project-URL: documentation, https://github.com/landing-ai/vision-agent
|
|
8
8
|
Author-email: Landing AI <dev@landing.ai>
|
9
9
|
License-File: LICENSE
|
10
10
|
Requires-Python: <4.0,>=3.9
|
11
|
-
Requires-Dist: anthropic
|
11
|
+
Requires-Dist: anthropic>=0.54.0
|
12
12
|
Requires-Dist: av<12,>=11.0.0
|
13
13
|
Requires-Dist: dotenv<0.10,>=0.9.9
|
14
14
|
Requires-Dist: flake8<8,>=7.0.0
|
@@ -20,7 +20,7 @@ Requires-Dist: matplotlib<4,>=3.9.2
|
|
20
20
|
Requires-Dist: nbclient<0.11,>=0.10.0
|
21
21
|
Requires-Dist: nbformat<6,>=5.10.4
|
22
22
|
Requires-Dist: numpy<2.0.0,>=1.21.0
|
23
|
-
Requires-Dist: openai
|
23
|
+
Requires-Dist: openai>=1.86.0
|
24
24
|
Requires-Dist: opencv-python==4.*
|
25
25
|
Requires-Dist: opentelemetry-api<2,>=1.29.0
|
26
26
|
Requires-Dist: pandas==2.*
|
@@ -36,7 +36,7 @@ Requires-Dist: tabulate<0.10,>=0.9.0
|
|
36
36
|
Requires-Dist: tenacity<9,>=8.3.0
|
37
37
|
Requires-Dist: tqdm<5.0.0,>=4.64.0
|
38
38
|
Requires-Dist: typing-extensions==4.*
|
39
|
-
Requires-Dist: yt-dlp>=2025.
|
39
|
+
Requires-Dist: yt-dlp>=2025.6.9
|
40
40
|
Description-Content-Type: text/markdown
|
41
41
|
|
42
42
|
<div align="center">
|
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
|
|
4
4
|
|
5
5
|
[project]
|
6
6
|
name = "vision-agent"
|
7
|
-
version = "1.1.
|
7
|
+
version = "1.1.18"
|
8
8
|
description = "Toolset for Vision Agent"
|
9
9
|
authors = [{ name = "Landing AI", email = "dev@landing.ai" }]
|
10
10
|
requires-python = ">=3.9,<4.0"
|
@@ -15,7 +15,7 @@ dependencies = [
|
|
15
15
|
"requests==2.*",
|
16
16
|
"tqdm>=4.64.0,<5.0.0",
|
17
17
|
"pandas==2.*",
|
18
|
-
"openai
|
18
|
+
"openai>=1.86.0",
|
19
19
|
"httpx==0.27.2",
|
20
20
|
"flake8>=7.0.0,<8",
|
21
21
|
"typing_extensions==4.*",
|
@@ -28,7 +28,7 @@ dependencies = [
|
|
28
28
|
"ipykernel>=6.29.4,<7",
|
29
29
|
"tenacity>=8.3.0,<9",
|
30
30
|
"pillow-heif>=0.16.0,<0.17",
|
31
|
-
"anthropic>=0.
|
31
|
+
"anthropic>=0.54.0",
|
32
32
|
"pydantic>=2.0.0,<3",
|
33
33
|
"av>=11.0.0,<12",
|
34
34
|
"libcst>=1.5.0,<2",
|
@@ -38,7 +38,7 @@ dependencies = [
|
|
38
38
|
"dotenv>=0.9.9,<0.10",
|
39
39
|
"pymupdf>=1.23.0,<2",
|
40
40
|
"google-genai>=1.0.0,<2",
|
41
|
-
"yt-dlp>=2025.
|
41
|
+
"yt-dlp>=2025.6.9",
|
42
42
|
]
|
43
43
|
|
44
44
|
[project.urls]
|
@@ -388,8 +388,8 @@ desc,doc,name
|
|
388
388
|
-------
|
389
389
|
>>> document_qa(image, question)
|
390
390
|
'The answer to the question ...'",document_qa
|
391
|
-
"'
|
392
|
-
'
|
391
|
+
"'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","paddle_ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
|
392
|
+
'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding
|
393
393
|
boxes with normalized coordinates, and confidence scores. The results are sorted
|
394
394
|
from top-left to bottom right.
|
395
395
|
|
@@ -402,10 +402,10 @@ desc,doc,name
|
|
402
402
|
|
403
403
|
Example
|
404
404
|
-------
|
405
|
-
>>>
|
405
|
+
>>> paddle_ocr(image)
|
406
406
|
[
|
407
407
|
{'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
|
408
|
-
]",
|
408
|
+
]",paddle_ocr
|
409
409
|
"'gemini_image_generation' performs either image inpainting given an image and text prompt, or image generation given a prompt. It can be used to edit parts of an image or the entire image according to the prompt given.","gemini_image_generation(prompt: str, image: Optional[numpy.ndarray] = None) -> numpy.ndarray:
|
410
410
|
'gemini_image_generation' performs either image inpainting given an image and text prompt, or image generation given a prompt.
|
411
411
|
It can be used to edit parts of an image or the entire image according to the prompt given.
|
@@ -484,26 +484,26 @@ desc,doc,name
|
|
484
484
|
{'start_time': 2, 'end_time': 4, 'location': 'Outdoor area', 'description': 'A person approaches a white bicycle parked in a row. The person then swings their leg over the bike and gets on it.', 'label': 0},
|
485
485
|
{'start_time': 10, 'end_time': 13, 'location': 'Outdoor area', 'description': 'A person gets off a white bicycle parked in a row. The person swings their leg over the bike and dismounts.', 'label': 1},
|
486
486
|
]",agentic_activity_recognition
|
487
|
-
'
|
488
|
-
'
|
489
|
-
depth
|
490
|
-
|
487
|
+
"'depth_pro' is a tool that runs the Apple DepthPro model to generate a depth map from a given RGB image. The returned depth map has the same dimensions as the input image, with each pixel indicating the distance from the camera in meters.","depth_pro(image: numpy.ndarray) -> numpy.ndarray:
|
488
|
+
'depth_pro' is a tool that runs the Apple DepthPro model to generate a
|
489
|
+
depth map from a given RGB image. The returned depth map has the same dimensions
|
490
|
+
as the input image, with each pixel indicating the distance from the camera in meters.
|
491
491
|
|
492
492
|
Parameters:
|
493
493
|
image (np.ndarray): The image to used to generate depth image
|
494
494
|
|
495
495
|
Returns:
|
496
|
-
np.ndarray: A
|
497
|
-
|
496
|
+
np.ndarray: A depth map with float32 pixel values that represent
|
497
|
+
the distance from the camera in meters.
|
498
498
|
|
499
499
|
Example
|
500
500
|
-------
|
501
|
-
>>>
|
501
|
+
>>> depth_pro(image)
|
502
502
|
array([[0, 0, 0, ..., 0, 0, 0],
|
503
503
|
[0, 20, 24, ..., 0, 100, 103],
|
504
504
|
...,
|
505
505
|
[10, 11, 15, ..., 202, 202, 205],
|
506
|
-
[10, 10, 10, ..., 200, 200, 200]], dtype=
|
506
|
+
[10, 10, 10, ..., 200, 200, 200]], dtype=np.float32),",depth_pro
|
507
507
|
'generate_pose_image' is a tool that generates a open pose bone/stick image from a given RGB image. The returned bone image is RGB with the pose amd keypoints colored and background as black.,"generate_pose_image(image: numpy.ndarray) -> numpy.ndarray:
|
508
508
|
'generate_pose_image' is a tool that generates a open pose bone/stick image from
|
509
509
|
a given RGB image. The returned bone image is RGB with the pose amd keypoints colored
|
Binary file
|
@@ -0,0 +1,372 @@
|
|
1
|
+
TOOLS = """
|
2
|
+
load_image(image_path: str) -> np.ndarray:
|
3
|
+
A function that loads an image from a file path and returns it as a numpy array.
|
4
|
+
|
5
|
+
instance_segmentation(prompt: str, image: np.ndarray, threshold: float = 0.23, nms_threshold: float = 0.5) -> list[dict[str, str | float | list[float] | np.ndarray]]:
|
6
|
+
A function that takes a prompt and an image and returns a list of dictionaries.
|
7
|
+
Each dictionary represents an object detected in the image and contains it's label,
|
8
|
+
confidence score, bounding box coordinates and mask. The prompt can be a noun
|
9
|
+
phrase such as 'dog' or 'person', the model generally has higher recall and lower
|
10
|
+
precision. Do not use plural prompts only single for detecting multiple instances.
|
11
|
+
An example of the return value:
|
12
|
+
[{
|
13
|
+
"label": "dog",
|
14
|
+
"score": 0.95,
|
15
|
+
"bbox": [0.1, 0.2, 0.3, 0.4], # normalized coordinates
|
16
|
+
"mask": np.ndarray # binary mask
|
17
|
+
}, ...]
|
18
|
+
|
19
|
+
ocr(image: np.ndarray) -> list[dict[str, str | float | list[float]]:
|
20
|
+
A function that takes an image and returns the text detected in the image with
|
21
|
+
bounding boxes. For example:
|
22
|
+
[{
|
23
|
+
"score": # float confidence,
|
24
|
+
"bbox": [x1, y1, x2, y2] # normalized coordinates
|
25
|
+
"label": "this is some text inside the bounding box",
|
26
|
+
}, ...]
|
27
|
+
|
28
|
+
depth_estimation(image: np.ndarray) -> np.ndarray:
|
29
|
+
A function that takes an image and returns the depth map of the image as a numpy
|
30
|
+
array. The value represents an estimate in meters of the distance of the object
|
31
|
+
from the camera.
|
32
|
+
|
33
|
+
visualize_bounding_boxes(image: np.ndarray, bounding_boxes: list[dict[str, str | float | list[float] | np.ndarray]]) -> np.ndarray:
|
34
|
+
A function that takes an image and a list of bounding boxes and returns an image
|
35
|
+
that displays the boxes on the image.
|
36
|
+
|
37
|
+
visualize_segmentation_masks(image: np.ndarray, segmentation_masks: list[dict[str, str | float | list[float] | np.ndarray]]) -> np.ndarray:
|
38
|
+
A function that takes an image and a list of segmentation masks and returns an image
|
39
|
+
that displays the masks on the image.
|
40
|
+
|
41
|
+
get_crops(image: np.ndarray, bounding_box: list[dict[str, str | float | list[float] | np.ndarray]]) -> list[np.ndarray]:
|
42
|
+
A function that takes an image and a list of bounding boxes and returns the cropped
|
43
|
+
bounding boxes.
|
44
|
+
|
45
|
+
rotate_90(image: np.ndarray, k: int) -> np.ndarray:
|
46
|
+
Rotates the image 90 degrees k times. The function takes an image and an integer.
|
47
|
+
|
48
|
+
display_image(image: Union[np.ndarray, PILImageType, matplotlib.figure.Figure, str]) -> None:
|
49
|
+
A function that takes an image and displays it. The image can be a numpy array, a
|
50
|
+
PIL image, a matplotlib figure or a string (path to the image).
|
51
|
+
|
52
|
+
iou(pred1: list[float] | np.ndarray, pred2: list[float] | np.ndarray) -> float:
|
53
|
+
A function that takes either two bounding boxes or two masks and returns the
|
54
|
+
intersection over union. Remember to unnormalize the bounding boxes before
|
55
|
+
calculating the iou.
|
56
|
+
"""
|
57
|
+
|
58
|
+
EXAMPLES = """
|
59
|
+
<code>
|
60
|
+
# This is a plan for trying to identify a missing object that existing detectors cannot find by taking advantage of a grid pattern positioning of the non-missing elements:
|
61
|
+
|
62
|
+
widths = [detection["bbox"][2] - detection["bbox"][0] for detection in detections]
|
63
|
+
heights = [detection["bbox"][3] - detection["bbox"][1] for detection in detections]
|
64
|
+
|
65
|
+
med_width = np.median(widths)
|
66
|
+
med_height = np.median(heights)
|
67
|
+
|
68
|
+
sorted_detections = sorted(detections, key=lambda x: x["bbox"][1])
|
69
|
+
rows = []
|
70
|
+
current_row = []
|
71
|
+
prev_y = sorted_detections[0]["bbox"][1]
|
72
|
+
|
73
|
+
for detection in sorted_detections:
|
74
|
+
if abs(detection["bbox"][1] - prev_y) > med_height / 2:
|
75
|
+
rows.append(current_row)
|
76
|
+
current_row = []
|
77
|
+
current_row.append(detection)
|
78
|
+
prev_y = detection["bbox"][1]
|
79
|
+
|
80
|
+
if current_row:
|
81
|
+
rows.append(current_row)
|
82
|
+
sorted_rows = [sorted(row, key=lambda x: x["bbox"][0]) for row in rows]
|
83
|
+
max_cols = max(len(row) for row in sorted_rows)
|
84
|
+
max_rows = len(sorted_rows)
|
85
|
+
|
86
|
+
column_positions = []
|
87
|
+
for col in range(max(len(row) for row in sorted_rows)):
|
88
|
+
column = [row[col] for row in sorted_rows if col < len(row)]
|
89
|
+
med_left = np.median([d["bbox"][0] for d in column])
|
90
|
+
med_right = np.median([d["bbox"][2] for d in column])
|
91
|
+
column_positions.append((med_left, med_right))
|
92
|
+
|
93
|
+
row_positions = []
|
94
|
+
for row in sorted_rows:
|
95
|
+
med_top = np.median([d["bbox"][1] for d in row])
|
96
|
+
med_bottom = np.median([d["bbox"][3] for d in row])
|
97
|
+
row_positions.append((med_top, med_bottom))
|
98
|
+
|
99
|
+
|
100
|
+
def find_element(left, right, top, bottom, elements):
|
101
|
+
center_x = (left + right) / 2
|
102
|
+
center_y = (top + bottom) / 2
|
103
|
+
for element in elements:
|
104
|
+
x_min, y_min, x_max, y_max = element["bbox"]
|
105
|
+
elt_center_x = (x_min + x_max) / 2
|
106
|
+
elt_center_y = (y_min + y_max) / 2
|
107
|
+
if (abs(center_x - elt_center_x) < med_width / 2) and (
|
108
|
+
abs(center_y - elt_center_y) < med_height / 2
|
109
|
+
):
|
110
|
+
return element
|
111
|
+
return
|
112
|
+
|
113
|
+
missing_elements = []
|
114
|
+
for row in range(max_rows):
|
115
|
+
for col in range(max_cols):
|
116
|
+
left, right = column_positions[col]
|
117
|
+
top, bottom = row_positions[row]
|
118
|
+
match = find_element(left, right, top, bottom, sorted_rows[row])
|
119
|
+
if match is None:
|
120
|
+
missing_elements.append((left, top, right, bottom))
|
121
|
+
</code>
|
122
|
+
|
123
|
+
<code>
|
124
|
+
# This a plan to find objects that are only identifiable when compared to other objects such "find the smaller object"
|
125
|
+
|
126
|
+
from sklearn.cluster import KMeans
|
127
|
+
import numpy as np
|
128
|
+
|
129
|
+
detections = instance_segmentation("object", image)
|
130
|
+
|
131
|
+
def get_area(detection):
|
132
|
+
return (detection["bbox"][2] - detection["bbox"][0]) * (detection["bbox"][3] - detection["bbox"][1])
|
133
|
+
|
134
|
+
areas = [get_area(detection) for detection in detections]
|
135
|
+
X = np.array(areas)[:, None]
|
136
|
+
|
137
|
+
kmeans = KMeans(n_clusters=<number of clusters>).fit(X)
|
138
|
+
smallest_cluster = np.argmin(kmeans.cluster_centers_)
|
139
|
+
largest_cluster = np.argmax(kmeans.cluster_centers_)
|
140
|
+
|
141
|
+
clusters = kmeans.predict(X)
|
142
|
+
smallest_detections = [detection for detection, cluster in zip(detections, clusters) if cluster == smallest_cluster]
|
143
|
+
largest_detections = [detection for detection, cluster in zip(detections, clusters) if cluster == largest_cluster]
|
144
|
+
</code>
|
145
|
+
|
146
|
+
<code>
|
147
|
+
# This is a plan to help find object that are identified spatially relative to other 'anchor' objects, for example find the boxes to the right of the chair
|
148
|
+
|
149
|
+
# First find a model that can detect the location of the anchor objects
|
150
|
+
anchor_dets = instance_segmentation("anchor object", image)
|
151
|
+
# Then find a model that can detect the location of the relative objects
|
152
|
+
relative_dets = instance_segmentation("relative object", image)
|
153
|
+
|
154
|
+
# This will give you relative objects 'above' the anchor objects since it's the
|
155
|
+
# distance between the lower left corner of the relative object and the upper left
|
156
|
+
# corner of the anchor object. The remaining functions can be used to get the other
|
157
|
+
# relative positions.
|
158
|
+
def above_distance(box1, box2):
|
159
|
+
return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
|
160
|
+
box1["bbox"][3] - box2["bbox"][1]
|
161
|
+
) ** 2
|
162
|
+
|
163
|
+
def below_distance(box1, box2):
|
164
|
+
return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
|
165
|
+
box1["bbox"][1] - box2["bbox"][3]
|
166
|
+
) ** 2
|
167
|
+
|
168
|
+
def right_distance(box1, box2):
|
169
|
+
return (box1["bbox"][0] - box2["bbox"][2]) ** 2 + (
|
170
|
+
box1["bbox"][1] - box2["bbox"][1]
|
171
|
+
) ** 2
|
172
|
+
|
173
|
+
def left_distance(box1, box2):
|
174
|
+
return (box1["bbox"][2] - box2["bbox"][0]) ** 2 + (
|
175
|
+
box1["bbox"][1] - box2["bbox"][1]
|
176
|
+
) ** 2
|
177
|
+
|
178
|
+
closest_boxes = []
|
179
|
+
for anchor_det in anchor_dets:
|
180
|
+
# You can use any of the above functions to get the relative position
|
181
|
+
distances = [
|
182
|
+
(relative_det, above_distance(relative_det, anchor_det))
|
183
|
+
for relative_det in relative_dets
|
184
|
+
]
|
185
|
+
# You must grab the nearest object for each of the anchors. This line will give
|
186
|
+
# you the box directly above the anchor box (or below, left, right depending on
|
187
|
+
# the function used)
|
188
|
+
closest_box = min(distances, key=lambda x: x[1])[0]
|
189
|
+
closest_boxes.append(closest_box)
|
190
|
+
</code>
|
191
|
+
|
192
|
+
<code>
|
193
|
+
# This is a plan to help you find objects according to their depth, for example find the person nearest to the camera
|
194
|
+
|
195
|
+
# First find a model to estimate the depth of the image
|
196
|
+
depth = depth_estimation(image)
|
197
|
+
# Then find a model to segment the objects in the image
|
198
|
+
masks = instance_segmentation("object", image)
|
199
|
+
|
200
|
+
for elt in masks:
|
201
|
+
# Multiply the depth by the mask and keep track of the mean depth for the masked
|
202
|
+
# object
|
203
|
+
depth_mask = depth * elt["mask"]
|
204
|
+
elt["mean_depth"] = depth_mask.mean()
|
205
|
+
|
206
|
+
# Sort the masks by mean depth in reverse, objects that are closer will have higher
|
207
|
+
# mean depth values and further objects will have lower mean depth values.
|
208
|
+
masks = sorted(masks, key=lambda x: x["mean_depth"], reverse=True)
|
209
|
+
closest_mask = masks[0]
|
210
|
+
</code>
|
211
|
+
|
212
|
+
<code>
|
213
|
+
# This plan helps you assign objects to other objects, for example "count the number of people sitting at a table"
|
214
|
+
|
215
|
+
pred = instance_segmentation("object 1, object 2", image)
|
216
|
+
objects_1 = [p for p in pred if p["label"] == "object 1"]
|
217
|
+
objects_2 = [p for p in pred if p["label"] == "object 2"]
|
218
|
+
|
219
|
+
def box_iou(bbox1: np.ndarray, bbox2: np.ndarray) -> float:
|
220
|
+
# Get coordinates of intersection rectangle
|
221
|
+
x1 = max(bbox1[0], bbox2[0])
|
222
|
+
y1 = max(bbox1[1], bbox2[1])
|
223
|
+
x2 = min(bbox1[2], bbox2[2])
|
224
|
+
y2 = min(bbox1[3], bbox2[3])
|
225
|
+
|
226
|
+
# Calculate area of intersection
|
227
|
+
intersection = max(0, x2 - x1) * max(0, y2 - y1)
|
228
|
+
if intersection == 0:
|
229
|
+
return 0.0
|
230
|
+
|
231
|
+
# Calculate area of both boxes
|
232
|
+
box1_area = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
|
233
|
+
box2_area = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
|
234
|
+
|
235
|
+
# Calculate IoU
|
236
|
+
union = box1_area + box2_area - intersection
|
237
|
+
return intersection / union if union > 0 else 0.0
|
238
|
+
|
239
|
+
# initialize assignment counts
|
240
|
+
objects_2_counts = {i: 0 for i in range(len(objects_2))}
|
241
|
+
# you can set a minimum iou threshold for assignment
|
242
|
+
iou_threshold = 0.05
|
243
|
+
|
244
|
+
# You can expand the object 2 box by a certain percentage if needed to help with the
|
245
|
+
# assignment.
|
246
|
+
for object_2 in objects_2:
|
247
|
+
box = object_2["bbox"]
|
248
|
+
# If your camera is at an angle you need to expand the top of the box like so:
|
249
|
+
box = [0.9 * box[0], 0.9 * box[1], 1.1 * box[2], box[3]]
|
250
|
+
# If the camera is top down you should expand all sides of the box like this:
|
251
|
+
box = [0.9 * box[0], 0.9 * box[1], 1.1 * box[2], 1.1 * box[3]]
|
252
|
+
|
253
|
+
object_2["bbox"] = box
|
254
|
+
|
255
|
+
for object_1 in objects_1:
|
256
|
+
best_iou = 0
|
257
|
+
best_object_2_idx = None
|
258
|
+
|
259
|
+
for j, object_2 in enumerate(objects_2):
|
260
|
+
iou = box_iou(object_1["bbox"], object_2["bbox"])
|
261
|
+
if iou > best_iou and iou > iou_threshold:
|
262
|
+
best_iou = iou
|
263
|
+
best_object_2_idx = j
|
264
|
+
|
265
|
+
if best_object_2_idx is not None:
|
266
|
+
objects_2_counts[best_object_2_idx] += 1
|
267
|
+
</code>
|
268
|
+
|
269
|
+
<code>
|
270
|
+
# This plan helps you adjust the threshold to get the best results for a specific object detection model
|
271
|
+
|
272
|
+
dets = instance_segmentation("object", image)
|
273
|
+
proposed_threshold = 0.3
|
274
|
+
for d in dets:
|
275
|
+
d["dist"] = abs(d["score"] - proposed_threshold)
|
276
|
+
|
277
|
+
nearby_dets = sorted(dets, key=lambda x: x["dist"])[:10]
|
278
|
+
unnormalized_coords = []
|
279
|
+
for d in nearby_dets:
|
280
|
+
unnormalized_coords.append([
|
281
|
+
d["bbox"][0] * image.shape[1],
|
282
|
+
d["bbox"][1] * image.shape[0],
|
283
|
+
d["bbox"][2] * image.shape[1],
|
284
|
+
d["bbox"][3] * image.shape[0],
|
285
|
+
])
|
286
|
+
|
287
|
+
crops = crop_image(image, unnormalized_coords)
|
288
|
+
|
289
|
+
for i, crop in enumerate(crops):
|
290
|
+
print(f"object {i}")
|
291
|
+
display(crop)
|
292
|
+
</code>
|
293
|
+
|
294
|
+
<code>
|
295
|
+
# this plan helps you analyze bounding boxes and filter out possible false positives
|
296
|
+
|
297
|
+
image = load_image(image_path)
|
298
|
+
dets = instance_segmentation("object", image)
|
299
|
+
crops = get_crops(image, dets)
|
300
|
+
|
301
|
+
# you can only display the first 5 crops
|
302
|
+
for i, crop in enumerate(crops[:5]):
|
303
|
+
print(f"object {i}, label: {dets[i]['label']}, score: {dets[i]['score']}")
|
304
|
+
display_image(crop)
|
305
|
+
|
306
|
+
# in your final <answer> you can choose only the crops that you want to keep
|
307
|
+
</code>
|
308
|
+
|
309
|
+
<code>
|
310
|
+
# If you want to zoom into an area of the image, you can use the following plan
|
311
|
+
|
312
|
+
image = load_image(image_path)
|
313
|
+
center_crop = get_crops(image, [{"bbox": [0.25, 0.25, 0.75, 0.75]}])[0]
|
314
|
+
|
315
|
+
# or you could split the image into 4 quadrants and display them
|
316
|
+
quadrants = get_crops(image, [{"bbox": [0, 0, 0.5, 0.5]}, {"bbox": [0.5, 0, 1, 0.5]}, {"bbox": [0, 0.5, 0.5, 1]}, {"bbox": [0.5, 0.5, 1, 1]}])
|
317
|
+
</code>
|
318
|
+
"""
|
319
|
+
|
320
|
+
|
321
|
+
def get_init_prompt(
|
322
|
+
model: str,
|
323
|
+
turns: int,
|
324
|
+
question: str,
|
325
|
+
category: str,
|
326
|
+
image_path: str,
|
327
|
+
tool_docs: str = TOOLS,
|
328
|
+
examples: str = EXAMPLES,
|
329
|
+
) -> str:
|
330
|
+
if category == "localization":
|
331
|
+
cat_text = "Return bounding box coordinates for this question."
|
332
|
+
elif category == "text":
|
333
|
+
cat_text = "Return only text for this question."
|
334
|
+
elif category == "counting":
|
335
|
+
cat_text = "Return only the number of objects for this question."
|
336
|
+
else:
|
337
|
+
cat_text = ""
|
338
|
+
|
339
|
+
initial_prompt = f"""You are given a question and an image from the user. Your task is using the vision tools and multi-modal reasoning to answer the user's question in a generalized way, so the user could give you a different image and run it through your code to get the correct answer to the same question. You have {turns} turns to solve the problem, be sure to utilize the tools to their fullest extent, also reflect on the results of the tools to best solve the problem. The tools are not perfect, so you may need to look into the results, refer back to the image and refine your answer.
|
340
|
+
|
341
|
+
Here is the documentation for the tools you have access to, you can implement new utilities if needed, but DO NOT redefine these functions or make placeholder functions for them or import them, you will have access to them:
|
342
|
+
<docs>
|
343
|
+
{tool_docs}
|
344
|
+
</docs>
|
345
|
+
|
346
|
+
Here's some example pseudo code plans that you can use to help solve your problem:
|
347
|
+
<examples>
|
348
|
+
{examples}
|
349
|
+
</examples>
|
350
|
+
|
351
|
+
<instructions>
|
352
|
+
- You should approximate the answer using a symbolic approach as much as possible (e.g. using combinations of the tools and python code to get the answer) since it's more robust and generalizable. However, if you need to use your own reasoning and observation to get the answer, you can do so.
|
353
|
+
- ALWAYS PUT THE CODE YOU WANT TO EXECUTE IN <code> TAGS (IMPORTANT!)
|
354
|
+
- You can progressively refine your answer using your own reflection on both the tasks, code outputs and tools to get the best results.
|
355
|
+
- During your iteration output <code> tags to execute code which will be run in a jupyter notebook environment.
|
356
|
+
- Do not re-define the tools as an another function, you will have access to them (e.g load_image, etc)
|
357
|
+
- For object detection queries, include only the bounding box coordinates within <answer> and omit any image data.
|
358
|
+
- If one of the tool results has low confidence or you believe it's wrong, you should focus on that result manually by yourself, observe and verify it, do not just accept it.
|
359
|
+
- Within <answer>, provide only the final answer without any explanation. For instance, for a bounding box query, return in a list of bounding box, for example <answer>[[0.234375, 0.68359375, 0.33203125, 0.7356770833333334]]</answer> or <answer>3</answer> or <answer>blue</answer>. For other tasks, return a single word or number; if the task pertains to pricing, prefix the number with a dollar sign, e.g., <answer>$3.4</answer>.
|
360
|
+
- If you cannot find an answer to the question, for example you are asked to find an object and it does not exist in the image, you should return <answer>None</answer>.
|
361
|
+
- In your first turn, you should look into the requested image, observe intelligently, tell me what you see, think about the user question, and plan your next steps or analysis.
|
362
|
+
- Each turn you only visualize or display any image or crop maximum 5 images, use `display_image` or any tool that can display an image in an intelligent way. You can first analyze the result, then target few of them to display for further analysis or understanding, do not waste your turns displaying all the images or waste one turn to display many irrelevant images.
|
363
|
+
- Output in the following format:
|
364
|
+
{"<thinking>Your very verbose visual observation, Your thoughts on the next step, reflecting on the last code execution result, planning on the next steps</thinking>" if model != "claude" else ""}
|
365
|
+
<code>The python code you want to execute (image manipulation, tool calls, etc)</code>
|
366
|
+
<answer>The final answer you want to output, but compulsory to output if you are on your final turn</answer>
|
367
|
+
</instructions>
|
368
|
+
|
369
|
+
Here is the user question: {question}
|
370
|
+
Here is the image path: {image_path}
|
371
|
+
{cat_text}"""
|
372
|
+
return initial_prompt
|