vision-agent 0.2.56__py3-none-any.whl → 0.2.58__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -11,6 +11,7 @@ from .tools import (
11
11
  closest_box_distance,
12
12
  closest_mask_distance,
13
13
  extract_frames,
14
+ get_tool_documentation,
14
15
  grounding_dino,
15
16
  grounding_sam,
16
17
  image_caption,
@@ -37,7 +38,7 @@ def register_tool(imports: Optional[List] = None) -> Callable:
37
38
  def decorator(tool: Callable) -> Callable:
38
39
  import inspect
39
40
 
40
- from .tools import get_tool_descriptions, get_tool_documentation, get_tools_df
41
+ from .tools import get_tool_descriptions, get_tools_df
41
42
 
42
43
  global TOOLS, TOOLS_DF, TOOL_DESCRIPTIONS, TOOL_DOCSTRING
43
44
 
@@ -316,14 +316,14 @@ def visual_prompt_counting(
316
316
  return resp_data
317
317
 
318
318
 
319
- def image_question_answering(image: np.ndarray, prompt: str) -> str:
319
+ def image_question_answering(prompt: str, image: np.ndarray) -> str:
320
320
  """'image_question_answering_' is a tool that can answer questions about the visual
321
321
  contents of an image given a question and an image. It returns an answer to the
322
322
  question
323
323
 
324
324
  Parameters:
325
- image (np.ndarray): The reference image used for the question
326
325
  prompt (str): The question about the image
326
+ image (np.ndarray): The reference image used for the question
327
327
 
328
328
  Returns:
329
329
  str: A string which is the answer to the given prompt. E.g. {'text': 'This
@@ -331,7 +331,7 @@ def image_question_answering(image: np.ndarray, prompt: str) -> str:
331
331
 
332
332
  Example
333
333
  -------
334
- >>> image_question_answering(image, 'What is the cat doing ?')
334
+ >>> image_question_answering('What is the cat doing ?', image)
335
335
  'drinking milk'
336
336
  """
337
337
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: vision-agent
3
- Version: 0.2.56
3
+ Version: 0.2.58
4
4
  Summary: Toolset for Vision Agent
5
5
  Author: Landing AI
6
6
  Author-email: dev@landing.ai
@@ -77,6 +77,7 @@ export OPENAI_API_KEY="your-api-key"
77
77
  ```
78
78
 
79
79
  ### Vision Agent
80
+ #### Basic Usage
80
81
  You can interact with the agent as you would with any LLM or LMM model:
81
82
 
82
83
  ```python
@@ -125,10 +126,12 @@ mode by passing in the verbose argument:
125
126
  >>> agent = VisionAgent(verbose=2)
126
127
  ```
127
128
 
128
- You can also have it return more information by calling `chat_with_workflow`:
129
+ #### Detailed Usage
130
+ You can also have it return more information by calling `chat_with_workflow`. The format
131
+ of the input is a list of dictionaries with the keys `role`, `content`, and `media`:
129
132
 
130
133
  ```python
131
- >>> results = agent.chat_with_workflow([{"role": "user", "content": "What percentage of the area of the jar is filled with coffee beans?"}], media="jar.jpg")
134
+ >>> results = agent.chat_with_workflow([{"role": "user", "content": "What percentage of the area of the jar is filled with coffee beans?", "media": ["jar.jpg"]}])
132
135
  >>> print(results)
133
136
  {
134
137
  "code": "from vision_agent.tools import ..."
@@ -139,19 +142,45 @@ You can also have it return more information by calling `chat_with_workflow`:
139
142
  }
140
143
  ```
141
144
 
142
- With this you can examine more detailed information such as the etesting code, testing
145
+ With this you can examine more detailed information such as the testing code, testing
143
146
  results, plan or working memory it used to complete the task.
144
147
 
148
+ #### Multi-turn conversations
149
+ You can have multi-turn conversations with vision-agent as well, giving it feedback on
150
+ the code and having it update. You just need to add the code as a response from the
151
+ assistant:
152
+
153
+ ```python
154
+ agent = va.agent.VisionAgent(verbosity=2)
155
+ conv = [
156
+ {
157
+ "role": "user",
158
+ "content": "Are these workers wearing safety gear? Output only a True or False value.",
159
+ "media": ["workers.png"],
160
+ }
161
+ ]
162
+ result = agent.chat_with_workflow(conv)
163
+ code = result["code"]
164
+ conv.append({"role": "assistant", "content": code})
165
+ conv.append(
166
+ {
167
+ "role": "user",
168
+ "content": "Can you also return the number of workers wearing safety gear?",
169
+ }
170
+ )
171
+ result = agent.chat_with_workflow(conv)
172
+ ```
173
+
145
174
  ### Tools
146
175
  There are a variety of tools for the model or the user to use. Some are executed locally
147
- while others are hosted for you. You can also ask an LLM directly to build a tool for
176
+ while others are hosted for you. You can also ask an LMM directly to build a tool for
148
177
  you. For example:
149
178
 
150
179
  ```python
151
180
  >>> import vision_agent as va
152
- >>> llm = va.llm.OpenAILLM()
181
+ >>> llm = va.llm.OpenAILMM()
153
182
  >>> detector = llm.generate_detector("Can you build a jar detector for me?")
154
- >>> detector("jar.jpg")
183
+ >>> detector(va.tools.load_image("jar.jpg"))
155
184
  [{"labels": ["jar",],
156
185
  "scores": [0.99],
157
186
  "bboxes": [
@@ -0,0 +1,23 @@
1
+ vision_agent/__init__.py,sha256=EAb4-f9iyuEYkBrX4ag1syM8Syx8118_t0R6_C34M9w,57
2
+ vision_agent/agent/__init__.py,sha256=IUwfbPMcT8X_rnXMLmI8gJ4ltsHy_XSs9eLiKURJxeY,81
3
+ vision_agent/agent/agent.py,sha256=ZK-5lOtd9-eD9aWcXssJpnOyvZuO7_5hAmnb-6sWVe8,569
4
+ vision_agent/agent/vision_agent.py,sha256=QK3G8YdT5vwW5OIh1PN-2gJECFkVNkey-MdK4YdXo-c,24394
5
+ vision_agent/agent/vision_agent_prompts.py,sha256=bMXdZYf6kbikHn__tCGrYE1QvXC88EmpMpM_97V6szA,8472
6
+ vision_agent/fonts/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
7
+ vision_agent/fonts/default_font_ch_en.ttf,sha256=1YM0Z3XqLDjSNbF7ihQFSAIUdjF9m1rtHiNC_6QosTE,1594400
8
+ vision_agent/lmm/__init__.py,sha256=3ro5lCIoS3DgEghOy0SPFrEhYvFnWZpVC5S5kSnIx6A,57
9
+ vision_agent/lmm/lmm.py,sha256=XqixNLuLNYu4-xXA8IOEdlcfgktds1ly6Ov7PiFLdsY,8706
10
+ vision_agent/tools/__init__.py,sha256=inKVLRUATQA9oi83l0NluC8Gm-LJU2-AjA6rL1j12Q8,1532
11
+ vision_agent/tools/prompts.py,sha256=V1z4YJLXZuUl_iZ5rY0M5hHc_2tmMEUKr0WocXKGt4E,1430
12
+ vision_agent/tools/tool_utils.py,sha256=wzRacbUpqk9hhfX_Y08rL8qP0XCN2w-8IZoYLi3Upn4,869
13
+ vision_agent/tools/tools.py,sha256=o9ojTfhu8KCSXfW4UPUNOhmki6A-l3jtVi0rPEnELjc,26944
14
+ vision_agent/utils/__init__.py,sha256=CW84HnhqI6XQVuxf2KifkLnSuO7EOhmuL09-gAymAak,219
15
+ vision_agent/utils/execute.py,sha256=GqoAodxtwTPBr1nujPTsWiZO2rBGvWVXTe8lgxY4d_g,20603
16
+ vision_agent/utils/image_utils.py,sha256=_cdiS5YrLzqkq_ZgFUO897m5M4_SCIThwUy4lOklfB8,7700
17
+ vision_agent/utils/sim.py,sha256=rGRGnjsy91IOn8qzt7k04PIRj5jyiaQyYAQl7ossPt8,4195
18
+ vision_agent/utils/type_defs.py,sha256=BlI8ywWHAplC7kYWLvt4AOdnKpEW3qWEFm-GEOSkrFQ,1792
19
+ vision_agent/utils/video.py,sha256=rNmU9KEIkZB5-EztZNlUiKYN0mm_55A_2VGUM0QpqLA,8779
20
+ vision_agent-0.2.58.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
21
+ vision_agent-0.2.58.dist-info/METADATA,sha256=aC5l1nZmxjqdpZXyF4qy1gRGgdhwYazJl1kK8aqCU6o,7632
22
+ vision_agent-0.2.58.dist-info/WHEEL,sha256=7Z8_27uaHI_UZAc4Uox4PpBhQ9Y5_modZXWMxtUi4NU,88
23
+ vision_agent-0.2.58.dist-info/RECORD,,
@@ -1,216 +0,0 @@
1
- import json
2
- import logging
3
- import os
4
- import sys
5
- from pathlib import Path
6
- from typing import Any, Dict, List, Optional, Union
7
-
8
- from rich.console import Console
9
- from rich.syntax import Syntax
10
-
11
- from vision_agent.agent import Agent
12
- from vision_agent.agent.agent_coder_prompts import (
13
- DEBUG,
14
- FIX_BUG,
15
- PROGRAM,
16
- TEST,
17
- VISUAL_TEST,
18
- )
19
- from vision_agent.llm import LLM, OpenAILLM
20
- from vision_agent.lmm import LMM, OpenAILMM
21
- from vision_agent.tools import TOOL_DOCSTRING, UTILITIES_DOCSTRING
22
- from vision_agent.utils import CodeInterpreterFactory
23
-
24
- IMPORT_HELPER = """
25
- import math
26
- import re
27
- import sys
28
- import copy
29
- import datetime
30
- import itertools
31
- import collections
32
- import heapq
33
- import statistics
34
- import functools
35
- import hashlib
36
- import numpy
37
- import numpy as np
38
- import string
39
- from typing import *
40
- from collections import *
41
- from vision_agent.tools import *
42
- """
43
- logging.basicConfig(stream=sys.stdout)
44
- _LOGGER = logging.getLogger(__name__)
45
- _EXECUTE = CodeInterpreterFactory.get_default_instance()
46
- _CONSOLE = Console()
47
-
48
-
49
- def write_tests(question: str, code: str, model: LLM) -> str:
50
- prompt = TEST.format(
51
- question=question,
52
- code=code,
53
- )
54
- completion = model(prompt)
55
- return preprocess_data(completion)
56
-
57
-
58
- def preprocess_data(code: str) -> str:
59
- if "```python" in code:
60
- code = code[code.find("```python") + len("```python") :]
61
- code = code[: code.find("```")]
62
- return code
63
-
64
-
65
- def parse_file_name(s: str) -> str:
66
- # We only output png files
67
- return "".join([p for p in s.split(" ") if p.endswith(".png")])
68
-
69
-
70
- def write_program(
71
- question: str, feedback: str, model: LLM, media: Optional[Union[str, Path]] = None
72
- ) -> str:
73
- prompt = PROGRAM.format(
74
- docstring=TOOL_DOCSTRING, question=question, feedback=feedback
75
- )
76
- if isinstance(model, OpenAILMM):
77
- completion = model(prompt, images=[media] if media else None)
78
- else:
79
- completion = model(prompt)
80
-
81
- return preprocess_data(completion)
82
-
83
-
84
- def write_debug(question: str, code: str, feedback: str, model: LLM) -> str:
85
- prompt = DEBUG.format(
86
- docstring=UTILITIES_DOCSTRING,
87
- code=code,
88
- question=question,
89
- feedback=feedback,
90
- )
91
- completion = model(prompt)
92
- return preprocess_data(completion)
93
-
94
-
95
- def execute_tests(code: str, tests: str) -> Dict[str, Union[str, bool]]:
96
- full_code = f"{IMPORT_HELPER}\n{code}\n{tests}"
97
- result = _EXECUTE.exec_isolation(full_code)
98
- return {"code": code, "result": result.text(), "passed": result.success}
99
-
100
-
101
- def run_visual_tests(
102
- question: str, code: str, viz_file: str, feedback: str, model: LMM
103
- ) -> Dict[str, Union[str, bool]]:
104
- prompt = VISUAL_TEST.format(
105
- docstring=TOOL_DOCSTRING,
106
- code=code,
107
- question=question,
108
- feedback=feedback,
109
- )
110
- completion = model(prompt, images=[viz_file])
111
- # type is from the prompt
112
- return json.loads(completion) # type: ignore
113
-
114
-
115
- def fix_bugs(code: str, tests: str, result: str, feedback: str, model: LLM) -> str:
116
- prompt = FIX_BUG.format(code=code, tests=tests, result=result, feedback=feedback)
117
- completion = model(prompt)
118
- return preprocess_data(completion)
119
-
120
-
121
- class AgentCoder(Agent):
122
- """AgentCoder is based off of the AgentCoder paper https://arxiv.org/abs/2312.13010
123
- and it's open source code https://github.com/huangd1999/AgentCoder with some key
124
- differences. AgentCoder comprises of 3 components: a coder agent, a tester agent,
125
- and an executor. The tester agents writes code to test the code written by the coder
126
- agent, but in our case because we are solving a vision task it's difficult to write
127
- testing code. We instead have the tester agent write code to visualize the output
128
- of the code written by the coder agent. If the code fails, we pass it back to the
129
- coder agent to fix the bug, if it succeeds we pass it to a visual tester agent, which
130
- is an LMM model like GPT4V, to visually inspect the output and make sure it looks
131
- good."""
132
-
133
- def __init__(
134
- self,
135
- coder_agent: Optional[LLM] = None,
136
- tester_agent: Optional[LLM] = None,
137
- visual_tester_agent: Optional[LMM] = None,
138
- verbose: bool = False,
139
- ) -> None:
140
- self.coder_agent = (
141
- OpenAILLM(temperature=0.1) if coder_agent is None else coder_agent
142
- )
143
- self.tester_agent = (
144
- OpenAILLM(temperature=0.1) if tester_agent is None else tester_agent
145
- )
146
- self.visual_tester_agent = (
147
- OpenAILMM(temperature=0.1, json_mode=True)
148
- if visual_tester_agent is None
149
- else visual_tester_agent
150
- )
151
- self.max_turns = 3
152
- self.verbose = verbose
153
- if self.verbose:
154
- _LOGGER.setLevel(logging.INFO)
155
-
156
- def __call__(
157
- self,
158
- input: Union[List[Dict[str, str]], str],
159
- media: Optional[Union[str, Path]] = None,
160
- ) -> str:
161
- if isinstance(input, str):
162
- input = [{"role": "user", "content": input}]
163
- return self.chat(input, media)
164
-
165
- def chat(
166
- self,
167
- input: List[Dict[str, str]],
168
- media: Optional[Union[str, Path]] = None,
169
- ) -> str:
170
- question = input[0]["content"]
171
- if media:
172
- question += f" Input file path: {os.path.abspath(media)}"
173
-
174
- code = ""
175
- feedback = ""
176
- for _ in range(self.max_turns):
177
- code = write_program(question, feedback, self.coder_agent, media=media)
178
- if self.verbose:
179
- _CONSOLE.print(
180
- Syntax(code, "python", theme="gruvbox-dark", line_numbers=True)
181
- )
182
- debug = write_debug(question, code, feedback, self.tester_agent)
183
- if self.verbose:
184
- _CONSOLE.print(
185
- Syntax(debug, "python", theme="gruvbox-dark", line_numbers=True)
186
- )
187
- results = execute_tests(code, debug)
188
- _LOGGER.info(
189
- f"execution results: passed: {results['passed']}\n{results['result']}"
190
- )
191
-
192
- if not results["passed"]:
193
- code = fix_bugs(
194
- code, debug, results["result"].strip(), feedback, self.coder_agent # type: ignore
195
- )
196
- if self.verbose:
197
- _CONSOLE.print(
198
- Syntax(code, "python", theme="gruvbox-dark", line_numbers=True)
199
- )
200
- else:
201
- # TODO: Sometimes it prints nothing, so we need to handle that case
202
- # TODO: The visual agent reflection does not work very well, needs more testing
203
- # viz_test_results = run_visual_tests(
204
- # question, code, parse_file_name(results["result"].strip()), feedback, self.visual_tester_agent
205
- # )
206
- # _LOGGER.info(f"visual test results:\n{viz_test_results}")
207
- # if viz_test_results["finished"]:
208
- # return f"{IMPORT_HELPER}\n{code}"
209
- # feedback += f"\n{viz_test_results['feedback']}"
210
-
211
- return f"{IMPORT_HELPER}\n{code}"
212
-
213
- return f"{IMPORT_HELPER}\n{code}"
214
-
215
- def log_progress(self, data: Dict[str, Any]) -> None:
216
- _LOGGER.info(data)
@@ -1,135 +0,0 @@
1
- PROGRAM = """
2
- **Role**: You are a software programmer.
3
-
4
- **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. Ensure that your code is efficient, readable, and well-commented. Return the requested information from the function you create.
5
-
6
- **Documentation**:
7
- This is the documentation for the functions you have access to. You may call any of these functions to help you complete the task, you do not need to worry about defining them or importing them and can assume they are available to you.
8
- {docstring}
9
-
10
- **Input Code Snippet**:
11
- ```python
12
- def execute(image_path: str):
13
- # Your code here
14
- ```
15
-
16
- **User Instructions**:
17
- {question}
18
-
19
- **Previous Feedback**:
20
- {feedback}
21
-
22
- **Instructions**:
23
- 1. **Understand and Clarify**: Make sure you understand the task.
24
- 2. **Algorithm/Method Selection**: Decide on the most efficient way.
25
- 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode.
26
- 4. **Code Generation**: Translate your pseudocode into executable Python code.
27
- """
28
-
29
- DEBUG = """
30
- **Role**: You are a software programmer.
31
-
32
- **Task**: Your task is to run the `execute` function and either print the output or print a file name containing visualized output for another agent to examine. The other agent will then use your output, either the printed return value of the function or the visualized output as a file, to determine if `execute` is functioning correctly.
33
-
34
- **Documentation**
35
- This is the documentation for the functions you have access to. You may call any of these functions to help you complete the task, you do not need to worry about defining them or importing them and can assume they are available to you.
36
- {docstring}
37
-
38
- **Input Code Snippet**:
39
- ```python
40
- ### Please decided how would you want to generate test cases. Based on incomplete code or completed version.
41
- {code}
42
- ```
43
-
44
- **User Instructions**:
45
- {question}
46
-
47
- **Previous Feedback**:
48
- {feedback}
49
-
50
- **Instructions**:
51
- 1. **Understand and Clarify**: Make sure you understand the task.
52
- 2. **Code Execution**: Run the `execute` function with the given input from the user instructions.
53
- 3. **Output Generation**: Print the output or save it as a file for visualization utilizing the functions you have access to.
54
- """
55
-
56
- VISUAL_TEST = """
57
- **Role**: You are a machine vision expert.
58
-
59
- **Task**: Your task is to visually inspect the output of the `execute` function and determine if the visualization of the function output looks correct given the user's instructions. If not, you can provide suggestions to improve the `execute` function to imporve it.
60
-
61
- **Documentation**:
62
- This is the documentation for the functions you have access to. You may call any of these functions to help you complete the task, you do not need to worry about defining them or importing them and can assume they are available to you.
63
- {docstring}
64
-
65
-
66
- **Input Code Snippet**:
67
- This is the code that
68
- ```python
69
- {code}
70
- ```
71
-
72
- **User Instructions**:
73
- {question}
74
-
75
- **Previous Feedback**:
76
- {feedback}
77
-
78
- **Instructions**:
79
- 1. **Visual Inspection**: Examine the visual output of the `execute` function.
80
- 2. **Evaluation**: Determine if the visualization is correct based on the user's instructions.
81
- 3. **Feedback**: Provide feedback on the visualization and suggest improvements if necessary.
82
- 4. **Clear Concrete Instructions**: Provide clear concrete instructions to improve the results. You can only make coding suggestions based on the either the input code snippet or the documented code provided. For example, do not say the threshold needs to be adjust, instead provide an exact value for adjusting the threshold.
83
-
84
- Provide output in JSON format {{"finished": boolean, "feedback": "your feedback"}} where "finished" is True if the output is correct and False if not and "feedback" is your feedback.
85
- """
86
-
87
- FIX_BUG = """
88
- Please re-complete the code to fix the error message. Here is the previous version:
89
- ```python
90
- {code}
91
- ```
92
-
93
- When we run this code:
94
- ```python
95
- {tests}
96
- ```
97
-
98
- It raises this error:
99
- ```python
100
- {result}
101
- ```
102
-
103
- This is previous feedback provided on the code:
104
- {feedback}
105
-
106
- Please fix the bug by follow the error information and only return python code. You do not need return the test cases. The re-completion code should in triple backticks format(i.e., in ```python ```).
107
- """
108
-
109
- TEST = """
110
- **Role**: As a tester, your task is to create comprehensive test cases for the incomplete `execute` function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability.
111
-
112
- **User Instructions**:
113
- {question}
114
-
115
- **Input Code Snippet**:
116
- ```python
117
- ### Please decided how would you want to generate test cases. Based on incomplete code or completed version.
118
- {code}
119
- ```
120
-
121
- **1. Basic Test Cases**:
122
- - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions.
123
-
124
- **2. Edge Test Cases**:
125
- - **Objective**: To evaluate the function's behavior under extreme or unusual conditions.
126
-
127
- **3. Large Scale Test Cases**:
128
- - **Objective**: To assess the function’s performance and scalability with large data samples.
129
-
130
- **Instructions**:
131
- - Implement a comprehensive set of test cases following the guidelines above.
132
- - Ensure each test case is well-documented with comments explaining the scenario it covers.
133
- - Pay special attention to edge cases as they often reveal hidden bugs.
134
- - For large-scale tests, focus on the function's efficiency and performance under heavy loads.
135
- """