PyPI - vision-agent - Versions diffs - 0.2.218__tar.gz → 0.2.220__tar.gz - Mend

vision-agent 0.2.218tar.gz → 0.2.220tar.gz

Files changed (47) hide show

{vision_agent-0.2.218 → vision_agent-0.2.220}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vision-agent
-Version: 0.2.218
+Version: 0.2.220
 Summary: Toolset for Vision Agent
 Author: Landing AI
 Author-email: dev@landing.ai
@@ -81,22 +81,26 @@ You can also run VisionAgent in a local Jupyter Notebook.  Here are some example
 Check out the [notebooks](https://github.com/landing-ai/vision-agent/blob/main/examples/notebooks) folder for more examples.
-### Installation
+### Get Started
 To get started with the python library, you can install it using pip:
+#### Installation and Setup
 ```bash
 pip install vision-agent
 ```
-Ensure you have both an Anthropic key and an OpenAI API key and set in your environment
-variables (if you are using Azure OpenAI please see the Azure setup section):
 ```bash
-export ANTHROPIC_API_KEY="your-api-key" # needed for VisionAgent and VisionAgentCoder
-export OPENAI_API_KEY="your-api-key" # needed for ToolRecommender
+export ANTHROPIC_API_KEY="your-api-key"
 ```
-### Basic Usage
+---
+**NOTE**
+You must have the Anthropic API key set in your environment variables to use
+VisionAgent. If you don't have an Anthropic key you can use another provider like
+OpenAI or Ollama.
+---
+#### Chatting with VisionAgent
 To get started you can just import the `VisionAgent` and start chatting with it:
 ```python
 >>> from vision_agent.agent import VisionAgent
@@ -112,6 +116,40 @@ The chat messages are similar to `OpenAI`'s format with `role` and `content` key
 in addition to those you can add `media` which is a list of media files that can either
 be images or video files.
+#### Getting Code from VisionAgent
+You can also use `VisionAgentCoder` to generate code for you:
+```python
+>>> from vision_agent.agent import VisionAgentCoder
+>>> agent = VisionAgentCoder(verbosity=2)
+>>> code = agent("Count the number of people in this image", media="people.jpg")
+```
+#### Don't have Anthropic/OpenAI API keys?
+You can use `OllamaVisionAgentCoder` which uses Ollama as the backend. To get started
+pull the models:
+```bash
+ollama pull llama3.2-vision
+ollama pull mxbai-embed-large
+```
+Then you can use it just like you would use `VisionAgentCoder`:
+```python
+>>> from vision_agent.agent import OllamaVisionAgentCoder
+>>> agent = OllamaVisionAgentCoder(verbosity=2)
+>>> code = agent("Count the number of people in this image", media="people.jpg")
+```
+---
+**NOTE**
+Smaller open source models like Llama 3.1 8B will not work well with VisionAgent. You
+will encounter many coding errors because it generates incorrect code or JSON decoding
+errors because it generates incorrect JSON. We recommend using larger models or
+Anthropic/OpenAI models.
+---
 ## Documentation
 [VisionAgent Library Docs](https://landing-ai.github.io/vision-agent/)
@@ -120,8 +158,7 @@ be images or video files.
 ### Chatting and Message Formats
 `VisionAgent` is an agent that can chat with you and call other tools or agents to
 write vision code for you. You can interact with it like you would ChatGPT or any other
-chatbot. The agent uses Clause-3.5 for it's LMM and OpenAI for embeddings for searching
-for tools.
+chatbot. The agent uses Clause-3.5 for it's LMM.
 The message format is:
 ```json
@@ -445,15 +482,14 @@ Usage is the same as `VisionAgentCoder`:
 `OllamaVisionAgentCoder` uses Ollama. To get started you must download a few models:
 ```bash
-ollama pull llama3.1
+ollama pull llama3.2-vision
 ollama pull mxbai-embed-large
 ```
-`llama3.1` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Normally we would
-use an actual LMM such as `llava` but `llava` cannot handle the long context lengths
-required by the agent. Since `llama3.1` cannot handle images you may see some
-performance degredation. `mxbai-embed-large` is the embedding model used to look up
-tools. You can use it just like you would use `VisionAgentCoder`:
+`llama3.2-vision` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Becuase
+`llama3.2-vision` is a smaller model you **WILL see performance degredation** compared to
+using Anthropic or OpenAI models. `mxbai-embed-large` is the embedding model used to
+look up tools. You can use it just like you would use `VisionAgentCoder`:
 ```python
 >>> import vision_agent as va

{vision_agent-0.2.218 → vision_agent-0.2.220}/README.md RENAMED Viewed

@@ -36,22 +36,26 @@ You can also run VisionAgent in a local Jupyter Notebook.  Here are some example
 Check out the [notebooks](https://github.com/landing-ai/vision-agent/blob/main/examples/notebooks) folder for more examples.
-### Installation
+### Get Started
 To get started with the python library, you can install it using pip:
+#### Installation and Setup
 ```bash
 pip install vision-agent
 ```
-Ensure you have both an Anthropic key and an OpenAI API key and set in your environment
-variables (if you are using Azure OpenAI please see the Azure setup section):
 ```bash
-export ANTHROPIC_API_KEY="your-api-key" # needed for VisionAgent and VisionAgentCoder
-export OPENAI_API_KEY="your-api-key" # needed for ToolRecommender
+export ANTHROPIC_API_KEY="your-api-key"
 ```
-### Basic Usage
+---
+**NOTE**
+You must have the Anthropic API key set in your environment variables to use
+VisionAgent. If you don't have an Anthropic key you can use another provider like
+OpenAI or Ollama.
+---
+#### Chatting with VisionAgent
 To get started you can just import the `VisionAgent` and start chatting with it:
 ```python
 >>> from vision_agent.agent import VisionAgent
@@ -67,6 +71,40 @@ The chat messages are similar to `OpenAI`'s format with `role` and `content` key
 in addition to those you can add `media` which is a list of media files that can either
 be images or video files.
+#### Getting Code from VisionAgent
+You can also use `VisionAgentCoder` to generate code for you:
+```python
+>>> from vision_agent.agent import VisionAgentCoder
+>>> agent = VisionAgentCoder(verbosity=2)
+>>> code = agent("Count the number of people in this image", media="people.jpg")
+```
+#### Don't have Anthropic/OpenAI API keys?
+You can use `OllamaVisionAgentCoder` which uses Ollama as the backend. To get started
+pull the models:
+```bash
+ollama pull llama3.2-vision
+ollama pull mxbai-embed-large
+```
+Then you can use it just like you would use `VisionAgentCoder`:
+```python
+>>> from vision_agent.agent import OllamaVisionAgentCoder
+>>> agent = OllamaVisionAgentCoder(verbosity=2)
+>>> code = agent("Count the number of people in this image", media="people.jpg")
+```
+---
+**NOTE**
+Smaller open source models like Llama 3.1 8B will not work well with VisionAgent. You
+will encounter many coding errors because it generates incorrect code or JSON decoding
+errors because it generates incorrect JSON. We recommend using larger models or
+Anthropic/OpenAI models.
+---
 ## Documentation
 [VisionAgent Library Docs](https://landing-ai.github.io/vision-agent/)
@@ -75,8 +113,7 @@ be images or video files.
 ### Chatting and Message Formats
 `VisionAgent` is an agent that can chat with you and call other tools or agents to
 write vision code for you. You can interact with it like you would ChatGPT or any other
-chatbot. The agent uses Clause-3.5 for it's LMM and OpenAI for embeddings for searching
-for tools.
+chatbot. The agent uses Clause-3.5 for it's LMM.
 The message format is:
 ```json
@@ -400,15 +437,14 @@ Usage is the same as `VisionAgentCoder`:
 `OllamaVisionAgentCoder` uses Ollama. To get started you must download a few models:
 ```bash
-ollama pull llama3.1
+ollama pull llama3.2-vision
 ollama pull mxbai-embed-large
 ```
-`llama3.1` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Normally we would
-use an actual LMM such as `llava` but `llava` cannot handle the long context lengths
-required by the agent. Since `llama3.1` cannot handle images you may see some
-performance degredation. `mxbai-embed-large` is the embedding model used to look up
-tools. You can use it just like you would use `VisionAgentCoder`:
+`llama3.2-vision` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Becuase
+`llama3.2-vision` is a smaller model you **WILL see performance degredation** compared to
+using Anthropic or OpenAI models. `mxbai-embed-large` is the embedding model used to
+look up tools. You can use it just like you would use `VisionAgentCoder`:
 ```python
 >>> import vision_agent as va

{vision_agent-0.2.218 → vision_agent-0.2.220}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
 [tool.poetry]
 name = "vision-agent"
-version = "0.2.218"
+version = "0.2.220"
 description = "Toolset for Vision Agent"
 authors = ["Landing AI <dev@landing.ai>"]
 readme = "README.md"

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/.sim_tools/df.csv RENAMED Viewed

@@ -460,19 +460,37 @@ desc,doc,name
     -------
         >>> document_analysis(image)
         {'pages':
-            [{'bbox': [0, 0, 1700, 2200],
-                    'chunks': [{'bbox': [1371, 75, 1503, 112],
+            [{'bbox': [0, 0, 1.0, 1.0],
+                    'chunks': [{'bbox': [0.8, 0.1, 1.0, 0.2],
                                 'label': 'page_header',
                                 'order': 75
                                 'caption': 'Annual Report 2024',
                                 'summary': 'This annual report summarizes ...' },
-                               {'bbox': [201, 1119, 1497, 1647],
+                               {'bbox': [0.2, 0.9, 0.9, 1.0],
                                 'label': table',
                                 'order': 1119,
                                 'caption': [{'Column 1': 'Value 1', 'Column 2': 'Value 2'},
                                 'summary': 'This table illustrates a trend of ...'},
                     ],
     ",document_extraction
+"'document_qa' is a tool that can answer any questions about arbitrary documents, presentations, or tables. It's very useful for document QA tasks, you can ask it a specific question or ask it to return a JSON object answering multiple questions about the document.","document_qa(prompt: str, image: numpy.ndarray) -> str:
+'document_qa' is a tool that can answer any questions about arbitrary documents,
+    presentations, or tables. It's very useful for document QA tasks, you can ask it a
+    specific question or ask it to return a JSON object answering multiple questions
+    about the document.
+    Parameters:
+        prompt (str): The question to be answered about the document image.
+        image (np.ndarray): The document image to analyze.
+    Returns:
+        str: The answer to the question based on the document's context.
+    Example
+    -------
+        >>> document_qa(image, question)
+        'The answer to the question ...'
+    ",document_qa
 'video_temporal_localization' will run qwen2vl on each chunk_length_frames value selected for the video. It can detect multiple objects independently per chunk_length_frames given a text prompt such as a referring expression but does not track objects across frames. It returns a list of floats with a value of 1.0 if the objects are found in a given chunk_length_frames of the video.,"video_temporal_localization(prompt: str, frames: List[numpy.ndarray], model: str = 'qwen2vl', chunk_length_frames: Optional[int] = 2) -> List[float]:
 'video_temporal_localization' will run qwen2vl on each chunk_length_frames
     value selected for the video. It can detect multiple objects independently per

vision_agent-0.2.220/vision_agent/.sim_tools/embs.npy ADDED Viewed

Binary file

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/agent/vision_agent_coder.py RENAMED Viewed

@@ -644,12 +644,9 @@ class OllamaVisionAgentCoder(VisionAgentCoder):
     """VisionAgentCoder that uses Ollama models for planning, coding, testing.
     Pre-requisites:
-    1. Run ollama pull llama3.1 for the LLM
+    1. Run ollama pull llama3.2-vision for the LMM
     2. Run ollama pull mxbai-embed-large for the embedding similarity model
-    Technically you should use a VLM such as llava but llava is not able to handle the
-    context length and crashes.
     Example
     -------
         >>> image vision_agent as va
@@ -674,17 +671,17 @@ class OllamaVisionAgentCoder(VisionAgentCoder):
                 else planner
             ),
             coder=(
-                OllamaLMM(model_name="llama3.1", temperature=0.0)
+                OllamaLMM(model_name="llama3.2-vision", temperature=0.0)
                 if coder is None
                 else coder
             ),
             tester=(
-                OllamaLMM(model_name="llama3.1", temperature=0.0)
+                OllamaLMM(model_name="llama3.2-vision", temperature=0.0)
                 if tester is None
                 else tester
             ),
             debugger=(
-                OllamaLMM(model_name="llama3.1", temperature=0.0)
+                OllamaLMM(model_name="llama3.2-vision", temperature=0.0)
                 if debugger is None
                 else debugger
             ),

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/agent/vision_agent_coder_v2.py RENAMED Viewed

@@ -5,7 +5,7 @@ from typing import Any, Callable, Dict, List, Optional, Sequence, Union, cast
 from rich.console import Console
 from rich.markup import escape
-import vision_agent.tools as T
+import vision_agent.tools.tools as T
 from vision_agent.agent import AgentCoder, AgentPlanner
 from vision_agent.agent.agent_utils import (
     DefaultImports,
@@ -34,7 +34,7 @@ from vision_agent.utils.execute import (
     CodeInterpreterFactory,
     Execution,
 )
-from vision_agent.utils.sim import Sim
+from vision_agent.utils.sim import Sim, get_tool_recommender
 _CONSOLE = Console()
@@ -316,7 +316,7 @@ class VisionAgentCoderV2(AgentCoder):
             elif isinstance(tool_recommender, Sim):
                 self.tool_recommender = tool_recommender
         else:
-            self.tool_recommender = T.get_tool_recommender()
+            self.tool_recommender = get_tool_recommender()
         self.verbose = verbose
         self.code_sandbox_runtime = code_sandbox_runtime

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/agent/vision_agent_planner.py RENAMED Viewed

@@ -532,7 +532,7 @@ class OllamaVisionAgentPlanner(VisionAgentPlanner):
     ) -> None:
         super().__init__(
             planner=(
-                OllamaLMM(model_name="llama3.1", temperature=0.0)
+                OllamaLMM(model_name="llama3.2-vision", temperature=0.0)
                 if planner is None
                 else planner
             ),

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/agent/vision_agent_planner_prompts.py RENAMED Viewed

@@ -62,10 +62,10 @@ plan2:
 - Count the number of detected objects labeled as 'person'.
 plan3:
 - Load the image from the provided file path 'image.jpg'.
-- Use the 'countgd_counting' tool to count the dominant foreground object, which in this case is people.
+- Use the 'countgd_object_detection' tool to count the dominant foreground object, which in this case is people.
 ```python
-from vision_agent.tools import load_image, owl_v2_image, florence2_sam2_image, countgd_counting
+from vision_agent.tools import load_image, owl_v2_image, florence2_sam2_image, countgd_object_detection
 image = load_image("image.jpg")
 owl_v2_out = owl_v2_image("person", image)
@@ -73,9 +73,9 @@ f2s2_out = florence2_sam2_image("person", image)
 # strip out the masks from the output becuase they don't provide useful information when printed
 f2s2_out = [{{k: v for k, v in o.items() if k != "mask"}} for o in f2s2_out]
-cgd_out = countgd_counting(image)
+cgd_out = countgd_object_detection("person", image)
-final_out = {{"owl_v2_image": owl_v2_out, "florence2_sam2_image": f2s2, "countgd_counting": cgd_out}}
+final_out = {{"owl_v2_image": owl_v2_out, "florence2_sam2_image": f2s2, "countgd_object_detection": cgd_out}}
 print(final_out)
 --- END EXAMPLE1 ---

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/agent/vision_agent_planner_prompts_v2.py RENAMED Viewed

@@ -440,16 +440,17 @@ PICK_PLAN = """
 """
 CATEGORIZE_TOOL_REQUEST = """
-You are given a task: {task} from the user. Your task is to extract the type of category this task belongs to, it can be one or more of the following:
+You are given a task: "{task}" from the user. You must extract the type of category this task belongs to, it can be one or more of the following:
 - "object detection and counting" - detecting objects or counting objects from a text prompt in an image or video.
 - "classification" - classifying objects in an image given a text prompt.
 - "segmentation" - segmenting objects in an image or video given a text prompt.
 - "OCR" - extracting text from an image.
 - "VQA" - answering questions about an image or video, can also be used for text extraction.
+- "DocQA" - answering questions about a document or extracting information from a document.
 - "video object tracking" - tracking objects in a video.
 - "depth and pose estimation" - estimating the depth or pose of objects in an image.
-Return the category or categories (comma separated) inside tags <category># your categories here</category>.
+Return the category or categories (comma separated) inside tags <category># your categories here</category>. If you are unsure about a task, it is better to include more categories than less.
 """
 TEST_TOOLS = """
@@ -473,7 +474,7 @@ TEST_TOOLS = """
 {examples}
 **Instructions**:
-1. List all the tools under **Tools** and the user request. Write a program to load the media and call every tool in parallel and print it's output along with other relevant information.
+1. List all the tools under **Tools** and the user request. Write a program to load the media and call the most relevant tools in parallel and print it's output along with other relevant information.
 2. Create a dictionary where the keys are the tool name and the values are the tool outputs. Remove numpy arrays from the printed dictionary.
 3. Your test case MUST run only on the given images which are {media}
 4. Print this final dictionary.

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/tools/__init__.py RENAMED Viewed

@@ -43,7 +43,6 @@ from .tools import (
     flux_image_inpainting,
     generate_pose_image,
     get_tool_documentation,
-    get_tool_recommender,
     gpt4o_image_vqa,
     gpt4o_video_vqa,
     load_image,
@@ -63,6 +62,7 @@ from .tools import (
     save_json,
     save_video,
     siglip_classification,
+    stella_embeddings,
     template_match,
     video_temporal_localization,
     vit_image_classification,

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/tools/planner_tools.py RENAMED Viewed

@@ -32,6 +32,7 @@ from vision_agent.utils.execute import (
     MimeType,
 )
 from vision_agent.utils.image_utils import convert_to_b64
+from vision_agent.utils.sim import get_tool_recommender
 TOOL_FUNCTIONS = {tool.__name__: tool for tool in T.TOOLS}
@@ -116,13 +117,11 @@ def run_tool_testing(
     query = lmm.generate(CATEGORIZE_TOOL_REQUEST.format(task=task))
     category = extract_tag(query, "category")  # type: ignore
     if category is None:
-        category = task
+        query = task
     else:
-        category = (
-            f"I need models from the {category.strip()} category of tools. {task}"
-        )
+        query = f"{category.strip()}. {task}"
-    tool_docs = T.get_tool_recommender().top_k(category, k=10, thresh=0.2)
+    tool_docs = get_tool_recommender().top_k(query, k=5, thresh=0.3)
     if exclude_tools is not None and len(exclude_tools) > 0:
         cleaned_tool_docs = []
         for tool_doc in tool_docs:

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/tools/tools.py RENAMED Viewed

@@ -7,7 +7,6 @@ import urllib.request
 from base64 import b64encode
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from enum import Enum
-from functools import lru_cache
 from importlib import resources
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple, Union, cast
@@ -49,7 +48,6 @@ from vision_agent.utils.image_utils import (
     rle_decode,
     rle_decode_array,
 )
-from vision_agent.utils.sim import Sim, load_cached_sim
 from vision_agent.utils.video import (
     extract_frames_from_video,
     frames_to_bytes,
@@ -85,11 +83,6 @@ _OCR_URL = "https://app.landing.ai/ocr/v1/detect-text"
 _LOGGER = logging.getLogger(__name__)
-@lru_cache(maxsize=1)
-def get_tool_recommender() -> Sim:
-    return load_cached_sim(TOOLS_DF)
 def _display_tool_trace(
     function_name: str,
     request: Dict[str, Any],
@@ -2178,13 +2171,14 @@ def document_qa(
     prompt: str,
     image: np.ndarray,
 ) -> str:
-    """'document_qa' is a tool that can answer any questions about arbitrary
-    images of documents or presentations. It answers by analyzing the contextual document data
-    and then using a model to answer specific questions. It returns text as an answer to the question.
+    """'document_qa' is a tool that can answer any questions about arbitrary documents,
+    presentations, or tables. It's very useful for document QA tasks, you can ask it a
+    specific question or ask it to return a JSON object answering multiple questions
+    about the document.
     Parameters:
-        prompt (str): The question to be answered about the document image
-        image (np.ndarray): The document image to analyze
+        prompt (str): The question to be answered about the document image.
+        image (np.ndarray): The document image to analyze.
     Returns:
         str: The answer to the question based on the document's context.
@@ -2203,7 +2197,7 @@ def document_qa(
         "model": "document-analysis",
     }
-    data: dict[str, Any] = send_inference_request(
+    data: Dict[str, Any] = send_inference_request(
         payload=payload,
         endpoint_name="document-analysis",
         files=files,
@@ -2225,10 +2219,10 @@ def document_qa(
     data = normalize(data)
     prompt = f"""
-    Document Context:
-    {data}\n
-    Question: {prompt}\n
-    Please provide a clear, concise answer using only the information from the document. If the answer is not definitively contained in the document, say "I cannot find the answer in the provided document."
+Document Context:
+{data}\n
+Question: {prompt}\n
+Answer the question directly using only the information from the document, do not answer with any additional text besides the answer. If the answer is not definitively contained in the document, say "I cannot find the answer in the provided document."
     """
     lmm = AnthropicLMM()
@@ -2245,6 +2239,22 @@ def document_qa(
     return llm_output
+def stella_embeddings(prompts: List[str]) -> List[np.ndarray]:
+    payload = {
+        "input": prompts,
+        "model": "stella1.5b",
+    }
+    data: Dict[str, Any] = send_inference_request(
+        payload=payload,
+        endpoint_name="embeddings",
+        v2=True,
+        metadata_payload={"function_name": "get_embeddings"},
+        is_form=True,
+    )
+    return [d["embedding"] for d in data]  # type: ignore
 # Utility and visualization functions
@@ -2781,6 +2791,7 @@ FUNCTION_TOOLS = [
     qwen2_vl_images_vqa,
     qwen2_vl_video_vqa,
     document_extraction,
+    document_qa,
     video_temporal_localization,
     flux_image_inpainting,
     siglip_classification,

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/utils/__init__.py RENAMED Viewed

@@ -7,4 +7,3 @@ from .execute import (
     Result,
 )
 from .sim import AzureSim, OllamaSim, Sim, load_sim, merge_sim
-from .video import extract_frames_from_video, video_writer

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/utils/execute.py RENAMED Viewed

@@ -28,10 +28,10 @@ from nbclient import __version__ as nbclient_version
 from nbclient.exceptions import CellTimeoutError, DeadKernelError
 from nbclient.util import run_sync
 from nbformat.v4 import new_code_cell
+from opentelemetry.context import get_current
+from opentelemetry.trace import SpanKind, Status, StatusCode, get_tracer
 from pydantic import BaseModel, field_serializer
 from typing_extensions import Self
-from opentelemetry.trace import get_tracer, Status, StatusCode, SpanKind
-from opentelemetry.context import get_current
 from vision_agent.utils.exceptions import (
     RemoteSandboxCreationError,

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/utils/image_utils.py RENAMED Viewed

@@ -11,7 +11,7 @@ import numpy as np
 from PIL import Image, ImageDraw, ImageFont
 from PIL.Image import Image as ImageType
-from vision_agent.utils import extract_frames_from_video
+from vision_agent.utils.video import extract_frames_from_video
 COLORS = [
     (158, 218, 229),

{vision_agent-0.2.218 → vision_agent-0.2.220}/vision_agent/utils/sim.py RENAMED Viewed

@@ -12,6 +12,13 @@ import requests
 from openai import AzureOpenAI, OpenAI
 from scipy.spatial.distance import cosine  # type: ignore
+from vision_agent.tools.tools import TOOLS_DF, stella_embeddings
+@lru_cache(maxsize=1)
+def get_tool_recommender() -> "Sim":
+    return load_cached_sim(TOOLS_DF)
 @lru_cache(maxsize=512)
 def get_embedding(
@@ -27,13 +34,13 @@ def load_cached_sim(
     cached_dir_full_path = str(resources.files("vision_agent") / cached_dir)
     if os.path.exists(cached_dir_full_path):
         if tools_df is not None:
-            if Sim.check_load(cached_dir_full_path, tools_df):
+            if StellaSim.check_load(cached_dir_full_path, tools_df):
                 # don't pass sim_key to loaded Sim object or else it will re-calculate embeddings
-                return Sim.load(cached_dir_full_path)
+                return StellaSim.load(cached_dir_full_path)
     if os.path.exists(cached_dir_full_path):
         shutil.rmtree(cached_dir_full_path)
-    sim = Sim(tools_df, sim_key=sim_key)
+    sim = StellaSim(tools_df, sim_key=sim_key)
     sim.save(cached_dir_full_path)
     return sim
@@ -58,6 +65,11 @@ class Sim:
         """
         self.df = df
         self.client = OpenAI(api_key=api_key)
+        self.emb_call = (
+            lambda x: self.client.embeddings.create(input=x, model=model)
+            .data[0]
+            .embedding
+        )
         self.model = model
         if "embs" not in df.columns and sim_key is None:
             raise ValueError("key is required if no column 'embs' is present.")
@@ -65,11 +77,7 @@ class Sim:
         if sim_key is not None:
             self.df["embs"] = self.df[sim_key].apply(
                 lambda x: get_embedding(
-                    lambda text: self.client.embeddings.create(
-                        input=text, model=self.model
-                    )
-                    .data[0]
-                    .embedding,
+                    self.emb_call,
                     x,
                 )
             )
@@ -126,9 +134,7 @@ class Sim:
         """
         embedding = get_embedding(
-            lambda text: self.client.embeddings.create(input=text, model=self.model)
-            .data[0]
-            .embedding,
+            self.emb_call,
             query,
         )
         self.df["sim"] = self.df.embs.apply(lambda x: 1 - cosine(x, embedding))
@@ -215,6 +221,40 @@ class OllamaSim(Sim):
             )
+class StellaSim(Sim):
+    def __init__(
+        self,
+        df: pd.DataFrame,
+        sim_key: Optional[str] = None,
+    ) -> None:
+        self.df = df
+        def emb_call(text: List[str]) -> List[float]:
+            return stella_embeddings(text)[0]  # type: ignore
+        self.emb_call = emb_call
+        if "embs" not in df.columns and sim_key is None:
+            raise ValueError("key is required if no column 'embs' is present.")
+        if sim_key is not None:
+            self.df["embs"] = self.df[sim_key].apply(
+                lambda x: get_embedding(emb_call, x)
+            )
+    @staticmethod
+    def load(
+        load_dir: Union[str, Path],
+        api_key: Optional[str] = None,
+        model: str = "stella1.5b",
+    ) -> "StellaSim":
+        load_dir = Path(load_dir)
+        df = pd.read_csv(load_dir / "df.csv")
+        embs = np.load(load_dir / "embs.npy")
+        df["embs"] = list(embs)
+        return StellaSim(df)
 def merge_sim(sim1: Sim, sim2: Sim) -> Sim:
     return Sim(pd.concat([sim1.df, sim2.df], ignore_index=True))