PyPI - vision-agent - Versions diffs - 1.1.15__py3-none-any.whl → 1.1.17__py3-none-any.whl - Mend

vision-agent 1.1.15py3-none-any.whl → 1.1.17py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

vision_agent/.sim_tools/df.csv CHANGED Viewed

@@ -388,8 +388,8 @@ desc,doc,name
     -------
         >>> document_qa(image, question)
         'The answer to the question ...'",document_qa
-"'ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
-'ocr' extracts text from an image. It returns a list of detected text, bounding
+"'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","paddle_ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
+'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding
     boxes with normalized coordinates, and confidence scores. The results are sorted
     from top-left to bottom right.
@@ -402,10 +402,10 @@ desc,doc,name
     Example
     -------
-        >>> ocr(image)
+        >>> paddle_ocr(image)
         [
             {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
-        ]",ocr
+        ]",paddle_ocr
 "'gemini_image_generation' performs either image inpainting given an image and text prompt, or image generation given a prompt. It can be used to edit parts of an image or the entire image according to the prompt given.","gemini_image_generation(prompt: str, image: Optional[numpy.ndarray] = None) -> numpy.ndarray:
 'gemini_image_generation' performs either image inpainting given an image and text prompt, or image generation given a prompt.
     It can be used to edit parts of an image or the entire image according to the prompt given.
@@ -484,26 +484,26 @@ desc,doc,name
             {'start_time': 2, 'end_time': 4, 'location': 'Outdoor area', 'description': 'A person approaches a white bicycle parked in a row. The person then swings their leg over the bike and gets on it.', 'label': 0},
             {'start_time': 10, 'end_time': 13, 'location': 'Outdoor area', 'description': 'A person gets off a white bicycle parked in a row. The person swings their leg over the bike and dismounts.', 'label': 1},
         ]",agentic_activity_recognition
-'depth_anything_v2' is a tool that runs depth anything v2 model to generate a depth image from a given RGB image. The returned depth image is monochrome and represents depth values as pixel intensities with pixel values ranging from 0 to 255.,"depth_anything_v2(image: numpy.ndarray) -> numpy.ndarray:
-'depth_anything_v2' is a tool that runs depth anything v2 model to generate a
-    depth image from a given RGB image. The returned depth image is monochrome and
-    represents depth values as pixel intensities with pixel values ranging from 0 to 255.
+"'depth_pro' is a tool that runs the Apple DepthPro model to generate a depth map from a given RGB image. The returned depth map has the same dimensions as the input image, with each pixel indicating the distance from the camera in meters.","depth_pro(image: numpy.ndarray) -> numpy.ndarray:
+'depth_pro' is a tool that runs the Apple DepthPro model to generate a
+    depth map from a given RGB image. The returned depth map has the same dimensions
+    as the input image, with each pixel indicating the distance from the camera in meters.
     Parameters:
         image (np.ndarray): The image to used to generate depth image
     Returns:
-        np.ndarray: A grayscale depth image with pixel values ranging from 0 to 255
-            where high values represent closer objects and low values further.
+        np.ndarray: A depth map with float32 pixel values that represent
+            the distance from the camera in meters.
     Example
     -------
-        >>> depth_anything_v2(image)
+        >>> depth_pro(image)
         array([[0, 0, 0, ..., 0, 0, 0],
                 [0, 20, 24, ..., 0, 100, 103],
                 ...,
                 [10, 11, 15, ..., 202, 202, 205],
-                [10, 10, 10, ..., 200, 200, 200]], dtype=uint8),",depth_anything_v2
+                [10, 10, 10, ..., 200, 200, 200]], dtype=np.float32),",depth_pro
 'generate_pose_image' is a tool that generates a open pose bone/stick image from a given RGB image. The returned bone image is RGB with the pose amd keypoints colored and background as black.,"generate_pose_image(image: numpy.ndarray) -> numpy.ndarray:
 'generate_pose_image' is a tool that generates a open pose bone/stick image from
     a given RGB image. The returned bone image is RGB with the pose amd keypoints colored

vision_agent/.sim_tools/embs.npy CHANGED Viewed

Binary file

vision_agent/tools/__init__.py CHANGED Viewed

@@ -21,7 +21,7 @@ from .tools import (
     countgd_sam2_visual_instance_segmentation,
     countgd_visual_object_detection,
     custom_object_detection,
-    depth_anything_v2,
+    depth_pro,
     detr_segmentation,
     document_extraction,
     document_qa,
@@ -42,7 +42,7 @@ from .tools import (
     glee_sam2_video_tracking,
     load_image,
     minimum_distance,
-    ocr,
+    paddle_ocr,
     od_sam2_video_tracking,
     overlay_bounding_boxes,
     overlay_heat_map,

vision_agent/tools/tools.py CHANGED Viewed

@@ -4,7 +4,7 @@ import logging
 import os
 import tempfile
 import urllib.request
-from base64 import b64encode
+from base64 import b64encode, b64decode
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from importlib import resources
 from pathlib import Path
@@ -15,7 +15,6 @@ import time
 import cv2
 import numpy as np
 import pandas as pd
-import requests
 from IPython.display import display
 from PIL import Image, ImageDraw, ImageFont
 from pillow_heif import register_heif_opener  # type: ignore
@@ -2034,8 +2033,8 @@ def qwen2_vl_video_vqa(prompt: str, frames: List[np.ndarray]) -> str:
     return cast(str, data)
-def ocr(image: np.ndarray) -> List[Dict[str, Any]]:
-    """'ocr' extracts text from an image. It returns a list of detected text, bounding
+def paddle_ocr(image: np.ndarray) -> List[Dict[str, Any]]:
+    """'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding
     boxes with normalized coordinates, and confidence scores. The results are sorted
     from top-left to bottom right.
@@ -2048,51 +2047,33 @@ def ocr(image: np.ndarray) -> List[Dict[str, Any]]:
     Example
     -------
-        >>> ocr(image)
+        >>> paddle_ocr(image)
         [
             {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
         ]
     """
-    pil_image = Image.fromarray(image).convert("RGB")
-    image_size = pil_image.size[::-1]
+    image_size = image.shape[:2]
     if image_size[0] < 1 or image_size[1] < 1:
         return []
-    image_buffer = io.BytesIO()
-    pil_image.save(image_buffer, format="PNG")
-    buffer_bytes = image_buffer.getvalue()
-    image_buffer.close()
-    res = requests.post(
-        _OCR_URL,
-        files={"images": buffer_bytes},
-        data={"language": "en"},
-        headers={"contentType": "multipart/form-data", "apikey": _API_KEY},
-    )
-    if res.status_code != 200:
-        raise ValueError(f"OCR request failed with status code {res.status_code}")
-    data = res.json()
-    output = []
-    for det in data[0]:
-        label = det["text"]
-        box = [
-            det["location"][0]["x"],
-            det["location"][0]["y"],
-            det["location"][2]["x"],
-            det["location"][2]["y"],
-        ]
-        box = normalize_bbox(box, image_size)
-        output.append({"label": label, "bbox": box, "score": round(det["score"], 2)})
+    buffer_bytes = numpy_to_bytes(image)
+    files = [("image", buffer_bytes)]
+    res = send_inference_request(
+        payload={"function_name": "paddle-ocr"},
+        endpoint_name="paddle-ocr",
+        files=files,
+        v2=True,
+    )
     _display_tool_trace(
-        ocr.__name__,
+        paddle_ocr.__name__,
         {},
-        data,
-        cast(List[Tuple[str, bytes]], [("image", buffer_bytes)]),
+        res,
+        files,
     )
-    return sorted(output, key=lambda x: (x["bbox"][1], x["bbox"][0]))
+    return sorted(res, key=lambda x: (x["bbox"][1], x["bbox"][0]))
 def claude35_text_extraction(image: np.ndarray) -> str:
@@ -2370,7 +2351,12 @@ def agentic_activity_recognition(
     buffer_bytes = frames_to_bytes(frames, fps=fps)
     files = [("video", buffer_bytes)]
-    payload = {"prompt": prompt, "specificity": specificity, "with_audio": with_audio}
+    payload = {
+        "prompt": prompt,
+        "specificity": specificity,
+        "with_audio": with_audio,
+        "function_name": "agentic_activity_recognition",
+    }
     response = send_inference_request(
         payload=payload, endpoint_name="activity-recognition", files=files, v2=True
@@ -2529,48 +2515,53 @@ def detr_segmentation(image: np.ndarray) -> List[Dict[str, Any]]:
     return return_data
-def depth_anything_v2(image: np.ndarray) -> np.ndarray:
-    """'depth_anything_v2' is a tool that runs depth anything v2 model to generate a
-    depth image from a given RGB image. The returned depth image is monochrome and
-    represents depth values as pixel intensities with pixel values ranging from 0 to 255.
+def depth_pro(
+    image: np.ndarray,
+) -> np.ndarray:
+    """'depth_pro' is a tool that runs the Apple DepthPro model to generate a
+    depth map from a given RGB image. The returned depth map has the same dimensions
+    as the input image, with each pixel indicating the distance from the camera in meters.
     Parameters:
         image (np.ndarray): The image to used to generate depth image
     Returns:
-        np.ndarray: A grayscale depth image with pixel values ranging from 0 to 255
-            where high values represent closer objects and low values further.
+        np.ndarray: A depth map with float32 pixel values that represent
+            the distance from the camera in meters.
     Example
     -------
-        >>> depth_anything_v2(image)
+        >>> depth_pro(image)
         array([[0, 0, 0, ..., 0, 0, 0],
                 [0, 20, 24, ..., 0, 100, 103],
                 ...,
                 [10, 11, 15, ..., 202, 202, 205],
-                [10, 10, 10, ..., 200, 200, 200]], dtype=uint8),
+                [10, 10, 10, ..., 200, 200, 200]], dtype=np.float32),
     """
-    if image.shape[0] < 1 or image.shape[1] < 1:
-        raise ValueError(f"Image is empty, image shape: {image.shape}")
-    image_b64 = convert_to_b64(image)
-    data = {
-        "image": image_b64,
-        "function_name": "depth_anything_v2",
-    }
+    image_size = image.shape[:2]
+    if image_size[0] < 1 or image_size[1] < 1:
+        return np.empty(0)
+    buffer_bytes = numpy_to_bytes(image)
+    files = [("image", buffer_bytes)]
-    depth_map = send_inference_request(data, "depth-anything-v2", v2=True)
-    depth_map_np = np.array(depth_map["map"])
-    depth_map_np = (depth_map_np - depth_map_np.min()) / (
-        depth_map_np.max() - depth_map_np.min()
+    detections = send_inference_request(
+        payload={"function_name": "depth-pro"},
+        endpoint_name="depth-pro",
+        files=files,
+        v2=True,
     )
-    depth_map_np = (255 * depth_map_np).astype(np.uint8)
+    depth_bytes = b64decode(detections["depth"])
+    depth_map_np = np.frombuffer(depth_bytes, dtype=np.float32).reshape(image_size)
     _display_tool_trace(
-        depth_anything_v2.__name__,
+        depth_pro.__name__,
         {},
-        depth_map,
-        image_b64,
+        response=detections,
+        files=files,
     )
     return depth_map_np
@@ -3564,12 +3555,12 @@ FUNCTION_TOOLS = [
     claude35_text_extraction,
     agentic_document_extraction,
     document_qa,
-    ocr,
+    paddle_ocr,
     gemini_image_generation,
     qwen25_vl_images_vqa,
     qwen25_vl_video_vqa,
     agentic_activity_recognition,
-    depth_anything_v2,
+    depth_pro,
     generate_pose_image,
     vit_nsfw_classification,
     siglip_classification,

{vision_agent-1.1.15.dist-info → vision_agent-1.1.17.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: vision-agent
-Version: 1.1.15
+Version: 1.1.17
 Summary: Toolset for Vision Agent
 Project-URL: Homepage, https://landing.ai
 Project-URL: repository, https://github.com/landing-ai/vision-agent
@@ -56,10 +56,8 @@ _Prompt with an image/video → Get runnable vision code → Build Visual AI App
 </div>
 <p align="center">
-  <a href="https://va.landing.ai/agent" target="_blank"><strong>Web App</strong></a> ·
   <a href="https://discord.com/invite/RVcW3j9RgR" target="_blank"><strong>Discord</strong></a> ·
   <a href="https://landing.ai/blog/visionagent-an-agentic-approach-for-complex-visual-reasoning" target="_blank"><strong>Architecture</strong></a> ·
-  <a href="https://support.landing.ai/docs/visionagent" target="_blank"><strong>Docs</strong></a> ·
   <a href="https://www.youtube.com/playlist?list=PLrKGAzovU85fvo22OnVtPl90mxBygIf79" target="_blank"><strong>YouTube</strong></a>
 </p>
@@ -67,12 +65,11 @@ _Prompt with an image/video → Get runnable vision code → Build Visual AI App
 **VisionAgent** is the Visual AI pilot from LandingAI. Give it a prompt and an image, and it automatically picks the right vision models and outputs ready‑to‑run code—letting you build vision‑enabled apps in minutes.
-Prefer full control? Install the library and run VisionAgent locally. Just want to dive in quickly? Use the [VisionAgent web app](https://va.landing.ai/).
 ## Steps to Set Up the Library
 ### Get Your VisionAgent API Key
-The most important step is to [signup](https://va.landing.ai/agent) and obtain your [API key](https://va.landing.ai/settings/api-key).
+The most important step is to [create an account](https://va.landing.ai/home) and obtain your [API key](https://va.landing.ai/settings/api-key).
 ### Other Prerequisites
 - Python version 3.9 or higher
@@ -82,9 +79,8 @@ The most important step is to [signup](https://va.landing.ai/agent) and obtain y
 ### Why do I need Anthropic and Google API Keys?
 VisionAgent uses models from Anthropic and Google to respond to prompts and generate code.
-When you run the web-based version of VisionAgent, the app uses the LandingAI API keys to access these models.
-When you run VisionAgent programmatically, the app will need to use your API keys to access the Anthropic and Google models. This ensures that any projects you run with VisionAgent aren’t limited by the rate limits in place with the LandingAI accounts, and it also prevents many users from overloading the LandingAI rate limits.
+When you run VisionAgent, the app will need to use your API keys to access the Anthropic and Google models. This ensures that any projects you run with VisionAgent aren’t limited by the rate limits in place with the LandingAI accounts, and it also prevents many users from overloading the LandingAI rate limits.
 Anthropic and Google each have their own rate limits and paid tiers. Refer to their documentation and pricing to learn more.
@@ -271,5 +267,4 @@ with this code:
 ## Resources
 - [Discord](https://discord.com/invite/RVcW3j9RgR): Check out our community of VisionAgent users to share use cases and learn about updates.
 - [VisionAgent Library Docs](https://landing-ai.github.io/vision-agent/): Learn how to use this library.
-- [VisionAgent Web App Docs](https://support.landing.ai/docs/agentic-ai): Learn how to use the web-based version of VisionAgent.
 - [Video Tutorials](https://www.youtube.com/playlist?list=PLrKGAzovU85fvo22OnVtPl90mxBygIf79): Watch the latest video tutorials to see how VisionAgent is used in a variety of use cases.

{vision_agent-1.1.15.dist-info → vision_agent-1.1.17.dist-info}/RECORD RENAMED Viewed

@@ -1,6 +1,6 @@
 vision_agent/__init__.py,sha256=EAb4-f9iyuEYkBrX4ag1syM8Syx8118_t0R6_C34M9w,57
-vision_agent/.sim_tools/df.csv,sha256=i732_U1KQf55UNhT-9srtZXF91XvDnfWBDdc8EqDmpw,41215
-vision_agent/.sim_tools/embs.npy,sha256=XCu3LnLS10IS3npfPMqX2VHIbDPq9iY_NPDBwq5AEj0,245888
+vision_agent/.sim_tools/df.csv,sha256=gheT5OXu68o0AfjV1623GzbD-T2csZ7GnkBbCMaVl8c,41188
+vision_agent/.sim_tools/embs.npy,sha256=OLj2rt4aBFze2HIf9bQ3yn0-_3RVPecrHWxm2CWvgn0,245888
 vision_agent/agent/README.md,sha256=3XSPG_VO7-6y6P8COvcgSSonWj5uvfgvfmOkBpfKK8Q,5527
 vision_agent/agent/__init__.py,sha256=_-nGLHhRTLViXxBSb9D4OwLTqk9HXKPEkTBkvK8c7OU,206
 vision_agent/agent/agent.py,sha256=o1Zuhl6h2R7uVwvUur0Aj38kak8U08plfeFWPst_ErM,1576
@@ -26,11 +26,11 @@ vision_agent/models/lmm_types.py,sha256=v04h-NjbczHOIN8UWa1vvO5-1BDuZ4JQhD2mge1c
 vision_agent/models/tools_types.py,sha256=8hYf2OZhI58gvf65KGaeGkt4EQ56nwLFqIQDPHioOBc,2339
 vision_agent/sim/__init__.py,sha256=Aouz6HEPPTYcLxR5_0fTYCL1OvPKAH1RMWAF90QXAlA,135
 vision_agent/sim/sim.py,sha256=WQY_x9A4VT647qGDBScJ3R8_Iv0aoYLHTgwcQSCXwv4,10059
-vision_agent/tools/__init__.py,sha256=zf8HzjcMSgxKhtrxbqYe9hmvsfuweeDMrOc8eVA8Ya8,2477
+vision_agent/tools/__init__.py,sha256=WfynKGn0Zl2GPkyFhzA2YhGGC0Dtb1oei4Hk_GdSY1c,2476
 vision_agent/tools/meta_tools.py,sha256=9iJilpGYEiXW0nYPTYAWHa7l23wGN8IM5KbE7mWDOT0,6798
 vision_agent/tools/planner_tools.py,sha256=iQWtTgXdomn0IWrbmvXXM-y8Q_RSEOxyP04HIRLrgWI,19576
 vision_agent/tools/prompts.py,sha256=V1z4YJLXZuUl_iZ5rY0M5hHc_2tmMEUKr0WocXKGt4E,1430
-vision_agent/tools/tools.py,sha256=i9GGGu8tvo2M6O5fF4UUBTpn_Ul2KEN9mG3ZlJ95qao,124929
+vision_agent/tools/tools.py,sha256=lndSG8xrIWcs6Rpe1-Jq44niUDXQnWlYfGP2B1YjpI0,124216
 vision_agent/utils/__init__.py,sha256=mANUs_84VL-3gpZbXryvV2mWU623eWnRlJCSUHtMjuw,122
 vision_agent/utils/agent.py,sha256=2ifTP5QElItnr4YHOJR6L5P1PUzV0GhChTTqVxuVyQg,15153
 vision_agent/utils/exceptions.py,sha256=zis8smCbdEylBVZBTVfEUfAh7Rb7cWV3MSPambu6FsQ,1837
@@ -40,7 +40,7 @@ vision_agent/utils/tools.py,sha256=Days0dETPRQLSDamMKPnXFsc5g5IKX9QJcPPNmSHNdM,8
 vision_agent/utils/tools_doc.py,sha256=PKcXXbJktiuPi9q6Q1zXzFx24Dh229SNgWBDtZ2fQSQ,2730
 vision_agent/utils/video.py,sha256=rjsQ1sKKisaQ6AVjJz0zd_G4g-ovRweS_rs4JEhenoI,5340
 vision_agent/utils/video_tracking.py,sha256=DZLFpNCuzuPJQzbQoVNcp-m4dKxgiKdCNM5QTh_zURE,12245
-vision_agent-1.1.15.dist-info/METADATA,sha256=EkYUNPMuq2WuDoBFVhKMT9H06z7-wzjWjV4EQGeIf8E,12673
-vision_agent-1.1.15.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
-vision_agent-1.1.15.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
-vision_agent-1.1.15.dist-info/RECORD,,
+vision_agent-1.1.17.dist-info/METADATA,sha256=LDH3i8vb2g6aqoEuRSPHdigP1bmhBjxZTQ37-cD9RlA,12078
+vision_agent-1.1.17.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
+vision_agent-1.1.17.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
+vision_agent-1.1.17.dist-info/RECORD,,

{vision_agent-1.1.15.dist-info → vision_agent-1.1.17.dist-info}/WHEEL RENAMED Viewed

File without changes

{vision_agent-1.1.15.dist-info → vision_agent-1.1.17.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

vision-agent 1.1.15__py3-none-any.whl → 1.1.17__py3-none-any.whl

vision-agent 1.1.15py3-none-any.whl → 1.1.17py3-none-any.whl