PyPI - vision-agent - Versions diffs - 1.1.16__py3-none-any.whl → 1.1.17__py3-none-any.whl - Mend

vision-agent 1.1.16py3-none-any.whl → 1.1.17py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

vision_agent/.sim_tools/df.csv CHANGED Viewed

@@ -388,8 +388,8 @@ desc,doc,name
     -------
         >>> document_qa(image, question)
         'The answer to the question ...'",document_qa
-"'ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
-'ocr' extracts text from an image. It returns a list of detected text, bounding
+"'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding boxes with normalized coordinates, and confidence scores. The results are sorted from top-left to bottom right.","paddle_ocr(image: numpy.ndarray) -> List[Dict[str, Any]]:
+'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding
     boxes with normalized coordinates, and confidence scores. The results are sorted
     from top-left to bottom right.
@@ -402,10 +402,10 @@ desc,doc,name
     Example
     -------
-        >>> ocr(image)
+        >>> paddle_ocr(image)
         [
             {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
-        ]",ocr
+        ]",paddle_ocr
 "'gemini_image_generation' performs either image inpainting given an image and text prompt, or image generation given a prompt. It can be used to edit parts of an image or the entire image according to the prompt given.","gemini_image_generation(prompt: str, image: Optional[numpy.ndarray] = None) -> numpy.ndarray:
 'gemini_image_generation' performs either image inpainting given an image and text prompt, or image generation given a prompt.
     It can be used to edit parts of an image or the entire image according to the prompt given.
@@ -484,26 +484,26 @@ desc,doc,name
             {'start_time': 2, 'end_time': 4, 'location': 'Outdoor area', 'description': 'A person approaches a white bicycle parked in a row. The person then swings their leg over the bike and gets on it.', 'label': 0},
             {'start_time': 10, 'end_time': 13, 'location': 'Outdoor area', 'description': 'A person gets off a white bicycle parked in a row. The person swings their leg over the bike and dismounts.', 'label': 1},
         ]",agentic_activity_recognition
-'depth_anything_v2' is a tool that runs depth anything v2 model to generate a depth image from a given RGB image. The returned depth image is monochrome and represents depth values as pixel intensities with pixel values ranging from 0 to 255.,"depth_anything_v2(image: numpy.ndarray) -> numpy.ndarray:
-'depth_anything_v2' is a tool that runs depth anything v2 model to generate a
-    depth image from a given RGB image. The returned depth image is monochrome and
-    represents depth values as pixel intensities with pixel values ranging from 0 to 255.
+"'depth_pro' is a tool that runs the Apple DepthPro model to generate a depth map from a given RGB image. The returned depth map has the same dimensions as the input image, with each pixel indicating the distance from the camera in meters.","depth_pro(image: numpy.ndarray) -> numpy.ndarray:
+'depth_pro' is a tool that runs the Apple DepthPro model to generate a
+    depth map from a given RGB image. The returned depth map has the same dimensions
+    as the input image, with each pixel indicating the distance from the camera in meters.
     Parameters:
         image (np.ndarray): The image to used to generate depth image
     Returns:
-        np.ndarray: A grayscale depth image with pixel values ranging from 0 to 255
-            where high values represent closer objects and low values further.
+        np.ndarray: A depth map with float32 pixel values that represent
+            the distance from the camera in meters.
     Example
     -------
-        >>> depth_anything_v2(image)
+        >>> depth_pro(image)
         array([[0, 0, 0, ..., 0, 0, 0],
                 [0, 20, 24, ..., 0, 100, 103],
                 ...,
                 [10, 11, 15, ..., 202, 202, 205],
-                [10, 10, 10, ..., 200, 200, 200]], dtype=uint8),",depth_anything_v2
+                [10, 10, 10, ..., 200, 200, 200]], dtype=np.float32),",depth_pro
 'generate_pose_image' is a tool that generates a open pose bone/stick image from a given RGB image. The returned bone image is RGB with the pose amd keypoints colored and background as black.,"generate_pose_image(image: numpy.ndarray) -> numpy.ndarray:
 'generate_pose_image' is a tool that generates a open pose bone/stick image from
     a given RGB image. The returned bone image is RGB with the pose amd keypoints colored

vision_agent/.sim_tools/embs.npy CHANGED Viewed

Binary file

vision_agent/tools/__init__.py CHANGED Viewed

@@ -21,7 +21,7 @@ from .tools import (
     countgd_sam2_visual_instance_segmentation,
     countgd_visual_object_detection,
     custom_object_detection,
-    depth_anything_v2,
+    depth_pro,
     detr_segmentation,
     document_extraction,
     document_qa,
@@ -42,7 +42,7 @@ from .tools import (
     glee_sam2_video_tracking,
     load_image,
     minimum_distance,
-    ocr,
+    paddle_ocr,
     od_sam2_video_tracking,
     overlay_bounding_boxes,
     overlay_heat_map,

vision_agent/tools/tools.py CHANGED Viewed

@@ -4,7 +4,7 @@ import logging
 import os
 import tempfile
 import urllib.request
-from base64 import b64encode
+from base64 import b64encode, b64decode
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from importlib import resources
 from pathlib import Path
@@ -15,7 +15,6 @@ import time
 import cv2
 import numpy as np
 import pandas as pd
-import requests
 from IPython.display import display
 from PIL import Image, ImageDraw, ImageFont
 from pillow_heif import register_heif_opener  # type: ignore
@@ -2034,8 +2033,8 @@ def qwen2_vl_video_vqa(prompt: str, frames: List[np.ndarray]) -> str:
     return cast(str, data)
-def ocr(image: np.ndarray) -> List[Dict[str, Any]]:
-    """'ocr' extracts text from an image. It returns a list of detected text, bounding
+def paddle_ocr(image: np.ndarray) -> List[Dict[str, Any]]:
+    """'paddle_ocr' extracts text from an image. It returns a list of detected text, bounding
     boxes with normalized coordinates, and confidence scores. The results are sorted
     from top-left to bottom right.
@@ -2048,51 +2047,33 @@ def ocr(image: np.ndarray) -> List[Dict[str, Any]]:
     Example
     -------
-        >>> ocr(image)
+        >>> paddle_ocr(image)
         [
             {'label': 'hello world', 'bbox': [0.1, 0.11, 0.35, 0.4], 'score': 0.99},
         ]
     """
-    pil_image = Image.fromarray(image).convert("RGB")
-    image_size = pil_image.size[::-1]
+    image_size = image.shape[:2]
     if image_size[0] < 1 or image_size[1] < 1:
         return []
-    image_buffer = io.BytesIO()
-    pil_image.save(image_buffer, format="PNG")
-    buffer_bytes = image_buffer.getvalue()
-    image_buffer.close()
-    res = requests.post(
-        _OCR_URL,
-        files={"images": buffer_bytes},
-        data={"language": "en"},
-        headers={"contentType": "multipart/form-data", "apikey": _API_KEY},
-    )
-    if res.status_code != 200:
-        raise ValueError(f"OCR request failed with status code {res.status_code}")
-    data = res.json()
-    output = []
-    for det in data[0]:
-        label = det["text"]
-        box = [
-            det["location"][0]["x"],
-            det["location"][0]["y"],
-            det["location"][2]["x"],
-            det["location"][2]["y"],
-        ]
-        box = normalize_bbox(box, image_size)
-        output.append({"label": label, "bbox": box, "score": round(det["score"], 2)})
+    buffer_bytes = numpy_to_bytes(image)
+    files = [("image", buffer_bytes)]
+    res = send_inference_request(
+        payload={"function_name": "paddle-ocr"},
+        endpoint_name="paddle-ocr",
+        files=files,
+        v2=True,
+    )
     _display_tool_trace(
-        ocr.__name__,
+        paddle_ocr.__name__,
         {},
-        data,
-        cast(List[Tuple[str, bytes]], [("image", buffer_bytes)]),
+        res,
+        files,
     )
-    return sorted(output, key=lambda x: (x["bbox"][1], x["bbox"][0]))
+    return sorted(res, key=lambda x: (x["bbox"][1], x["bbox"][0]))
 def claude35_text_extraction(image: np.ndarray) -> str:
@@ -2370,7 +2351,12 @@ def agentic_activity_recognition(
     buffer_bytes = frames_to_bytes(frames, fps=fps)
     files = [("video", buffer_bytes)]
-    payload = {"prompt": prompt, "specificity": specificity, "with_audio": with_audio}
+    payload = {
+        "prompt": prompt,
+        "specificity": specificity,
+        "with_audio": with_audio,
+        "function_name": "agentic_activity_recognition",
+    }
     response = send_inference_request(
         payload=payload, endpoint_name="activity-recognition", files=files, v2=True
@@ -2529,48 +2515,53 @@ def detr_segmentation(image: np.ndarray) -> List[Dict[str, Any]]:
     return return_data
-def depth_anything_v2(image: np.ndarray) -> np.ndarray:
-    """'depth_anything_v2' is a tool that runs depth anything v2 model to generate a
-    depth image from a given RGB image. The returned depth image is monochrome and
-    represents depth values as pixel intensities with pixel values ranging from 0 to 255.
+def depth_pro(
+    image: np.ndarray,
+) -> np.ndarray:
+    """'depth_pro' is a tool that runs the Apple DepthPro model to generate a
+    depth map from a given RGB image. The returned depth map has the same dimensions
+    as the input image, with each pixel indicating the distance from the camera in meters.
     Parameters:
         image (np.ndarray): The image to used to generate depth image
     Returns:
-        np.ndarray: A grayscale depth image with pixel values ranging from 0 to 255
-            where high values represent closer objects and low values further.
+        np.ndarray: A depth map with float32 pixel values that represent
+            the distance from the camera in meters.
     Example
     -------
-        >>> depth_anything_v2(image)
+        >>> depth_pro(image)
         array([[0, 0, 0, ..., 0, 0, 0],
                 [0, 20, 24, ..., 0, 100, 103],
                 ...,
                 [10, 11, 15, ..., 202, 202, 205],
-                [10, 10, 10, ..., 200, 200, 200]], dtype=uint8),
+                [10, 10, 10, ..., 200, 200, 200]], dtype=np.float32),
     """
-    if image.shape[0] < 1 or image.shape[1] < 1:
-        raise ValueError(f"Image is empty, image shape: {image.shape}")
-    image_b64 = convert_to_b64(image)
-    data = {
-        "image": image_b64,
-        "function_name": "depth_anything_v2",
-    }
+    image_size = image.shape[:2]
+    if image_size[0] < 1 or image_size[1] < 1:
+        return np.empty(0)
+    buffer_bytes = numpy_to_bytes(image)
+    files = [("image", buffer_bytes)]
-    depth_map = send_inference_request(data, "depth-anything-v2", v2=True)
-    depth_map_np = np.array(depth_map["map"])
-    depth_map_np = (depth_map_np - depth_map_np.min()) / (
-        depth_map_np.max() - depth_map_np.min()
+    detections = send_inference_request(
+        payload={"function_name": "depth-pro"},
+        endpoint_name="depth-pro",
+        files=files,
+        v2=True,
     )
-    depth_map_np = (255 * depth_map_np).astype(np.uint8)
+    depth_bytes = b64decode(detections["depth"])
+    depth_map_np = np.frombuffer(depth_bytes, dtype=np.float32).reshape(image_size)
     _display_tool_trace(
-        depth_anything_v2.__name__,
+        depth_pro.__name__,
         {},
-        depth_map,
-        image_b64,
+        response=detections,
+        files=files,
     )
     return depth_map_np
@@ -3564,12 +3555,12 @@ FUNCTION_TOOLS = [
     claude35_text_extraction,
     agentic_document_extraction,
     document_qa,
-    ocr,
+    paddle_ocr,
     gemini_image_generation,
     qwen25_vl_images_vqa,
     qwen25_vl_video_vqa,
     agentic_activity_recognition,
-    depth_anything_v2,
+    depth_pro,
     generate_pose_image,
     vit_nsfw_classification,
     siglip_classification,

{vision_agent-1.1.16.dist-info → vision_agent-1.1.17.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: vision-agent
-Version: 1.1.16
+Version: 1.1.17
 Summary: Toolset for Vision Agent
 Project-URL: Homepage, https://landing.ai
 Project-URL: repository, https://github.com/landing-ai/vision-agent

{vision_agent-1.1.16.dist-info → vision_agent-1.1.17.dist-info}/RECORD RENAMED Viewed

@@ -1,6 +1,6 @@
 vision_agent/__init__.py,sha256=EAb4-f9iyuEYkBrX4ag1syM8Syx8118_t0R6_C34M9w,57
-vision_agent/.sim_tools/df.csv,sha256=i732_U1KQf55UNhT-9srtZXF91XvDnfWBDdc8EqDmpw,41215
-vision_agent/.sim_tools/embs.npy,sha256=XCu3LnLS10IS3npfPMqX2VHIbDPq9iY_NPDBwq5AEj0,245888
+vision_agent/.sim_tools/df.csv,sha256=gheT5OXu68o0AfjV1623GzbD-T2csZ7GnkBbCMaVl8c,41188
+vision_agent/.sim_tools/embs.npy,sha256=OLj2rt4aBFze2HIf9bQ3yn0-_3RVPecrHWxm2CWvgn0,245888
 vision_agent/agent/README.md,sha256=3XSPG_VO7-6y6P8COvcgSSonWj5uvfgvfmOkBpfKK8Q,5527
 vision_agent/agent/__init__.py,sha256=_-nGLHhRTLViXxBSb9D4OwLTqk9HXKPEkTBkvK8c7OU,206
 vision_agent/agent/agent.py,sha256=o1Zuhl6h2R7uVwvUur0Aj38kak8U08plfeFWPst_ErM,1576
@@ -26,11 +26,11 @@ vision_agent/models/lmm_types.py,sha256=v04h-NjbczHOIN8UWa1vvO5-1BDuZ4JQhD2mge1c
 vision_agent/models/tools_types.py,sha256=8hYf2OZhI58gvf65KGaeGkt4EQ56nwLFqIQDPHioOBc,2339
 vision_agent/sim/__init__.py,sha256=Aouz6HEPPTYcLxR5_0fTYCL1OvPKAH1RMWAF90QXAlA,135
 vision_agent/sim/sim.py,sha256=WQY_x9A4VT647qGDBScJ3R8_Iv0aoYLHTgwcQSCXwv4,10059
-vision_agent/tools/__init__.py,sha256=zf8HzjcMSgxKhtrxbqYe9hmvsfuweeDMrOc8eVA8Ya8,2477
+vision_agent/tools/__init__.py,sha256=WfynKGn0Zl2GPkyFhzA2YhGGC0Dtb1oei4Hk_GdSY1c,2476
 vision_agent/tools/meta_tools.py,sha256=9iJilpGYEiXW0nYPTYAWHa7l23wGN8IM5KbE7mWDOT0,6798
 vision_agent/tools/planner_tools.py,sha256=iQWtTgXdomn0IWrbmvXXM-y8Q_RSEOxyP04HIRLrgWI,19576
 vision_agent/tools/prompts.py,sha256=V1z4YJLXZuUl_iZ5rY0M5hHc_2tmMEUKr0WocXKGt4E,1430
-vision_agent/tools/tools.py,sha256=i9GGGu8tvo2M6O5fF4UUBTpn_Ul2KEN9mG3ZlJ95qao,124929
+vision_agent/tools/tools.py,sha256=lndSG8xrIWcs6Rpe1-Jq44niUDXQnWlYfGP2B1YjpI0,124216
 vision_agent/utils/__init__.py,sha256=mANUs_84VL-3gpZbXryvV2mWU623eWnRlJCSUHtMjuw,122
 vision_agent/utils/agent.py,sha256=2ifTP5QElItnr4YHOJR6L5P1PUzV0GhChTTqVxuVyQg,15153
 vision_agent/utils/exceptions.py,sha256=zis8smCbdEylBVZBTVfEUfAh7Rb7cWV3MSPambu6FsQ,1837
@@ -40,7 +40,7 @@ vision_agent/utils/tools.py,sha256=Days0dETPRQLSDamMKPnXFsc5g5IKX9QJcPPNmSHNdM,8
 vision_agent/utils/tools_doc.py,sha256=PKcXXbJktiuPi9q6Q1zXzFx24Dh229SNgWBDtZ2fQSQ,2730
 vision_agent/utils/video.py,sha256=rjsQ1sKKisaQ6AVjJz0zd_G4g-ovRweS_rs4JEhenoI,5340
 vision_agent/utils/video_tracking.py,sha256=DZLFpNCuzuPJQzbQoVNcp-m4dKxgiKdCNM5QTh_zURE,12245
-vision_agent-1.1.16.dist-info/METADATA,sha256=JMmL6rIdT1-WO6XTrjNHucAp4S_UlkjDW1dxznQJ994,12078
-vision_agent-1.1.16.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
-vision_agent-1.1.16.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
-vision_agent-1.1.16.dist-info/RECORD,,
+vision_agent-1.1.17.dist-info/METADATA,sha256=LDH3i8vb2g6aqoEuRSPHdigP1bmhBjxZTQ37-cD9RlA,12078
+vision_agent-1.1.17.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
+vision_agent-1.1.17.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
+vision_agent-1.1.17.dist-info/RECORD,,

{vision_agent-1.1.16.dist-info → vision_agent-1.1.17.dist-info}/WHEEL RENAMED Viewed

File without changes

{vision_agent-1.1.16.dist-info → vision_agent-1.1.17.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

vision-agent 1.1.16__py3-none-any.whl → 1.1.17__py3-none-any.whl

vision-agent 1.1.16py3-none-any.whl → 1.1.17py3-none-any.whl