minitap-mobile-use 2.2.0__py3-none-any.whl → 2.3.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of minitap-mobile-use might be problematic. Click here for more details.

@@ -1,4 +1,5 @@
1
1
  from minitap.mobile_use.agents.executor.utils import is_last_tool_message_take_screenshot
2
+ from minitap.mobile_use.context import MobileUseContext
2
3
  from minitap.mobile_use.controllers.mobile_command_controller import get_screen_data
3
4
  from minitap.mobile_use.controllers.platform_specific_commands_controller import (
4
5
  get_device_date,
@@ -7,7 +8,6 @@ from minitap.mobile_use.controllers.platform_specific_commands_controller import
7
8
  from minitap.mobile_use.graph.state import State
8
9
  from minitap.mobile_use.utils.decorators import wrap_with_callbacks
9
10
  from minitap.mobile_use.utils.logger import get_logger
10
- from minitap.mobile_use.context import MobileUseContext
11
11
 
12
12
  logger = get_logger(__name__)
13
13
 
@@ -26,7 +26,9 @@ class ContextorNode:
26
26
  focused_app_info = get_focused_app_info(self.ctx)
27
27
  device_date = get_device_date(self.ctx)
28
28
 
29
- should_add_screenshot_context = is_last_tool_message_take_screenshot(list(state.messages))
29
+ should_add_screenshot_context = is_last_tool_message_take_screenshot(
30
+ list(state.executor_messages)
31
+ )
30
32
 
31
33
  return state.sanitize_update(
32
34
  ctx=self.ctx,
@@ -4,19 +4,40 @@ Your job is to **analyze the current {{ platform }} mobile device state** and pr
4
4
 
5
5
  You must act like a human brain, responsible for giving instructions to your hands (the **Executor** agent). Therefore, you must act with the same imprecision and uncertainty as a human when performing swipe actions: humans don't know where exactly they are swiping (always prefer percentages of width and height instead of absolute coordinates), they just know they are swiping up or down, left or right, and with how much force (usually amplified compared to what's truly needed - go overboard of sliders for instance).
6
6
 
7
- ### Context You Receive:
7
+ ### Core Principle: Break Unproductive Cycles
8
8
 
9
- You are provided with:
9
+ Your highest priority is to recognize when you are not making progress. You are in an unproductive cycle if a **sequence of actions brings you back to a previous state without achieving the subgoal.**
10
10
 
11
- - 📱 **Device state**:
11
+ If you detect a cycle, you are **FORBIDDEN** from repeating it. You must pivot your strategy.
12
12
 
13
- - Latest **UI hierarchy**
14
- - (Optional) Latest **screenshot (base64)**. You can query one if you need it by calling the take_screenshot tool. Often, the UI hierarchy is enough to understand what is happening on the screen.
15
- - Current **focused app info**
16
- - **Screen size** and **device date**
13
+ 1. **Announce the Pivot:** In your `agent_thought`, you must briefly state which workflow is failing and what your new approach is.
17
14
 
18
- - 🧭 **Task context**:
15
+ 2. **Find a Simpler Path:** Abandon the current workflow. Ask yourself: **"How would a human do this if this feature didn't exist?"** This usually means relying on fundamental actions like scrolling, swiping, or navigating through menus manually.
16
+
17
+ 3. **Retreat as a Last Resort:** If no simpler path exists, declare the subgoal a failure to trigger a replan.
18
+
19
+ ### How to Perceive the Screen: A Two-Sense Approach
20
+
21
+ To understand the device state, you have two senses, each with its purpose:
22
+
23
+ 1. **UI Hierarchy (Your sense of "Touch"):**
24
+ * **What it is:** A structured list of all elements on the screen.
25
+ * **Use it for:** Finding elements by `resource-id`, checking for specific text, and understanding the layout structure.
26
+ * **Limitation:** It does NOT tell you what the screen *looks* like. It can be incomplete, and it contains no information about images, colors, or whether an element is visually obscured.
27
+
28
+ 2. **`glimpse_screen` (Your sense of "Sight"):**
29
+ * **What it is:** A tool that provides a real, up-to-date image of the screen.
30
+ * **Use it for:** Confirming what is actually visible. This is your source of TRUTH for all visual information (icons, images, element positions, colors).
31
+ * **Golden Rule:** When the UI hierarchy is ambiguous, seems incomplete, or when you need to verify a visual detail before acting, **`glimpse_screen` is always the most effective and reliable action.** Never guess what the screen looks like; use your sight to be sure.
32
+
33
+ **CRITICAL NOTE ON SIGHT:** The visual information from `glimpse_screen` is **ephemeral**. It is available for **THIS decision turn ONLY**. You MUST extract all necessary information from it IMMEDIATELY, as it will be cleared before the next step.
34
+ ### Context You Receive:
35
+
36
+ - 📱 **Device state**:
37
+ - Latest **UI hierarchy** and (if available) a **screenshot**.
38
+ - **CRITICAL NOTE ON SIGHT:** The visual information from `glimpse_screen` is **ephemeral**. It is available for **THIS decision turn ONLY**. You MUST extract all necessary information from it IMMEDIATELY, as it will be cleared before the next step.
19
39
 
40
+ - 🧭 **Task context**:
20
41
  - The user's **initial goal**
21
42
  - The **subgoal plan** with their statuses
22
43
  - The **current subgoal** (the one in `PENDING` in the plan)
@@ -27,25 +48,51 @@ You are provided with:
27
48
 
28
49
  Focus on the **current PENDING subgoal and the next subgoals not yet started**.
29
50
 
30
- 1. **Analyze the UI** and environment to understand what action is required.
51
+ **CRITICAL: Before making any decision, you MUST thoroughly analyze the agent thoughts history to:**
52
+ - **Detect patterns of failure or repeated attempts** that suggest the current approach isn't working
53
+ - **Identify contradictions** between what was planned and what actually happened
54
+ - **Spot errors in previous reasoning** that need to be corrected
55
+ - **Learn from successful strategies** used in similar situations
56
+ - **Avoid repeating failed approaches** by recognizing when to change strategy
57
+
58
+ 1. **Analyze the agent thoughts first** - Review all previous agent thoughts to understand:
59
+ - What strategies have been tried and their outcomes
60
+ - Any errors or misconceptions in previous reasoning
61
+ - Patterns that indicate success or failure
62
+ - Whether the current approach should be continued or modified
63
+
64
+ 2. **Then analyze the UI** and environment to understand what action is required, but always in the context of what the agent thoughts reveal about the situation.
31
65
 
32
- 2.1. If some of the subgoals must be **completed** based on your observations, add them to `complete_subgoals_by_ids`. To justify your conclusion, you will fill in the `agent_thought` field based on:
66
+ 3. If some of the subgoals must be **completed** based on your observations, add them to `complete_subgoals_by_ids`. To justify your conclusion, you will fill in the `agent_thought` field based on:
33
67
 
34
68
  - The current UI state
35
- - Past agent thoughts
36
- - Recent tool effects
69
+ - **Critical analysis of past agent thoughts and their accuracy**
70
+ - Recent tool effects and whether they matched expectations from agent thoughts
71
+ - **Any corrections needed to previous reasoning or strategy**
72
+
73
+
74
+ ### The Rule of Element Interaction
75
+
76
+ **You MUST follow it for every element interaction.**
37
77
 
38
- 2.2. Otherwise, output a **stringified structured set of instructions** that an **Executor agent** can perform on a real mobile device:
78
+ When you target a UI element (for a `tap`, `input_text`, `clear_text`, etc.), you **MUST** provide a comprehensive target object containing every piece of information you can find about it.
79
+
80
+ * **1. `resource_id`**: Include this if it is present in the UI hierarchy.
81
+ * **2. `coordinates`**: Include the full bounds (`x`, `y`, `width`, `height`) if they are available.
82
+ * **3. `text`**: Include the *current text* content of the element (e.g., "Sign In", "Search...", "First Name").
83
+
84
+ **This is NOT optional.** Providing all three locators if we have, it is the foundation of the system's reliability. It allows next steps to use a fallback mechanism: if the ID fails, it tries the coordinates, etc. Failing to provide this complete context will lead to action failures.
85
+
86
+ ### Outputting Your Decisions
87
+
88
+ If you decide to act, output a **valid JSON stringified structured set of instructions** for the Executor.
39
89
 
40
90
  - These must be **concrete low-level actions**.
41
91
  - The executor has the following available tools: {{ executor_tools_list }}.
42
92
  - Your goal is to achieve subgoals **fast** - so you must put as much actions as possible in your instructions to complete all achievable subgoals (based on your observations) in one go.
43
93
  - To open URLs/links directly, use the `open_link` tool - it will automatically handle opening in the appropriate browser. It also handles deep links.
44
94
  - When you need to open an app, use the `find_packages` low-level action to try and get its name. Then, simply use the `launch_app` low-level action to launch it.
45
- - If you refer to a UI element or coordinates, specify it clearly (e.g., `resource-id: com.whatsapp:id/search`, `text: "Alice"`, `x: 100, y: 200`).
46
- - **The structure is up to you**, but it must be valid **JSON stringified output**. You will accompany this output with a **natural-language summary** of your reasoning and approach in your agent thought.
47
- - **Never use a sequence of `tap` + `input_text` to type into a field. Always use a single `input_text` action** with the correct `resource_id` (this already ensures the element is focused and the cursor is moved to the end).
48
- - When you want to launch/stop an app, prefer using its package name.
95
+ - **Always use a single `input_text` action** to type in a field. This tool handles focusing the element and placing the cursor correctly. If the tool feedback indicates verification is needed or shows None/empty content, perform verification before proceeding.
49
96
  - **Only reference UI element IDs or visible texts that are explicitly present in the provided UI hierarchy or screenshot. Do not invent, infer, or guess any IDs or texts that are not directly observed**.
50
97
  - **For text clearing**: When you need to completely clear text from an input field, always call the `clear_text` tool with the correct resource_id. This tool automatically focuses the element, and ensures the field is emptied. If you notice this tool fails to clear the text, try to long press the input, select all, and call `erase_one_char`.
51
98
 
@@ -57,10 +104,12 @@ Focus on the **current PENDING subgoal and the next subgoals not yet started**.
57
104
  - **Structured Decisions** _(optional)_:
58
105
  A **valid stringified JSON** describing what should be executed **right now** to advance through the subgoals as much as possible.
59
106
 
60
- - **Agent Thought** _(1-2 sentences)_:
61
- If there is any information you need to remember for later steps, you must include it here, because only the agent thoughts will be used to produce the final structured output.
107
+ - **Agent Thought** _(2-4 sentences)_:
108
+ **MANDATORY: Start by analyzing previous agent thoughts** - Did previous reasoning contain errors? Are we repeating failed approaches? What worked before in similar situations?
109
+
110
+ Then explain your current decision based on this analysis. If there is any information you need to remember for later steps, you must include it here, because only the agent thoughts will be used to produce the final structured output.
62
111
 
63
- This also helps other agents understand your decision and learn from future failures.
112
+ This also helps other agents understand your decision and learn from future failures. **Explicitly mention if you're correcting a previous error or changing strategy based on agent thoughts analysis.**
64
113
  You must also use this field to mention checkpoints when you perform actions without definite ending: for instance "Swiping up to reveal more recipes - last seen recipe was <ID or NAME>, stop when no more".
65
114
 
66
115
  **Important:** `complete_subgoals_by_ids` and the structured decisions are mutually exclusive: if you provide both, the structured decisions will be ignored. Therefore, you must always prioritize completing subgoals over providing structured decisions.
@@ -76,12 +125,12 @@ Focus on the **current PENDING subgoal and the next subgoals not yet started**.
76
125
  #### Structured Decisions:
77
126
 
78
127
  ```text
79
- "{\"action\": \"tap\", \"target\": {\"resource_id\": \"com.whatsapp:id/menuitem_search\", \"text\": \"Search\"}}"
128
+ "{\"action\": \"tap\", \"target\": {\"text_input_resource_id\": \"com.whatsapp:id/menuitem_search\", \"text_input_coordinates\": {\"x\": 880, \"y\": 150, \"width\": 120, \"height\": 120}, \"text_input_text\": \"Search\"}}"
80
129
  ```
81
130
 
82
131
  #### Agent Thought:
83
132
 
84
- > I will tap the search icon at the top of the WhatsApp interface to begin searching for Alice.
133
+ > Analysis: No previous attempts, this is a fresh approach. I will tap the search icon to begin searching. I am providing its resource_id, coordinates, and text content to ensure the Executor can find it reliably, following the element rule.
85
134
 
86
135
  ### Input
87
136
 
@@ -94,9 +143,6 @@ Focus on the **current PENDING subgoal and the next subgoals not yet started**.
94
143
  **Current Subgoal (what needs to be done right now):**
95
144
  {{ current_subgoal }}
96
145
 
97
- **Agent thoughts (previous reasoning, observations about the environment):**
98
- {{ agents_thoughts }}
99
-
100
146
  **Executor agent feedback on latest UI decisions:**
101
147
 
102
148
  {{ executor_feedback }}
@@ -44,7 +44,6 @@ class CortexNode:
44
44
  initial_goal=state.initial_goal,
45
45
  subgoal_plan=state.subgoal_plan,
46
46
  current_subgoal=get_current_subgoal(state.subgoal_plan),
47
- agents_thoughts=state.agents_thoughts,
48
47
  executor_feedback=executor_feedback,
49
48
  executor_tools_list=format_tools_list(ctx=self.ctx, wrappers=EXECUTOR_WRAPPERS_TOOLS),
50
49
  )
@@ -28,7 +28,6 @@ and your previous actions, you must:
28
28
  {
29
29
  "action": "tap",
30
30
  "target": {
31
- "text": "Alice",
32
31
  "resource_id": "com.whatsapp:id/conversation_item"
33
32
  }
34
33
  }
@@ -39,7 +38,6 @@ and your previous actions, you must:
39
38
  Call the `tap_on_element` tool with:
40
39
 
41
40
  - `resource_id = "com.whatsapp:id/conversation_item"`
42
- - `text = "Alice"`
43
41
  - `agent_thought = "I'm tapping on the chat item labeled 'Alice' to open the conversation."`
44
42
 
45
43
  ---
@@ -55,10 +53,14 @@ Call the `tap_on_element` tool with:
55
53
 
56
54
  When using the `input_text` tool:
57
55
 
58
- - **Always provide the `resource_id` of the element** you want to type into.
56
+ - **Provide all available information** from the following optional parameters to identify the text input element:
57
+ - `text_input_resource_id`: The resource ID of the text input element (when available)
58
+ - `text_input_coordinates`: The bounds (ElementBounds) of the text input element (when available)
59
+ - `text_input_text`: The current text content of the text input element (when available)
60
+
59
61
  - The tool will automatically:
60
62
 
61
- 1. **Focus the element first**
63
+ 1. **Focus the element** using the provided identification parameters
62
64
  2. **Move the cursor to the end** of the existing text
63
65
  3. **Then type the new text**
64
66
 
@@ -1,4 +1,5 @@
1
1
  from langchain_core.messages import BaseMessage
2
+
2
3
  from minitap.mobile_use.utils.conversations import is_tool_message
3
4
 
4
5
 
@@ -7,5 +8,5 @@ def is_last_tool_message_take_screenshot(messages: list[BaseMessage]) -> bool:
7
8
  return False
8
9
  for msg in messages[::-1]:
9
10
  if is_tool_message(msg):
10
- return msg.name == "take_screenshot"
11
+ return msg.name == "glimpse_screen"
11
12
  return False
@@ -1,9 +1,21 @@
1
- from minitap.mobile_use.agents.outputter.outputter import outputter
2
- from minitap.mobile_use.config import LLM, OutputConfig
3
- from minitap.mobile_use.context import MobileUseContext
4
- from minitap.mobile_use.utils.logger import get_logger
1
+ import sys
2
+ from unittest.mock import AsyncMock, Mock, patch
3
+
4
+ import pytest
5
5
  from pydantic import BaseModel
6
6
 
7
+ sys.modules["langgraph.prebuilt.chat_agent_executor"] = Mock()
8
+ sys.modules["minitap.mobile_use.graph.state"] = Mock()
9
+ sys.modules["langchain_google_vertexai"] = Mock()
10
+ sys.modules["langchain_google_genai"] = Mock()
11
+ sys.modules["langchain_openai"] = Mock()
12
+ sys.modules["langchain_cerebras"] = Mock()
13
+
14
+ from minitap.mobile_use.agents.outputter.outputter import outputter # noqa: E402
15
+ from minitap.mobile_use.config import LLM, OutputConfig # noqa: E402
16
+ from minitap.mobile_use.context import MobileUseContext # noqa: E402
17
+ from minitap.mobile_use.utils.logger import get_logger # noqa: E402
18
+
7
19
  logger = get_logger(__name__)
8
20
 
9
21
 
@@ -40,68 +52,118 @@ mocked_state = DummyState(
40
52
  ],
41
53
  )
42
54
 
43
- mocked_ctx = MobileUseContext(
44
- llm_config={
55
+
56
+ @pytest.fixture
57
+ def mock_context():
58
+ """Create a properly mocked context with all required fields."""
59
+ ctx = Mock(spec=MobileUseContext)
60
+ ctx.llm_config = {
45
61
  "executor": LLM(provider="openai", model="gpt-5-nano"),
46
62
  "cortex": LLM(provider="openai", model="gpt-5-nano"),
47
63
  "planner": LLM(provider="openai", model="gpt-5-nano"),
48
64
  "orchestrator": LLM(provider="openai", model="gpt-5-nano"),
49
- },
50
- ) # type: ignore
65
+ }
66
+ ctx.device = Mock()
67
+ ctx.hw_bridge_client = Mock()
68
+ ctx.screen_api_client = Mock()
69
+ return ctx
70
+
71
+
72
+ @pytest.fixture
73
+ def mock_state():
74
+ """Create a mock state with test data."""
75
+ return DummyState(
76
+ messages=[],
77
+ initial_goal="Find a green product on my website",
78
+ agents_thoughts=[
79
+ "Going on http://superwebsite.fr",
80
+ "Searching for products",
81
+ "Filtering by color",
82
+ "Color 'green' found for a 20 dollars product",
83
+ ],
84
+ )
85
+
86
+
87
+ @patch("minitap.mobile_use.agents.outputter.outputter.get_llm")
88
+ @pytest.mark.asyncio
89
+ async def test_outputter_with_pydantic_model(mock_get_llm, mock_context, mock_state):
90
+ """Test outputter with Pydantic model output."""
91
+ # Mock the structured LLM response
92
+ mock_structured_llm = AsyncMock()
93
+ mock_structured_llm.ainvoke.return_value = MockPydanticSchema(
94
+ color="green", price=20, currency_symbol="$", website_url="http://superwebsite.fr"
95
+ )
51
96
 
97
+ # Mock the base LLM
98
+ mock_llm = Mock()
99
+ mock_llm.with_structured_output.return_value = mock_structured_llm
100
+ mock_get_llm.return_value = mock_llm
52
101
 
53
- async def test_outputter_with_pydantic_model():
54
- logger.info("Starting test_outputter_with_pydantic_model")
55
102
  config = OutputConfig(
56
103
  structured_output=MockPydanticSchema,
57
104
  output_description=None,
58
105
  )
59
106
 
60
- result = await outputter(ctx=mocked_ctx, output_config=config, graph_output=mocked_state) # type: ignore
61
-
62
- assert isinstance(result, MockPydanticSchema)
63
- assert result.color.lower() == "green"
64
- logger.success(str(result))
107
+ result = await outputter(ctx=mock_context, output_config=config, graph_output=mock_state)
65
108
 
109
+ assert isinstance(result, dict)
110
+ assert result.get("color") == "green"
111
+
112
+
113
+ @patch("minitap.mobile_use.agents.outputter.outputter.get_llm")
114
+ @pytest.mark.asyncio
115
+ async def test_outputter_with_dict(mock_get_llm, mock_context, mock_state):
116
+ """Test outputter with dictionary output."""
117
+ # Mock the structured LLM response for dict
118
+ mock_structured_llm = AsyncMock()
119
+ expected_dict = {
120
+ "color": "green",
121
+ "price": 20,
122
+ "currency_symbol": "$",
123
+ "website_url": "http://superwebsite.fr",
124
+ }
125
+ mock_structured_llm.ainvoke.return_value = expected_dict
126
+
127
+ # Mock the base LLM
128
+ mock_llm = Mock()
129
+ mock_llm.with_structured_output.return_value = mock_structured_llm
130
+ mock_get_llm.return_value = mock_llm
66
131
 
67
- async def test_outputter_with_dict():
68
- logger.info("Starting test_outputter_with_dict")
69
132
  config = OutputConfig(
70
133
  structured_output=mock_dict,
71
134
  output_description=None,
72
135
  )
73
136
 
74
- result = await outputter(ctx=mocked_ctx, output_config=config, graph_output=mocked_state) # type: ignore
137
+ result = await outputter(ctx=mock_context, output_config=config, graph_output=mock_state)
75
138
 
76
139
  assert isinstance(result, dict)
77
- assert result.get("color", None) == "green"
78
- assert result.get("price", None) == 20
79
- assert result.get("currency_symbol", None) == "$"
80
- assert result.get("website_url", None) == "http://superwebsite.fr"
81
- logger.success(str(result))
82
-
140
+ assert result.get("color") == "green"
141
+ assert result.get("price") == 20
142
+ assert result.get("currency_symbol") == "$"
143
+ assert result.get("website_url") == "http://superwebsite.fr"
144
+
145
+
146
+ @patch("minitap.mobile_use.agents.outputter.outputter.get_llm")
147
+ @pytest.mark.asyncio
148
+ async def test_outputter_with_natural_language_output(mock_get_llm, mock_context, mock_state):
149
+ """Test outputter with natural language description output."""
150
+ # Mock the LLM response for natural language output (no structured output)
151
+ mock_llm = AsyncMock()
152
+ expected_json = '{"color": "green", "price": 20, "currency_symbol": "$", "website_url": "http://superwebsite.fr"}'
153
+ mock_llm.ainvoke.return_value = Mock(content=expected_json)
154
+ mock_get_llm.return_value = mock_llm
83
155
 
84
- async def test_outputter_with_natural_language_output():
85
- logger.info("Starting test_outputter_with_natural_language_output")
86
156
  config = OutputConfig(
87
157
  structured_output=None,
88
- output_description="A JSON object with a color, \
89
- a price, a currency_symbol and a website_url key",
158
+ output_description=(
159
+ "A JSON object with a color, a price, a currency_symbol and a website_url key"
160
+ ),
90
161
  )
91
162
 
92
- result = await outputter(ctx=mocked_ctx, output_config=config, graph_output=mocked_state) # type: ignore
93
- logger.info(str(result))
163
+ result = await outputter(ctx=mock_context, output_config=config, graph_output=mock_state)
94
164
 
95
165
  assert isinstance(result, dict)
96
- assert result.get("color", None) == "green"
97
- assert result.get("price", None) == 20
98
- assert result.get("currency_symbol", None) == "$"
99
- assert result.get("website_url", None) == "http://superwebsite.fr"
100
- logger.success(str(result))
101
-
102
-
103
- if __name__ == "__main__":
104
- import asyncio
105
-
106
- asyncio.run(test_outputter_with_pydantic_model())
107
- asyncio.run(test_outputter_with_natural_language_output())
166
+ assert result.get("color") == "green"
167
+ assert result.get("price") == 20
168
+ assert result.get("currency_symbol") == "$"
169
+ assert result.get("website_url") == "http://superwebsite.fr"
@@ -5,6 +5,7 @@ import time
5
5
  import uuid
6
6
  from datetime import datetime
7
7
  from pathlib import Path
8
+ from shutil import which
8
9
  from types import NoneType
9
10
  from typing import TypeVar, overload
10
11
 
@@ -41,6 +42,7 @@ from minitap.mobile_use.sdk.types.exceptions import (
41
42
  AgentProfileNotFoundError,
42
43
  AgentTaskRequestError,
43
44
  DeviceNotFoundError,
45
+ ExecutableNotFoundError,
44
46
  ServerStartupError,
45
47
  )
46
48
  from minitap.mobile_use.sdk.types.task import AgentProfile, Task, TaskRequest, TaskStatus
@@ -81,6 +83,12 @@ class Agent:
81
83
  self._tasks = []
82
84
  self._tmp_traces_dir = Path(tempfile.gettempdir()) / "mobile-use-traces"
83
85
  self._initialized = False
86
+ self._is_default_hw_bridge = (
87
+ self._config.servers.hw_bridge_base_url == DEFAULT_HW_BRIDGE_BASE_URL
88
+ )
89
+ self._is_default_screen_api = (
90
+ self._config.servers.screen_api_base_url == DEFAULT_SCREEN_API_BASE_URL
91
+ )
84
92
 
85
93
  def init(
86
94
  self,
@@ -88,6 +96,11 @@ class Agent:
88
96
  retry_count: int = 5,
89
97
  retry_wait_seconds: int = 5,
90
98
  ):
99
+ if not which("adb") and not which("xcrun"):
100
+ raise ExecutableNotFoundError("cli_tools")
101
+ if self._is_default_hw_bridge and not which("maestro"):
102
+ raise ExecutableNotFoundError("maestro")
103
+
91
104
  if self._initialized:
92
105
  logger.warning("Agent is already initialized. Skipping...")
93
106
  return True
@@ -433,17 +446,11 @@ class Agent:
433
446
  self._hw_bridge_client = DeviceHardwareClient(
434
447
  base_url=self._config.servers.hw_bridge_base_url.to_url(),
435
448
  )
436
- self._is_default_hw_bridge = (
437
- self._config.servers.hw_bridge_base_url == DEFAULT_HW_BRIDGE_BASE_URL
438
- )
439
449
  self._screen_api_client = ScreenApiClient(
440
450
  base_url=self._config.servers.screen_api_base_url.to_url(),
441
451
  retry_count=retry_count,
442
452
  retry_wait_seconds=retry_wait_seconds,
443
453
  )
444
- self._is_default_screen_api = (
445
- self._config.servers.screen_api_base_url == DEFAULT_SCREEN_API_BASE_URL
446
- )
447
454
 
448
455
  def _run_servers(self, device_id: str, platform: DevicePlatform) -> bool:
449
456
  if self._is_default_hw_bridge:
@@ -496,7 +503,10 @@ class Agent:
496
503
 
497
504
  def _check_device_screen_api_health(self) -> bool:
498
505
  try:
506
+ # Required to know if the Screen API is up
499
507
  self._screen_api_client.get_with_retry("/health", timeout=5)
508
+ # Required to know if the Screen API actually receives screenshot from the HW Bridge API
509
+ self._screen_api_client.get_with_retry("/screen-info", timeout=5)
500
510
  return True
501
511
  except Exception as e:
502
512
  logger.error(f"Device Screen API health check failed: {e}")
@@ -4,6 +4,8 @@ Exceptions for the Mobile-use SDK.
4
4
  This module defines the exception hierarchy used throughout the Mobile-use SDK.
5
5
  """
6
6
 
7
+ from typing import Literal
8
+
7
9
 
8
10
  class MobileUseError(Exception):
9
11
  """Base exception class for all Mobile-use SDK exceptions."""
@@ -72,3 +74,31 @@ class AgentProfileNotFoundError(AgentTaskRequestError):
72
74
 
73
75
  def __init__(self, profile_name: str):
74
76
  super().__init__(f"Agent profile {profile_name} not found")
77
+
78
+
79
+ EXECUTABLES = Literal["adb", "maestro", "xcrun", "cli_tools"]
80
+
81
+
82
+ class ExecutableNotFoundError(MobileUseError):
83
+ """Exception raised when a required executable is not found."""
84
+
85
+ def __init__(self, executable_name: EXECUTABLES):
86
+ install_instructions: dict[EXECUTABLES, str] = {
87
+ "adb": "https://developer.android.com/tools/adb",
88
+ "maestro": "https://docs.maestro.dev/getting-started/installing-maestro",
89
+ "xcrun": "Install with: xcode-select --install",
90
+ }
91
+ if executable_name == "cli_tools":
92
+ message = (
93
+ "ADB or Xcode Command Line Tools not found in PATH. "
94
+ "At least one of them is required to run mobile-use "
95
+ "depending on the device platform you wish to run (Android: adb, iOS: xcrun)."
96
+ "Refer to the following links for installation instructions :"
97
+ f"\n- ADB: {install_instructions['adb']}"
98
+ f"\n- Xcode Command Line Tools: {install_instructions['xcrun']}"
99
+ )
100
+ else:
101
+ message = f"Required executable '{executable_name}' not found in PATH."
102
+ if executable_name in install_instructions:
103
+ message += f"\nTo install it, please visit: {install_instructions[executable_name]}"
104
+ super().__init__(message)
@@ -6,6 +6,7 @@ import time
6
6
  from enum import Enum
7
7
 
8
8
  import requests
9
+
9
10
  from minitap.mobile_use.context import DevicePlatform
10
11
  from minitap.mobile_use.servers.utils import is_port_in_use
11
12
 
@@ -175,7 +176,7 @@ class DeviceHardwareBridge:
175
176
  ]
176
177
 
177
178
  def start(self):
178
- if is_port_in_use(DEVICE_HARDWARE_BRIDGE_PORT):
179
+ if is_port_in_use(port=DEVICE_HARDWARE_BRIDGE_PORT):
179
180
  print("Maestro port already in use - assuming Maestro is running.")
180
181
  self.status = BridgeStatus.RUNNING
181
182
  return True
@@ -1,11 +1,8 @@
1
- import psutil
1
+ import contextlib
2
+ import socket
2
3
 
3
4
 
4
- def is_port_in_use(port: int):
5
- for conn in psutil.net_connections():
6
- if conn.status == psutil.CONN_LISTEN and conn.laddr:
7
- if hasattr(conn.laddr, "port") and conn.laddr.port == port:
8
- return True
9
- elif isinstance(conn.laddr, tuple) and len(conn.laddr) >= 2 and conn.laddr[1] == port:
10
- return True
11
- return False
5
+ def is_port_in_use(port: int, host: str = "127.0.0.1") -> bool:
6
+ with contextlib.closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
7
+ s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
8
+ return s.connect_ex((host, port)) == 0
@@ -6,6 +6,7 @@ from minitap.mobile_use.tools.mobile.clear_text import clear_text_wrapper
6
6
  from minitap.mobile_use.tools.mobile.copy_text_from import copy_text_from_wrapper
7
7
  from minitap.mobile_use.tools.mobile.erase_one_char import erase_one_char_wrapper
8
8
  from minitap.mobile_use.tools.mobile.find_packages import find_packages_wrapper
9
+ from minitap.mobile_use.tools.mobile.glimpse_screen import glimpse_screen_wrapper
9
10
  from minitap.mobile_use.tools.mobile.input_text import input_text_wrapper
10
11
  from minitap.mobile_use.tools.mobile.launch_app import launch_app_wrapper
11
12
  from minitap.mobile_use.tools.mobile.long_press_on import long_press_on_wrapper
@@ -14,7 +15,6 @@ from minitap.mobile_use.tools.mobile.paste_text import paste_text_wrapper
14
15
  from minitap.mobile_use.tools.mobile.press_key import press_key_wrapper
15
16
  from minitap.mobile_use.tools.mobile.stop_app import stop_app_wrapper
16
17
  from minitap.mobile_use.tools.mobile.swipe import swipe_wrapper
17
- from minitap.mobile_use.tools.mobile.take_screenshot import take_screenshot_wrapper
18
18
  from minitap.mobile_use.tools.mobile.tap import tap_wrapper
19
19
  from minitap.mobile_use.tools.mobile.wait_for_animation_to_end import (
20
20
  wait_for_animation_to_end_wrapper,
@@ -27,7 +27,7 @@ EXECUTOR_WRAPPERS_TOOLS = [
27
27
  tap_wrapper,
28
28
  long_press_on_wrapper,
29
29
  swipe_wrapper,
30
- take_screenshot_wrapper,
30
+ glimpse_screen_wrapper,
31
31
  copy_text_from_wrapper,
32
32
  input_text_wrapper,
33
33
  erase_one_char_wrapper,