auto-simctl 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,183 @@
1
+ Metadata-Version: 2.4
2
+ Name: auto-simctl
3
+ Version: 0.1.0
4
+ Summary: Intelligent mobile simulator control — AI-driven device testing for vibe coding
5
+ Author: auto-simctl
6
+ License: MIT
7
+ Requires-Python: >=3.10
8
+ Description-Content-Type: text/markdown
9
+ Requires-Dist: fb-idb>=1.1.0
10
+ Requires-Dist: pure-python-adb>=0.2.2.dev0
11
+ Requires-Dist: mlx-openai-server>=1.6.0
12
+ Requires-Dist: mlx-vlm>=0.4.0
13
+ Requires-Dist: openai>=1.0.0
14
+ Requires-Dist: typer>=0.9.0
15
+ Requires-Dist: rich>=13.0.0
16
+ Requires-Dist: fastmcp>=2.0.0
17
+ Requires-Dist: huggingface_hub>=0.20.0
18
+ Provides-Extra: dev
19
+ Requires-Dist: pytest>=7.0; extra == "dev"
20
+ Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
21
+
22
+ # auto-simctl
23
+
24
+ **Intelligent Mobile Simulator Control** — the missing piece for vibe coding on mobile. An AI agent that controls real/simulated Android and iOS devices: screenshot → UI understanding → reasoning → action → report.
25
+
26
+ ## What it does
27
+
28
+ - **Unified device bridge (MDB)**: One API over `adb` (Android) and `idb` (iOS Simulator). All coordinates clamped to screen bounds before execution.
29
+ - **Accessibility-first UI understanding**: `idb ui describe-all` provides precise logical-point `(cx, cy)` for every element. Qwen picks *which* element; the element table supplies *where*.
30
+ - **Qwen-as-director**: Qwen3.5-9B reasons from the accessibility tree + screenshot, decides actions, and bridges language semantics (e.g. Chinese task → English UI labels) without hardcoded translation tables.
31
+ - **Fast-paths (no LLM)**: The orchestrator short-circuits Qwen for deterministic cases — gesture keywords (往右滑, 返回, …), open-keyboard `input_text`, app-icon visible, foreground matches target app.
32
+ - **`act` — stateful one-shot mode**: Does NOT reset to HOME. Executes exactly one atomic action and returns. tap / swipe / scroll / pan are always considered done after one execution. Designed for the MCP vibe-coding loop.
33
+ - **`run` — full autonomous mode**: Pre-flight HOME reset, then multi-step ReAct loop until the goal is complete or max steps reached.
34
+ - **`screen` — instant screen snapshot**: Returns foreground app, visible elements with tap coordinates, scroll state, and keyboard state. With `-s` saves a screenshot.
35
+ - **Keyboard-open fast-path**: When `act("input_text X")` is called and the keyboard is already visible, the text is typed immediately — no Qwen call needed.
36
+ - **Compound input**: `act("輸入https://… on address textfield")` taps the field, waits for the keyboard, then types — all in one call.
37
+ - **Pre-flight home reset** (`run` only): Before each task, presses HOME then corrects Today View / side launcher pages by swiping left to page 0.
38
+ - **UI-UG fallback**: UI-UG-7B-2601 via a background HTTP server handles custom-drawn views where accessibility labels are unavailable (games, canvas, WebView).
39
+ - **ReAct loop with navigation stack**: Screenshot → acc elements → fast-paths → Qwen → action → execute → update nav stack → repeat until done or max steps.
40
+ - **Navigation & scroll awareness**: Maintains a `NavFrame` stack (depth, screen label, scroll offset) so the agent always knows which page it's on and how far it has scrolled.
41
+ - **Dialog & keyboard handling**: Auto-detects and dismisses system permission dialogs; detects on-screen keyboard and switches to `input_text()` automatically.
42
+ - **MCP server**: `mcp_server/server.py` exposes `list_devices`, `get_screen_state`, `act`, and `run_task` as FastMCP tools — plug directly into Cursor or Claude Desktop.
43
+
44
+ ## Quick start
45
+
46
+ ```bash
47
+ # One-time setup: install adb, idb, Python deps, download models
48
+ ./setup.sh
49
+
50
+ # Start both model servers (Qwen on :8080, UI-UG on :8081)
51
+ python3 cli.py server start
52
+
53
+ # ── Full autonomous task (HOME reset, multi-step) ─────────────────────────────
54
+ python3 cli.py run "Open Settings"
55
+ python3 cli.py run "找看看有沒有資料夾" --verbose
56
+
57
+ # ── One-shot act (no HOME reset, continues from current screen) ───────────────
58
+ python3 cli.py act "往右滑" # gesture fast-path
59
+ python3 cli.py act "點 Watch app" # tap
60
+ python3 cli.py act "tap address bar" # tap a field (keyboard opens)
61
+ python3 cli.py act "input_text https://google.com" # type (keyboard must be open)
62
+ python3 cli.py act "press enter" # submit — input_text does NOT auto-press Enter
63
+ python3 cli.py act "輸入https://google.com on address bar" # tap + type in one call (still needs press enter after)
64
+ python3 cli.py act "返回" # back
65
+
66
+ # ── Screen snapshot (for the vibe-coding brain) ───────────────────────────────
67
+ python3 cli.py screen # rich text summary of current screen
68
+ python3 cli.py screen --json # machine-readable JSON
69
+ python3 cli.py screen -s shot.png # save screenshot file
70
+
71
+ # List connected devices
72
+ python3 cli.py devices
73
+
74
+ # Stop servers
75
+ python3 cli.py server stop
76
+ ```
77
+
78
+ ## Requirements
79
+
80
+ - macOS (Apple Silicon) — MLX-based models
81
+ - Python 3.10+
82
+ - Xcode + `idb-companion` (Homebrew) for iOS
83
+ - `android-platform-tools` (Homebrew) or Android Studio for Android
84
+ - Models in `~/.cache/huggingface/hub/`:
85
+ - `qwen3.5-9b-mlx-4bit` (reasoning, ~8s/step with thinking)
86
+ - `neovateai/UI-UG-7B-2601` (UI grounding fallback)
87
+
88
+ ## How the agent thinks
89
+
90
+ ```
91
+ run mode — pre-flight (once per task):
92
+ 1. Press HOME if not on SpringBoard
93
+ 2. Detect Today View (>12 elements) → swipe left to page 0
94
+
95
+ act mode — no pre-flight (continues from current screen)
96
+
97
+ For each step (max 20):
98
+ 1. Take screenshot
99
+ 2. List all accessibility elements (visible + off-screen)
100
+ 3. Detect keyboard open (single-letter Button elements present)
101
+ 4. Get scroll boundary info (content above/below/left/right)
102
+ 5. Detect & auto-dismiss system dialogs
103
+ 6. Fast-paths (no Qwen):
104
+ a. Keyboard open + input task (input_text / 輸入 / type)
105
+ → input_text(X) directly, skip Qwen
106
+ b. Gesture keyword (往右滑, 往左滑, 返回, swipe right, back, …)
107
+ → deterministic swipe/press, skip Qwen [act only: also skips step loop]
108
+ c. Elements show Tab Bar + No Recents + app label → done
109
+ d. MDB foreground = target app + elements show in-app → done
110
+ e. App icon Button visible in elements → direct tap(cx, cy)
111
+ 7. Qwen phase-1: analyze screenshot + elements → decide action
112
+ └─ If action = "ground": accessibility elements passed to Qwen phase-2
113
+ (UI-UG called only if no accessibility labels at all)
114
+ 8. Snap out-of-bounds tap to nearest accessible element
115
+ 9. Execute action via MDB (coords clamped to screen bounds)
116
+ 10. act mode one-shot rules:
117
+ - tap / swipe / scroll / pan → return done immediately (no verify loop)
118
+ - input task + step 1 was tap → sleep 0.8s for keyboard, type, return done
119
+ 11. Update navigation stack (NavFrame + ScrollState)
120
+ 12. Dead-end check: same action 3× → force HOME [run mode only]
121
+ ```
122
+
123
+ **Key design principles**:
124
+ - Accessibility elements carry `(cx, cy)` in logical points — Qwen decides *which* element, the element table provides *where*. Qwen never needs to estimate coordinates from the screenshot for standard iOS apps.
125
+ - `act` is strictly one-shot: tap / swipe / scroll / pan complete after one execution. No post-action verification — the MCP caller observes via `get_screen_state`.
126
+ - Keyboard detection drives the `input_text` fast-path: if a keyboard is visible when `input_text X` is issued, the text is typed immediately without any Qwen call.
127
+ - `input_text` does **not** press Enter/Return automatically. If the action requires submission (URL navigation, search, form submit), follow up with a separate `act("press enter")` / `act("按 enter")`.
128
+
129
+ ## Project layout
130
+
131
+ ```
132
+ auto-simctl/
133
+ ├── cli.py # Entry point: run / act / screen / devices / server
134
+ ├── ui_server.py # UI-UG-7B HTTP server (port 8081)
135
+ ├── logger.py # Structured logging helpers
136
+
137
+ ├── mdb/ # Mobile Device Bridge
138
+ │ ├── bridge.py # DeviceBridge unified API + coord clamping
139
+ │ ├── screen.py # ScreenSpec: pixel ↔ logical-pt ↔ norm1000
140
+ │ ├── models.py # DeviceInfo, Action, Screenshot dataclasses
141
+ │ └── backends/
142
+ │ ├── idb_backend.py # iOS: screenshot, tap, swipe, input_text,
143
+ │ │ # list_elements, get_foreground_app,
144
+ │ │ # get_scroll_info, detect_system_dialog
145
+ │ └── adb_backend.py # Android: same interface via adb
146
+
147
+ ├── agents/
148
+ │ ├── qwen_agent.py # Qwen3.5-9B via mlx-openai-server; adaptive thinking
149
+ │ ├── ui_agent.py # UI-UG-7B-2601 client (HTTP → port 8081)
150
+ │ └── prompts.py # SYSTEM_PROMPT + build_user_message
151
+
152
+ ├── orchestrator/
153
+ │ ├── loop.py # Pre-flight, fast-paths, ReAct loop, nav stack,
154
+ │ │ # act one-shot rules, keyboard/input fast-path
155
+ │ └── result.py # TaskResult, StepLog, NavFrame, ScrollState
156
+
157
+ ├── mcp_server/
158
+ │ └── server.py # FastMCP server: list_devices, get_screen_state,
159
+ │ # act (one-shot), run_task (multi-step)
160
+
161
+ ├── PLAN.md # Full architecture and design decisions
162
+ ├── setup.sh # Auto-installer
163
+ └── .cursor/skills/
164
+ └── auto-simctl-navigation/SKILL.md # Navigation patterns & failure modes
165
+ ```
166
+
167
+ See [PLAN.md](PLAN.md) for full architecture, coordinate systems, and design decisions.
168
+
169
+ ## Third-party Models
170
+
171
+ This project downloads and uses the following models at runtime (not bundled):
172
+
173
+ | Model | License | Source |
174
+ |---|---|---|
175
+ | `qwen3.5-9b-mlx-4bit` | Apache 2.0 | [Qwen / Alibaba Cloud](https://huggingface.co/Qwen) |
176
+ | `neovateai/UI-UG-7B-2601` | Apache 2.0 | [neovateai/UI-UG-7B-2601](https://huggingface.co/neovateai/UI-UG-7B-2601) |
177
+
178
+ Models are downloaded separately (via `setup.sh`) and are not redistributed with this project.
179
+
180
+ ## License
181
+
182
+ MIT — auto-simctl source code only.
183
+ The downloaded models are governed by their respective Apache 2.0 licenses.
@@ -0,0 +1,162 @@
1
+ # auto-simctl
2
+
3
+ **Intelligent Mobile Simulator Control** — the missing piece for vibe coding on mobile. An AI agent that controls real/simulated Android and iOS devices: screenshot → UI understanding → reasoning → action → report.
4
+
5
+ ## What it does
6
+
7
+ - **Unified device bridge (MDB)**: One API over `adb` (Android) and `idb` (iOS Simulator). All coordinates clamped to screen bounds before execution.
8
+ - **Accessibility-first UI understanding**: `idb ui describe-all` provides precise logical-point `(cx, cy)` for every element. Qwen picks *which* element; the element table supplies *where*.
9
+ - **Qwen-as-director**: Qwen3.5-9B reasons from the accessibility tree + screenshot, decides actions, and bridges language semantics (e.g. Chinese task → English UI labels) without hardcoded translation tables.
10
+ - **Fast-paths (no LLM)**: The orchestrator short-circuits Qwen for deterministic cases — gesture keywords (往右滑, 返回, …), open-keyboard `input_text`, app-icon visible, foreground matches target app.
11
+ - **`act` — stateful one-shot mode**: Does NOT reset to HOME. Executes exactly one atomic action and returns. tap / swipe / scroll / pan are always considered done after one execution. Designed for the MCP vibe-coding loop.
12
+ - **`run` — full autonomous mode**: Pre-flight HOME reset, then multi-step ReAct loop until the goal is complete or max steps reached.
13
+ - **`screen` — instant screen snapshot**: Returns foreground app, visible elements with tap coordinates, scroll state, and keyboard state. With `-s` saves a screenshot.
14
+ - **Keyboard-open fast-path**: When `act("input_text X")` is called and the keyboard is already visible, the text is typed immediately — no Qwen call needed.
15
+ - **Compound input**: `act("輸入https://… on address textfield")` taps the field, waits for the keyboard, then types — all in one call.
16
+ - **Pre-flight home reset** (`run` only): Before each task, presses HOME then corrects Today View / side launcher pages by swiping left to page 0.
17
+ - **UI-UG fallback**: UI-UG-7B-2601 via a background HTTP server handles custom-drawn views where accessibility labels are unavailable (games, canvas, WebView).
18
+ - **ReAct loop with navigation stack**: Screenshot → acc elements → fast-paths → Qwen → action → execute → update nav stack → repeat until done or max steps.
19
+ - **Navigation & scroll awareness**: Maintains a `NavFrame` stack (depth, screen label, scroll offset) so the agent always knows which page it's on and how far it has scrolled.
20
+ - **Dialog & keyboard handling**: Auto-detects and dismisses system permission dialogs; detects on-screen keyboard and switches to `input_text()` automatically.
21
+ - **MCP server**: `mcp_server/server.py` exposes `list_devices`, `get_screen_state`, `act`, and `run_task` as FastMCP tools — plug directly into Cursor or Claude Desktop.
22
+
23
+ ## Quick start
24
+
25
+ ```bash
26
+ # One-time setup: install adb, idb, Python deps, download models
27
+ ./setup.sh
28
+
29
+ # Start both model servers (Qwen on :8080, UI-UG on :8081)
30
+ python3 cli.py server start
31
+
32
+ # ── Full autonomous task (HOME reset, multi-step) ─────────────────────────────
33
+ python3 cli.py run "Open Settings"
34
+ python3 cli.py run "找看看有沒有資料夾" --verbose
35
+
36
+ # ── One-shot act (no HOME reset, continues from current screen) ───────────────
37
+ python3 cli.py act "往右滑" # gesture fast-path
38
+ python3 cli.py act "點 Watch app" # tap
39
+ python3 cli.py act "tap address bar" # tap a field (keyboard opens)
40
+ python3 cli.py act "input_text https://google.com" # type (keyboard must be open)
41
+ python3 cli.py act "press enter" # submit — input_text does NOT auto-press Enter
42
+ python3 cli.py act "輸入https://google.com on address bar" # tap + type in one call (still needs press enter after)
43
+ python3 cli.py act "返回" # back
44
+
45
+ # ── Screen snapshot (for the vibe-coding brain) ───────────────────────────────
46
+ python3 cli.py screen # rich text summary of current screen
47
+ python3 cli.py screen --json # machine-readable JSON
48
+ python3 cli.py screen -s shot.png # save screenshot file
49
+
50
+ # List connected devices
51
+ python3 cli.py devices
52
+
53
+ # Stop servers
54
+ python3 cli.py server stop
55
+ ```
56
+
57
+ ## Requirements
58
+
59
+ - macOS (Apple Silicon) — MLX-based models
60
+ - Python 3.10+
61
+ - Xcode + `idb-companion` (Homebrew) for iOS
62
+ - `android-platform-tools` (Homebrew) or Android Studio for Android
63
+ - Models in `~/.cache/huggingface/hub/`:
64
+ - `qwen3.5-9b-mlx-4bit` (reasoning, ~8s/step with thinking)
65
+ - `neovateai/UI-UG-7B-2601` (UI grounding fallback)
66
+
67
+ ## How the agent thinks
68
+
69
+ ```
70
+ run mode — pre-flight (once per task):
71
+ 1. Press HOME if not on SpringBoard
72
+ 2. Detect Today View (>12 elements) → swipe left to page 0
73
+
74
+ act mode — no pre-flight (continues from current screen)
75
+
76
+ For each step (max 20):
77
+ 1. Take screenshot
78
+ 2. List all accessibility elements (visible + off-screen)
79
+ 3. Detect keyboard open (single-letter Button elements present)
80
+ 4. Get scroll boundary info (content above/below/left/right)
81
+ 5. Detect & auto-dismiss system dialogs
82
+ 6. Fast-paths (no Qwen):
83
+ a. Keyboard open + input task (input_text / 輸入 / type)
84
+ → input_text(X) directly, skip Qwen
85
+ b. Gesture keyword (往右滑, 往左滑, 返回, swipe right, back, …)
86
+ → deterministic swipe/press, skip Qwen [act only: also skips step loop]
87
+ c. Elements show Tab Bar + No Recents + app label → done
88
+ d. MDB foreground = target app + elements show in-app → done
89
+ e. App icon Button visible in elements → direct tap(cx, cy)
90
+ 7. Qwen phase-1: analyze screenshot + elements → decide action
91
+ └─ If action = "ground": accessibility elements passed to Qwen phase-2
92
+ (UI-UG called only if no accessibility labels at all)
93
+ 8. Snap out-of-bounds tap to nearest accessible element
94
+ 9. Execute action via MDB (coords clamped to screen bounds)
95
+ 10. act mode one-shot rules:
96
+ - tap / swipe / scroll / pan → return done immediately (no verify loop)
97
+ - input task + step 1 was tap → sleep 0.8s for keyboard, type, return done
98
+ 11. Update navigation stack (NavFrame + ScrollState)
99
+ 12. Dead-end check: same action 3× → force HOME [run mode only]
100
+ ```
101
+
102
+ **Key design principles**:
103
+ - Accessibility elements carry `(cx, cy)` in logical points — Qwen decides *which* element, the element table provides *where*. Qwen never needs to estimate coordinates from the screenshot for standard iOS apps.
104
+ - `act` is strictly one-shot: tap / swipe / scroll / pan complete after one execution. No post-action verification — the MCP caller observes via `get_screen_state`.
105
+ - Keyboard detection drives the `input_text` fast-path: if a keyboard is visible when `input_text X` is issued, the text is typed immediately without any Qwen call.
106
+ - `input_text` does **not** press Enter/Return automatically. If the action requires submission (URL navigation, search, form submit), follow up with a separate `act("press enter")` / `act("按 enter")`.
107
+
108
+ ## Project layout
109
+
110
+ ```
111
+ auto-simctl/
112
+ ├── cli.py # Entry point: run / act / screen / devices / server
113
+ ├── ui_server.py # UI-UG-7B HTTP server (port 8081)
114
+ ├── logger.py # Structured logging helpers
115
+
116
+ ├── mdb/ # Mobile Device Bridge
117
+ │ ├── bridge.py # DeviceBridge unified API + coord clamping
118
+ │ ├── screen.py # ScreenSpec: pixel ↔ logical-pt ↔ norm1000
119
+ │ ├── models.py # DeviceInfo, Action, Screenshot dataclasses
120
+ │ └── backends/
121
+ │ ├── idb_backend.py # iOS: screenshot, tap, swipe, input_text,
122
+ │ │ # list_elements, get_foreground_app,
123
+ │ │ # get_scroll_info, detect_system_dialog
124
+ │ └── adb_backend.py # Android: same interface via adb
125
+
126
+ ├── agents/
127
+ │ ├── qwen_agent.py # Qwen3.5-9B via mlx-openai-server; adaptive thinking
128
+ │ ├── ui_agent.py # UI-UG-7B-2601 client (HTTP → port 8081)
129
+ │ └── prompts.py # SYSTEM_PROMPT + build_user_message
130
+
131
+ ├── orchestrator/
132
+ │ ├── loop.py # Pre-flight, fast-paths, ReAct loop, nav stack,
133
+ │ │ # act one-shot rules, keyboard/input fast-path
134
+ │ └── result.py # TaskResult, StepLog, NavFrame, ScrollState
135
+
136
+ ├── mcp_server/
137
+ │ └── server.py # FastMCP server: list_devices, get_screen_state,
138
+ │ # act (one-shot), run_task (multi-step)
139
+
140
+ ├── PLAN.md # Full architecture and design decisions
141
+ ├── setup.sh # Auto-installer
142
+ └── .cursor/skills/
143
+ └── auto-simctl-navigation/SKILL.md # Navigation patterns & failure modes
144
+ ```
145
+
146
+ See [PLAN.md](PLAN.md) for full architecture, coordinate systems, and design decisions.
147
+
148
+ ## Third-party Models
149
+
150
+ This project downloads and uses the following models at runtime (not bundled):
151
+
152
+ | Model | License | Source |
153
+ |---|---|---|
154
+ | `qwen3.5-9b-mlx-4bit` | Apache 2.0 | [Qwen / Alibaba Cloud](https://huggingface.co/Qwen) |
155
+ | `neovateai/UI-UG-7B-2601` | Apache 2.0 | [neovateai/UI-UG-7B-2601](https://huggingface.co/neovateai/UI-UG-7B-2601) |
156
+
157
+ Models are downloaded separately (via `setup.sh`) and are not redistributed with this project.
158
+
159
+ ## License
160
+
161
+ MIT — auto-simctl source code only.
162
+ The downloaded models are governed by their respective Apache 2.0 licenses.
@@ -0,0 +1,4 @@
1
+ from .qwen_agent import QwenAgent
2
+ from .ui_agent import UIAgent
3
+
4
+ __all__ = ["QwenAgent", "UIAgent"]
@@ -0,0 +1,254 @@
1
+ """
2
+ Prompts for the Qwen reasoning agent (vision + text).
3
+
4
+ Design: Qwen MUST see the screenshot (vision) as the universal solution.
5
+ Accessibility elements / MDB identifier precise tap are optional context or fast-paths.
6
+ """
7
+
8
+ SYSTEM_PROMPT = """\
9
+ You are a mobile UI automation agent. The SCREEN IMAGE is the source of truth — use it to understand the current state and complete the task.
10
+
11
+ Output ONLY one JSON action per step.
12
+
13
+ ACTIONS:
14
+ {"action_type":"tap","x":<int>,"y":<int>,"reasoning":"<why>"}
15
+ {"action_type":"swipe","x":<int>,"y":<int>,"x2":<int>,"y2":<int>,"duration_ms":400,"reasoning":"<why>"}
16
+ {"action_type":"input_text","text":"<string>","reasoning":"<why>"}
17
+ {"action_type":"press_key","key":"HOME|BACK|ENTER","reasoning":"<why>"}
18
+ {"action_type":"launch_app","app_id":"<bundle_id>","reasoning":"<why>"}
19
+ {"action_type":"ground","ground_query":"<what to find>","reasoning":"<why>"}
20
+ {"action_type":"done","result":"<what was done>","reasoning":"<why>"}
21
+ {"action_type":"error","result":"<reason>","reasoning":"<why>"}
22
+
23
+ RULES:
24
+ 1. ELEMENTS = IN-APP (trust over image): If the elements table contains "Tab Bar" AND ("No Recents" or "X | Application"), you are INSIDE the app, NOT on the home screen. Home screen has many Buttons (Fitness, Watch, Contacts, Files as icons). So: 4 elements with Tab Bar + No Recents = in-app. For task 打開 X app when elements show in-app → output done. Do NOT say "home screen" when elements show Tab Bar + No Recents.
25
+ 2. FOREGROUND + IN-APP → done: If "Current foreground app: ... (X)" and task is 打開 X app, and elements show in-app (Tab Bar, No Recents, X | Application), → output done. (If elements show many icon Buttons = home grid, do not output done.)
26
+ 3. IMAGE: Home = grid of app icons. In-app = Tab Bar, "No Recents", in-app content. When elements already show in-app, trust elements.
27
+ 4. DONE — generic: If the screen/elements clearly show the task result, output done. For "連上 URL" / "navigate to URL": done only when the page has loaded (URL bar shows the target domain, or page content is visible) — NOT when the URL is merely typed into the address bar.
28
+ 5. Elements (tap): ALWAYS use the (cx,cy) shown in the elements table. NEVER estimate coordinates from the image — the table coordinates are exact. "X | Application" + Tab Bar/No Recents = inside X.
29
+ 6. Bridge languages: 設定→Settings, 檔案→Files, 相片→Photos.
30
+ 7. KEYBOARD OPEN: Use input_text() directly. Do NOT tap letter keys. After input_text() for a URL or search query, ALWAYS follow with press_key(ENTER) in the next step to submit — typing alone does NOT complete a navigation task.
31
+ 8. SCROLL (vertical): swipe(201,700,201,200) = scroll down; swipe(201,200,201,700) = scroll up. PAGE (horizontal): swipe(50,437,350,437) = swipe right (go to previous page); swipe(350,437,50,437) = swipe left (go to next page). All coordinates in 402×874pt space.
32
+ 9. DIALOGS: Handle system dialogs first. Dismiss unless task needs that permission.
33
+ 10. DEAD-END: Same action repeated → go BACK or try new path.
34
+ 11. Screen: iPhone 16 Pro 402×874. Top-left origin. Status bar y<55.
35
+ 12. If you cannot find the target in the image, use ground("query"); otherwise decide from the screenshot.
36
+
37
+ Output ONLY the JSON. No explanation."""
38
+
39
+
40
+ def build_user_message(
41
+ task: str,
42
+ screenshot_data_url: str,
43
+ ui_elements: list[dict],
44
+ history: list[dict],
45
+ step: int,
46
+ max_steps: int,
47
+ grounding_result: "list[dict] | None" = None,
48
+ nav_stack: "list | None" = None,
49
+ dialog_info: "dict | None" = None,
50
+ ground_query: "str | None" = None,
51
+ scroll_info: "dict | None" = None,
52
+ keyboard_open: bool = False,
53
+ screenshot_url: "str | None" = None,
54
+ foreground_app: "dict | None" = None,
55
+ ) -> list[dict]:
56
+ """
57
+ Build the user message content for the Vision API: image (screenshot) + text context.
58
+
59
+ Prefer screenshot_url (binary fetch) over screenshot_data_url (base64) to avoid
60
+ large request bodies and extra encoding. Qwen sees the screenshot first (universal).
61
+ foreground_app from MDB (idb list-apps) helps done detection: e.g. task 打開 files app
62
+ + foreground_app bundle_id is com.apple.DocumentsApp → we're in Files → done.
63
+ """
64
+ parts: list[dict] = []
65
+
66
+ # Vision-first: send screenshot so Qwen can see the screen.
67
+ # Prefer URL (server fetches binary) over data URL (base64 in body).
68
+ image_url = screenshot_url if screenshot_url else (
69
+ screenshot_data_url if (screenshot_data_url and screenshot_data_url.startswith("data:")) else None
70
+ )
71
+ if image_url:
72
+ parts.append({
73
+ "type": "image_url",
74
+ "image_url": {"url": image_url},
75
+ })
76
+
77
+ lines: list[str] = [
78
+ f"Step {step}/{max_steps} | Task: {task}",
79
+ ]
80
+
81
+ # ── Foreground app (from MDB: idb list-apps process_state=Running) ─────────
82
+ if foreground_app:
83
+ bid = foreground_app.get("bundle_id", "")
84
+ name = foreground_app.get("name", "")
85
+ lines.append(f"\nCurrent foreground app: {bid} ({name})")
86
+ lines.append(" Use this to decide done: e.g. task 打開 files app + foreground is Files app → done.")
87
+
88
+ # ── Active dialog (highest priority context) ───────────────────────────────
89
+ if dialog_info:
90
+ lines.append("\n⚠️ ACTIVE SYSTEM DIALOG — handle this FIRST:")
91
+ lines.append(f" Type: {dialog_info['type']}")
92
+ lines.append(f" Message: {dialog_info['message']}")
93
+ lines.append(" Buttons:")
94
+ for btn in dialog_info["buttons"]:
95
+ lines.append(f" • \"{btn['label']}\" tap({btn['cx']}, {btn['cy']})")
96
+ dl = dialog_info["dismiss_label"]
97
+ dismiss_btn = next((b for b in dialog_info["buttons"] if b["label"] == dl), None)
98
+ if dismiss_btn:
99
+ lines.append(f" Suggested dismiss: tap({dismiss_btn['cx']}, {dismiss_btn['cy']}) "
100
+ f"→ \"{dl}\"")
101
+ lines.append(" → Does this task require that permission? If not, dismiss it.")
102
+
103
+ # ── Keyboard status ────────────────────────────────────────────────────────
104
+ if keyboard_open:
105
+ lines.append(
106
+ "\n⌨️ KEYBOARD IS OPEN — a text field is focused and ready for input.\n"
107
+ " Use input_text(\"your text\") to type. "
108
+ "Do NOT tap individual letter keys."
109
+ )
110
+
111
+ # ── Navigation breadcrumbs + scroll position ──────────────────────────────
112
+ if nav_stack:
113
+ lines.append(f"\nNavigation (depth {len(nav_stack) - 1}):")
114
+ for frame in nav_stack:
115
+ prefix = " →" if frame.depth < len(nav_stack) - 1 else " ★"
116
+ action_str = f" (via {frame.action_taken})" if frame.action_taken else ""
117
+ scroll_str = ""
118
+ if frame.scroll.scroll_y != 0 or frame.scroll.scroll_x != 0:
119
+ scroll_str = f" [scroll: {frame.scroll.summary()}]"
120
+ lines.append(f"{prefix} [{frame.depth}] {frame.screen_label}{action_str}{scroll_str}")
121
+ else:
122
+ lines.append("\nNavigation: home screen (depth 0)")
123
+
124
+ # ── Scroll boundaries ──────────────────────────────────────────────────────
125
+ if scroll_info:
126
+ si = scroll_info
127
+ scroll_parts = []
128
+ if si.get("has_content_above"):
129
+ scroll_parts.append("content ABOVE viewport (scroll up to see)")
130
+ if si.get("has_content_below"):
131
+ scroll_parts.append("content BELOW viewport (scroll down to see)")
132
+ if si.get("has_content_left"):
133
+ scroll_parts.append("content to the LEFT")
134
+ if si.get("has_content_right"):
135
+ scroll_parts.append("content to the RIGHT")
136
+ if scroll_parts:
137
+ lines.append("Scroll boundaries: " + " | ".join(scroll_parts))
138
+ if si.get("content_height_pt"):
139
+ lines.append(f" Total content height ≈ {si['content_height_pt']}pt "
140
+ f"(screen = 874pt)")
141
+
142
+ # ── Recent action history ──────────────────────────────────────────────────
143
+ if history:
144
+ lines.append("\nLast actions:")
145
+ for h in history[-4:]:
146
+ line = f" step {h.get('step','?')}: {h.get('action','?')}"
147
+ if h.get("screen_after"):
148
+ line += f" → screen: [{h['screen_after']}]"
149
+ if h.get("error"):
150
+ line += f" ⚠ ERROR: {h['error']}"
151
+ lines.append(line)
152
+
153
+ # ── Phase-2: picking from provided elements ────────────────────────────────
154
+ if grounding_result is not None:
155
+ is_phase2_acc = (
156
+ grounding_result and
157
+ isinstance(grounding_result[0], dict) and
158
+ "cx" in grounding_result[0] and
159
+ "bbox" not in grounding_result[0]
160
+ )
161
+ if ground_query:
162
+ lines.append(f"\nGROUND QUERY: \"{ground_query}\"")
163
+
164
+ if is_phase2_acc:
165
+ # Accessibility elements — already in logical points
166
+ lines.append(f"\nAccessibility elements on screen ({len(grounding_result)} total):")
167
+ lines.append(" label | type | tap(x,y)")
168
+ lines.append(" " + "-" * 55)
169
+ for el in grounding_result[:30]:
170
+ label = el.get("label", "")[:30]
171
+ etype = el.get("type", "")[:14]
172
+ cx = el.get("cx", 0)
173
+ cy = el.get("cy", 0)
174
+ lines.append(f" {label:<30} | {etype:<14} | tap({cx},{cy})")
175
+ lines.append(
176
+ "\nSemantically match the ground_query to the label above "
177
+ "(task may be in Chinese, labels in English — you bridge them). "
178
+ "Output a DIRECT tap/press_key/done action. Do NOT output ground."
179
+ )
180
+ else:
181
+ # Visual grounding (UI-UG) result
182
+ lines.append(f"\nVisual grounding result ({len(grounding_result)} element(s)):")
183
+ for el in grounding_result[:10]:
184
+ cx_list = el.get("center", [0, 0])
185
+ cx, cy = (cx_list[0], cx_list[1]) if len(cx_list) >= 2 else (0, 0)
186
+ lines.append(
187
+ f" • [{el.get('type','')}] \"{el.get('label','')}\" "
188
+ f"center=({cx},{cy}) bbox={el.get('bbox',[])}"
189
+ )
190
+ if grounding_result:
191
+ lines.append(
192
+ "\nOutput a DIRECT action using these coordinates. "
193
+ "Do NOT output ground."
194
+ )
195
+ else:
196
+ lines.append(
197
+ "\nNothing found visually. Try BACK, a different ground query, or error."
198
+ )
199
+
200
+ # ── Phase-1: accessibility elements as context ─────────────────────────────
201
+ elif ui_elements:
202
+ # Filter out keyboard keys (single-letter buttons) — they clutter context
203
+ _is_key = lambda e: (e.get("type") == "Button" and
204
+ len(e.get("label", "").strip()) == 1)
205
+ visible = [el for el in ui_elements
206
+ if el.get("visible", True) and not _is_key(el)]
207
+ offscreen = [el for el in ui_elements
208
+ if not el.get("visible", True) and not _is_key(el)]
209
+
210
+ lines.append(f"\nVisible elements ({len(visible)}):")
211
+ lines.append(" label | type | tap(x,y)")
212
+ lines.append(" " + "-" * 55)
213
+ for el in visible[:25]:
214
+ label = el.get("label", "")[:30]
215
+ etype = el.get("type", "")[:14]
216
+ cx = el.get("cx", 0)
217
+ cy = el.get("cy", 0)
218
+ lines.append(f" {label:<30} | {etype:<14} | tap({cx},{cy})")
219
+
220
+ if offscreen:
221
+ lines.append(f"\nOff-screen elements ({len(offscreen)}, need scrolling to reach):")
222
+ for el in offscreen[:15]:
223
+ label = el.get("label", "")[:30]
224
+ etype = el.get("type", "")[:14]
225
+ cy = el.get("cy", 0)
226
+ direction = "↓ below" if cy > 874 else "↑ above"
227
+ lines.append(f" {label:<30} | {etype:<14} | {direction} viewport")
228
+
229
+ # Hint when elements clearly indicate in-app (Tab Bar + No Recents / Application)
230
+ labels_lower = " ".join(el.get("label", "") for el in visible).lower()
231
+ has_tab_bar = "tab bar" in labels_lower
232
+ has_in_app = "no recents" in labels_lower or "application" in labels_lower
233
+ task_lower = task.lower()
234
+ open_app_task = "打開" in task or "open" in task_lower
235
+ if has_tab_bar and has_in_app and open_app_task:
236
+ lines.append(
237
+ "\n→ IN-APP: Elements show Tab Bar and in-app UI (No Recents / Application). "
238
+ "You are INSIDE the app, not on home screen. If task is to open this app → output done."
239
+ )
240
+
241
+ lines.append(
242
+ "\nMatch task semantically to labels (Chinese→English). "
243
+ "Tap visible elements directly. "
244
+ "For off-screen elements: scroll toward them first, then tap."
245
+ )
246
+ else:
247
+ lines.append(
248
+ "\nNo accessibility elements available. "
249
+ "Use ground() to locate elements visually, or swipe to reveal content."
250
+ )
251
+
252
+ lines.append("\nOutput ONE JSON action object.")
253
+ parts.append({"type": "text", "text": "\n".join(lines)})
254
+ return parts