PyPI - auto-simctl - Versions diffs - 0.1.0__tar.gz - Mend

auto-simctl 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

auto_simctl-0.1.0/PKG-INFO +183 -0
auto_simctl-0.1.0/README.md +162 -0
auto_simctl-0.1.0/agents/__init__.py +4 -0
auto_simctl-0.1.0/agents/prompts.py +254 -0
auto_simctl-0.1.0/agents/qwen_agent.py +484 -0
auto_simctl-0.1.0/agents/ui_agent.py +282 -0
auto_simctl-0.1.0/auto_simctl.egg-info/PKG-INFO +183 -0
auto_simctl-0.1.0/auto_simctl.egg-info/SOURCES.txt +26 -0
auto_simctl-0.1.0/auto_simctl.egg-info/dependency_links.txt +1 -0
auto_simctl-0.1.0/auto_simctl.egg-info/entry_points.txt +3 -0
auto_simctl-0.1.0/auto_simctl.egg-info/requires.txt +13 -0
auto_simctl-0.1.0/auto_simctl.egg-info/top_level.txt +6 -0
auto_simctl-0.1.0/cli.py +838 -0
auto_simctl-0.1.0/logger.py +63 -0
auto_simctl-0.1.0/mcp_server/__init__.py +0 -0
auto_simctl-0.1.0/mcp_server/server.py +481 -0
auto_simctl-0.1.0/mdb/__init__.py +4 -0
auto_simctl-0.1.0/mdb/backends/__init__.py +4 -0
auto_simctl-0.1.0/mdb/backends/adb_backend.py +146 -0
auto_simctl-0.1.0/mdb/backends/idb_backend.py +598 -0
auto_simctl-0.1.0/mdb/bridge.py +287 -0
auto_simctl-0.1.0/mdb/models.py +163 -0
auto_simctl-0.1.0/mdb/screen.py +232 -0
auto_simctl-0.1.0/orchestrator/__init__.py +4 -0
auto_simctl-0.1.0/orchestrator/loop.py +1275 -0
auto_simctl-0.1.0/orchestrator/result.py +106 -0
auto_simctl-0.1.0/pyproject.toml +40 -0
auto_simctl-0.1.0/setup.cfg +4 -0

auto_simctl-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,183 @@
+Metadata-Version: 2.4
+Name: auto-simctl
+Version: 0.1.0
+Summary: Intelligent mobile simulator control — AI-driven device testing for vibe coding
+Author: auto-simctl
+License: MIT
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: fb-idb>=1.1.0
+Requires-Dist: pure-python-adb>=0.2.2.dev0
+Requires-Dist: mlx-openai-server>=1.6.0
+Requires-Dist: mlx-vlm>=0.4.0
+Requires-Dist: openai>=1.0.0
+Requires-Dist: typer>=0.9.0
+Requires-Dist: rich>=13.0.0
+Requires-Dist: fastmcp>=2.0.0
+Requires-Dist: huggingface_hub>=0.20.0
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
+# auto-simctl
+**Intelligent Mobile Simulator Control** — the missing piece for vibe coding on mobile. An AI agent that controls real/simulated Android and iOS devices: screenshot → UI understanding → reasoning → action → report.
+## What it does
+- **Unified device bridge (MDB)**: One API over `adb` (Android) and `idb` (iOS Simulator). All coordinates clamped to screen bounds before execution.
+- **Accessibility-first UI understanding**: `idb ui describe-all` provides precise logical-point `(cx, cy)` for every element. Qwen picks *which* element; the element table supplies *where*.
+- **Qwen-as-director**: Qwen3.5-9B reasons from the accessibility tree + screenshot, decides actions, and bridges language semantics (e.g. Chinese task → English UI labels) without hardcoded translation tables.
+- **Fast-paths (no LLM)**: The orchestrator short-circuits Qwen for deterministic cases — gesture keywords (往右滑, 返回, …), open-keyboard `input_text`, app-icon visible, foreground matches target app.
+- **`act` — stateful one-shot mode**: Does NOT reset to HOME. Executes exactly one atomic action and returns. tap / swipe / scroll / pan are always considered done after one execution. Designed for the MCP vibe-coding loop.
+- **`run` — full autonomous mode**: Pre-flight HOME reset, then multi-step ReAct loop until the goal is complete or max steps reached.
+- **`screen` — instant screen snapshot**: Returns foreground app, visible elements with tap coordinates, scroll state, and keyboard state. With `-s` saves a screenshot.
+- **Keyboard-open fast-path**: When `act("input_text X")` is called and the keyboard is already visible, the text is typed immediately — no Qwen call needed.
+- **Compound input**: `act("輸入https://… on address textfield")` taps the field, waits for the keyboard, then types — all in one call.
+- **Pre-flight home reset** (`run` only): Before each task, presses HOME then corrects Today View / side launcher pages by swiping left to page 0.
+- **UI-UG fallback**: UI-UG-7B-2601 via a background HTTP server handles custom-drawn views where accessibility labels are unavailable (games, canvas, WebView).
+- **ReAct loop with navigation stack**: Screenshot → acc elements → fast-paths → Qwen → action → execute → update nav stack → repeat until done or max steps.
+- **Navigation & scroll awareness**: Maintains a `NavFrame` stack (depth, screen label, scroll offset) so the agent always knows which page it's on and how far it has scrolled.
+- **Dialog & keyboard handling**: Auto-detects and dismisses system permission dialogs; detects on-screen keyboard and switches to `input_text()` automatically.
+- **MCP server**: `mcp_server/server.py` exposes `list_devices`, `get_screen_state`, `act`, and `run_task` as FastMCP tools — plug directly into Cursor or Claude Desktop.
+## Quick start
+```bash
+# One-time setup: install adb, idb, Python deps, download models
+./setup.sh
+# Start both model servers (Qwen on :8080, UI-UG on :8081)
+python3 cli.py server start
+# ── Full autonomous task (HOME reset, multi-step) ─────────────────────────────
+python3 cli.py run "Open Settings"
+python3 cli.py run "找看看有沒有資料夾" --verbose
+# ── One-shot act (no HOME reset, continues from current screen) ───────────────
+python3 cli.py act "往右滑"                                # gesture fast-path
+python3 cli.py act "點 Watch app"                          # tap
+python3 cli.py act "tap address bar"                       # tap a field (keyboard opens)
+python3 cli.py act "input_text https://google.com"         # type (keyboard must be open)
+python3 cli.py act "press enter"                           # submit — input_text does NOT auto-press Enter
+python3 cli.py act "輸入https://google.com on address bar" # tap + type in one call (still needs press enter after)
+python3 cli.py act "返回"                                   # back
+# ── Screen snapshot (for the vibe-coding brain) ───────────────────────────────
+python3 cli.py screen                  # rich text summary of current screen
+python3 cli.py screen --json           # machine-readable JSON
+python3 cli.py screen -s shot.png      # save screenshot file
+# List connected devices
+python3 cli.py devices
+# Stop servers
+python3 cli.py server stop
+```
+## Requirements
+- macOS (Apple Silicon) — MLX-based models
+- Python 3.10+
+- Xcode + `idb-companion` (Homebrew) for iOS
+- `android-platform-tools` (Homebrew) or Android Studio for Android
+- Models in `~/.cache/huggingface/hub/`:
+  - `qwen3.5-9b-mlx-4bit` (reasoning, ~8s/step with thinking)
+  - `neovateai/UI-UG-7B-2601` (UI grounding fallback)
+## How the agent thinks
+```
+run mode — pre-flight (once per task):
+  1. Press HOME if not on SpringBoard
+  2. Detect Today View (>12 elements) → swipe left to page 0
+act mode — no pre-flight (continues from current screen)
+For each step (max 20):
+  1. Take screenshot
+  2. List all accessibility elements (visible + off-screen)
+  3. Detect keyboard open (single-letter Button elements present)
+  4. Get scroll boundary info (content above/below/left/right)
+  5. Detect & auto-dismiss system dialogs
+  6. Fast-paths (no Qwen):
+     a. Keyboard open + input task (input_text / 輸入 / type)
+        → input_text(X) directly, skip Qwen
+     b. Gesture keyword (往右滑, 往左滑, 返回, swipe right, back, …)
+        → deterministic swipe/press, skip Qwen  [act only: also skips step loop]
+     c. Elements show Tab Bar + No Recents + app label → done
+     d. MDB foreground = target app + elements show in-app → done
+     e. App icon Button visible in elements → direct tap(cx, cy)
+  7. Qwen phase-1: analyze screenshot + elements → decide action
+     └─ If action = "ground": accessibility elements passed to Qwen phase-2
+        (UI-UG called only if no accessibility labels at all)
+  8. Snap out-of-bounds tap to nearest accessible element
+  9. Execute action via MDB (coords clamped to screen bounds)
+  10. act mode one-shot rules:
+      - tap / swipe / scroll / pan → return done immediately (no verify loop)
+      - input task + step 1 was tap → sleep 0.8s for keyboard, type, return done
+  11. Update navigation stack (NavFrame + ScrollState)
+  12. Dead-end check: same action 3× → force HOME  [run mode only]
+```
+**Key design principles**:
+- Accessibility elements carry `(cx, cy)` in logical points — Qwen decides *which* element, the element table provides *where*. Qwen never needs to estimate coordinates from the screenshot for standard iOS apps.
+- `act` is strictly one-shot: tap / swipe / scroll / pan complete after one execution. No post-action verification — the MCP caller observes via `get_screen_state`.
+- Keyboard detection drives the `input_text` fast-path: if a keyboard is visible when `input_text X` is issued, the text is typed immediately without any Qwen call.
+- `input_text` does **not** press Enter/Return automatically. If the action requires submission (URL navigation, search, form submit), follow up with a separate `act("press enter")` / `act("按 enter")`.
+## Project layout
+```
+auto-simctl/
+├── cli.py                  # Entry point: run / act / screen / devices / server
+├── ui_server.py            # UI-UG-7B HTTP server (port 8081)
+├── logger.py               # Structured logging helpers
+│
+├── mdb/                    # Mobile Device Bridge
+│   ├── bridge.py           # DeviceBridge unified API + coord clamping
+│   ├── screen.py           # ScreenSpec: pixel ↔ logical-pt ↔ norm1000
+│   ├── models.py           # DeviceInfo, Action, Screenshot dataclasses
+│   └── backends/
+│       ├── idb_backend.py  # iOS: screenshot, tap, swipe, input_text,
+│       │                   #   list_elements, get_foreground_app,
+│       │                   #   get_scroll_info, detect_system_dialog
+│       └── adb_backend.py  # Android: same interface via adb
+│
+├── agents/
+│   ├── qwen_agent.py       # Qwen3.5-9B via mlx-openai-server; adaptive thinking
+│   ├── ui_agent.py         # UI-UG-7B-2601 client (HTTP → port 8081)
+│   └── prompts.py          # SYSTEM_PROMPT + build_user_message
+│
+├── orchestrator/
+│   ├── loop.py             # Pre-flight, fast-paths, ReAct loop, nav stack,
+│   │                       #   act one-shot rules, keyboard/input fast-path
+│   └── result.py           # TaskResult, StepLog, NavFrame, ScrollState
+│
+├── mcp_server/
+│   └── server.py           # FastMCP server: list_devices, get_screen_state,
+│                           #   act (one-shot), run_task (multi-step)
+│
+├── PLAN.md                 # Full architecture and design decisions
+├── setup.sh                # Auto-installer
+└── .cursor/skills/
+    └── auto-simctl-navigation/SKILL.md   # Navigation patterns & failure modes
+```
+See [PLAN.md](PLAN.md) for full architecture, coordinate systems, and design decisions.
+## Third-party Models
+This project downloads and uses the following models at runtime (not bundled):
+| Model | License | Source |
+|---|---|---|
+| `qwen3.5-9b-mlx-4bit` | Apache 2.0 | [Qwen / Alibaba Cloud](https://huggingface.co/Qwen) |
+| `neovateai/UI-UG-7B-2601` | Apache 2.0 | [neovateai/UI-UG-7B-2601](https://huggingface.co/neovateai/UI-UG-7B-2601) |
+Models are downloaded separately (via `setup.sh`) and are not redistributed with this project.
+## License
+MIT — auto-simctl source code only.
+The downloaded models are governed by their respective Apache 2.0 licenses.

auto_simctl-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,162 @@
+# auto-simctl
+**Intelligent Mobile Simulator Control** — the missing piece for vibe coding on mobile. An AI agent that controls real/simulated Android and iOS devices: screenshot → UI understanding → reasoning → action → report.
+## What it does
+- **Unified device bridge (MDB)**: One API over `adb` (Android) and `idb` (iOS Simulator). All coordinates clamped to screen bounds before execution.
+- **Accessibility-first UI understanding**: `idb ui describe-all` provides precise logical-point `(cx, cy)` for every element. Qwen picks *which* element; the element table supplies *where*.
+- **Qwen-as-director**: Qwen3.5-9B reasons from the accessibility tree + screenshot, decides actions, and bridges language semantics (e.g. Chinese task → English UI labels) without hardcoded translation tables.
+- **Fast-paths (no LLM)**: The orchestrator short-circuits Qwen for deterministic cases — gesture keywords (往右滑, 返回, …), open-keyboard `input_text`, app-icon visible, foreground matches target app.
+- **`act` — stateful one-shot mode**: Does NOT reset to HOME. Executes exactly one atomic action and returns. tap / swipe / scroll / pan are always considered done after one execution. Designed for the MCP vibe-coding loop.
+- **`run` — full autonomous mode**: Pre-flight HOME reset, then multi-step ReAct loop until the goal is complete or max steps reached.
+- **`screen` — instant screen snapshot**: Returns foreground app, visible elements with tap coordinates, scroll state, and keyboard state. With `-s` saves a screenshot.
+- **Keyboard-open fast-path**: When `act("input_text X")` is called and the keyboard is already visible, the text is typed immediately — no Qwen call needed.
+- **Compound input**: `act("輸入https://… on address textfield")` taps the field, waits for the keyboard, then types — all in one call.
+- **Pre-flight home reset** (`run` only): Before each task, presses HOME then corrects Today View / side launcher pages by swiping left to page 0.
+- **UI-UG fallback**: UI-UG-7B-2601 via a background HTTP server handles custom-drawn views where accessibility labels are unavailable (games, canvas, WebView).
+- **ReAct loop with navigation stack**: Screenshot → acc elements → fast-paths → Qwen → action → execute → update nav stack → repeat until done or max steps.
+- **Navigation & scroll awareness**: Maintains a `NavFrame` stack (depth, screen label, scroll offset) so the agent always knows which page it's on and how far it has scrolled.
+- **Dialog & keyboard handling**: Auto-detects and dismisses system permission dialogs; detects on-screen keyboard and switches to `input_text()` automatically.
+- **MCP server**: `mcp_server/server.py` exposes `list_devices`, `get_screen_state`, `act`, and `run_task` as FastMCP tools — plug directly into Cursor or Claude Desktop.
+## Quick start
+```bash
+# One-time setup: install adb, idb, Python deps, download models
+./setup.sh
+# Start both model servers (Qwen on :8080, UI-UG on :8081)
+python3 cli.py server start
+# ── Full autonomous task (HOME reset, multi-step) ─────────────────────────────
+python3 cli.py run "Open Settings"
+python3 cli.py run "找看看有沒有資料夾" --verbose
+# ── One-shot act (no HOME reset, continues from current screen) ───────────────
+python3 cli.py act "往右滑"                                # gesture fast-path
+python3 cli.py act "點 Watch app"                          # tap
+python3 cli.py act "tap address bar"                       # tap a field (keyboard opens)
+python3 cli.py act "input_text https://google.com"         # type (keyboard must be open)
+python3 cli.py act "press enter"                           # submit — input_text does NOT auto-press Enter
+python3 cli.py act "輸入https://google.com on address bar" # tap + type in one call (still needs press enter after)
+python3 cli.py act "返回"                                   # back
+# ── Screen snapshot (for the vibe-coding brain) ───────────────────────────────
+python3 cli.py screen                  # rich text summary of current screen
+python3 cli.py screen --json           # machine-readable JSON
+python3 cli.py screen -s shot.png      # save screenshot file
+# List connected devices
+python3 cli.py devices
+# Stop servers
+python3 cli.py server stop
+```
+## Requirements
+- macOS (Apple Silicon) — MLX-based models
+- Python 3.10+
+- Xcode + `idb-companion` (Homebrew) for iOS
+- `android-platform-tools` (Homebrew) or Android Studio for Android
+- Models in `~/.cache/huggingface/hub/`:
+  - `qwen3.5-9b-mlx-4bit` (reasoning, ~8s/step with thinking)
+  - `neovateai/UI-UG-7B-2601` (UI grounding fallback)
+## How the agent thinks
+```
+run mode — pre-flight (once per task):
+  1. Press HOME if not on SpringBoard
+  2. Detect Today View (>12 elements) → swipe left to page 0
+act mode — no pre-flight (continues from current screen)
+For each step (max 20):
+  1. Take screenshot
+  2. List all accessibility elements (visible + off-screen)
+  3. Detect keyboard open (single-letter Button elements present)
+  4. Get scroll boundary info (content above/below/left/right)
+  5. Detect & auto-dismiss system dialogs
+  6. Fast-paths (no Qwen):
+     a. Keyboard open + input task (input_text / 輸入 / type)
+        → input_text(X) directly, skip Qwen
+     b. Gesture keyword (往右滑, 往左滑, 返回, swipe right, back, …)
+        → deterministic swipe/press, skip Qwen  [act only: also skips step loop]
+     c. Elements show Tab Bar + No Recents + app label → done
+     d. MDB foreground = target app + elements show in-app → done
+     e. App icon Button visible in elements → direct tap(cx, cy)
+  7. Qwen phase-1: analyze screenshot + elements → decide action
+     └─ If action = "ground": accessibility elements passed to Qwen phase-2
+        (UI-UG called only if no accessibility labels at all)
+  8. Snap out-of-bounds tap to nearest accessible element
+  9. Execute action via MDB (coords clamped to screen bounds)
+  10. act mode one-shot rules:
+      - tap / swipe / scroll / pan → return done immediately (no verify loop)
+      - input task + step 1 was tap → sleep 0.8s for keyboard, type, return done
+  11. Update navigation stack (NavFrame + ScrollState)
+  12. Dead-end check: same action 3× → force HOME  [run mode only]
+```
+**Key design principles**:
+- Accessibility elements carry `(cx, cy)` in logical points — Qwen decides *which* element, the element table provides *where*. Qwen never needs to estimate coordinates from the screenshot for standard iOS apps.
+- `act` is strictly one-shot: tap / swipe / scroll / pan complete after one execution. No post-action verification — the MCP caller observes via `get_screen_state`.
+- Keyboard detection drives the `input_text` fast-path: if a keyboard is visible when `input_text X` is issued, the text is typed immediately without any Qwen call.
+- `input_text` does **not** press Enter/Return automatically. If the action requires submission (URL navigation, search, form submit), follow up with a separate `act("press enter")` / `act("按 enter")`.
+## Project layout
+```
+auto-simctl/
+├── cli.py                  # Entry point: run / act / screen / devices / server
+├── ui_server.py            # UI-UG-7B HTTP server (port 8081)
+├── logger.py               # Structured logging helpers
+│
+├── mdb/                    # Mobile Device Bridge
+│   ├── bridge.py           # DeviceBridge unified API + coord clamping
+│   ├── screen.py           # ScreenSpec: pixel ↔ logical-pt ↔ norm1000
+│   ├── models.py           # DeviceInfo, Action, Screenshot dataclasses
+│   └── backends/
+│       ├── idb_backend.py  # iOS: screenshot, tap, swipe, input_text,
+│       │                   #   list_elements, get_foreground_app,
+│       │                   #   get_scroll_info, detect_system_dialog
+│       └── adb_backend.py  # Android: same interface via adb
+│
+├── agents/
+│   ├── qwen_agent.py       # Qwen3.5-9B via mlx-openai-server; adaptive thinking
+│   ├── ui_agent.py         # UI-UG-7B-2601 client (HTTP → port 8081)
+│   └── prompts.py          # SYSTEM_PROMPT + build_user_message
+│
+├── orchestrator/
+│   ├── loop.py             # Pre-flight, fast-paths, ReAct loop, nav stack,
+│   │                       #   act one-shot rules, keyboard/input fast-path
+│   └── result.py           # TaskResult, StepLog, NavFrame, ScrollState
+│
+├── mcp_server/
+│   └── server.py           # FastMCP server: list_devices, get_screen_state,
+│                           #   act (one-shot), run_task (multi-step)
+│
+├── PLAN.md                 # Full architecture and design decisions
+├── setup.sh                # Auto-installer
+└── .cursor/skills/
+    └── auto-simctl-navigation/SKILL.md   # Navigation patterns & failure modes
+```
+See [PLAN.md](PLAN.md) for full architecture, coordinate systems, and design decisions.
+## Third-party Models
+This project downloads and uses the following models at runtime (not bundled):
+| Model | License | Source |
+|---|---|---|
+| `qwen3.5-9b-mlx-4bit` | Apache 2.0 | [Qwen / Alibaba Cloud](https://huggingface.co/Qwen) |
+| `neovateai/UI-UG-7B-2601` | Apache 2.0 | [neovateai/UI-UG-7B-2601](https://huggingface.co/neovateai/UI-UG-7B-2601) |
+Models are downloaded separately (via `setup.sh`) and are not redistributed with this project.
+## License
+MIT — auto-simctl source code only.
+The downloaded models are governed by their respective Apache 2.0 licenses.

auto_simctl-0.1.0/agents/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .qwen_agent import QwenAgent
+from .ui_agent import UIAgent
+__all__ = ["QwenAgent", "UIAgent"]

auto_simctl-0.1.0/agents/prompts.py ADDED Viewed

@@ -0,0 +1,254 @@
+"""
+Prompts for the Qwen reasoning agent (vision + text).
+Design: Qwen MUST see the screenshot (vision) as the universal solution.
+Accessibility elements / MDB identifier precise tap are optional context or fast-paths.
+"""
+SYSTEM_PROMPT = """\
+You are a mobile UI automation agent. The SCREEN IMAGE is the source of truth — use it to understand the current state and complete the task.
+Output ONLY one JSON action per step.
+ACTIONS:
+{"action_type":"tap","x":<int>,"y":<int>,"reasoning":"<why>"}
+{"action_type":"swipe","x":<int>,"y":<int>,"x2":<int>,"y2":<int>,"duration_ms":400,"reasoning":"<why>"}
+{"action_type":"input_text","text":"<string>","reasoning":"<why>"}
+{"action_type":"press_key","key":"HOME|BACK|ENTER","reasoning":"<why>"}
+{"action_type":"launch_app","app_id":"<bundle_id>","reasoning":"<why>"}
+{"action_type":"ground","ground_query":"<what to find>","reasoning":"<why>"}
+{"action_type":"done","result":"<what was done>","reasoning":"<why>"}
+{"action_type":"error","result":"<reason>","reasoning":"<why>"}
+RULES:
+1. ELEMENTS = IN-APP (trust over image): If the elements table contains "Tab Bar" AND ("No Recents" or "X | Application"), you are INSIDE the app, NOT on the home screen. Home screen has many Buttons (Fitness, Watch, Contacts, Files as icons). So: 4 elements with Tab Bar + No Recents = in-app. For task 打開 X app when elements show in-app → output done. Do NOT say "home screen" when elements show Tab Bar + No Recents.
+2. FOREGROUND + IN-APP → done: If "Current foreground app: ... (X)" and task is 打開 X app, and elements show in-app (Tab Bar, No Recents, X | Application), → output done. (If elements show many icon Buttons = home grid, do not output done.)
+3. IMAGE: Home = grid of app icons. In-app = Tab Bar, "No Recents", in-app content. When elements already show in-app, trust elements.
+4. DONE — generic: If the screen/elements clearly show the task result, output done. For "連上 URL" / "navigate to URL": done only when the page has loaded (URL bar shows the target domain, or page content is visible) — NOT when the URL is merely typed into the address bar.
+5. Elements (tap): ALWAYS use the (cx,cy) shown in the elements table. NEVER estimate coordinates from the image — the table coordinates are exact. "X | Application" + Tab Bar/No Recents = inside X.
+6. Bridge languages: 設定→Settings, 檔案→Files, 相片→Photos.
+7. KEYBOARD OPEN: Use input_text() directly. Do NOT tap letter keys. After input_text() for a URL or search query, ALWAYS follow with press_key(ENTER) in the next step to submit — typing alone does NOT complete a navigation task.
+8. SCROLL (vertical): swipe(201,700,201,200) = scroll down; swipe(201,200,201,700) = scroll up. PAGE (horizontal): swipe(50,437,350,437) = swipe right (go to previous page); swipe(350,437,50,437) = swipe left (go to next page). All coordinates in 402×874pt space.
+9. DIALOGS: Handle system dialogs first. Dismiss unless task needs that permission.
+10. DEAD-END: Same action repeated → go BACK or try new path.
+11. Screen: iPhone 16 Pro 402×874. Top-left origin. Status bar y<55.
+12. If you cannot find the target in the image, use ground("query"); otherwise decide from the screenshot.
+Output ONLY the JSON. No explanation."""
+def build_user_message(
+    task: str,
+    screenshot_data_url: str,
+    ui_elements: list[dict],
+    history: list[dict],
+    step: int,
+    max_steps: int,
+    grounding_result: "list[dict] | None" = None,
+    nav_stack: "list | None" = None,
+    dialog_info: "dict | None" = None,
+    ground_query: "str | None" = None,
+    scroll_info: "dict | None" = None,
+    keyboard_open: bool = False,
+    screenshot_url: "str | None" = None,
+    foreground_app: "dict | None" = None,
+) -> list[dict]:
+    """
+    Build the user message content for the Vision API: image (screenshot) + text context.
+    Prefer screenshot_url (binary fetch) over screenshot_data_url (base64) to avoid
+    large request bodies and extra encoding. Qwen sees the screenshot first (universal).
+    foreground_app from MDB (idb list-apps) helps done detection: e.g. task 打開 files app
+    + foreground_app bundle_id is com.apple.DocumentsApp → we're in Files → done.
+    """
+    parts: list[dict] = []
+    # Vision-first: send screenshot so Qwen can see the screen.
+    # Prefer URL (server fetches binary) over data URL (base64 in body).
+    image_url = screenshot_url if screenshot_url else (
+        screenshot_data_url if (screenshot_data_url and screenshot_data_url.startswith("data:")) else None
+    )
+    if image_url:
+        parts.append({
+            "type": "image_url",
+            "image_url": {"url": image_url},
+        })
+    lines: list[str] = [
+        f"Step {step}/{max_steps}  |  Task: {task}",
+    ]
+    # ── Foreground app (from MDB: idb list-apps process_state=Running) ─────────
+    if foreground_app:
+        bid = foreground_app.get("bundle_id", "")
+        name = foreground_app.get("name", "")
+        lines.append(f"\nCurrent foreground app: {bid} ({name})")
+        lines.append("  Use this to decide done: e.g. task 打開 files app + foreground is Files app → done.")
+    # ── Active dialog (highest priority context) ───────────────────────────────
+    if dialog_info:
+        lines.append("\n⚠️  ACTIVE SYSTEM DIALOG — handle this FIRST:")
+        lines.append(f"  Type: {dialog_info['type']}")
+        lines.append(f"  Message: {dialog_info['message']}")
+        lines.append("  Buttons:")
+        for btn in dialog_info["buttons"]:
+            lines.append(f"    • \"{btn['label']}\"  tap({btn['cx']}, {btn['cy']})")
+        dl = dialog_info["dismiss_label"]
+        dismiss_btn = next((b for b in dialog_info["buttons"] if b["label"] == dl), None)
+        if dismiss_btn:
+            lines.append(f"  Suggested dismiss: tap({dismiss_btn['cx']}, {dismiss_btn['cy']}) "
+                         f"→ \"{dl}\"")
+        lines.append("  → Does this task require that permission? If not, dismiss it.")
+    # ── Keyboard status ────────────────────────────────────────────────────────
+    if keyboard_open:
+        lines.append(
+            "\n⌨️  KEYBOARD IS OPEN — a text field is focused and ready for input.\n"
+            "   Use input_text(\"your text\") to type. "
+            "Do NOT tap individual letter keys."
+        )
+    # ── Navigation breadcrumbs + scroll position ──────────────────────────────
+    if nav_stack:
+        lines.append(f"\nNavigation (depth {len(nav_stack) - 1}):")
+        for frame in nav_stack:
+            prefix = "  →" if frame.depth < len(nav_stack) - 1 else "  ★"
+            action_str = f"  (via {frame.action_taken})" if frame.action_taken else ""
+            scroll_str = ""
+            if frame.scroll.scroll_y != 0 or frame.scroll.scroll_x != 0:
+                scroll_str = f"  [scroll: {frame.scroll.summary()}]"
+            lines.append(f"{prefix} [{frame.depth}] {frame.screen_label}{action_str}{scroll_str}")
+    else:
+        lines.append("\nNavigation: home screen (depth 0)")
+    # ── Scroll boundaries ──────────────────────────────────────────────────────
+    if scroll_info:
+        si = scroll_info
+        scroll_parts = []
+        if si.get("has_content_above"):
+            scroll_parts.append("content ABOVE viewport (scroll up to see)")
+        if si.get("has_content_below"):
+            scroll_parts.append("content BELOW viewport (scroll down to see)")
+        if si.get("has_content_left"):
+            scroll_parts.append("content to the LEFT")
+        if si.get("has_content_right"):
+            scroll_parts.append("content to the RIGHT")
+        if scroll_parts:
+            lines.append("Scroll boundaries: " + "  |  ".join(scroll_parts))
+            if si.get("content_height_pt"):
+                lines.append(f"  Total content height ≈ {si['content_height_pt']}pt "
+                             f"(screen = 874pt)")
+    # ── Recent action history ──────────────────────────────────────────────────
+    if history:
+        lines.append("\nLast actions:")
+        for h in history[-4:]:
+            line = f"  step {h.get('step','?')}: {h.get('action','?')}"
+            if h.get("screen_after"):
+                line += f" → screen: [{h['screen_after']}]"
+            if h.get("error"):
+                line += f" ⚠ ERROR: {h['error']}"
+            lines.append(line)
+    # ── Phase-2: picking from provided elements ────────────────────────────────
+    if grounding_result is not None:
+        is_phase2_acc = (
+            grounding_result and
+            isinstance(grounding_result[0], dict) and
+            "cx" in grounding_result[0] and
+            "bbox" not in grounding_result[0]
+        )
+        if ground_query:
+            lines.append(f"\nGROUND QUERY: \"{ground_query}\"")
+        if is_phase2_acc:
+            # Accessibility elements — already in logical points
+            lines.append(f"\nAccessibility elements on screen ({len(grounding_result)} total):")
+            lines.append("  label                        | type            | tap(x,y)")
+            lines.append("  " + "-" * 55)
+            for el in grounding_result[:30]:
+                label = el.get("label", "")[:30]
+                etype = el.get("type", "")[:14]
+                cx    = el.get("cx", 0)
+                cy    = el.get("cy", 0)
+                lines.append(f"  {label:<30} | {etype:<14} | tap({cx},{cy})")
+            lines.append(
+                "\nSemantically match the ground_query to the label above "
+                "(task may be in Chinese, labels in English — you bridge them). "
+                "Output a DIRECT tap/press_key/done action. Do NOT output ground."
+            )
+        else:
+            # Visual grounding (UI-UG) result
+            lines.append(f"\nVisual grounding result ({len(grounding_result)} element(s)):")
+            for el in grounding_result[:10]:
+                cx_list = el.get("center", [0, 0])
+                cx, cy  = (cx_list[0], cx_list[1]) if len(cx_list) >= 2 else (0, 0)
+                lines.append(
+                    f"  • [{el.get('type','')}] \"{el.get('label','')}\"  "
+                    f"center=({cx},{cy})  bbox={el.get('bbox',[])}"
+                )
+            if grounding_result:
+                lines.append(
+                    "\nOutput a DIRECT action using these coordinates. "
+                    "Do NOT output ground."
+                )
+            else:
+                lines.append(
+                    "\nNothing found visually. Try BACK, a different ground query, or error."
+                )
+    # ── Phase-1: accessibility elements as context ─────────────────────────────
+    elif ui_elements:
+        # Filter out keyboard keys (single-letter buttons) — they clutter context
+        _is_key = lambda e: (e.get("type") == "Button" and
+                             len(e.get("label", "").strip()) == 1)
+        visible = [el for el in ui_elements
+                   if el.get("visible", True) and not _is_key(el)]
+        offscreen = [el for el in ui_elements
+                     if not el.get("visible", True) and not _is_key(el)]
+        lines.append(f"\nVisible elements ({len(visible)}):")
+        lines.append("  label                        | type            | tap(x,y)")
+        lines.append("  " + "-" * 55)
+        for el in visible[:25]:
+            label = el.get("label", "")[:30]
+            etype = el.get("type", "")[:14]
+            cx    = el.get("cx", 0)
+            cy    = el.get("cy", 0)
+            lines.append(f"  {label:<30} | {etype:<14} | tap({cx},{cy})")
+        if offscreen:
+            lines.append(f"\nOff-screen elements ({len(offscreen)}, need scrolling to reach):")
+            for el in offscreen[:15]:
+                label = el.get("label", "")[:30]
+                etype = el.get("type", "")[:14]
+                cy    = el.get("cy", 0)
+                direction = "↓ below" if cy > 874 else "↑ above"
+                lines.append(f"  {label:<30} | {etype:<14} | {direction} viewport")
+        # Hint when elements clearly indicate in-app (Tab Bar + No Recents / Application)
+        labels_lower = " ".join(el.get("label", "") for el in visible).lower()
+        has_tab_bar = "tab bar" in labels_lower
+        has_in_app = "no recents" in labels_lower or "application" in labels_lower
+        task_lower = task.lower()
+        open_app_task = "打開" in task or "open" in task_lower
+        if has_tab_bar and has_in_app and open_app_task:
+            lines.append(
+                "\n→ IN-APP: Elements show Tab Bar and in-app UI (No Recents / Application). "
+                "You are INSIDE the app, not on home screen. If task is to open this app → output done."
+            )
+        lines.append(
+            "\nMatch task semantically to labels (Chinese→English). "
+            "Tap visible elements directly. "
+            "For off-screen elements: scroll toward them first, then tap."
+        )
+    else:
+        lines.append(
+            "\nNo accessibility elements available. "
+            "Use ground() to locate elements visually, or swipe to reveal content."
+        )
+    lines.append("\nOutput ONE JSON action object.")
+    parts.append({"type": "text", "text": "\n".join(lines)})
+    return parts