ta-studio-mcp 1.1.0 โ†’ 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -9,156 +9,69 @@ AI agents often struggle with project-specific context, unique navigation patter
9
9
  ## ๐Ÿ“‹ Prerequisites
10
10
 
11
11
  - **Node.js**: `v18.0.0` or higher.
12
- - **MCP Client**: An IDE or tool that supports the [Model Context Protocol](https://modelcontextprotocol.io) (e.g., Claude Desktop, Cursor, Windsurf, or VS Code with an MCP extension).
12
+ - **MCP Client**: An IDE or tool that supports the [Model Context Protocol](https://modelcontextprotocol.io) (e.g., Claude Desktop, Cursor, Windsurf, VS Code).
13
13
 
14
14
  ---
15
15
 
16
16
  ## โšก Quick Start
17
17
 
18
- You don't even need to install it locally to test it. Run the following command to see the server in action:
19
-
20
18
  ```bash
21
19
  npx ta-studio-mcp
22
20
  ```
23
- *(This will start the server in stdio mode, waiting for a client to connect.)*
24
21
 
25
22
  ---
26
23
 
27
- ## ๐Ÿง  Key Knowledge & Expert Implementation
24
+ ## ๐Ÿงช Methodologies & Deep Technical Lore
28
25
 
29
- This section contains the FULL documentation of the team's internal methodologies, patterns, and bug fixes.
26
+ This section documents the state-of-the-art implementations used by the TA Studio team.
30
27
 
31
- ### 1. OAVR Pattern โ€” Observe-Act-Verify-Reason
28
+ ### 1. Model Tiering (Jan 2026 Standard)
29
+ We avoid "one size fits all" model selection. Models are tiered by "Thinking Budget":
30
+ - **Thinking Tier (GPT-5.2)**: Used for high-level orchestration (Coordinator) and complex visual reasoning. reasoning effort: `high`.
31
+ - **Core Tier (GPT-5-mini)**: Used for specialized specialists (Classifier, Verifier, Diagnosis). *Never use nano for classification.*
32
+ - **Utility Tier (GPT-5-nano)**: Used for MCP tool formatting, data distillation, and search cleanup.
32
33
 
33
- The device testing agent uses OAVR for autonomous navigation:
34
+ ### 2. Failure Taxonomy (OAVR "Reason")
35
+ The `Failure Diagnosis Specialist` uses a structured taxonomy to recover from errors:
36
+ - **PLANNING_ERROR**: Incorrect action choice. *Recovery*: Backtrack/Retry.
37
+ - **PERCEPTION_ERROR**: UI misrepresented in model. *Recovery*: Wait/Re-scan.
38
+ - **ENVIRONMENT_ERROR**: App crash/Dialogs. *Recovery*: Handle OS dialog/Restart.
39
+ - **EXECUTION_ERROR**: Click/Swipe failed to register. *Recovery*: Apply 5px jitter/Retry.
34
40
 
35
- 1. **Observe** โ†’ Screen Classifier Agent analyzes current screen state.
36
- 2. **Act** โ†’ Execute action (click, swipe, type) via Mobile MCP.
37
- 3. **Verify** โ†’ Action Verifier confirms the action succeeded.
38
- 4. **Reason** โ†’ Failure Diagnosis suggests recovery if verification failed.
41
+ ### 3. Mobile MCP ADB Fallback
42
+ Mobile MCP v0.0.36 fails if *any* device is offline. Our client implements a comprehensive ADB bridge:
43
+ - **Fast Screenshot**: `exec-out screencap -p` (Base64 direct stream).
44
+ - **Fast UI Dump**: `uiautomator dump /dev/tty` (No temp file I/O).
45
+ - **Control**: Direct `input tap`, `input swipe`, and `am start -n` activity mapping.
39
46
 
40
- **Implementation Details:**
41
- - **Handoff Logic**: Sub-agents are triggered via `@tool` decorators in the `DeviceTestingAgent` class.
42
- - **State Management**: The `session_id` is passed through all tools to ensure logs are grouped.
43
- - **Fallback**: If `Screen Classifier` fails to identify an element, the agent automatically falls back to `vision_click` (pixel-based).
47
+ ### 4. Golden Bug Metrics
48
+ We measure agent reliability via a two-stage deterministic pipeline:
49
+ - **Planning Judge**: Static analysis + LLM verification before device boot.
50
+ - **Execution Judge**: Reproduction and AI verification of goal state.
51
+ - **Output**: Precision, Recall, and F1 scores aggregated in `data/agent_runs/golden`.
44
52
 
45
53
  ---
46
54
 
47
- ### 2. Set-of-Mark (SoM) Screenshot Annotation
48
-
49
- Based on OmniParser's SoM approach โ€” color-coded, type-aware bounding boxes.
50
-
51
- **9-Color Element Type Palette:**
52
- | Type | Color | Tag | Example Classes |
53
- |-----------|-------------|--------|-----------------------------------|
54
- | button | Dodger blue | BTN | Button, ImageButton, FAB |
55
- | input | Orange | INPUT | EditText, SearchView |
56
- | toggle | Purple | TOGGLE | Switch, CheckBox, RadioButton |
57
- | nav | Deep pink | NAV | BottomNavigationView, Toolbar |
58
- | image | Dark cyan | IMG | ImageView |
59
- | text | Gray | TXT | TextView |
60
- | list | Forest green| LIST | RecyclerView, ListView |
61
- | container | Dark gray | BOX | FrameLayout, LinearLayout |
62
- | unknown | Green | ELEM | Unclassified elements |
63
-
64
- **Implementation Details:**
65
- - **PIL Threading**: Drawing 50+ bounding boxes with antialiasing is CPU intensive. We use `asyncio.to_thread(_draw_bounding_boxes_threaded)` to keep the event loop non-blocking.
66
- - **Class Priority**: Substring matching uses a prioritized list. `radiobutton` is checked before `button` to prevent incorrect classification.
67
- - **TOON Optimization**: The `list_elements_on_screen` output is converted to **TOON** (Token Optimized Object Notation) which strips redundant metadata (package name, resource-id, etc.) to save 40% in prompt tokens.
68
- - **Font Scaling**: `font_size = int(img_width / 54)`. On a 1080ร—2400 screen (scaled to 486ร—1080), this yields a readable ~9px font.
69
-
70
- ---
55
+ ## ๐Ÿž Critical Bug Fixes (Implementation Level)
71
56
 
72
- ### 3. Coordinate Scaling (Screenshot vs Device Resolution)
73
-
74
- **The Problem:**
75
- Mobile MCP `take_screenshot` returns JPEG images scaled to ~45% of native resolution, but `list_elements_on_screen` returns coordinates in native device resolution.
76
-
77
- | Layer | Resolution | Source |
78
- |--------------------|-------------|----------------------------------|
79
- | Device screen | 1080ร—2400 | Native resolution |
80
- | Screenshot image | 486ร—1080 | Mobile MCP compresses to JPEG |
81
- | Element coordinates| 1080ร—2400 | list_elements_on_screen (native) |
82
-
83
- **Implementation Details:**
84
- 1. **Parse Resolution**: Get screen size via `get_screen_size()` and parse with:
85
- ```python
86
- re.search(r'(\d+)\s*x\s*(\d+)', screen_info)
87
- ```
88
- 2. **Scaling Logic**:
89
- ```python
90
- scale_x = img.width / screen_width # e.g., 486 / 1080 = 0.45
91
- scale_y = img.height / screen_height # e.g., 1080 / 2400 = 0.45
92
- ```
93
- 3. **Coordinate Transformation**:
94
- `target_x = raw_x * scale_x`
95
- `target_y = raw_y * scale_y`
96
-
97
- **Warning**: If scaling is omitted, PIL drawing will fail with `IndexError` or draw elements entirely off-canvas as native coordinates exceed the 1080px image height.
98
-
99
- ---
100
-
101
- ### 4. Flicker Detection Pipeline (4-Layer)
102
-
103
- Detects screen flickers too fast for periodic screenshots (16-200ms).
104
-
105
- 1. **Layer 1 (Trigger)**: High-speed recording using `adb shell screenrecord --time-limit 10 /sdcard/flicker.mp4`.
106
- 2. **Layer 2 (Extraction)**: `ffmpeg` scene filtering to extract only candidate frames:
107
- ```bash
108
- ffmpeg -i in.mp4 -vf "select='gt(scene,0.003)'" -vsync vfr out_%03d.jpg
109
- ```
110
- 3. **Layer 3 (Analysis)**: Structural Similarity Index (SSIM) calculated between consecutive frame pairs. Drops > 0.15 are flagged for human/AI review.
111
- 4. **Layer 4 (LLM)**: GPT-5.2 Vision analyzes the visual delta to distinguish between "UI Glitch" and "Expected Animation".
112
-
113
- ---
114
-
115
- ### 5. Agent Configuration Patterns
116
-
117
- **Coordinator Agent (GPT-5.2):**
118
- - `parallel_tool_calls=True` (orchestration tasks can be parallel)
119
- - `reasoning=Reasoning(effort="high")`
120
- - Delegates to specialized assistants.
121
-
122
- **Device Testing Agent (GPT-5-mini):**
123
- - `parallel_tool_calls=False` โ† **CRITICAL**.
124
- - `reasoning=Reasoning(effort="medium")`.
125
-
126
- **Why `parallel_tool_calls=False`?**
127
- If set to True, the agent often sends multiple tool calls in one turn (e.g., `list_devices` and `take_screenshot`). This fails because the emulator connection is a single active session. Sequential execution ensures the session state remains stable.
128
-
129
- ---
130
-
131
- ## ๐Ÿž Verified Bug Fixes (Known Issues)
132
-
133
- | Severity | Issue | Root Cause & Fix |
134
- |----------|-------|------------------|
135
- | **CRITICAL** | Bbox Misalignment | **RC**: 45% JPEG scaling vs native coords. **Fix**: Apply `img_width / native_width` scale factors. |
136
- | **CRITICAL** | Missing asyncio | **RC**: `asyncio.to_thread` used without import. **Fix**: Added `import asyncio`. |
137
- | **CRITICAL** | Async def in to_thread | **RC**: `async def` passed to `to_thread` returns coroutine, not result. **Fix**: Remove `async` keyword. |
138
- | **HIGH** | Arg Mismatch | **RC**: Positional args passed to keyword-only (`*`) params. **Fix**: Use `func(x=x, y=y)`. |
139
- | **HIGH** | Path Not Reset | **RC**: State pointed to `.annotated.png` even if drawing failed. **Fix**: Reset in `except` block. |
140
- | **CRITICAL** | Race Condition | **RC**: Agent calling `list` + `start` simultaneously. **Fix**: `parallel_tool_calls=False`. |
141
- | **MEDIUM** | Figma Rate Limit | **RC**: Figma API 429 errors. **Fix**: Direct CV overlay via Playwright + brightness detection. |
142
- | **MEDIUM** | OpenAI Rate Limit | **RC**: Verbose XML hierarchies. **Fix**: Integrated **TOON** format (65% reduction). |
143
- | **HIGH** | Chef JSON Crash | **RC**: Multiple `response.json()` calls + null fields. **Fix**: Guard `JSON.parse` + coalescing operators. |
57
+ | Severity | Issue | Root Cause & Expert Fix |
58
+ |----------|-------|-------------------------|
59
+ | **CRITICAL** | Bbox Misalignment | **RC**: 45% Scaling Delta. **Fix**: Apply `img_width / native_width` factor. |
60
+ | **CRITICAL** | Async to_thread | **RC**: CORO vs CALL. **Fix**: Remove `async` from functions passed to `asyncio.to_thread`. |
61
+ | **CRITICAL** | Race Condition | **RC**: Parallel sessions. **Fix**: `parallel_tool_calls=False` for sequential testing. |
62
+ | **HIGH** | Simulation Leak | **RC**: Memory persistence. **Fix**: 24h/100-run auto-purge with `asyncio.Lock` safety. |
63
+ | **HIGH** | Figma Rate Limit | **RC**: API 429 status. **Fix**: Direct CV overlay via Playwright + brightness detection. |
144
64
 
145
65
  ---
146
66
 
147
67
  ## ๐Ÿ”„ Core Workflows
148
68
 
149
- ### Bug Fix Workflow
150
- 1. **DIAGNOSE** โ€” `tail -f /tmp/backend.log`
151
- 2. **LOCATE** โ€” Find affected file.
152
- 3. **FIX** โ€” Targeted minimal change.
153
- 4. **VERIFY** โ€” `pytest --tb=short`
154
- 5. **COMMIT** โ€” Once green.
155
-
156
- ### Closed-Loop Verification (The Ralph Loop)
157
- 1. **CODE** โ€” Implement change.
158
- 2. **LINT** โ€” Run `mypy` or `eslint`.
159
- 3. **UNIT TEST** โ€” Run specific failing test.
160
- 4. **CHECK ASYNC** โ€” Verify no `async def` in `to_thread`.
161
- 5. **VERIFY HUD** โ€” Watch the emulator stream while agent runs.
69
+ ### The Ralph Loop (Closed-Loop Verification)
70
+ 1. **CODE** โ†’ Implement.
71
+ 2. **LINT** โ†’ `mypy` / `eslint`.
72
+ 3. **UNIT TEST** โ†’ Specific module verification.
73
+ 4. **CHECK ASYNC** โ†’ Confirm `to_thread` safety.
74
+ 5. **VERIFY HUD** โ†’ Watch the emulator stream while agent runs.
162
75
 
163
76
  ---
164
77
 
@@ -166,11 +79,13 @@ If set to True, the agent often sends multiple tool calls in one turn (e.g., `li
166
79
 
167
80
  ### Claude Desktop
168
81
  ```bash
169
- claude mcp add ta-studio -- npx -y ta-studio-mcp
82
+ claude mcp add ta-studio -- npx -y ta-studio-mcp@latest
170
83
  ```
171
84
 
172
85
  ### Cursor / Windsurf
173
- Add `npx -y ta-studio-mcp` as a command-type MCP server in your IDE settings.
86
+ Add `npx -y ta-studio-mcp@latest` as a command-type MCP server.
87
+
88
+ ---
174
89
 
175
90
  ## ๐Ÿ“œ License
176
91
  MIT ยฉ 2026 TA Studios.
package/dist/index.js CHANGED
@@ -14,7 +14,7 @@ import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'
14
14
  import { registerAllTools } from './tools/register-all.js';
15
15
  const server = new McpServer({
16
16
  name: 'ta-studio-mcp',
17
- version: '1.1.0',
17
+ version: '1.2.0',
18
18
  }, {
19
19
  capabilities: {
20
20
  logging: {},
@@ -1 +1 @@
1
- {"version":3,"file":"codebase-map.d.ts","sourceRoot":"","sources":["../../src/knowledge/codebase-map.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,eAAO,MAAM,iBAAiB,EAAE,MAAM,CAAC,MAAM,EAAE,MAAM,CAiHpD,CAAC;AAEF,eAAO,MAAM,qBAAqB,UAAiC,CAAC"}
1
+ {"version":3,"file":"codebase-map.d.ts","sourceRoot":"","sources":["../../src/knowledge/codebase-map.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,eAAO,MAAM,iBAAiB,EAAE,MAAM,CAAC,MAAM,EAAE,MAAM,CAyHpD,CAAC;AAEF,eAAO,MAAM,qBAAqB,UAAiC,CAAC"}
@@ -11,6 +11,14 @@ my-fullstack-app/
11
11
  โ”œโ”€โ”€ backend/app/ # FastAPI backend (Python 3.11+)
12
12
  โ”œโ”€โ”€ frontend/test-studio/ # React + TypeScript + Vite frontend
13
13
  โ”œโ”€โ”€ integrations/chef/ # Chef AI agent integration (Remix)
14
+ โ”‚ โ”œโ”€โ”€ device_testing/ # Mobile test execution logic
15
+ โ”‚ โ”‚ โ”œโ”€โ”€ subagents/ # OAVR specialist agents (Classifier, Verifier, Diagnosis)
16
+ โ”‚ โ”‚ โ”œโ”€โ”€ tools/ # MCP tool implementations (Navigation, Agentic Vision)
17
+ โ”‚ โ”‚ โ”œโ”€โ”€ mobile_mcp_client.py # Mobile MCP client with ADB fallback
18
+ โ”‚ โ”‚ โ”œโ”€โ”€ golden_bug_service.py # Evaluates agent reliability metrics
19
+ โ”‚ โ”‚ โ””โ”€โ”€ autonomous_exploration_service.py # Goal-agnostic curiosity
20
+ โ”‚ โ”œโ”€โ”€ api/ # API endpoints
21
+ โ”‚ โ””โ”€โ”€ observability/ # Tracing and metrics
14
22
  โ”œโ”€โ”€ packages/ # Local npm packages (ta-studio-mcp)
15
23
  โ”œโ”€โ”€ scripts/ # Utility scripts
16
24
  โ”œโ”€โ”€ tests/ # E2E and manual tests
@@ -1 +1 @@
1
- {"version":3,"file":"codebase-map.js","sourceRoot":"","sources":["../../src/knowledge/codebase-map.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,MAAM,CAAC,MAAM,iBAAiB,GAA2B;IACvD,QAAQ,EAAE;;;;;;;;;;;;qDAYyC;IAEnD,OAAO,EAAE;;;;;;;;;;;;;;;;;;;;;;;iDAuBsC;IAE/C,QAAQ,EAAE;;;;;;;;;;;;;;;0DAe8C;IAExD,MAAM,EAAE;;;;;;;;;;;;;;;;;oEAiB0D;IAElE,OAAO,EAAE;;;;;;;;;;yDAU8C;IAEvD,YAAY,EAAE;;;;;;;;;;;;;;8DAc8C;IAE5D,MAAM,EAAE;;;;;;;;mEAQyD;CAClE,CAAC;AAEF,MAAM,CAAC,MAAM,qBAAqB,GAAG,MAAM,CAAC,IAAI,CAAC,iBAAiB,CAAC,CAAC"}
1
+ {"version":3,"file":"codebase-map.js","sourceRoot":"","sources":["../../src/knowledge/codebase-map.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,MAAM,CAAC,MAAM,iBAAiB,GAA2B;IACvD,QAAQ,EAAE;;;;;;;;;;;;;;;;;;;;qDAoByC;IAEnD,OAAO,EAAE;;;;;;;;;;;;;;;;;;;;;;;iDAuBsC;IAE/C,QAAQ,EAAE;;;;;;;;;;;;;;;0DAe8C;IAExD,MAAM,EAAE;;;;;;;;;;;;;;;;;oEAiB0D;IAElE,OAAO,EAAE;;;;;;;;;;yDAU8C;IAEvD,YAAY,EAAE;;;;;;;;;;;;;;8DAc8C;IAE5D,MAAM,EAAE;;;;;;;;mEAQyD;CAClE,CAAC;AAEF,MAAM,CAAC,MAAM,qBAAqB,GAAG,MAAM,CAAC,IAAI,CAAC,iBAAiB,CAAC,CAAC"}
@@ -1,6 +1,6 @@
1
1
  /**
2
2
  * Code conventions and style guidelines.
3
3
  */
4
- export declare const CONVENTIONS = "# TA Studio Code Conventions\n\n## Python (Backend)\n- Imports: absolute imports from app. prefix\n- Type hints: Required for ALL function signatures\n- Docstrings: Google style for public functions\n- Async: Use async/await for I/O operations\n- Logging: Use logging.getLogger(__name__)\n\nExample:\n```python\nfrom app.agents.device_testing.mobile_mcp_client import MobileMCPClient\n\nasync def take_screenshot(device_id: str) -> str:\n \"\"\"Take screenshot from device.\n \n Args:\n device_id: Android device identifier\n \n Returns:\n Base64-encoded screenshot data\n \"\"\"\n pass\n```\n\n## TypeScript (Frontend)\n- Components: Functional components with hooks, one per file\n- Types: Explicit types, avoid any\n- Imports: Use @/ path alias for src imports\n- State: React hooks for local state, TanStack Query for server state\n\nExample:\n```typescript\ninterface DeviceProps {\n deviceId: string;\n onStreamStart: (id: string) => void;\n}\nconst DeviceCard: React.FC<DeviceProps> = ({ deviceId, onStreamStart }) => { ... };\n```\n\n## Agent Code Patterns\n- Agent-as-tool pattern: coordinator delegates to specialized agents\n- Colocation: agent code + tools + models together\n- Factory pattern: agent creation via factory functions\n- DRY: no duplicate code across modules\n\n## Convex / Template Literals\n- Use \\n escape sequences (not multi-line templates) in Convex actions\n- Why: Easier to diff-review, auto-formatters don't mess indentation\n\n## Mobile MCP Data Shapes\n- Screenshots: { type: \"image\", data: \"base64...\", mimeType: \"image/jpeg\" }\n- ALWAYS keep structured, NEVER JSON.stringify for model consumption\n- Vision-ready: convert to data-URL: data:{mime};base64,{b64}\n\n## Critical Rules\n1. NEVER share node_modules between Chef (React 18) and TA frontend (React 19)\n2. NEVER commit without running verification\n3. ALWAYS auto-select first device (prefer emulator-5554)\n4. ALWAYS scale coordinates before drawing bounding boxes\n5. ALWAYS wrap JSON.parse in try-catch for external payloads\n6. ALWAYS use keyword arguments for functions with *, syntax\n7. NEVER pass async functions to asyncio.to_thread()\n";
4
+ export declare const CONVENTIONS = "# TA Studio Code Conventions\n\n## Python (Backend)\n- Imports: absolute imports from app. prefix\n- Type hints: Required for ALL function signatures\n- Docstrings: Google style for public functions\n- Async: Use async/await for I/O operations\n- Logging: Use logging.getLogger(__name__)\n- Subagents: Use specialized agents for Perception (Screen Classifier), Action (Verifier), and Diagnosis.\n- Concurrency: Use asyncio.Semaphore and asyncio.Lock for multi-device simulation safety.\n- Model Tiering: GPT-5.2 (Thinking), GPT-5-mini (Core), GPT-5-nano (Utilities).\n- Fallback: ALWAYS implement ADB fallback for Mobile MCP operations.\n\nExample:\n```python\nfrom app.agents.device_testing.mobile_mcp_client import MobileMCPClient\n\nasync def take_screenshot(device_id: str) -> str:\n \"\"\"Take screenshot from device.\n \n Args:\n device_id: Android device identifier\n \n Returns:\n Base64-encoded screenshot data\n \"\"\"\n pass\n```\n\n## TypeScript (Frontend)\n- Components: Functional components with hooks, one per file\n- Types: Explicit types, avoid any\n- Imports: Use @/ path alias for src imports\n- State: React hooks for local state, TanStack Query for server state\n\nExample:\n```typescript\ninterface DeviceProps {\n deviceId: string;\n onStreamStart: (id: string) => void;\n}\nconst DeviceCard: React.FC<DeviceProps> = ({ deviceId, onStreamStart }) => { ... };\n```\n\n## Agent Code Patterns\n- Agent-as-tool pattern: coordinator delegates to specialized agents\n- Colocation: agent code + tools + models together\n- Factory pattern: agent creation via factory functions\n- DRY: no duplicate code across modules\n\n## Convex / Template Literals\n- Use \\n escape sequences (not multi-line templates) in Convex actions\n- Why: Easier to diff-review, auto-formatters don't mess indentation\n\n## Mobile MCP Data Shapes\n- Screenshots: { type: \"image\", data: \"base64...\", mimeType: \"image/jpeg\" }\n- ALWAYS keep structured, NEVER JSON.stringify for model consumption\n- Vision-ready: convert to data-URL: data:{mime};base64,{b64}\n\n## Critical Rules\n1. NEVER share node_modules between Chef (React 18) and TA frontend (React 19)\n2. NEVER commit without running verification\n3. ALWAYS auto-select first device (prefer emulator-5554)\n4. ALWAYS scale coordinates before drawing bounding boxes\n5. ALWAYS wrap JSON.parse in try-catch for external payloads\n6. ALWAYS use keyword arguments for functions with *, syntax\n7. NEVER pass async functions to asyncio.to_thread()\n";
5
5
  export declare const AGENT_CONFIG_REFERENCE = "# Agent Configuration Reference\n\n## Coordinator Agent\n- Model: gpt-5.2\n- parallel_tool_calls: True\n- reasoning: Reasoning(effort=\"high\")\n- Handoffs: Search Assistant, Test Generation, Device Testing\n- Instructions: General orchestration, task routing\n\n## Device Testing Agent\n- Model: gpt-5-mini (vision-capable)\n- parallel_tool_calls: False (CRITICAL \u2014 navigation is sequential)\n- reasoning: Reasoning(effort=\"medium\")\n- Tools: take_screenshot, list_elements, click, swipe, type, vision_click\n- Instructions: OAVR pattern, auto-select device, never ask user\n\n## Search Assistant\n- Model: gpt-5-mini\n- Purpose: Bug/scenario search in knowledge base\n\n## Test Generation Specialist\n- Model: gpt-5-mini\n- Purpose: Generate test code from bug descriptions\n\n## Streaming\n- SSE (Server-Sent Events) for AI chat responses\n- WebSocket for emulator frame streaming\n- OpenAI Agents SDK Runner.run_streamed() for agent execution\n";
6
6
  //# sourceMappingURL=conventions.d.ts.map
@@ -1 +1 @@
1
- {"version":3,"file":"conventions.d.ts","sourceRoot":"","sources":["../../src/knowledge/conventions.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,eAAO,MAAM,WAAW,opEA+DvB,CAAC;AAEF,eAAO,MAAM,sBAAsB,g8BA4BlC,CAAC"}
1
+ {"version":3,"file":"conventions.d.ts","sourceRoot":"","sources":["../../src/knowledge/conventions.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,eAAO,MAAM,WAAW,i/EAmEvB,CAAC;AAEF,eAAO,MAAM,sBAAsB,g8BA4BlC,CAAC"}
@@ -9,6 +9,10 @@ export const CONVENTIONS = `# TA Studio Code Conventions
9
9
  - Docstrings: Google style for public functions
10
10
  - Async: Use async/await for I/O operations
11
11
  - Logging: Use logging.getLogger(__name__)
12
+ - Subagents: Use specialized agents for Perception (Screen Classifier), Action (Verifier), and Diagnosis.
13
+ - Concurrency: Use asyncio.Semaphore and asyncio.Lock for multi-device simulation safety.
14
+ - Model Tiering: GPT-5.2 (Thinking), GPT-5-mini (Core), GPT-5-nano (Utilities).
15
+ - Fallback: ALWAYS implement ADB fallback for Mobile MCP operations.
12
16
 
13
17
  Example:
14
18
  \`\`\`python
@@ -1 +1 @@
1
- {"version":3,"file":"conventions.js","sourceRoot":"","sources":["../../src/knowledge/conventions.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,MAAM,CAAC,MAAM,WAAW,GAAG;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;CA+D1B,CAAC;AAEF,MAAM,CAAC,MAAM,sBAAsB,GAAG;;;;;;;;;;;;;;;;;;;;;;;;;;;;CA4BrC,CAAC"}
1
+ {"version":3,"file":"conventions.js","sourceRoot":"","sources":["../../src/knowledge/conventions.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,MAAM,CAAC,MAAM,WAAW,GAAG;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;CAmE1B,CAAC;AAEF,MAAM,CAAC,MAAM,sBAAsB,GAAG;;;;;;;;;;;;;;;;;;;;;;;;;;;;CA4BrC,CAAC"}
@@ -1 +1 @@
1
- {"version":3,"file":"methodology.d.ts","sourceRoot":"","sources":["../../src/knowledge/methodology.ts"],"names":[],"mappings":"AAAA;;;GAGG;AAEH,eAAO,MAAM,kBAAkB,EAAE,MAAM,CAAC,MAAM,EAAE,MAAM,CAqFrD,CAAC;AA2FF,eAAO,MAAM,sBAAsB,UAAkC,CAAC"}
1
+ {"version":3,"file":"methodology.d.ts","sourceRoot":"","sources":["../../src/knowledge/methodology.ts"],"names":[],"mappings":"AAAA;;;GAGG;AAEH,eAAO,MAAM,kBAAkB,EAAE,MAAM,CAAC,MAAM,EAAE,MAAM,CAsKrD,CAAC;AAEF,eAAO,MAAM,sBAAsB,UAAkC,CAAC"}
@@ -5,7 +5,7 @@
5
5
  export const METHODOLOGY_TOPICS = {
6
6
  overview: `# TA Studio Methodologies โ€” Overview
7
7
 
8
- Available topics: oavr, som_annotation, coordinate_scaling, agent_config, flicker_detection, figma_flow, golden_bugs, mobile_mcp, vision_click, self_correction, closed_loop, device_auto_select, parallel_tool_calls
8
+ Available topics: oavr, som_annotation, coordinate_scaling, agent_config, flicker_detection, golden_bugs, mobile_mcp, vision_click, failure_diagnosis, self_correction, model_tiering, simulation_lifecycle
9
9
 
10
10
  ## Architecture
11
11
  - **Backend**: FastAPI (Python 3.11+) at backend/
@@ -24,9 +24,9 @@ The device testing agent uses OAVR for autonomous navigation:
24
24
  4. **Reason** โ†’ Failure Diagnosis suggests recovery if verification failed
25
25
 
26
26
  ## Implementation Details
27
- - **Handoff Logic**: Sub-agents are triggered via \`@tool\` decorators in the \`DeviceTestingAgent\` class.
28
- - **State Management**: The \`session_id\` is passed through all tools to ensure logs are grouped.
29
- - **Fallback**: If \`Screen Classifier\` fails to identify an element, the agent automatically falls back to \`vision_click\` (pixel-based).
27
+ - **Handoff Logic**: Sub-agents are triggered via @tool decorators in the DeviceTestingAgent class.
28
+ - **State Management**: The session_id is passed through all tools to ensure logs are grouped.
29
+ - **Fallback**: If Screen Classifier fails to identify an element, the agent automatically falls back to vision_click.
30
30
 
31
31
  Key files:
32
32
  - agents/device_testing/subagents/screen_classifier_agent.py
@@ -50,15 +50,14 @@ Based on OmniParser's SoM approach โ€” color-coded, type-aware bounding boxes.
50
50
  | unknown | Green | ELEM | Unclassified elements |
51
51
 
52
52
  ## Implementation Details
53
- - **PIL Threading**: Drawing 50+ bounding boxes with antialiasing is CPU intensive. We use \`asyncio.to_thread(_draw_bounding_boxes_threaded)\` to keep the event loop non-blocking.
54
- - **Class Priority**: Substring matching uses a prioritized list. \`radiobutton\` is checked before \`button\` to prevent incorrect classification.
55
- - **TOON Optimization**: The \`list_elements_on_screen\` output is converted to **TOON** (Token Optimized Object Notation) which strips redundant metadata to save 40% in prompt tokens.
56
- - **Font Scaling**: \`font_size = int(img_width / 54)\`. On a 1080ร—2400 screen (scaled to 486ร—1080), this yields a readable ~9px font.
53
+ - **PIL Threading**: Drawing 50+ bounding boxes with antialiasing is CPU intensive. We use asyncio.to_thread(_draw_bounding_boxes_threaded) to keep the event loop non-blocking.
54
+ - **Class Priority**: Substring matching uses a prioritized list. radiobutton is checked before button to prevent incorrect classification.
55
+ - **TOON Optimization**: The list_elements_on_screen output is converted to TOON (Token Optimized Object Notation) which strips redundant metadata to save 40% in prompt tokens.
56
+ - **Font Scaling**: font_size = int(img_width / 54).
57
57
 
58
58
  Key file: agents/device_testing/tools/autonomous_navigation_tools.py`,
59
59
  coordinate_scaling: `# Coordinate Scaling โ€” Screenshot vs Device Resolution
60
60
 
61
- ## The Problem
62
61
  Mobile MCP take_screenshot returns JPEG images scaled to ~45% of native resolution, but list_elements_on_screen returns coordinates in native device resolution.
63
62
 
64
63
  | Layer | Resolution | Source |
@@ -68,105 +67,97 @@ Mobile MCP take_screenshot returns JPEG images scaled to ~45% of native resoluti
68
67
  | Element coordinates| 1080ร—2400 | list_elements_on_screen (native) |
69
68
 
70
69
  ## Implementation Details
71
- 1. **Parse Resolution**: Get screen size via \`get_screen_size()\` and parse with:
72
- \`\`\`python
73
- re.search(r'(\d+)\s*x\s*(\d+)', screen_info)
74
- \`\`\`
75
- 2. **Scaling Logic**:
76
- \`\`\`python
77
- scale_x = img.width / screen_width # e.g., 486 / 1080 = 0.45
78
- scale_y = img.height / screen_height # e.g., 1080 / 2400 = 0.45
79
- \`\`\`
80
- 3. **Coordinate Transformation**:
81
- \`target_x = raw_x * scale_x\`
82
- \`target_y = raw_y * scale_y\`
83
-
84
- **Warning**: If scaling is omitted, PIL drawing will fail with \`IndexError\` or draw elements entirely off-canvas as native coordinates exceed the 1080px image height.
70
+ 1. **Parse Resolution**: Get screen size via get_screen_size() and parse with regex.
71
+ 2. **Scaling Logic**: scale_x = img.width / screen_width.
72
+ 3. **Coordinate Transformation**: target_x = raw_x * scale_x.
85
73
 
86
74
  Key file: autonomous_navigation_tools.py lines 397-595`,
87
- };
88
- // Additional topics added below to stay within file-size limits
89
- METHODOLOGY_TOPICS.agent_config = `# Agent Configuration Patterns
75
+ flicker_detection: `# Flicker Detection Pipeline โ€” 4-Layer Architecture
90
76
 
91
- ## Coordinator Agent (GPT-5.2)
92
- - parallel_tool_calls=True (orchestration tasks can be parallel)
93
- - reasoning=Reasoning(effort="high")
94
- - Dynamic handoffs via is_enabled callbacks
95
- - Delegates to: Search Assistant, Test Generation Specialist, Device Testing Specialist
77
+ Detects screen flickers too fast for periodic screenshots (16-200ms).
96
78
 
97
- ## Device Testing Agent (GPT-5-mini)
98
- - parallel_tool_calls=False โ† CRITICAL
99
- - reasoning=Reasoning(effort="medium")
100
- - Auto-selects first device (prefers emulator-5554)
79
+ 1. **Layer 1 (Trigger)**: adb shell screenrecord --time-limit 10.
80
+ 2. **Layer 2 (Extraction)**: ffmpeg scene filtering (select='gt(scene,0.003)').
81
+ 3. **Layer 3 (Analysis)**: SSIM calculated between consecutive pairs. Drops > 0.15 are flagged.
82
+ 4. **Layer 4 (LLM)**: GPT-5.2 Vision verification.
101
83
 
102
- ## Why parallel_tool_calls=False
103
- If set to True, GPT-5-mini often sends multiple tool calls in one turn (e.g., \`list_devices\` and \`take_screenshot\`). This fails because the emulator connection is a single active session. Sequential execution ensures:
104
- 1. Session is established.
105
- 2. Screen is observed.
106
- 3. Action is taken.
84
+ Key file: agents/device_testing/flicker_detection_service.py`,
85
+ golden_bugs: `# Golden Bug Evaluation Pipeline
107
86
 
108
- Key file: agents/device_testing/device_testing_agent.py`;
109
- METHODOLOGY_TOPICS.vision_click = `# Vision-Augmented Navigation (Sight-Based Interaction)
87
+ A two-stage deterministic evaluation system for measuring agent reliability.
110
88
 
111
- When accessibility metadata is missing (e.g., Unity games, custom canvases), the agent uses pixel-based vision.
89
+ ## 1. Pre-Device Planning (LLM Judge)
90
+ - **Static Checks**: Verifies device_id, app_package, and steps are present.
91
+ - **LLM Judge**: GPT-5-mini reviews the plan for logical consistency.
92
+ - **Fail-Fast**: If planning fails, execution is skipped.
112
93
 
113
- ## Implementation Details
114
- 1. **Capture**: \`take_screenshot()\` provides the visual state.
115
- 2. **Reasoning**: GPT-5.2 Vision identifies the target and returns normalized coordinates ([0-1000] scale).
116
- 3. **Denormalization**: Coordinates are projected back to native resolution:
117
- \`\`\`python
118
- native_x = (norm_x / 1000) * screen_width
119
- native_y = (norm_y / 1000) * screen_height
120
- \`\`\`
121
- 4. **Execution**: \`click_on_screen(x=native_x, y=native_y)\` via Mobile MCP.
122
-
123
- Key file: agents/device_testing/tools/agentic_vision_tools.py`;
124
- METHODOLOGY_TOPICS.mobile_mcp = `# Mobile MCP Client Architecture
94
+ ## 2. On-Device Execution
95
+ - **Reproduction**: Agent attempts task up to 3 times.
96
+ - **Verification**: AI analyzes screen state to confirm goal achievement.
97
+ - **Classification**: TPs (Bug reproduced), FNs (Bug missed), TNs (Correct), FPs (False Alarm).
125
98
 
126
- ## Implementation Details
127
- - **Transport**: stdio-based JSON-RPC via \`subprocess.Popen\`.
128
- - **Lifecycle**: The client automatically restarts the MCP subprocess if it crashes or hangs for >30s.
129
- - **ADB Bridge**: We use \`adb shell dumpsys window | grep mCurrentFocus\` as a fallback verification when MCP reports success but UI hasn't updated.
99
+ Key file: agents/device_testing/golden_bug_service.py`,
100
+ failure_diagnosis: `# Failure Taxonomy & Diagnosis (OAVR "Reason")
130
101
 
131
- ## Key Logic
132
- - \`list_elements_on_screen\`: Returns native coordinates.
133
- - \`take_screenshot\`: Returns scaled JPEG (45% resolution).
102
+ When Action Verifier fails, the Failure Diagnosis Specialist classifies the error.
134
103
 
135
- Key file: agents/device_testing/mobile_mcp_client.py`;
136
- METHODOLOGY_TOPICS.flicker_detection = `# Flicker Detection Pipeline โ€” 4-Layer Architecture
104
+ ## Failure Taxonomy
105
+ 1. **PLANNING_ERROR**: Wrong action for current state.
106
+ 2. **PERCEPTION_ERROR**: Misinterpreted UI (e.g., empty element list).
107
+ 3. **ENVIRONMENT_ERROR**: App crash, OS dialog, or network timeout.
108
+ 4. **EXECUTION_ERROR**: Action failed despite element presence.
137
109
 
138
- Detects screen flickers too fast for periodic screenshots (16-200ms).
110
+ ## Recovery Strategies
111
+ - **Backtrack**: Press BACK and re-classify.
112
+ - **Wait**: Wait 2s for UI sync and re-scan.
113
+ - **Restart**: Press HOME and re-launch app.
114
+ - **Adjust**: Apply 5px jitter to coordinates and retry.
115
+
116
+ Key file: agents/device_testing/subagents/failure_diagnosis_agent.py`,
117
+ model_tiering: `# 2026 Model Tiering Standard
118
+
119
+ Model selection is strictly tiered by "Thinking Budget":
120
+
121
+ 1. **Thinking Tier (GPT-5.2)**: Orchestration (Coordinator), Complex Reasoning, Test Generation.
122
+ 2. **Core Tier (GPT-5-mini)**: Routing, Classification (OAVR), Planning.
123
+ 3. **Utility Tier (GPT-5-nano)**: MCP tool calls, data distillation (JSON cleaning), search enhancement.
124
+
125
+ Key file: backend/app/agents/model_fallback.py`,
126
+ mobile_mcp_fallback: `# Mobile MCP v0.0.36 ADB Fallback
127
+
128
+ Mobile MCP has a critical bug where it fails device detection if *any* device is offline.
139
129
 
140
- ## Layered Logic
141
- - **Layer 1 (Trigger)**: \`adb shell screenrecord --time-limit 10 /sdcard/flicker.mp4\`.
142
- - **Layer 2 (Extraction)**: \`ffmpeg\` scene filtering to extract only candidate frames:
143
- \`\`\`bash
144
- ffmpeg -i in.mp4 -vf "select='gt(scene,0.003)'" -vsync vfr out_%03d.jpg
145
- \`\`\`
146
- - **Layer 3 (Analysis)**: Structural Similarity Index (SSIM) calculated between consecutive pairs. Drops > 0.15 are flagged.
147
- - **Layer 4 (LLM)**: GPT-5.2 analyzes the visual delta to distinguish between "UI Glitch" and "Expected Animation".
148
-
149
- Key file: agents/device_testing/flicker_detection_service.py (1117 lines)`;
150
- METHODOLOGY_TOPICS.closed_loop = `# Closed-Loop Verification (Ralph Loop)
151
-
152
- Every change follows: THINK โ†’ ACT โ†’ VERIFY โ†’ OBSERVE โ†’ ADAPT โ†’ COMMIT.
153
-
154
- ## Implementation
155
- The agent doesn't just run code; it monitors the terminal output and **self-corrects**:
156
- 1. If \`pytest\` fails, it reads the traceback.
157
- 2. It checks for missing imports (classic bug in SoM tools).
158
- 3. It checks for async/sync mismatches in \`asyncio.to_thread\`.
159
- 4. It re-runs the specific failing test file before attempting a whole-project build.
160
-
161
- NEVER commit without running full verification!`;
162
- METHODOLOGY_TOPICS.device_auto_select = `# Device Auto-Selection
163
-
164
- ## Implementation
165
- 1. Query: \`adb devices\`.
166
- 2. Filter: Keep only entries with \`device\` status.
167
- 3. Rank: Priority 1: \`emulator-5554\`, Priority 2: first \`emulator-*\`, Priority 3: first item.
168
- 4. Persistence: Store choice in navigation session state to prevent mid-task jumping.
169
-
170
- Key file: agents/device_testing/device_testing_agent.py`;
130
+ ## Workaround Logic
131
+ Comprehensive ADB bridge fallback for:
132
+ - **Launching**: am start -n with known activity mapping.
133
+ - **UI Dump**: uiautomator dump /dev/tty (direct to stdout for speed).
134
+ - **Screenshots**: exec-out screencap -p (fast PNG capture).
135
+ - **Interaction**: input tap, input swipe, and input text.
136
+
137
+ Key file: agents/device_testing/mobile_mcp_client.py`,
138
+ simulation_lifecycle: `# Simulation Lifecycle & Safety
139
+
140
+ Managing parallel device executions at scale.
141
+
142
+ ## Safety Controls
143
+ - **Concurrency**: asyncio.Semaphore(max_concurrent) limits active emulators.
144
+ - **Thread Safety**: Per-simulation asyncio.Lock ensures serial result indexing.
145
+ - **Retention**: Max 24h age or 100 total simulations before auto-purge.
146
+
147
+ Key file: agents/coordinator/coordinator_service.py`,
148
+ agent_config: `# Agent Configuration Patterns
149
+
150
+ ## Coordinator Agent (GPT-5.2)
151
+ - parallel_tool_calls=True (orchestration tasks can be parallel)
152
+ - reasoning=Reasoning(effort="high")
153
+
154
+ ## Device Testing Agent (GPT-5-mini)
155
+ - parallel_tool_calls=False โ† CRITICAL
156
+ - reasoning=Reasoning(effort="medium")
157
+
158
+ Sequential execution ensures session stability.
159
+
160
+ Key file: agents/device_testing/device_testing_agent.py`,
161
+ };
171
162
  export const METHODOLOGY_TOPIC_LIST = Object.keys(METHODOLOGY_TOPICS);
172
163
  //# sourceMappingURL=methodology.js.map
@@ -1 +1 @@
1
- {"version":3,"file":"methodology.js","sourceRoot":"","sources":["../../src/knowledge/methodology.ts"],"names":[],"mappings":"AAAA;;;GAGG;AAEH,MAAM,CAAC,MAAM,kBAAkB,GAA2B;IACxD,QAAQ,EAAE;;;;;;;;;;gEAUoD;IAE9D,IAAI,EAAE;;;;;;;;;;;;;;;;;6DAiBqD;IAE3D,cAAc,EAAE;;;;;;;;;;;;;;;;;;;;;;;qEAuBmD;IAEnE,kBAAkB,EAAE;;;;;;;;;;;;;;;;;;;;;;;;;;;uDA2BiC;CACtD,CAAC;AAEF,gEAAgE;AAChE,kBAAkB,CAAC,YAAY,GAAG;;;;;;;;;;;;;;;;;;;wDAmBsB,CAAC;AAEzD,kBAAkB,CAAC,YAAY,GAAG;;;;;;;;;;;;;;8DAc4B,CAAC;AAE/D,kBAAkB,CAAC,UAAU,GAAG;;;;;;;;;;;qDAWqB,CAAC;AAEtD,kBAAkB,CAAC,iBAAiB,GAAG;;;;;;;;;;;;;0EAamC,CAAC;AAE3E,kBAAkB,CAAC,WAAW,GAAG;;;;;;;;;;;gDAWe,CAAC;AAEjD,kBAAkB,CAAC,kBAAkB,GAAG;;;;;;;;wDAQgB,CAAC;AAEzD,MAAM,CAAC,MAAM,sBAAsB,GAAG,MAAM,CAAC,IAAI,CAAC,kBAAkB,CAAC,CAAC"}
1
+ {"version":3,"file":"methodology.js","sourceRoot":"","sources":["../../src/knowledge/methodology.ts"],"names":[],"mappings":"AAAA;;;GAGG;AAEH,MAAM,CAAC,MAAM,kBAAkB,GAA2B;IACvD,QAAQ,EAAE;;;;;;;;;;gEAUmD;IAE7D,IAAI,EAAE;;;;;;;;;;;;;;;;;6DAiBoD;IAE1D,cAAc,EAAE;;;;;;;;;;;;;;;;;;;;;;;qEAuBkD;IAElE,kBAAkB,EAAE;;;;;;;;;;;;;;;uDAegC;IAEpD,iBAAiB,EAAE;;;;;;;;;6DASuC;IAE1D,WAAW,EAAE;;;;;;;;;;;;;;sDAcsC;IAEnD,iBAAiB,EAAE;;;;;;;;;;;;;;;;qEAgB+C;IAElE,aAAa,EAAE;;;;;;;;+CAQ6B;IAE5C,mBAAmB,EAAE;;;;;;;;;;;qDAW6B;IAElD,oBAAoB,EAAE;;;;;;;;;oDAS2B;IAEjD,YAAY,EAAE;;;;;;;;;;;;wDAYuC;CACvD,CAAC;AAEF,MAAM,CAAC,MAAM,sBAAsB,GAAG,MAAM,CAAC,IAAI,CAAC,kBAAkB,CAAC,CAAC"}
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "ta-studio-mcp",
3
- "version": "1.1.0",
3
+ "version": "1.2.0",
4
4
  "description": "TA Studio MCP โ€” Domain knowledge, patterns, bug fixes, and workflows for AI agents working on the TA Studio mobile test automation platform.",
5
5
  "type": "module",
6
6
  "bin": {