ta-studio-mcp 1.2.0 โ†’ 1.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -6,13 +6,6 @@ AI agents often struggle with project-specific context, unique navigation patter
6
6
 
7
7
  ---
8
8
 
9
- ## ๐Ÿ“‹ Prerequisites
10
-
11
- - **Node.js**: `v18.0.0` or higher.
12
- - **MCP Client**: An IDE or tool that supports the [Model Context Protocol](https://modelcontextprotocol.io) (e.g., Claude Desktop, Cursor, Windsurf, VS Code).
13
-
14
- ---
15
-
16
9
  ## โšก Quick Start
17
10
 
18
11
  ```bash
@@ -21,70 +14,66 @@ npx ta-studio-mcp
21
14
 
22
15
  ---
23
16
 
24
- ## ๐Ÿงช Methodologies & Deep Technical Lore
17
+ ## ๐Ÿง  Expert Knowledge & Deep Technical Lore
25
18
 
26
19
  This section documents the state-of-the-art implementations used by the TA Studio team.
27
20
 
28
- ### 1. Model Tiering (Jan 2026 Standard)
29
- We avoid "one size fits all" model selection. Models are tiered by "Thinking Budget":
30
- - **Thinking Tier (GPT-5.2)**: Used for high-level orchestration (Coordinator) and complex visual reasoning. reasoning effort: `high`.
31
- - **Core Tier (GPT-5-mini)**: Used for specialized specialists (Classifier, Verifier, Diagnosis). *Never use nano for classification.*
32
- - **Utility Tier (GPT-5-nano)**: Used for MCP tool formatting, data distillation, and search cleanup.
33
-
34
- ### 2. Failure Taxonomy (OAVR "Reason")
35
- The `Failure Diagnosis Specialist` uses a structured taxonomy to recover from errors:
36
- - **PLANNING_ERROR**: Incorrect action choice. *Recovery*: Backtrack/Retry.
37
- - **PERCEPTION_ERROR**: UI misrepresented in model. *Recovery*: Wait/Re-scan.
38
- - **ENVIRONMENT_ERROR**: App crash/Dialogs. *Recovery*: Handle OS dialog/Restart.
39
- - **EXECUTION_ERROR**: Click/Swipe failed to register. *Recovery*: Apply 5px jitter/Retry.
40
-
41
- ### 3. Mobile MCP ADB Fallback
42
- Mobile MCP v0.0.36 fails if *any* device is offline. Our client implements a comprehensive ADB bridge:
43
- - **Fast Screenshot**: `exec-out screencap -p` (Base64 direct stream).
44
- - **Fast UI Dump**: `uiautomator dump /dev/tty` (No temp file I/O).
45
- - **Control**: Direct `input tap`, `input swipe`, and `am start -n` activity mapping.
46
-
47
- ### 4. Golden Bug Metrics
48
- We measure agent reliability via a two-stage deterministic pipeline:
49
- - **Planning Judge**: Static analysis + LLM verification before device boot.
50
- - **Execution Judge**: Reproduction and AI verification of goal state.
51
- - **Output**: Precision, Recall, and F1 scores aggregated in `data/agent_runs/golden`.
21
+ ### 1. Set-of-Mark (SoM) Screenshot Annotation
22
+ Based on OmniParser's SoM approach, we use color-coded, type-aware bounding boxes to provide visual anchors.
23
+ - **Type-Aware Palette**: 9 distinct colors (e.g., **Dodger Blue** for buttons, **Orange** for inputs).
24
+ - **PIL Threading**: `asyncio.to_thread(_draw_bounding_boxes_threaded)` for non-blocking UI drawing.
25
+ - **TOON Optimization**: **Token Optimized Object Notation** reduces prompt tokens by 40% by stripping redundant XML.
26
+
27
+ ### 2. Deep Subagent Handoff Protocol
28
+ Our "Deep Agent Pattern" orchestrates specialized specialists via a strict chain of custody:
29
+ 1. **Perceptor** (`Screen Classifier`): Returns structured state and **TOON** elements.
30
+ 2. **Planner** (`Device Agent`): Proposes action.
31
+ 3. **Guardrail** (`Action Verifier`): Applies **Boolean Verification** (Safe/Relevant/Executable).
32
+ 4. **Doctor** (`Failure Diagnosis`): Categorizes failures and suggests recovery (Jitter/Wait/Backtrack).
33
+
34
+ ### 3. Boolean Verification vs. Numerical Scoring
35
+ We reject "confidence scores" (e.g. 0.85). Every action must pass three binary checks:
36
+ - **is_safe**: No data loss or unauthorized access?
37
+ - **is_relevant**: Moves toward task goal?
38
+ - **is_executable**: Target is reachable?
39
+ **Logic**: Action executes ONLY if ALL checks are YES.
40
+
41
+ ### 4. Real-Time HUD & Parallel Execution
42
+ - **Observation**: `<200ms` lag via `on_step` async callbacks that emit SSE events.
43
+ - **Concurrency**: `asyncio.Semaphore` and per-simulation `asyncio.Lock` manage multiple parallel device streams without resource collision.
44
+
45
+ ### 5. Model Tiering (2026 Standard)
46
+ - **Thinking Tier (GPT-5.2)**: Orchestration & complex visual reasoning. reasoning effort: `high`.
47
+ - **Core Tier (GPT-5-mini)**: Specialist subagents. *Never use nano for classification.*
48
+ - **Utility Tier (GPT-5-nano)**: MCP formatting and distillation.
52
49
 
53
50
  ---
54
51
 
55
- ## ๐Ÿž Critical Bug Fixes (Implementation Level)
52
+ ## ๐Ÿž Critical Bug Fixes (The "Expert" List)
56
53
 
57
54
  | Severity | Issue | Root Cause & Expert Fix |
58
55
  |----------|-------|-------------------------|
59
- | **CRITICAL** | Bbox Misalignment | **RC**: 45% Scaling Delta. **Fix**: Apply `img_width / native_width` factor. |
60
- | **CRITICAL** | Async to_thread | **RC**: CORO vs CALL. **Fix**: Remove `async` from functions passed to `asyncio.to_thread`. |
61
- | **CRITICAL** | Race Condition | **RC**: Parallel sessions. **Fix**: `parallel_tool_calls=False` for sequential testing. |
62
- | **HIGH** | Simulation Leak | **RC**: Memory persistence. **Fix**: 24h/100-run auto-purge with `asyncio.Lock` safety. |
63
- | **HIGH** | Figma Rate Limit | **RC**: API 429 status. **Fix**: Direct CV overlay via Playwright + brightness detection. |
56
+ | **CRITICAL** | Bbox Misalignment | **RC**: 45% Scaling Delta. **Fix**: Apply `img_width / native_width`. |
57
+ | **CRITICAL** | Async to_thread | **RC**: Core async vs Coroutine. **Fix**: Remove `async` from `to_thread` targets. |
58
+ | **CRITICAL** | Race Condition | **RC**: Parallel sessions. **Fix**: `parallel_tool_calls=False`. |
59
+ | **HIGH** | Simulation Leak | **RC**: Memory persistence. **Fix**: 24h/100-run auto-purge + `asyncio.Lock`. |
60
+ | **HIGH** | Mobile MCP Bug | **RC**: Offline device fail. **Fix**: Full ADB bridge fallback (screencap/uiautomator). |
64
61
 
65
62
  ---
66
63
 
67
64
  ## ๐Ÿ”„ Core Workflows
68
65
 
69
66
  ### The Ralph Loop (Closed-Loop Verification)
70
- 1. **CODE** โ†’ Implement.
71
- 2. **LINT** โ†’ `mypy` / `eslint`.
72
- 3. **UNIT TEST** โ†’ Specific module verification.
73
- 4. **CHECK ASYNC** โ†’ Confirm `to_thread` safety.
74
- 5. **VERIFY HUD** โ†’ Watch the emulator stream while agent runs.
67
+ 1. **CODE** โ†’ **LINT** โ†’ **UNIT TEST** โ†’ **CHECK ASYNC** โ†’ **VERIFY HUD**.
75
68
 
76
69
  ---
77
70
 
78
71
  ## ๐Ÿ“ฆ Installation & Setup
79
72
 
80
- ### Claude Desktop
81
73
  ```bash
82
74
  claude mcp add ta-studio -- npx -y ta-studio-mcp@latest
83
75
  ```
84
76
 
85
- ### Cursor / Windsurf
86
- Add `npx -y ta-studio-mcp@latest` as a command-type MCP server.
87
-
88
77
  ---
89
78
 
90
79
  ## ๐Ÿ“œ License
package/dist/index.js CHANGED
@@ -14,7 +14,7 @@ import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'
14
14
  import { registerAllTools } from './tools/register-all.js';
15
15
  const server = new McpServer({
16
16
  name: 'ta-studio-mcp',
17
- version: '1.2.0',
17
+ version: '1.2.2',
18
18
  }, {
19
19
  capabilities: {
20
20
  logging: {},
@@ -1,6 +1,5 @@
1
1
  /**
2
2
  * TA Studio methodology knowledge base.
3
- * Each topic explains a pattern, technique, or architectural decision.
4
3
  */
5
4
  export declare const METHODOLOGY_TOPICS: Record<string, string>;
6
5
  export declare const METHODOLOGY_TOPIC_LIST: string[];
@@ -1 +1 @@
1
- {"version":3,"file":"methodology.d.ts","sourceRoot":"","sources":["../../src/knowledge/methodology.ts"],"names":[],"mappings":"AAAA;;;GAGG;AAEH,eAAO,MAAM,kBAAkB,EAAE,MAAM,CAAC,MAAM,EAAE,MAAM,CAsKrD,CAAC;AAEF,eAAO,MAAM,sBAAsB,UAAkC,CAAC"}
1
+ {"version":3,"file":"methodology.d.ts","sourceRoot":"","sources":["../../src/knowledge/methodology.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,eAAO,MAAM,kBAAkB,EAAE,MAAM,CAAC,MAAM,EAAE,MAAM,CA2LrD,CAAC;AAEF,eAAO,MAAM,sBAAsB,UAAkC,CAAC"}
@@ -1,11 +1,10 @@
1
1
  /**
2
2
  * TA Studio methodology knowledge base.
3
- * Each topic explains a pattern, technique, or architectural decision.
4
3
  */
5
4
  export const METHODOLOGY_TOPICS = {
6
5
  overview: `# TA Studio Methodologies โ€” Overview
7
6
 
8
- Available topics: oavr, som_annotation, coordinate_scaling, agent_config, flicker_detection, golden_bugs, mobile_mcp, vision_click, failure_diagnosis, self_correction, model_tiering, simulation_lifecycle
7
+ Available topics: oavr, som_annotation, coordinate_scaling, agent_config, flicker_detection, golden_bugs, mobile_mcp, vision_click, failure_diagnosis, self_correction, model_tiering, simulation_lifecycle, subagent_handoff, boolean_verification, hud_streaming
9
8
 
10
9
  ## Architecture
11
10
  - **Backend**: FastAPI (Python 3.11+) at backend/
@@ -23,15 +22,7 @@ The device testing agent uses OAVR for autonomous navigation:
23
22
  3. **Verify** โ†’ Action Verifier confirms the action succeeded
24
23
  4. **Reason** โ†’ Failure Diagnosis suggests recovery if verification failed
25
24
 
26
- ## Implementation Details
27
- - **Handoff Logic**: Sub-agents are triggered via @tool decorators in the DeviceTestingAgent class.
28
- - **State Management**: The session_id is passed through all tools to ensure logs are grouped.
29
- - **Fallback**: If Screen Classifier fails to identify an element, the agent automatically falls back to vision_click.
30
-
31
- Key files:
32
- - agents/device_testing/subagents/screen_classifier_agent.py
33
- - agents/device_testing/subagents/action_verifier_agent.py
34
- - agents/device_testing/subagents/failure_diagnosis_agent.py`,
25
+ Key file: agents/device_testing/subagents/README.md`,
35
26
  som_annotation: `# Set-of-Mark (SoM) Screenshot Annotation
36
27
 
37
28
  Based on OmniParser's SoM approach โ€” color-coded, type-aware bounding boxes.
@@ -49,16 +40,10 @@ Based on OmniParser's SoM approach โ€” color-coded, type-aware bounding boxes.
49
40
  | container | Dark gray | BOX | FrameLayout, LinearLayout |
50
41
  | unknown | Green | ELEM | Unclassified elements |
51
42
 
52
- ## Implementation Details
53
- - **PIL Threading**: Drawing 50+ bounding boxes with antialiasing is CPU intensive. We use asyncio.to_thread(_draw_bounding_boxes_threaded) to keep the event loop non-blocking.
54
- - **Class Priority**: Substring matching uses a prioritized list. radiobutton is checked before button to prevent incorrect classification.
55
- - **TOON Optimization**: The list_elements_on_screen output is converted to TOON (Token Optimized Object Notation) which strips redundant metadata to save 40% in prompt tokens.
56
- - **Font Scaling**: font_size = int(img_width / 54).
57
-
58
43
  Key file: agents/device_testing/tools/autonomous_navigation_tools.py`,
59
44
  coordinate_scaling: `# Coordinate Scaling โ€” Screenshot vs Device Resolution
60
45
 
61
- Mobile MCP take_screenshot returns JPEG images scaled to ~45% of native resolution, but list_elements_on_screen returns coordinates in native device resolution.
46
+ Mobile MCP take_screenshot returns JPEG images scaled to ~45% of native resolution.
62
47
 
63
48
  | Layer | Resolution | Source |
64
49
  |--------------------|-------------|----------------------------------|
@@ -69,7 +54,6 @@ Mobile MCP take_screenshot returns JPEG images scaled to ~45% of native resoluti
69
54
  ## Implementation Details
70
55
  1. **Parse Resolution**: Get screen size via get_screen_size() and parse with regex.
71
56
  2. **Scaling Logic**: scale_x = img.width / screen_width.
72
- 3. **Coordinate Transformation**: target_x = raw_x * scale_x.
73
57
 
74
58
  Key file: autonomous_navigation_tools.py lines 397-595`,
75
59
  flicker_detection: `# Flicker Detection Pipeline โ€” 4-Layer Architecture
@@ -78,7 +62,7 @@ Detects screen flickers too fast for periodic screenshots (16-200ms).
78
62
 
79
63
  1. **Layer 1 (Trigger)**: adb shell screenrecord --time-limit 10.
80
64
  2. **Layer 2 (Extraction)**: ffmpeg scene filtering (select='gt(scene,0.003)').
81
- 3. **Layer 3 (Analysis)**: SSIM calculated between consecutive pairs. Drops > 0.15 are flagged.
65
+ 3. **Layer 3 (Analysis)**: SSIM calculated between consecutive pairs.
82
66
  4. **Layer 4 (LLM)**: GPT-5.2 Vision verification.
83
67
 
84
68
  Key file: agents/device_testing/flicker_detection_service.py`,
@@ -89,12 +73,10 @@ A two-stage deterministic evaluation system for measuring agent reliability.
89
73
  ## 1. Pre-Device Planning (LLM Judge)
90
74
  - **Static Checks**: Verifies device_id, app_package, and steps are present.
91
75
  - **LLM Judge**: GPT-5-mini reviews the plan for logical consistency.
92
- - **Fail-Fast**: If planning fails, execution is skipped.
93
76
 
94
77
  ## 2. On-Device Execution
95
78
  - **Reproduction**: Agent attempts task up to 3 times.
96
79
  - **Verification**: AI analyzes screen state to confirm goal achievement.
97
- - **Classification**: TPs (Bug reproduced), FNs (Bug missed), TNs (Correct), FPs (False Alarm).
98
80
 
99
81
  Key file: agents/device_testing/golden_bug_service.py`,
100
82
  failure_diagnosis: `# Failure Taxonomy & Diagnosis (OAVR "Reason")
@@ -107,12 +89,6 @@ When Action Verifier fails, the Failure Diagnosis Specialist classifies the erro
107
89
  3. **ENVIRONMENT_ERROR**: App crash, OS dialog, or network timeout.
108
90
  4. **EXECUTION_ERROR**: Action failed despite element presence.
109
91
 
110
- ## Recovery Strategies
111
- - **Backtrack**: Press BACK and re-classify.
112
- - **Wait**: Wait 2s for UI sync and re-scan.
113
- - **Restart**: Press HOME and re-launch app.
114
- - **Adjust**: Apply 5px jitter to coordinates and retry.
115
-
116
92
  Key file: agents/device_testing/subagents/failure_diagnosis_agent.py`,
117
93
  model_tiering: `# 2026 Model Tiering Standard
118
94
 
@@ -120,7 +96,7 @@ Model selection is strictly tiered by "Thinking Budget":
120
96
 
121
97
  1. **Thinking Tier (GPT-5.2)**: Orchestration (Coordinator), Complex Reasoning, Test Generation.
122
98
  2. **Core Tier (GPT-5-mini)**: Routing, Classification (OAVR), Planning.
123
- 3. **Utility Tier (GPT-5-nano)**: MCP tool calls, data distillation (JSON cleaning), search enhancement.
99
+ 3. **Utility Tier (GPT-5-nano)**: MCP tool calls, data distillation (JSON cleaning).
124
100
 
125
101
  Key file: backend/app/agents/model_fallback.py`,
126
102
  mobile_mcp_fallback: `# Mobile MCP v0.0.36 ADB Fallback
@@ -132,7 +108,6 @@ Comprehensive ADB bridge fallback for:
132
108
  - **Launching**: am start -n with known activity mapping.
133
109
  - **UI Dump**: uiautomator dump /dev/tty (direct to stdout for speed).
134
110
  - **Screenshots**: exec-out screencap -p (fast PNG capture).
135
- - **Interaction**: input tap, input swipe, and input text.
136
111
 
137
112
  Key file: agents/device_testing/mobile_mcp_client.py`,
138
113
  simulation_lifecycle: `# Simulation Lifecycle & Safety
@@ -145,6 +120,49 @@ Managing parallel device executions at scale.
145
120
  - **Retention**: Max 24h age or 100 total simulations before auto-purge.
146
121
 
147
122
  Key file: agents/coordinator/coordinator_service.py`,
123
+ subagent_handoff: `# Subagent Handoff Protocol
124
+
125
+ How the Coordinator orchestrates specialist subagents without context loss.
126
+
127
+ ## The Chain of Custody
128
+ 1. **Perceptor (Screen Classifier)**: Analyzes UI and returns a structured screen_state.
129
+ 2. **Planner (Device Agent)**: Proposes an action based on the state.
130
+ 3. **Guardrail (Action Verifier)**: Receives the action, screen_state, and task_goal. Returns boolean approval.
131
+ 4. **Actor (Mobile MCP)**: Executes the approved action.
132
+ 5. **Doctor (Failure Diagnosis)**: Only triggered if execution fails.
133
+
134
+ ## Memory Strategy
135
+ We avoid passing giant raw XML. Instead, the Classifier distills UI into **TOON** elements, which are then carried through the Verifier/Diagnosis steps to save tokens.
136
+
137
+ Key file: agents/device_testing/device_testing_agent.py`,
138
+ boolean_verification: `# Boolean Verification vs. Numerical Scoring
139
+
140
+ Based on the V-Droid approach (arxiv.org/html/2503.15937v4).
141
+
142
+ ## The Three Checks
143
+ Every action must pass three binary checks:
144
+ 1. **is_safe**: Does this action cause data loss or unauthorized access?
145
+ 2. **is_relevant**: Does this move the needle on the task goal?
146
+ 3. **is_executable**: Can the target realistically be clicked/typed on?
147
+
148
+ Logic: approved = (is_safe AND is_relevant AND is_executable).
149
+ If any check is NO, the agent must propose an alternative_action.
150
+
151
+ Key file: agents/device_testing/subagents/action_verifier_agent.py`,
152
+ hud_streaming: `# Real-Time HUD Observation Pipeline
153
+
154
+ How we achieve <200ms lag between agent thought and UI rendering.
155
+
156
+ ## The on_step Callback
157
+ The UnifiedBugReproductionService accepts an on_step async callback.
158
+ 1. **Capture**: Screenshot saved.
159
+ 2. **Signal**: Service emits a tool_call event via FastAPI SSE/WebSocket.
160
+ 3. **Render**: Frontend React components update instantly.
161
+
162
+ ## Parallel HUDs
163
+ Each device runs in its own asyncio.Task, allowing Frontend to display multiple live streams simultaneously, each with its own independent thinking drawer.
164
+
165
+ Key files: agents/coordinator/coordinator_service.py, api/device_simulation.py`,
148
166
  agent_config: `# Agent Configuration Patterns
149
167
 
150
168
  ## Coordinator Agent (GPT-5.2)
@@ -153,7 +171,6 @@ Key file: agents/coordinator/coordinator_service.py`,
153
171
 
154
172
  ## Device Testing Agent (GPT-5-mini)
155
173
  - parallel_tool_calls=False โ† CRITICAL
156
- - reasoning=Reasoning(effort="medium")
157
174
 
158
175
  Sequential execution ensures session stability.
159
176
 
@@ -1 +1 @@
1
- {"version":3,"file":"methodology.js","sourceRoot":"","sources":["../../src/knowledge/methodology.ts"],"names":[],"mappings":"AAAA;;;GAGG;AAEH,MAAM,CAAC,MAAM,kBAAkB,GAA2B;IACvD,QAAQ,EAAE;;;;;;;;;;gEAUmD;IAE7D,IAAI,EAAE;;;;;;;;;;;;;;;;;6DAiBoD;IAE1D,cAAc,EAAE;;;;;;;;;;;;;;;;;;;;;;;qEAuBkD;IAElE,kBAAkB,EAAE;;;;;;;;;;;;;;;uDAegC;IAEpD,iBAAiB,EAAE;;;;;;;;;6DASuC;IAE1D,WAAW,EAAE;;;;;;;;;;;;;;sDAcsC;IAEnD,iBAAiB,EAAE;;;;;;;;;;;;;;;;qEAgB+C;IAElE,aAAa,EAAE;;;;;;;;+CAQ6B;IAE5C,mBAAmB,EAAE;;;;;;;;;;;qDAW6B;IAElD,oBAAoB,EAAE;;;;;;;;;oDAS2B;IAEjD,YAAY,EAAE;;;;;;;;;;;;wDAYuC;CACvD,CAAC;AAEF,MAAM,CAAC,MAAM,sBAAsB,GAAG,MAAM,CAAC,IAAI,CAAC,kBAAkB,CAAC,CAAC"}
1
+ {"version":3,"file":"methodology.js","sourceRoot":"","sources":["../../src/knowledge/methodology.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,MAAM,CAAC,MAAM,kBAAkB,GAA2B;IACvD,QAAQ,EAAE;;;;;;;;;;gEAUmD;IAE7D,IAAI,EAAE;;;;;;;;;oDAS2C;IAEjD,cAAc,EAAE;;;;;;;;;;;;;;;;;qEAiBkD;IAElE,kBAAkB,EAAE;;;;;;;;;;;;;;uDAcgC;IAEpD,iBAAiB,EAAE;;;;;;;;;6DASuC;IAE1D,WAAW,EAAE;;;;;;;;;;;;sDAYsC;IAEnD,iBAAiB,EAAE;;;;;;;;;;qEAU+C;IAElE,aAAa,EAAE;;;;;;;;+CAQ6B;IAE5C,mBAAmB,EAAE;;;;;;;;;;qDAU6B;IAElD,oBAAoB,EAAE;;;;;;;;;oDAS2B;IAEjD,gBAAgB,EAAE;;;;;;;;;;;;;;wDAcmC;IAErD,oBAAoB,EAAE;;;;;;;;;;;;;mEAa0C;IAEhE,aAAa,EAAE;;;;;;;;;;;;;+EAa6D;IAE5E,YAAY,EAAE;;;;;;;;;;;wDAWuC;CACvD,CAAC;AAEF,MAAM,CAAC,MAAM,sBAAsB,GAAG,MAAM,CAAC,IAAI,CAAC,kBAAkB,CAAC,CAAC"}
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "ta-studio-mcp",
3
- "version": "1.2.0",
3
+ "version": "1.2.2",
4
4
  "description": "TA Studio MCP โ€” Domain knowledge, patterns, bug fixes, and workflows for AI agents working on the TA Studio mobile test automation platform.",
5
5
  "type": "module",
6
6
  "bin": {