morphnet 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- morphnet-0.1.0/LICENSE +21 -0
- morphnet-0.1.0/PKG-INFO +304 -0
- morphnet-0.1.0/README.md +272 -0
- morphnet-0.1.0/morphnet/__init__.py +3 -0
- morphnet-0.1.0/morphnet/computer_use.py +803 -0
- morphnet-0.1.0/morphnet/mcp_manager.py +1913 -0
- morphnet-0.1.0/morphnet/morphnet_orchestrator.py +1029 -0
- morphnet-0.1.0/morphnet/reflector.py +1404 -0
- morphnet-0.1.0/morphnet/representation.py +1811 -0
- morphnet-0.1.0/morphnet/session_manager.py +2332 -0
- morphnet-0.1.0/morphnet/sites/bookmyshow_com/profile.json +26 -0
- morphnet-0.1.0/morphnet/sites/bookmyshow_com/tools.json +2944 -0
- morphnet-0.1.0/morphnet/sites/irctc_co_in/profile.json +10 -0
- morphnet-0.1.0/morphnet/sites/lego_com/profile.json +26 -0
- morphnet-0.1.0/morphnet/sites/lego_com/tools.json +2966 -0
- morphnet-0.1.0/morphnet/sites/swiggy_com/profile.json +26 -0
- morphnet-0.1.0/morphnet/sites/swiggy_com/tools.json +1704 -0
- morphnet-0.1.0/morphnet/trace.py +386 -0
- morphnet-0.1.0/morphnet.egg-info/PKG-INFO +304 -0
- morphnet-0.1.0/morphnet.egg-info/SOURCES.txt +23 -0
- morphnet-0.1.0/morphnet.egg-info/dependency_links.txt +1 -0
- morphnet-0.1.0/morphnet.egg-info/requires.txt +12 -0
- morphnet-0.1.0/morphnet.egg-info/top_level.txt +1 -0
- morphnet-0.1.0/pyproject.toml +48 -0
- morphnet-0.1.0/setup.cfg +4 -0
morphnet-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Rohan Saswade
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
morphnet-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,304 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: morphnet
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Transforms browser automation into reusable API tools. Uses computer use as discovery infrastructure to learn deterministic MCP tool calls from live site traffic.
|
|
5
|
+
Author-email: Rohan Saswade <suswader@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/suswader/morphnet
|
|
8
|
+
Project-URL: Repository, https://github.com/suswader/morphnet
|
|
9
|
+
Project-URL: Issues, https://github.com/suswader/morphnet/issues
|
|
10
|
+
Keywords: browser-automation,mcp,computer-use,web-agent,tool-learning
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
|
|
16
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
17
|
+
Requires-Python: >=3.12
|
|
18
|
+
Description-Content-Type: text/markdown
|
|
19
|
+
License-File: LICENSE
|
|
20
|
+
Requires-Dist: google-genai
|
|
21
|
+
Requires-Dist: playwright
|
|
22
|
+
Requires-Dist: python-dotenv
|
|
23
|
+
Requires-Dist: httpx
|
|
24
|
+
Requires-Dist: pydantic
|
|
25
|
+
Requires-Dist: pyyaml
|
|
26
|
+
Requires-Dist: pillow>=12.1.1
|
|
27
|
+
Requires-Dist: curl-cffi>=0.14.0
|
|
28
|
+
Provides-Extra: dev
|
|
29
|
+
Requires-Dist: ruff; extra == "dev"
|
|
30
|
+
Requires-Dist: pytest; extra == "dev"
|
|
31
|
+
Dynamic: license-file
|
|
32
|
+
|
|
33
|
+
# MorphNet
|
|
34
|
+
|
|
35
|
+
MorphNet transforms volatile, expensive computer use (CU) into stable, fast, affordable MCP tool calls. Use CU as a **discovery mechanism** — observe successful browser interactions, capture HTTP traffic, identify deterministic request patterns, and crystallize these into reusable MCP tools. Over time, the system shifts from unreliable browser automation to deterministic API-level execution. **CU is discovery infrastructure, not the end state.**
|
|
36
|
+
|
|
37
|
+
## Architecture
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
User Query + URL
|
|
41
|
+
│
|
|
42
|
+
▼
|
|
43
|
+
┌──────────────────┐
|
|
44
|
+
│ session_manager │ Persistent Chrome via CDP · Raw data server
|
|
45
|
+
│ │ Shared Gemini inference utility
|
|
46
|
+
└────────┬─────────┘
|
|
47
|
+
│
|
|
48
|
+
▼
|
|
49
|
+
┌──────────────────┐
|
|
50
|
+
│ morphnet_ │ Branch/prune planning tree (AgentOccam)
|
|
51
|
+
│ orchestrator │ Routes subtasks to CU or MCP
|
|
52
|
+
└───┬──────────┬───┘
|
|
53
|
+
│ │
|
|
54
|
+
▼ ▼
|
|
55
|
+
┌────────┐ ┌────────────┐
|
|
56
|
+
│computer│ │mcp_manager │ All protocols: REST, GraphQL, JSON-RPC, form, multipart
|
|
57
|
+
│_use │ │ │ Lifecycle: verified → trusted → degraded → discarded
|
|
58
|
+
│ 10 acts│ │ │
|
|
59
|
+
└───┬────┘ └──────┬─────┘
|
|
60
|
+
│ │
|
|
61
|
+
▼ ▼
|
|
62
|
+
┌──────────────────────┐
|
|
63
|
+
│ reflector │ Three-stage pipeline: deterministic → AXTree diff → LLM
|
|
64
|
+
│ │ Separate paths for CU actions vs MCP calls vs subtasks
|
|
65
|
+
└───────────────────────┘
|
|
66
|
+
│
|
|
67
|
+
▼
|
|
68
|
+
┌──────────────────────┐
|
|
69
|
+
│ trace.py │ Every decision: reasoning, evidence, confidence → JSONL
|
|
70
|
+
└───────────────────────┘
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Directory Structure
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
morphnet/
|
|
77
|
+
├── session_manager.py # Browser session + raw data + Gemini utility
|
|
78
|
+
├── morphnet_orchestrator.py # Planning, routing, website profiling
|
|
79
|
+
├── computer_use.py # CU agent (10 actions per subtask)
|
|
80
|
+
├── representation.py # Page representation pipeline (CLEAN→COLLECT→STRUCTURE→FORMAT)
|
|
81
|
+
├── mcp_manager.py # MCP creation, execution, lifecycle
|
|
82
|
+
├── reflector.py # Three-stage verification pipeline
|
|
83
|
+
├── trace.py # Decision trace recorder (deterministic)
|
|
84
|
+
├── run_webarena_evals.py # Eval harness (deterministic, no LLM calls)
|
|
85
|
+
├── prompts/ # All LLM prompts as .txt files
|
|
86
|
+
└── sites/ # Per-website persistent state
|
|
87
|
+
├── noise_domains.txt
|
|
88
|
+
└── {site_name}/
|
|
89
|
+
├── profile.json # Website insights, auth patterns
|
|
90
|
+
├── credentials.json # Login credentials
|
|
91
|
+
└── tools.json # MCP tools + lifecycle status
|
|
92
|
+
|
|
93
|
+
results/ # Trace output (auto-created)
|
|
94
|
+
├── YYYY-MM-DD_HHMMSS/ # Single run
|
|
95
|
+
│ └── trace.jsonl
|
|
96
|
+
└── eval_{benchmark}_{datetime}/ # Eval batch
|
|
97
|
+
├── task_{id}/
|
|
98
|
+
│ └── trace.jsonl
|
|
99
|
+
└── eval_summary.json
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## Core Design Decision: Each Module Owns Its Representation
|
|
105
|
+
|
|
106
|
+
**session_manager.py serves raw data.** It extracts DOM, AXTree, screenshots, cookies, tokens, and traffic — hands them unprocessed to consumer modules. Basic structural cleaning only (strip `<script>`, `<style>`, `<noscript>`; filter noise domains).
|
|
107
|
+
|
|
108
|
+
**Each consumer module distills this raw data into the representation its LLM needs.** The raw toolkit includes: AXTree (semantic structure), DOM (parameter sources, form structure, hidden fields), screenshots (visual layout), Set-of-Marks annotation (element grounding for VLMs), cookies/storage (session state), meta tokens (CSRF/auth), and captured traffic (API patterns). No single module uses all of these — each selects and processes the subset relevant to its task, following AgentOccam (ICLR 2025) and Agent-E's principle of task-adaptive distillation. The raw data sources at each module's disposal include: AXTree (semantic roles, states, accessible names), DOM (structure, hidden fields, data attributes, form layout, parameter sources), screenshots (visual layout — annotated with SoM bounding boxes by CU when needed), cookies/storage (session state, auth tokens), meta tokens (CSRF, form keys with source annotations), and captured network traffic (request/response pairs with protocol classification). Each module selects and distills only the sources relevant to its task.
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## Module Specifications
|
|
113
|
+
|
|
114
|
+
### session_manager.py — Raw Data Server
|
|
115
|
+
|
|
116
|
+
Owns the browser. Every other module operates through it.
|
|
117
|
+
|
|
118
|
+
**Serves (on-demand):** Each consumer calls only what it needs — no bundled extraction. Available: `get_raw_accessibility_tree()`, `get_dom_tree()` (fast regex-cleaned `page.content()`), `take_screenshot()`, `get_interactive_elements()` (with hierarchical filtering at 200+ elements), `get_cookies()`, `get_storage()`, meta tokens with source annotations, captured network traffic with protocol classification.
|
|
119
|
+
|
|
120
|
+
**Does not do:** LLM-oriented formatting, SoM annotation, DOM distillation, task interpretation, MCP logic.
|
|
121
|
+
|
|
122
|
+
**Shared Gemini utility:** `call_gemini()` at module level handles API mechanics. Each consumer provides its own model, schema, prompt, and config. Defaults: `max_output_tokens=8192`, `ThinkingConfig(thinking_budget=4096)`. Retries once with doubled thinking budget on truncated JSON.
|
|
123
|
+
|
|
124
|
+
**Action execution:** Receives structured action dicts from CU agent, resolves element IDs to Playwright selectors, executes, returns structured results. Never decides what action to take.
|
|
125
|
+
|
|
126
|
+
**Chrome via CDP** for real browser fingerprint. **curl_cffi** for TLS-matched MCP HTTP replay.
|
|
127
|
+
|
|
128
|
+
---
|
|
129
|
+
|
|
130
|
+
### morphnet_orchestrator.py — Task Planner
|
|
131
|
+
|
|
132
|
+
Receives a natural language task + start URL. Decomposes into subtasks. Routes each to CU or MCP.
|
|
133
|
+
|
|
134
|
+
**Representation:** Text-only AXTree distillation (strip element-level details, keep headings/landmarks/text/structure) + lightweight DOM summary (page landmarks, form structures, metadata). No screenshots — no actionable planning information beyond what text provides.
|
|
135
|
+
|
|
136
|
+
**Planning model:** AgentOccam's branch/prune tree. Each node is a sub-plan. The orchestrator can `branch` (try new approach), `prune` (abandon failed approach), or `continue`. When a branch completes or is pruned, its observations are condensed into a **structured summary** — not a one-liner but a pointed digest capturing: what was attempted, key actions taken, outcome, reasoning for the outcome, and any insights gained. Only the current active branch retains full context. This manages context growth while preserving enough history for informed planning.
|
|
137
|
+
|
|
138
|
+
**MCP lifecycle management:** Tracks tool status. Routes to trusted/verified MCPs when available. Falls back to CU when MCPs fail. Does not interpret MCP HTTP responses — reads the reflector's structured verdict. If reflector said success but the page state contradicts it on the next planning step, the orchestrator notices naturally (it loads fresh page state for planning anyway) and degrades the MCP.
|
|
139
|
+
|
|
140
|
+
**Model:** `gemini-3.1-pro-preview`, thinking enabled, ~0.4 temperature, 8192 max tokens.
|
|
141
|
+
|
|
142
|
+
---
|
|
143
|
+
|
|
144
|
+
### representation.py — Page Representation Pipeline
|
|
145
|
+
|
|
146
|
+
Owns ALL AXTree-to-text transformations. Both CU agent and orchestrator import from it.
|
|
147
|
+
|
|
148
|
+
**Pipeline:** CLEAN (whitespace normalization, CSS-name filtering, text compression) → COLLECT (element matching, functional role inference) → STRUCTURE (depth-keyed context tracking, text dedup, footer exclusion) → FORMAT (section-based output with inline elements).
|
|
149
|
+
|
|
150
|
+
**Context tracking:** A `_ContextStack` records the most recent significant text at each AXTree depth during the walk. When a generic button like "ADD" is encountered, the stack provides the nearest product name — regardless of whether it's a heading, StaticText, or paragraph. This solves the "which ADD button?" disambiguation problem on food delivery menus, e-commerce product lists, etc.
|
|
151
|
+
|
|
152
|
+
**Four views:** `build_cu_representation()` (section-based, inline elements with context, footer excluded), `build_orchestrator_representation()` (text-only, full page, no element IDs), `build_reflector_representation()` (content-focused, card-aware, chrome-compressed — for subtask outcome verification), and `build_mcp_parameter_context()` (recipe-based extraction from browser state for MCP parameter generation).
|
|
153
|
+
|
|
154
|
+
---
|
|
155
|
+
|
|
156
|
+
### computer_use.py — Browser Action Agent
|
|
157
|
+
|
|
158
|
+
Receives a subtask description. Has 10 actions to complete it.
|
|
159
|
+
|
|
160
|
+
**Representation:** Uses `representation.py` for AXTree-to-text transformation. Interactive elements appear inline with their context text. Generic buttons get nearby-text disambiguation. Footer excluded. Pruning rules: merge redundant StaticText, convert tables/lists to Markdown, strip rendering artifacts, collapse repetitive siblings, exclude CSS-class names.
|
|
161
|
+
|
|
162
|
+
**Viewport-aware:** Loads visible + one viewport below. Scroll remains a valid action for revealing more content. Unlike AgentOccam's "load full page" approach, this handles real-world infinite-scroll sites.
|
|
163
|
+
|
|
164
|
+
**Screenshots:** SoM-annotated screenshot only on first action and after failed actions. AXTree with element IDs is the primary representation.
|
|
165
|
+
|
|
166
|
+
**Action space:** `click`, `type`, `select`, `scroll`, `press_key`, `navigate`, `hover`, `go_back`, `wait`, `note`, `stop`. The `note` action records observations without browser interaction (critical for multi-step retrieval). The `stop` action signals subtask completion.
|
|
167
|
+
|
|
168
|
+
**History:** Flat within subtask (no branching for 10 actions). Last 2-3 actions: full detail. Earlier: one-line summaries. Current state dominates context.
|
|
169
|
+
|
|
170
|
+
**Extraction pattern (n+1):** Initial extraction once before the action loop. After each action, the after-state becomes the next iteration's before-state. For n actions, this requires n+1 extractions instead of the naive 2n.
|
|
171
|
+
|
|
172
|
+
**On success:** Signals mcp_manager to analyze captured traffic for MCP discovery.
|
|
173
|
+
|
|
174
|
+
**Model:** `gemini-3-flash-preview`, thinking enabled, 8192 max tokens.
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
### mcp_manager.py — API Tool Manager
|
|
179
|
+
|
|
180
|
+
Creates, validates, executes, and lifecycle-manages MCP tools. Built after the three core modules.
|
|
181
|
+
|
|
182
|
+
**Representation:** Raw DOM focused on parameter sources (hidden fields, data attributes, form structure), meta tokens with source annotations, cookies, storage dumps, and captured traffic. Does not receive AXTree or screenshots.
|
|
183
|
+
|
|
184
|
+
**Protocol support:** REST, GraphQL (operationName-based identity, mutation detection), JSON-RPC (method-based identity), URL-encoded form, multipart form.
|
|
185
|
+
|
|
186
|
+
**Evolving parameter schema:** Each MCP tool maintains an inferred schema that grows with every observation. Per parameter, the schema tracks: data type, required vs optional (presence frequency across observations), example values, value ranges for numerics, format hints (UUID, ISO date, JWT, etc.), and — critically — **source hints** noting where this parameter value was found in the browser state (which DOM element, which cookie, which prior API response field). These source hints mean that when the MCP is used in an entirely new scenario, the parameter generator knows exactly where to look first. Early observations produce a draft schema; after 10+ observations it stabilizes with confident required/optional classification. No enum detection — enums catastrophically constrain user-intent fields, session tokens, and chained outputs. Example values (`x-examples`) guide the LLM without constraining it.
|
|
187
|
+
|
|
188
|
+
**Extraction recipe:** Each tool has a per-parameter `extraction_recipe` — a list of `ExtractionStep` dicts that tell representation.py HOW to extract each parameter at execution time. Steps are typed (cookie, dom_field, dom_list, storage, meta_tag, url_component, prior_api_response, task_description) and classified (user_intent, ephemeral, chained, page_context, static). Built automatically from traced parameter sources at discovery time. The recipe executor in representation.py (`build_mcp_parameter_context`) runs each step deterministically against the browser state to produce structured context for the parameter generation LLM.
|
|
189
|
+
|
|
190
|
+
**Response chaining:** MCP response bodies are cached by `endpoint_identity`. Tool B's extraction recipe can reference Tool A's response via `prior_api_response` steps — works regardless of whether Tool A ran via MCP or CU (checks cache first, then browser captured traffic). This enables multi-step workflows like "search for location → use place_id to set delivery address."
|
|
191
|
+
|
|
192
|
+
**Response template:** Each tool learns a structural response template from successful responses. Tracks `always_present_paths` and `always_non_null_paths` (intersection across observations). The reflector uses this for deterministic structural checks — if a path that was always present is suddenly missing, or always-non-null data becomes empty, it's flagged as a failure without needing an LLM.
|
|
193
|
+
|
|
194
|
+
**A/B learning:** When an MCP tool fails and CU fallback succeeds, `learn_from_cu_fallback` compares the failed parameters against the correct CU traffic. For each differing parameter: traces the correct value in the browser state registry, rebuilds the extraction step, and replaces the old recipe step. Also merges the correct request/response into the schema and template.
|
|
195
|
+
|
|
196
|
+
**Validation at discovery:** Immediate replay via curl_cffi + independent param generation test + reflector confirms state change. Tool only marked "validated" if all three pass.
|
|
197
|
+
|
|
198
|
+
**MCP Lifecycle:**
|
|
199
|
+
|
|
200
|
+
| State | Entry Condition | Orchestrator Behavior |
|
|
201
|
+
|---|---|---|
|
|
202
|
+
| **Verified** | Passes validation at discovery | Available for routing |
|
|
203
|
+
| **Trusted** | 3 consecutive successes from verified | Preferred over CU |
|
|
204
|
+
| **Degraded** | Trusted tool fails once | Available with warning; 2 more consecutive failures → discarded |
|
|
205
|
+
| **Discarded** | 3 consecutive failures from any state | Removed from routing. Failure reason logged for future reference |
|
|
206
|
+
|
|
207
|
+
---
|
|
208
|
+
|
|
209
|
+
### reflector.py — Three-Stage Verification
|
|
210
|
+
|
|
211
|
+
Determines whether actions and subtasks succeeded. Most actions verified without LLM calls.
|
|
212
|
+
|
|
213
|
+
**Stage 1 — Deterministic Signals (every action, zero LLM cost):**
|
|
214
|
+
Element value before/after, URL change, HTTP status codes from captured traffic, ARIA alert/status/dialog nodes in AXTree (W3C standard — framework-agnostic), `aria-invalid` field changes, element count diff.
|
|
215
|
+
|
|
216
|
+
Most actions resolve here: type (value match), select (value match), scroll (new elements), navigate (URL change), click-with-navigation (URL change + no alerts).
|
|
217
|
+
|
|
218
|
+
**Stage 2 — AXTree Diff (ambiguous cases only):**
|
|
219
|
+
Flatten before/after AXTrees, compare node signatures, report additions/removals/changes. Prioritize ARIA signal nodes and structural changes. Inherently excludes cosmetic noise (CSS, animations, decorations aren't in AXTree).
|
|
220
|
+
|
|
221
|
+
Key detection: submit action + no meaningful changes + no ARIA alerts + no HTTP errors = silent failure (flagged as suspicious, never auto-classified as success).
|
|
222
|
+
|
|
223
|
+
**Stage 3 — LLM Evaluation (only when Stages 1-2 can't resolve, ~2-3 per subtask):**
|
|
224
|
+
Receives: the action attempted, deterministic signals as facts, compact AXTree diff, ARIA alert/status text. Must cite specific evidence for its verdict — cannot claim success without pointing to concrete signals. Binary verdict (not rubric-based — research shows 87% human agreement vs ambiguous rubric scores).
|
|
225
|
+
|
|
226
|
+
**MCP verification — deterministic-only:**
|
|
227
|
+
- **Reflector (immediate):** Deterministic HTTP status check → response structure check against learned template (always_present_paths, always_non_null_paths) → page state AXTree diff for mutations. No LLM calls. Returns structured verdict to orchestrator.
|
|
228
|
+
- **Orchestrator (natural):** Loads fresh page state on next planning step. If reflector said success but page contradicts, orchestrator notices and degrades the MCP. Semantic verification is the orchestrator's job, not the reflector's.
|
|
229
|
+
|
|
230
|
+
**Subtask reflection (deep, after entire subtask):**
|
|
231
|
+
Full journey evaluation: condensed action log with per-action verdicts, current page AXTree, focused DOM excerpt around expected change region, notes from CU agent. Specifically checks for "claimed but not executed" (agent said stop/success but never performed the key submit/click action). Uses Gemini Pro Preview with high thinking budget.
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
### trace.py — Decision Trace
|
|
236
|
+
|
|
237
|
+
Already built. Deterministic recorder. Zero LLM calls.
|
|
238
|
+
|
|
239
|
+
Every Gemini call wraps in `trace.span()`. Every schema includes `reasoning`, `confidence`, `evidence_sources` — these flow directly from model output to trace entries. Every browser action, traffic capture, and reflection assessment is logged.
|
|
240
|
+
|
|
241
|
+
Output: `./results/{datetime}/trace.jsonl`. Eval harness controls path for benchmark runs.
|
|
242
|
+
|
|
243
|
+
---
|
|
244
|
+
|
|
245
|
+
### run_webarena_evals.py — Eval Harness
|
|
246
|
+
|
|
247
|
+
Deterministic scoring. Zero LLM calls. Wraps MorphNet for WebArena Verified benchmarks.
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
## Model Assignments
|
|
252
|
+
|
|
253
|
+
| Role | Model | Thinking | Max Tokens |
|
|
254
|
+
|---|---|---|---|
|
|
255
|
+
| Orchestrator planning | `gemini-3.1-pro-preview` | Enabled, budget 4096 | 8192 |
|
|
256
|
+
| CU action generation | `gemini-3-flash-preview` | Enabled, budget 4096 | 8192 |
|
|
257
|
+
| Per-action reflection (Stage 3) | `gemini-3-flash-preview` | Enabled, budget 4096 | 8192 |
|
|
258
|
+
| Per-subtask reflection | `gemini-3.1-pro-preview` | Enabled, budget 4096 | 8192 |
|
|
259
|
+
| MCP parameter generation | `gemini-3-flash-preview` | Enabled, budget 4096 | 8192 |
|
|
260
|
+
| MCP response-vs-intent check | `gemini-3-flash-preview` | Enabled, budget 4096 | 8192 |
|
|
261
|
+
|
|
262
|
+
**Flash Lite is not used anywhere.** Every call sits on a critical path.
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
## Development Principles
|
|
267
|
+
|
|
268
|
+
1. **Gemini structured output schemas are typed function contracts.** Maximally descriptive field names, types, enums, descriptions. Every schema includes `reasoning`, `confidence`, `evidence_sources`. These flow directly to trace entries.
|
|
269
|
+
|
|
270
|
+
2. **Never string match on unstructured natural language.** Parsing structured material (HTML, JSON) is fine. Never regex/substring on model outputs.
|
|
271
|
+
|
|
272
|
+
3. **Centralized representation pipeline in `representation.py`.** session_manager serves raw data. `representation.py` owns all AXTree-to-text transformations — CU gets section-based inline elements with context tracking, orchestrator gets text-only distillation. Additional views: Set-of-Marks annotation (CU on failure), task-adaptive DOM distillation (MCP), adaptive evidence selection (reflector).
|
|
273
|
+
|
|
274
|
+
4. **Justify every field in every data structure.** What consumes it? What breaks if removed?
|
|
275
|
+
|
|
276
|
+
5. **No unnecessary files.** Consolidated. Whatever is used in a module, keep it closeby.
|
|
277
|
+
|
|
278
|
+
6. **Comments explain why, not what.** Related logic stays together.
|
|
279
|
+
|
|
280
|
+
7. **Prompts live in ./prompts/ as .txt files.** Not hardcoded.
|
|
281
|
+
|
|
282
|
+
8. **Every decision is traced.** Gemini calls wrap in `trace.span()`. Schema fields → JSONL.
|
|
283
|
+
|
|
284
|
+
9. **On-demand extraction, not bundled.** session_manager never bundles all extractions into one call. Each consumer calls exactly what it needs. This prevents 26+ second bottlenecks on complex pages.
|
|
285
|
+
|
|
286
|
+
10. **Auto site profiling.** `site_name` is derived from the URL hostname automatically. Site directories and configs are created on first access, not manually.
|
|
287
|
+
|
|
288
|
+
---
|
|
289
|
+
|
|
290
|
+
## Architectural Rules
|
|
291
|
+
|
|
292
|
+
1. **session_manager owns the browser.** No other module creates contexts, pages, or HTTP clients.
|
|
293
|
+
2. **session_manager serves raw data.** Each consumer builds its own view.
|
|
294
|
+
3. **Chrome via CDP + curl_cffi.** Real fingerprint for browsing and API replay.
|
|
295
|
+
4. **Orchestrator is benchmark-agnostic.** Eval logic in run_webarena_evals.py only.
|
|
296
|
+
5. **CU is stateless per subtask.** Orchestrator manages memory via planning tree.
|
|
297
|
+
6. **MCP lifecycle: verified → trusted → degraded → discarded.** Orchestrator checks status before routing.
|
|
298
|
+
7. **Reflector uses three stages.** Deterministic first, AXTree diff second, LLM third. Most actions need no LLM.
|
|
299
|
+
8. **All LLM outputs use structured schemas.** No free-form parsing.
|
|
300
|
+
9. **Website state in ./sites/.** Tools, profiles, credentials per-website.
|
|
301
|
+
10. **AgentOccam principles throughout.** Align to LLM pretraining. Simplify action/observation spaces. Branch/prune for context.
|
|
302
|
+
11. **Python 3.12.** Modern features throughout.
|
|
303
|
+
12. **Hierarchical element filtering.** When pages have >200 interactive elements, structural/navigational elements are preserved and the rest are sampled with section summaries for collapsed groups.
|
|
304
|
+
13. **Every decision traced via schema fields.** Gemini schemas include reasoning + evidence_sources + confidence. `./results/` stores all trace output, organized by datetime.
|
morphnet-0.1.0/README.md
ADDED
|
@@ -0,0 +1,272 @@
|
|
|
1
|
+
# MorphNet
|
|
2
|
+
|
|
3
|
+
MorphNet transforms volatile, expensive computer use (CU) into stable, fast, affordable MCP tool calls. Use CU as a **discovery mechanism** — observe successful browser interactions, capture HTTP traffic, identify deterministic request patterns, and crystallize these into reusable MCP tools. Over time, the system shifts from unreliable browser automation to deterministic API-level execution. **CU is discovery infrastructure, not the end state.**
|
|
4
|
+
|
|
5
|
+
## Architecture
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
User Query + URL
|
|
9
|
+
│
|
|
10
|
+
▼
|
|
11
|
+
┌──────────────────┐
|
|
12
|
+
│ session_manager │ Persistent Chrome via CDP · Raw data server
|
|
13
|
+
│ │ Shared Gemini inference utility
|
|
14
|
+
└────────┬─────────┘
|
|
15
|
+
│
|
|
16
|
+
▼
|
|
17
|
+
┌──────────────────┐
|
|
18
|
+
│ morphnet_ │ Branch/prune planning tree (AgentOccam)
|
|
19
|
+
│ orchestrator │ Routes subtasks to CU or MCP
|
|
20
|
+
└───┬──────────┬───┘
|
|
21
|
+
│ │
|
|
22
|
+
▼ ▼
|
|
23
|
+
┌────────┐ ┌────────────┐
|
|
24
|
+
│computer│ │mcp_manager │ All protocols: REST, GraphQL, JSON-RPC, form, multipart
|
|
25
|
+
│_use │ │ │ Lifecycle: verified → trusted → degraded → discarded
|
|
26
|
+
│ 10 acts│ │ │
|
|
27
|
+
└───┬────┘ └──────┬─────┘
|
|
28
|
+
│ │
|
|
29
|
+
▼ ▼
|
|
30
|
+
┌──────────────────────┐
|
|
31
|
+
│ reflector │ Three-stage pipeline: deterministic → AXTree diff → LLM
|
|
32
|
+
│ │ Separate paths for CU actions vs MCP calls vs subtasks
|
|
33
|
+
└───────────────────────┘
|
|
34
|
+
│
|
|
35
|
+
▼
|
|
36
|
+
┌──────────────────────┐
|
|
37
|
+
│ trace.py │ Every decision: reasoning, evidence, confidence → JSONL
|
|
38
|
+
└───────────────────────┘
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
## Directory Structure
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
morphnet/
|
|
45
|
+
├── session_manager.py # Browser session + raw data + Gemini utility
|
|
46
|
+
├── morphnet_orchestrator.py # Planning, routing, website profiling
|
|
47
|
+
├── computer_use.py # CU agent (10 actions per subtask)
|
|
48
|
+
├── representation.py # Page representation pipeline (CLEAN→COLLECT→STRUCTURE→FORMAT)
|
|
49
|
+
├── mcp_manager.py # MCP creation, execution, lifecycle
|
|
50
|
+
├── reflector.py # Three-stage verification pipeline
|
|
51
|
+
├── trace.py # Decision trace recorder (deterministic)
|
|
52
|
+
├── run_webarena_evals.py # Eval harness (deterministic, no LLM calls)
|
|
53
|
+
├── prompts/ # All LLM prompts as .txt files
|
|
54
|
+
└── sites/ # Per-website persistent state
|
|
55
|
+
├── noise_domains.txt
|
|
56
|
+
└── {site_name}/
|
|
57
|
+
├── profile.json # Website insights, auth patterns
|
|
58
|
+
├── credentials.json # Login credentials
|
|
59
|
+
└── tools.json # MCP tools + lifecycle status
|
|
60
|
+
|
|
61
|
+
results/ # Trace output (auto-created)
|
|
62
|
+
├── YYYY-MM-DD_HHMMSS/ # Single run
|
|
63
|
+
│ └── trace.jsonl
|
|
64
|
+
└── eval_{benchmark}_{datetime}/ # Eval batch
|
|
65
|
+
├── task_{id}/
|
|
66
|
+
│ └── trace.jsonl
|
|
67
|
+
└── eval_summary.json
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## Core Design Decision: Each Module Owns Its Representation
|
|
73
|
+
|
|
74
|
+
**session_manager.py serves raw data.** It extracts DOM, AXTree, screenshots, cookies, tokens, and traffic — hands them unprocessed to consumer modules. Basic structural cleaning only (strip `<script>`, `<style>`, `<noscript>`; filter noise domains).
|
|
75
|
+
|
|
76
|
+
**Each consumer module distills this raw data into the representation its LLM needs.** The raw toolkit includes: AXTree (semantic structure), DOM (parameter sources, form structure, hidden fields), screenshots (visual layout), Set-of-Marks annotation (element grounding for VLMs), cookies/storage (session state), meta tokens (CSRF/auth), and captured traffic (API patterns). No single module uses all of these — each selects and processes the subset relevant to its task, following AgentOccam (ICLR 2025) and Agent-E's principle of task-adaptive distillation. The raw data sources at each module's disposal include: AXTree (semantic roles, states, accessible names), DOM (structure, hidden fields, data attributes, form layout, parameter sources), screenshots (visual layout — annotated with SoM bounding boxes by CU when needed), cookies/storage (session state, auth tokens), meta tokens (CSRF, form keys with source annotations), and captured network traffic (request/response pairs with protocol classification). Each module selects and distills only the sources relevant to its task.
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## Module Specifications
|
|
81
|
+
|
|
82
|
+
### session_manager.py — Raw Data Server
|
|
83
|
+
|
|
84
|
+
Owns the browser. Every other module operates through it.
|
|
85
|
+
|
|
86
|
+
**Serves (on-demand):** Each consumer calls only what it needs — no bundled extraction. Available: `get_raw_accessibility_tree()`, `get_dom_tree()` (fast regex-cleaned `page.content()`), `take_screenshot()`, `get_interactive_elements()` (with hierarchical filtering at 200+ elements), `get_cookies()`, `get_storage()`, meta tokens with source annotations, captured network traffic with protocol classification.
|
|
87
|
+
|
|
88
|
+
**Does not do:** LLM-oriented formatting, SoM annotation, DOM distillation, task interpretation, MCP logic.
|
|
89
|
+
|
|
90
|
+
**Shared Gemini utility:** `call_gemini()` at module level handles API mechanics. Each consumer provides its own model, schema, prompt, and config. Defaults: `max_output_tokens=8192`, `ThinkingConfig(thinking_budget=4096)`. Retries once with doubled thinking budget on truncated JSON.
|
|
91
|
+
|
|
92
|
+
**Action execution:** Receives structured action dicts from CU agent, resolves element IDs to Playwright selectors, executes, returns structured results. Never decides what action to take.
|
|
93
|
+
|
|
94
|
+
**Chrome via CDP** for real browser fingerprint. **curl_cffi** for TLS-matched MCP HTTP replay.
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
98
|
+
### morphnet_orchestrator.py — Task Planner
|
|
99
|
+
|
|
100
|
+
Receives a natural language task + start URL. Decomposes into subtasks. Routes each to CU or MCP.
|
|
101
|
+
|
|
102
|
+
**Representation:** Text-only AXTree distillation (strip element-level details, keep headings/landmarks/text/structure) + lightweight DOM summary (page landmarks, form structures, metadata). No screenshots — no actionable planning information beyond what text provides.
|
|
103
|
+
|
|
104
|
+
**Planning model:** AgentOccam's branch/prune tree. Each node is a sub-plan. The orchestrator can `branch` (try new approach), `prune` (abandon failed approach), or `continue`. When a branch completes or is pruned, its observations are condensed into a **structured summary** — not a one-liner but a pointed digest capturing: what was attempted, key actions taken, outcome, reasoning for the outcome, and any insights gained. Only the current active branch retains full context. This manages context growth while preserving enough history for informed planning.
|
|
105
|
+
|
|
106
|
+
**MCP lifecycle management:** Tracks tool status. Routes to trusted/verified MCPs when available. Falls back to CU when MCPs fail. Does not interpret MCP HTTP responses — reads the reflector's structured verdict. If reflector said success but the page state contradicts it on the next planning step, the orchestrator notices naturally (it loads fresh page state for planning anyway) and degrades the MCP.
|
|
107
|
+
|
|
108
|
+
**Model:** `gemini-3.1-pro-preview`, thinking enabled, ~0.4 temperature, 8192 max tokens.
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
### representation.py — Page Representation Pipeline
|
|
113
|
+
|
|
114
|
+
Owns ALL AXTree-to-text transformations. Both CU agent and orchestrator import from it.
|
|
115
|
+
|
|
116
|
+
**Pipeline:** CLEAN (whitespace normalization, CSS-name filtering, text compression) → COLLECT (element matching, functional role inference) → STRUCTURE (depth-keyed context tracking, text dedup, footer exclusion) → FORMAT (section-based output with inline elements).
|
|
117
|
+
|
|
118
|
+
**Context tracking:** A `_ContextStack` records the most recent significant text at each AXTree depth during the walk. When a generic button like "ADD" is encountered, the stack provides the nearest product name — regardless of whether it's a heading, StaticText, or paragraph. This solves the "which ADD button?" disambiguation problem on food delivery menus, e-commerce product lists, etc.
|
|
119
|
+
|
|
120
|
+
**Four views:** `build_cu_representation()` (section-based, inline elements with context, footer excluded), `build_orchestrator_representation()` (text-only, full page, no element IDs), `build_reflector_representation()` (content-focused, card-aware, chrome-compressed — for subtask outcome verification), and `build_mcp_parameter_context()` (recipe-based extraction from browser state for MCP parameter generation).
|
|
121
|
+
|
|
122
|
+
---
|
|
123
|
+
|
|
124
|
+
### computer_use.py — Browser Action Agent
|
|
125
|
+
|
|
126
|
+
Receives a subtask description. Has 10 actions to complete it.
|
|
127
|
+
|
|
128
|
+
**Representation:** Uses `representation.py` for AXTree-to-text transformation. Interactive elements appear inline with their context text. Generic buttons get nearby-text disambiguation. Footer excluded. Pruning rules: merge redundant StaticText, convert tables/lists to Markdown, strip rendering artifacts, collapse repetitive siblings, exclude CSS-class names.
|
|
129
|
+
|
|
130
|
+
**Viewport-aware:** Loads visible + one viewport below. Scroll remains a valid action for revealing more content. Unlike AgentOccam's "load full page" approach, this handles real-world infinite-scroll sites.
|
|
131
|
+
|
|
132
|
+
**Screenshots:** SoM-annotated screenshot only on first action and after failed actions. AXTree with element IDs is the primary representation.
|
|
133
|
+
|
|
134
|
+
**Action space:** `click`, `type`, `select`, `scroll`, `press_key`, `navigate`, `hover`, `go_back`, `wait`, `note`, `stop`. The `note` action records observations without browser interaction (critical for multi-step retrieval). The `stop` action signals subtask completion.
|
|
135
|
+
|
|
136
|
+
**History:** Flat within subtask (no branching for 10 actions). Last 2-3 actions: full detail. Earlier: one-line summaries. Current state dominates context.
|
|
137
|
+
|
|
138
|
+
**Extraction pattern (n+1):** Initial extraction once before the action loop. After each action, the after-state becomes the next iteration's before-state. For n actions, this requires n+1 extractions instead of the naive 2n.
|
|
139
|
+
|
|
140
|
+
**On success:** Signals mcp_manager to analyze captured traffic for MCP discovery.
|
|
141
|
+
|
|
142
|
+
**Model:** `gemini-3-flash-preview`, thinking enabled, 8192 max tokens.
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
### mcp_manager.py — API Tool Manager
|
|
147
|
+
|
|
148
|
+
Creates, validates, executes, and lifecycle-manages MCP tools. Built after the three core modules.
|
|
149
|
+
|
|
150
|
+
**Representation:** Raw DOM focused on parameter sources (hidden fields, data attributes, form structure), meta tokens with source annotations, cookies, storage dumps, and captured traffic. Does not receive AXTree or screenshots.
|
|
151
|
+
|
|
152
|
+
**Protocol support:** REST, GraphQL (operationName-based identity, mutation detection), JSON-RPC (method-based identity), URL-encoded form, multipart form.
|
|
153
|
+
|
|
154
|
+
**Evolving parameter schema:** Each MCP tool maintains an inferred schema that grows with every observation. Per parameter, the schema tracks: data type, required vs optional (presence frequency across observations), example values, value ranges for numerics, format hints (UUID, ISO date, JWT, etc.), and — critically — **source hints** noting where this parameter value was found in the browser state (which DOM element, which cookie, which prior API response field). These source hints mean that when the MCP is used in an entirely new scenario, the parameter generator knows exactly where to look first. Early observations produce a draft schema; after 10+ observations it stabilizes with confident required/optional classification. No enum detection — enums catastrophically constrain user-intent fields, session tokens, and chained outputs. Example values (`x-examples`) guide the LLM without constraining it.
|
|
155
|
+
|
|
156
|
+
**Extraction recipe:** Each tool has a per-parameter `extraction_recipe` — a list of `ExtractionStep` dicts that tell representation.py HOW to extract each parameter at execution time. Steps are typed (cookie, dom_field, dom_list, storage, meta_tag, url_component, prior_api_response, task_description) and classified (user_intent, ephemeral, chained, page_context, static). Built automatically from traced parameter sources at discovery time. The recipe executor in representation.py (`build_mcp_parameter_context`) runs each step deterministically against the browser state to produce structured context for the parameter generation LLM.
|
|
157
|
+
|
|
158
|
+
**Response chaining:** MCP response bodies are cached by `endpoint_identity`. Tool B's extraction recipe can reference Tool A's response via `prior_api_response` steps — works regardless of whether Tool A ran via MCP or CU (checks cache first, then browser captured traffic). This enables multi-step workflows like "search for location → use place_id to set delivery address."
|
|
159
|
+
|
|
160
|
+
**Response template:** Each tool learns a structural response template from successful responses. Tracks `always_present_paths` and `always_non_null_paths` (intersection across observations). The reflector uses this for deterministic structural checks — if a path that was always present is suddenly missing, or always-non-null data becomes empty, it's flagged as a failure without needing an LLM.
|
|
161
|
+
|
|
162
|
+
**A/B learning:** When an MCP tool fails and CU fallback succeeds, `learn_from_cu_fallback` compares the failed parameters against the correct CU traffic. For each differing parameter: traces the correct value in the browser state registry, rebuilds the extraction step, and replaces the old recipe step. Also merges the correct request/response into the schema and template.
|
|
163
|
+
|
|
164
|
+
**Validation at discovery:** Immediate replay via curl_cffi + independent param generation test + reflector confirms state change. Tool only marked "validated" if all three pass.
|
|
165
|
+
|
|
166
|
+
**MCP Lifecycle:**
|
|
167
|
+
|
|
168
|
+
| State | Entry Condition | Orchestrator Behavior |
|
|
169
|
+
|---|---|---|
|
|
170
|
+
| **Verified** | Passes validation at discovery | Available for routing |
|
|
171
|
+
| **Trusted** | 3 consecutive successes from verified | Preferred over CU |
|
|
172
|
+
| **Degraded** | Trusted tool fails once | Available with warning; 2 more consecutive failures → discarded |
|
|
173
|
+
| **Discarded** | 3 consecutive failures from any state | Removed from routing. Failure reason logged for future reference |
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
### reflector.py — Three-Stage Verification
|
|
178
|
+
|
|
179
|
+
Determines whether actions and subtasks succeeded. Most actions verified without LLM calls.
|
|
180
|
+
|
|
181
|
+
**Stage 1 — Deterministic Signals (every action, zero LLM cost):**
|
|
182
|
+
Element value before/after, URL change, HTTP status codes from captured traffic, ARIA alert/status/dialog nodes in AXTree (W3C standard — framework-agnostic), `aria-invalid` field changes, element count diff.
|
|
183
|
+
|
|
184
|
+
Most actions resolve here: type (value match), select (value match), scroll (new elements), navigate (URL change), click-with-navigation (URL change + no alerts).
|
|
185
|
+
|
|
186
|
+
**Stage 2 — AXTree Diff (ambiguous cases only):**
|
|
187
|
+
Flatten before/after AXTrees, compare node signatures, report additions/removals/changes. Prioritize ARIA signal nodes and structural changes. Inherently excludes cosmetic noise (CSS, animations, decorations aren't in AXTree).
|
|
188
|
+
|
|
189
|
+
Key detection: submit action + no meaningful changes + no ARIA alerts + no HTTP errors = silent failure (flagged as suspicious, never auto-classified as success).
|
|
190
|
+
|
|
191
|
+
**Stage 3 — LLM Evaluation (only when Stages 1-2 can't resolve, ~2-3 per subtask):**
|
|
192
|
+
Receives: the action attempted, deterministic signals as facts, compact AXTree diff, ARIA alert/status text. Must cite specific evidence for its verdict — cannot claim success without pointing to concrete signals. Binary verdict (not rubric-based — research shows 87% human agreement vs ambiguous rubric scores).
|
|
193
|
+
|
|
194
|
+
**MCP verification — deterministic-only:**
|
|
195
|
+
- **Reflector (immediate):** Deterministic HTTP status check → response structure check against learned template (always_present_paths, always_non_null_paths) → page state AXTree diff for mutations. No LLM calls. Returns structured verdict to orchestrator.
|
|
196
|
+
- **Orchestrator (natural):** Loads fresh page state on next planning step. If reflector said success but page contradicts, orchestrator notices and degrades the MCP. Semantic verification is the orchestrator's job, not the reflector's.
|
|
197
|
+
|
|
198
|
+
**Subtask reflection (deep, after entire subtask):**
|
|
199
|
+
Full journey evaluation: condensed action log with per-action verdicts, current page AXTree, focused DOM excerpt around expected change region, notes from CU agent. Specifically checks for "claimed but not executed" (agent said stop/success but never performed the key submit/click action). Uses Gemini Pro Preview with high thinking budget.
|
|
200
|
+
|
|
201
|
+
---
|
|
202
|
+
|
|
203
|
+
### trace.py — Decision Trace
|
|
204
|
+
|
|
205
|
+
Already built. Deterministic recorder. Zero LLM calls.
|
|
206
|
+
|
|
207
|
+
Every Gemini call wraps in `trace.span()`. Every schema includes `reasoning`, `confidence`, `evidence_sources` — these flow directly from model output to trace entries. Every browser action, traffic capture, and reflection assessment is logged.
|
|
208
|
+
|
|
209
|
+
Output: `./results/{datetime}/trace.jsonl`. Eval harness controls path for benchmark runs.
|
|
210
|
+
|
|
211
|
+
---
|
|
212
|
+
|
|
213
|
+
### run_webarena_evals.py — Eval Harness
|
|
214
|
+
|
|
215
|
+
Deterministic scoring. Zero LLM calls. Wraps MorphNet for WebArena Verified benchmarks.
|
|
216
|
+
|
|
217
|
+
---
|
|
218
|
+
|
|
219
|
+
## Model Assignments
|
|
220
|
+
|
|
221
|
+
| Role | Model | Thinking | Max Tokens |
|
|
222
|
+
|---|---|---|---|
|
|
223
|
+
| Orchestrator planning | `gemini-3.1-pro-preview` | Enabled, budget 4096 | 8192 |
|
|
224
|
+
| CU action generation | `gemini-3-flash-preview` | Enabled, budget 4096 | 8192 |
|
|
225
|
+
| Per-action reflection (Stage 3) | `gemini-3-flash-preview` | Enabled, budget 4096 | 8192 |
|
|
226
|
+
| Per-subtask reflection | `gemini-3.1-pro-preview` | Enabled, budget 4096 | 8192 |
|
|
227
|
+
| MCP parameter generation | `gemini-3-flash-preview` | Enabled, budget 4096 | 8192 |
|
|
228
|
+
| MCP response-vs-intent check | `gemini-3-flash-preview` | Enabled, budget 4096 | 8192 |
|
|
229
|
+
|
|
230
|
+
**Flash Lite is not used anywhere.** Every call sits on a critical path.
|
|
231
|
+
|
|
232
|
+
---
|
|
233
|
+
|
|
234
|
+
## Development Principles
|
|
235
|
+
|
|
236
|
+
1. **Gemini structured output schemas are typed function contracts.** Maximally descriptive field names, types, enums, descriptions. Every schema includes `reasoning`, `confidence`, `evidence_sources`. These flow directly to trace entries.
|
|
237
|
+
|
|
238
|
+
2. **Never string match on unstructured natural language.** Parsing structured material (HTML, JSON) is fine. Never regex/substring on model outputs.
|
|
239
|
+
|
|
240
|
+
3. **Centralized representation pipeline in `representation.py`.** session_manager serves raw data. `representation.py` owns all AXTree-to-text transformations — CU gets section-based inline elements with context tracking, orchestrator gets text-only distillation. Additional views: Set-of-Marks annotation (CU on failure), task-adaptive DOM distillation (MCP), adaptive evidence selection (reflector).
|
|
241
|
+
|
|
242
|
+
4. **Justify every field in every data structure.** What consumes it? What breaks if removed?
|
|
243
|
+
|
|
244
|
+
5. **No unnecessary files.** Consolidated. Whatever is used in a module, keep it closeby.
|
|
245
|
+
|
|
246
|
+
6. **Comments explain why, not what.** Related logic stays together.
|
|
247
|
+
|
|
248
|
+
7. **Prompts live in ./prompts/ as .txt files.** Not hardcoded.
|
|
249
|
+
|
|
250
|
+
8. **Every decision is traced.** Gemini calls wrap in `trace.span()`. Schema fields → JSONL.
|
|
251
|
+
|
|
252
|
+
9. **On-demand extraction, not bundled.** session_manager never bundles all extractions into one call. Each consumer calls exactly what it needs. This prevents 26+ second bottlenecks on complex pages.
|
|
253
|
+
|
|
254
|
+
10. **Auto site profiling.** `site_name` is derived from the URL hostname automatically. Site directories and configs are created on first access, not manually.
|
|
255
|
+
|
|
256
|
+
---
|
|
257
|
+
|
|
258
|
+
## Architectural Rules
|
|
259
|
+
|
|
260
|
+
1. **session_manager owns the browser.** No other module creates contexts, pages, or HTTP clients.
|
|
261
|
+
2. **session_manager serves raw data.** Each consumer builds its own view.
|
|
262
|
+
3. **Chrome via CDP + curl_cffi.** Real fingerprint for browsing and API replay.
|
|
263
|
+
4. **Orchestrator is benchmark-agnostic.** Eval logic in run_webarena_evals.py only.
|
|
264
|
+
5. **CU is stateless per subtask.** Orchestrator manages memory via planning tree.
|
|
265
|
+
6. **MCP lifecycle: verified → trusted → degraded → discarded.** Orchestrator checks status before routing.
|
|
266
|
+
7. **Reflector uses three stages.** Deterministic first, AXTree diff second, LLM third. Most actions need no LLM.
|
|
267
|
+
8. **All LLM outputs use structured schemas.** No free-form parsing.
|
|
268
|
+
9. **Website state in ./sites/.** Tools, profiles, credentials per-website.
|
|
269
|
+
10. **AgentOccam principles throughout.** Align to LLM pretraining. Simplify action/observation spaces. Branch/prune for context.
|
|
270
|
+
11. **Python 3.12.** Modern features throughout.
|
|
271
|
+
12. **Hierarchical element filtering.** When pages have >200 interactive elements, structural/navigational elements are preserved and the rest are sampled with section summaries for collapsed groups.
|
|
272
|
+
13. **Every decision traced via schema fields.** Gemini schemas include reasoning + evidence_sources + confidence. `./results/` stores all trace output, organized by datetime.
|