athena-browser-mcp 2.0.5 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +80 -231
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,293 +1,142 @@
1
1
  # Athena Browser MCP
2
2
 
3
- [![CI](https://github.com/lespaceman/athena-browser-mcp/actions/workflows/ci.yml/badge.svg)](https://github.com/lespaceman/athena-browser-mcp/actions/workflows/ci.yml)
4
- [![npm version](https://badge.fury.io/js/athena-browser-mcp.svg)](https://www.npmjs.com/package/athena-browser-mcp)
5
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
3
+ An MCP server for browser automation that exposes semantic, token-efficient page representations optimized for LLM agents.
6
4
 
7
- MCP server for AI browser automation - 18 tools with semantic element targeting.
5
+ ---
8
6
 
9
- ## Why Athena?
7
+ ## Motivation
10
8
 
11
- LLM agents face two hard constraints: limited context windows and expensive tokens. Yet browser automation requires understanding complex, ever-changing page state. Traditional tools dump raw accessibility trees or screenshots, wasting precious context and creating needle-in-haystack problems where agents struggle to locate relevant elements.
9
+ LLM-based agents operate under strict context window and token constraints.
10
+ However, most browser automation tools expose entire DOMs or full accessibility trees to the model.
12
11
 
13
- Athena solves this with **semantic page snapshots** - compact, structured representations designed for LLM consumption:
12
+ This leads to:
14
13
 
15
- - **Token-efficient** - Hierarchical layers and regions eliminate noise, fitting more page understanding into less context
16
- - **High recall** - Structured XML with semantic element IDs lets agents find elements without scanning entire DOM trees
17
- - **Intuitive querying** - `find_elements` with semantic filters (kind, label, region) so agents ask for what they need
18
- - **Stable references** - Semantic `eid`s survive DOM mutations, eliminating stale element errors
14
+ - Rapid token exhaustion
15
+ - Higher inference costs
16
+ - Reduced reliability as relevant signal is buried in noise
19
17
 
20
- The result: fewer tokens, faster task completion, and higher-quality outputs with fewer errors.
18
+ In practice, agents spend more effort _finding_ the right information than reasoning about it.
21
19
 
22
- ## Benchmark
20
+ Athena exists to change the unit of information exposed to the model.
23
21
 
24
- Comparison between Athena Browser MCP and Playwright MCP on real-world browser automation tasks. Tests run in Claude Code with Claude Opus 4.5.
22
+ ---
25
23
 
26
- | # | Task | Agent | Result | Tokens Used | Time Taken |
27
- | --- | ------------------------------------------------------------------------------------- | -------------- | ---------- | ----------- | ---------- |
28
- | 1 | Login → Create wishlist "Summer Escapes" → Add beach property (Airbnb) | **Athena** | ✅ Success | 92,870 | 2m 08s |
29
- | | | **Playwright** | ✅ Success | 137,063 | 5m 23s |
30
- | 2 | Bangkok Experiences → Food tour → Extract itinerary & pricing (Airbnb) | **Athena** | ✅ Success | 87,194 | 3m 27s |
31
- | | | **Playwright** | ✅ Success | 94,942 | 3m 38s |
32
- | 3 | Miami → Beachfront stays under $300 → Top 3 names + prices (Airbnb) | **Athena** | ✅ Success | 124,597 | 5m 38s |
33
- | | | **Playwright** | ✅ Success | 122,077 | 4m 51s |
34
- | 4 | Paris → "Play" section → Top 5 titles + descriptions (Airbnb) | **Athena** | ❌ Failed | 146,575 | 4m 15s |
35
- | | | **Playwright** | ❌ Failed | 189,495 | 7m 37s |
36
- | 5 | Navigate Apple → find iPhone → configure iPhone 17 → add 256GB Black → confirm in bag | **Athena** | ✅ Success | 65,629 | 3m 30s |
37
- | | | **Playwright** | ✅ Success | 102,754 | 6m 59s |
24
+ ## Core Idea: Semantic Page Snapshots
38
25
 
39
- **Total Results:**
26
+ Instead of exposing raw DOM structures or full accessibility trees, Athena produces **semantic page snapshots**.
40
27
 
41
- - **Tokens**: Athena used **125,341 fewer tokens** (~19.4% more efficient)
42
- - **Time**: Athena completed tasks **9m 30s faster** (~33.4% faster)
28
+ These snapshots are:
43
29
 
44
- _Benchmark on a larger dataset coming soon._
30
+ - Compact and structured
31
+ - Focused on user-visible intent
32
+ - Designed for LLM recall and reasoning, not DOM completeness
33
+ - Stable across layout shifts and DOM churn
45
34
 
46
- ## Architecture
35
+ The goal is not to mirror the browser, but to present the page in a form that aligns with how language models reason about interfaces.
47
36
 
48
- ```
49
- ┌─────────────────────────────────────────────────────────────────┐
50
- │ AI Agent │
51
- │ ┌────────────────────────────────────────────────────────────┐ │
52
- │ │ System Prompt: XML state (layers, actionables, atoms) │ │
53
- │ └────────────────────────────────────────────────────────────┘ │
54
- └───────────────────────────┬─────────────────────────────────────┘
55
- │ MCP Protocol (stdio)
56
- ┌───────────────────────────▼─────────────────────────────────────┐
57
- │ SESSION: launch_browser, connect_browser, close_page, │
58
- │ close_session │
59
- │ NAVIGATION: navigate, go_back, go_forward, reload │
60
- │ OBSERVATION: capture_snapshot, find_elements, get_node_details │
61
- │ INTERACTION: click, type, press, select, hover, │
62
- │ scroll_element_into_view, scroll_page │
63
- └───────────────────────────┬─────────────────────────────────────┘
64
- │ Playwright + CDP
65
- ┌───────────────────────────▼─────────────────────────────────────┐
66
- │ Chromium Browser │
67
- └─────────────────────────────────────────────────────────────────┘
68
- ```
37
+ ---
69
38
 
70
- ## Tools
39
+ ## How It Works
71
40
 
72
- ### Session
41
+ At a high level:
73
42
 
74
- | Tool | Purpose | Input |
75
- | ----------------- | ------------------------- | ------------------- |
76
- | `launch_browser` | Launch new browser | `{ headless? }` |
77
- | `connect_browser` | Connect to existing (CDP) | `{ endpoint_url? }` |
78
- | `close_page` | Close specific page | `{ page_id }` |
79
- | `close_session` | Close entire browser | `{}` |
43
+ 1. The browser is controlled via Playwright and CDP
44
+ 2. The page is reduced into semantic regions and actionable elements
45
+ 3. A structured snapshot is generated and sent to the LLM
46
+ 4. Actions are resolved against stable semantic identifiers rather than fragile selectors
80
47
 
81
- ### Navigation
48
+ This separation keeps:
82
49
 
83
- | Tool | Purpose | Input |
84
- | ------------ | --------------- | ------------------- |
85
- | `navigate` | Go to URL | `{ url, page_id? }` |
86
- | `go_back` | Browser back | `{ page_id? }` |
87
- | `go_forward` | Browser forward | `{ page_id? }` |
88
- | `reload` | Refresh page | `{ page_id? }` |
50
+ - Browser lifecycle management isolated
51
+ - Snapshots deterministic and low-entropy
52
+ - Agent reasoning predictable and efficient
89
53
 
90
- ### Observation
54
+ ---
91
55
 
92
- | Tool | Purpose | Input |
93
- | ------------------ | ------------------- | ----------------------------------------------------------------- |
94
- | `capture_snapshot` | Capture page state | `{ page_id? }` |
95
- | `find_elements` | Find by criteria | `{ kind?, label?, region?, limit?, include_readable?, page_id? }` |
96
- | `get_node_details` | Get element details | `{ eid, page_id? }` |
56
+ ## Benchmarks
97
57
 
98
- ### Interaction
58
+ Early benchmarks against Playwright MCP show:
99
59
 
100
- | Tool | Purpose | Input |
101
- | -------------------------- | ------------------ | ---------------------------------- |
102
- | `click` | Click element | `{ eid, page_id? }` |
103
- | `type` | Type text | `{ eid, text, clear?, page_id? }` |
104
- | `press` | Press keyboard key | `{ key, modifiers?, page_id? }` |
105
- | `select` | Select option | `{ eid, value, page_id? }` |
106
- | `hover` | Hover element | `{ eid, page_id? }` |
107
- | `scroll_element_into_view` | Scroll to element | `{ eid, page_id? }` |
108
- | `scroll_page` | Scroll viewport | `{ direction, amount?, page_id? }` |
60
+ - **~19% fewer tokens consumed**
61
+ - **~33% faster task completion**
62
+ - Same or better success rates on common navigation tasks
109
63
 
110
- ## Element IDs (eid)
64
+ Benchmarks were run using Claude Code on representative real-world tasks.
65
+ Results are task-dependent and should be treated as directional rather than absolute.
111
66
 
112
- Elements are identified by stable semantic IDs (`eid`) instead of transient DOM node IDs:
67
+ ---
113
68
 
114
- ```xml
115
- <match eid="a1b2c3d4e5f6" kind="button" label="Sign In" region="header" />
116
- ```
69
+ ## What Athena Is (and Is Not)
117
70
 
118
- EIDs are computed from:
119
-
120
- - Role/kind (button, link, input)
121
- - Accessible name (label text)
122
- - Landmark path (region + group hierarchy)
123
- - Position hint (screen zone, quadrant)
124
-
125
- This means the same logical element keeps its `eid` across page updates.
126
-
127
- ## Response Format
128
-
129
- Tools return XML state responses with page understanding:
130
-
131
- ```xml
132
- <state page_id="abc123" url="https://example.com" title="Example">
133
- <layer type="main" active="true">
134
- <actionables count="12">
135
- <el eid="a1b2c3" kind="button" label="Sign In" />
136
- <el eid="d4e5f6" kind="link" label="Forgot password?" />
137
- <el eid="g7h8i9" kind="input" label="Email" type="email" />
138
- </actionables>
139
- </layer>
140
- <atoms>
141
- <viewport w="1280" h="720" />
142
- <scroll x="0" y="0" />
143
- </atoms>
144
- </state>
145
- ```
71
+ ### Athena is:
146
72
 
147
- ### Layer Types
73
+ - A semantic interface between browsers and LLM agents
74
+ - An MCP server focused on reliability and efficiency
75
+ - Designed for agent workflows, not test automation
148
76
 
149
- | Layer | Description |
150
- | --------- | -------------------------- |
151
- | `main` | Primary page content |
152
- | `modal` | Dialog overlays |
153
- | `drawer` | Slide-in panels |
154
- | `popover` | Dropdowns, tooltips, menus |
77
+ ### Athena is not:
155
78
 
156
- ## Usage Examples
79
+ - A general-purpose browser
80
+ - A visual testing or screenshot framework
81
+ - A replacement for Playwright
157
82
 
158
- ### Login Flow
83
+ Playwright remains the execution layer; Athena focuses on representation and reasoning.
159
84
 
160
- ```
161
- 1. launch_browser { }
162
- → XML state with initial page
85
+ ---
163
86
 
164
- 2. navigate { url: "https://example.com/login" }
165
- → State shows login form elements
87
+ ## Usage
166
88
 
167
- 3. find_elements { kind: "input", label: "email" }
168
- → <match eid="abc123" kind="input" label="Email" />
89
+ Athena implements the **Model Context Protocol (MCP)** and works with:
169
90
 
170
- 4. click { eid: "abc123" }
171
- Element focused
91
+ - Claude Code
92
+ - Claude Desktop
93
+ - Cursor
94
+ - VS Code
95
+ - Any MCP-compatible client
172
96
 
173
- 5. type { eid: "abc123", text: "user@example.com" }
174
- → Value filled
97
+ Example workflows include:
175
98
 
176
- 6. press { key: "Tab" }
177
- Focus moved to password field
99
+ - Navigating complex web apps
100
+ - Handling login and consent flows
101
+ - Performing multi-step UI interactions with lower token usage
178
102
 
179
- 7. type { eid: "def456", text: "password123" }
180
- → Password filled
103
+ See the `examples/` directory for concrete agent workflows.
181
104
 
182
- 8. press { key: "Enter" }
183
- → Form submitted, navigation to dashboard
184
- ```
185
-
186
- ### Cookie Consent (Multi-Frame)
187
-
188
- ```
189
- 1. navigate { url: "https://news-site.com" }
190
- → Modal layer detected (cookie consent iframe)
191
-
192
- 2. find_elements { label: "Accept", kind: "button" }
193
- → <match eid="xyz789" kind="button" label="Accept All" />
194
-
195
- 3. click { eid: "xyz789" }
196
- → Modal closed, main layer active
197
- ```
105
+ ---
198
106
 
199
107
  ## Installation
200
108
 
201
109
  ```bash
110
+ git clone https://github.com/lespaceman/athena-browser-mcp
111
+ cd athena-browser-mcp
202
112
  npm install
203
113
  npm run build
204
114
  ```
205
115
 
206
- ## Configuration
207
-
208
- ### Claude Desktop
116
+ Configure the MCP server in your client according to its MCP integration instructions.
209
117
 
210
- Add to your Claude Desktop config:
118
+ ---
211
119
 
212
- **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
213
- **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
214
- **Linux**: `~/.config/Claude/claude_desktop_config.json`
120
+ ## Architecture Overview
215
121
 
216
- ```json
217
- {
218
- "mcpServers": {
219
- "browser": {
220
- "command": "npx",
221
- "args": ["athena-browser-mcp@latest"]
222
- }
223
- }
224
- }
225
- ```
122
+ Athena separates concerns into three layers:
226
123
 
227
- ### Claude Code
228
-
229
- ```bash
230
- claude mcp add athena-browser-mcp npx athena-browser-mcp@latest
231
- ```
124
+ - **Browser lifecycle** — page creation, navigation, teardown
125
+ - **Semantic snapshot generation** — regions, elements, identifiers
126
+ - **Action resolution** — mapping agent intent to browser actions
232
127
 
233
- ### VS Code
128
+ This separation allows each layer to evolve independently while keeping agent-visible behavior stable.
234
129
 
235
- ```bash
236
- code --add-mcp '{"name":"athena-browser-mcp","command":"npx","args":["athena-browser-mcp@latest"]}'
237
- ```
130
+ ---
238
131
 
239
- ### Cursor
132
+ ## Status
240
133
 
241
- Go to **Cursor Settings → MCP → Add new MCP Server**. Use command type with:
134
+ Athena is under active development.
135
+ APIs and snapshot formats may evolve as real-world agent usage informs the design.
242
136
 
243
- ```
244
- npx athena-browser-mcp@latest
245
- ```
137
+ Feedback from practitioners building agent systems is especially welcome.
246
138
 
247
- ### Codex
248
-
249
- ```bash
250
- codex mcp add athena-browser-mcp npx athena-browser-mcp@latest
251
- ```
252
-
253
- ### Gemini CLI
254
-
255
- ```bash
256
- gemini mcp add -s user athena-browser-mcp -- npx athena-browser-mcp@latest
257
- ```
258
-
259
- ### Connect to Existing Browser
260
-
261
- To connect to an existing Chromium browser with CDP enabled:
262
-
263
- ```bash
264
- # Start Chrome with remote debugging
265
- google-chrome --remote-debugging-port=9222
266
-
267
- # Or use environment variables
268
- export CEF_BRIDGE_HOST=127.0.0.1
269
- export CEF_BRIDGE_PORT=9222
270
- ```
271
-
272
- Then use `connect_browser` instead of `launch_browser`.
273
-
274
- ### Environment Variables
275
-
276
- | Variable | Description | Default |
277
- | ----------------- | -------------------- | ----------- |
278
- | `CEF_BRIDGE_HOST` | CDP host for connect | `127.0.0.1` |
279
- | `CEF_BRIDGE_PORT` | CDP port for connect | `9223` |
280
-
281
- ## Development
282
-
283
- ```bash
284
- npm run build # Compile TypeScript
285
- npm run type-check # TypeScript type checking
286
- npm run lint # ESLint
287
- npm run format # Prettier format
288
- npm run check # Run all checks
289
- npm test # Run tests
290
- ```
139
+ ---
291
140
 
292
141
  ## License
293
142
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "athena-browser-mcp",
3
- "version": "2.0.5",
3
+ "version": "2.1.0",
4
4
  "description": "MCP server for controlling Athena browser",
5
5
  "type": "module",
6
6
  "main": "dist/src/index.js",