@kilospark/webact 2.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/SKILL.md ADDED
@@ -0,0 +1,209 @@
1
+ ---
2
+ name: webact
3
+ description: Use when the user asks to interact with a website, browse the web, check a site, send a message, read content from a web page, or accomplish any goal that requires controlling a browser
4
+ ---
5
+
6
+ # WebAct Browser Control
7
+
8
+ Control Chrome directly via the Chrome DevTools Protocol. No Playwright, no MCP - raw CDP through a CLI helper.
9
+
10
+ ## How to Run Commands
11
+
12
+ All commands use `webact.js` from this skill's base directory. The base directory is provided when the skill loads - use it as the path prefix.
13
+
14
+ ### Session Setup (once)
15
+
16
+ ```bash
17
+ node <base-dir>/webact.js launch
18
+ ```
19
+
20
+ This launches Chrome (or connects to an existing instance) and creates a session. All subsequent commands auto-discover the session - no session ID needed.
21
+
22
+ ### Running Commands
23
+
24
+ Use direct CLI commands. Each is a single bash call:
25
+
26
+ ```bash
27
+ node <base-dir>/webact.js navigate https://example.com
28
+ node <base-dir>/webact.js click button.submit
29
+ node <base-dir>/webact.js keyboard "hello world"
30
+ node <base-dir>/webact.js press Enter
31
+ node <base-dir>/webact.js dom
32
+ ```
33
+
34
+ **Auto-brief:** State-changing commands (navigate, click, hover, press Enter/Tab, scroll, select, waitfor) auto-print a compact page summary showing URL, title, inputs, buttons, links, and total element counts. You usually don't need a separate `dom` call. Use `dom` only when you need the full page structure, `axtree -i` for a quick list of all interactive elements, or `axtree` for the full semantic tree.
35
+
36
+ ### Command Reference
37
+
38
+ | Command | Example |
39
+ |---------|---------|
40
+ | `navigate <url>` | `node webact.js navigate https://example.com` |
41
+ | `back` | `node webact.js back` |
42
+ | `forward` | `node webact.js forward` |
43
+ | `reload` | `node webact.js reload` |
44
+ | `dom [selector] [--full]` | `node webact.js dom` or `node webact.js dom .results` |
45
+ | `axtree [selector] [-i]` | `node webact.js axtree` or `node webact.js axtree -i` |
46
+ | `observe` | `node webact.js observe` |
47
+ | `screenshot` | `node webact.js screenshot` |
48
+ | `pdf [path]` | `node webact.js pdf` or `node webact.js pdf /tmp/page.pdf` |
49
+ | `click <selector>` | `node webact.js click button.submit` |
50
+ | `doubleclick <selector>` | `node webact.js doubleclick td.cell` |
51
+ | `rightclick <selector>` | `node webact.js rightclick .context-target` |
52
+ | `hover <selector>` | `node webact.js hover .menu-trigger` |
53
+ | `focus <selector>` | `node webact.js focus input[name=q]` |
54
+ | `clear <selector>` | `node webact.js clear input[name=q]` |
55
+ | `type <selector> <text>` | `node webact.js type input[name=q] search query` |
56
+ | `keyboard <text>` | `node webact.js keyboard hello world` |
57
+ | `select <selector> <value>` | `node webact.js select select#country US` |
58
+ | `upload <selector> <file>` | `node webact.js upload input[type=file] /tmp/photo.png` |
59
+ | `drag <from> <to>` | `node webact.js drag .card .dropzone` |
60
+ | `dialog <accept\|dismiss> [text]` | `node webact.js dialog accept` |
61
+ | `waitfor <selector> [ms]` | `node webact.js waitfor .dropdown 5000` |
62
+ | `waitfornav [ms]` | `node webact.js waitfornav` |
63
+ | `press <key\|combo>` | `node webact.js press Enter` or `node webact.js press Ctrl+A` |
64
+ | `scroll <target> [px]` | `node webact.js scroll down 500` or `node webact.js scroll top` |
65
+ | `eval <js>` | `node webact.js eval document.title` |
66
+ | `cookies [get\|set\|clear\|delete]` | `node webact.js cookies` or `node webact.js cookies set name val` |
67
+ | `console [show\|errors\|listen]` | `node webact.js console` or `node webact.js console errors` |
68
+ | `block <pattern>` | `node webact.js block images css` or `node webact.js block off` |
69
+ | `viewport <w> <h>` | `node webact.js viewport mobile` or `node webact.js viewport 1024 768` |
70
+ | `frames` | `node webact.js frames` |
71
+ | `frame <id\|selector>` | `node webact.js frame main` or `node webact.js frame iframe#embed` |
72
+ | `download [path\|list]` | `node webact.js download path /tmp/dl` or `node webact.js download list` |
73
+ | `tabs` | `node webact.js tabs` |
74
+ | `tab <id>` | `node webact.js tab ABC123` |
75
+ | `newtab [url]` | `node webact.js newtab https://example.com` |
76
+ | `close` | `node webact.js close` |
77
+ | `activate` | `node webact.js activate` |
78
+ | `minimize` | `node webact.js minimize` |
79
+
80
+ **`type` vs `keyboard`:** Use `type` to focus a specific input and fill it. Use `keyboard` to type at the current caret position - essential for rich text editors (Slack, Google Docs, Notion) where `type`'s focus call resets the cursor.
81
+
82
+ **`click` behavior:** Waits up to 5s for the element, scrolls it into view, then clicks. No manual waits needed for dynamic elements.
83
+
84
+ **`dialog` behavior:** Sets a one-shot auto-handler. Run BEFORE the action that triggers the dialog.
85
+
86
+ **`axtree` vs `dom`:** The accessibility tree shows semantic roles (button, link, heading, textbox) and accessible names - better for understanding page structure. Use `dom` when you need HTML structure/selectors; use `axtree` when you need to understand what's on the page.
87
+
88
+ **`axtree -i` (interactive mode):** Shows only actionable elements (buttons, links, inputs, etc.) as a flat numbered list. Most token-efficient way to see what you can interact with on a page - typically ~500 tokens vs ~4000 for full `dom`. After running `axtree -i`, use the ref numbers directly as selectors: `click 1`, `type 3 hello`. Refs are cached per URL and reused on revisits.
89
+
90
+ **`observe`:** Like `axtree -i` but formats each element as a ready-to-use command (e.g. `click 1`, `type 3 <text>`, `select 5 <value>`). Generates the ref map as a side effect.
91
+
92
+ **Ref-based targeting:** After `axtree -i` or `observe`, numeric refs work in all selector-accepting commands: `click`, `type`, `select`, `hover`, `focus`, `clear`, `doubleclick`, `rightclick`, `upload`, `drag`, `waitfor`, `dom`.
93
+
94
+ **`press` combos:** Supports modifier keys: `Ctrl+A` (select all), `Ctrl+C` (copy), `Meta+V` (paste on Mac), `Shift+Enter`, etc. Modifiers: Ctrl, Alt, Shift, Meta/Cmd.
95
+
96
+ **`scroll` targets:** `up`/`down` (default 400px, or specify pixels), `top`/`bottom`, or a CSS selector to scroll an element into view.
97
+
98
+ **`block` patterns:** Block resource types (`images`, `css`, `fonts`, `media`, `scripts`) or URL substrings. Speeds up page loads. Use `block off` to disable.
99
+
100
+ **`viewport` presets:** `mobile` (375x667), `iphone` (390x844), `ipad` (820x1180), `tablet` (768x1024), `desktop` (1280x800). Or specify exact width and height.
101
+
102
+ **`frames`:** Lists all frames/iframes on the page. Use `frame <id>` to switch context, `frame main` to return to the top frame.
103
+
104
+ ### Tab Isolation
105
+
106
+ Each session creates and owns its own tabs. Sessions never reuse tabs from other sessions or pre-existing tabs.
107
+
108
+ - `launch`/`connect` creates a **new blank tab** for the session
109
+ - `newtab` opens an additional tab within the session
110
+ - `tabs` only lists tabs owned by the current session
111
+ - `tab <id>` only switches to session-owned tabs
112
+ - `close` removes the tab from the session
113
+
114
+ This means two agents can work side by side in the same Chrome instance without interfering with each other.
115
+
116
+ ## The Perceive-Act Loop
117
+
118
+ When given a goal, follow this loop:
119
+
120
+ 1. **PLAN** - Break the goal into steps. Chain predictable sequences (click → type → press Enter) into a single command array.
121
+
122
+ 2. **ACT** - Write command JSON (or array), run `node <base-dir>/webact.js run <sessionId>`. Actions auto-print a page brief.
123
+
124
+ 3. **DECIDE** - Read the brief. Expected state? Continue. Login wall / CAPTCHA? Tell user. Need more detail? Use `dom`. Goal complete? Report.
125
+
126
+ 4. **REPEAT** until done or blocked.
127
+
128
+ ## Rules
129
+
130
+ <HARD-RULES>
131
+
132
+ 1. **Read the brief after acting.** State-changing commands auto-print a page brief. Read it before deciding your next step. Use `dom` only when the brief isn't enough (e.g., you need to find a specific element's selector in a complex page).
133
+
134
+ 2. **DOM before screenshot.** Always try `dom` first. Only use `screenshot` if DOM output is empty/insufficient (canvas apps, image-heavy layouts).
135
+
136
+ 3. **Report actual content.** When the goal is information retrieval, extract and present the actual text from the page. Do not summarize what you think is there - show what IS there.
137
+
138
+ 4. **Stop when blocked.** If you encounter a login wall, CAPTCHA, 2FA prompt, or cookie consent that blocks progress, first run `activate` to bring the browser window to the front so the user can see it, then tell the user. Do not guess credentials or attempt to bypass security. Once the blocker is resolved and you resume automation, run `minimize` before your next action so the browser doesn't steal focus from the user. Minimizing does not affect page focus — the active element and caret position are preserved.
139
+
140
+ 5. **Wait for dynamic content.** After clicks that trigger page loads, use `waitfornav` or `waitfor <selector>` before reading DOM.
141
+
142
+ 6. **Use CSS selectors for targeting.** When you need to click or type into a specific element, identify it from the DOM output using CSS selectors (id, class, aria-label, data-testid, or structural selectors).
143
+
144
+ 7. **Clean up tabs.** When you open a tab with `newtab` for a subtask, `close` it when you're done and switch back to your previous tab. Before reporting a task as complete, run `tabs` to check for any tabs you forgot to close. Don't leave orphaned tabs behind.
145
+
146
+ </HARD-RULES>
147
+
148
+ ## Getting Started
149
+
150
+ ```bash
151
+ # Launch Chrome and get a session ID
152
+ node <base-dir>/webact.js launch
153
+ # Output: Session: a1b2c3d4
154
+ # Command file: /tmp/webact-command-a1b2c3d4.json (path varies by OS)
155
+ ```
156
+
157
+ If Chrome is not running, `launch` starts a new instance automatically and minimizes it (macOS). All subsequent commands auto-discover the session. Use `activate` to bring the browser window to the front when needed.
158
+
159
+ ## Token Efficiency
160
+
161
+ The `dom` command returns a compact representation:
162
+ - Scripts, styles, SVGs, hidden elements are stripped
163
+ - Only interactive and structural tags are shown with their attributes
164
+ - Whitespace is collapsed
165
+ - Output is truncated to ~4000 chars by default
166
+
167
+ Use `dom <selector>` to scope to a specific part of the page when you know where to look. This saves significant tokens on large pages.
168
+
169
+ Use `--full` only when you need the complete DOM (rare).
170
+
171
+ ## Finding Elements
172
+
173
+ Read the DOM output and identify elements by:
174
+ 1. **id**: `#search-input` - most reliable
175
+ 2. **data-testid**: `[data-testid="submit-btn"]`
176
+ 3. **aria-label**: `[aria-label="Search"]`
177
+ 4. **class**: `.nav-link`
178
+ 5. **structural**: `form input[type="email"]`
179
+ 6. **text-based** (via eval): use eval with `document.querySelector('button').textContent`
180
+
181
+ If a CSS selector doesn't work, use `eval` to find elements by text content:
182
+ ```bash
183
+ node webact.js eval "[...document.querySelectorAll('a')].find(a => a.textContent.includes('Sign in'))?.getAttribute('href')"
184
+ ```
185
+
186
+ ## Common Patterns
187
+
188
+ All examples assume you've already run `node webact.js launch`.
189
+
190
+ **Navigate and read** (navigate auto-prints brief - no separate dom needed):
191
+ ```bash
192
+ node webact.js navigate https://news.ycombinator.com
193
+ ```
194
+
195
+ **Fill a form:**
196
+ ```bash
197
+ node webact.js click input[name=q]
198
+ node webact.js type input[name=q] search query
199
+ node webact.js press Enter
200
+ ```
201
+
202
+ **Rich text editors and @mentions:**
203
+ ```bash
204
+ node webact.js click .ql-editor
205
+ node webact.js keyboard Hello @alice
206
+ node webact.js waitfor [data-qa='tab_complete_ui_item'] 5000
207
+ node webact.js click [data-qa='tab_complete_ui_item']
208
+ node webact.js keyboard " check this out"
209
+ ```
@@ -0,0 +1,7 @@
1
+ interface:
2
+ display_name: WebAct
3
+ short_description: Control any Chromium browser via DevTools Protocol
4
+ default_prompt: Use the browser to accomplish the given goal.
5
+
6
+ policy:
7
+ allow_implicit_invocation: true
package/package.json ADDED
@@ -0,0 +1,41 @@
1
+ {
2
+ "name": "@kilospark/webact",
3
+ "version": "2.5.0",
4
+ "description": "CLI for browser automation via Chrome DevTools Protocol",
5
+ "main": "webact.js",
6
+ "bin": {
7
+ "webact": "./webact.js"
8
+ },
9
+ "files": [
10
+ "webact.js",
11
+ "SKILL.md",
12
+ "agents/"
13
+ ],
14
+ "scripts": {
15
+ "build": "esbuild webact.src.js --bundle --platform=node --target=node18 --format=cjs --banner:js='#!/usr/bin/env node' --external:bufferutil --external:utf-8-validate --outfile=webact.js",
16
+ "test": "echo \"Error: no test specified\" && exit 1"
17
+ },
18
+ "keywords": [
19
+ "browser",
20
+ "automation",
21
+ "chrome",
22
+ "cdp",
23
+ "cli",
24
+ "agents"
25
+ ],
26
+ "author": "",
27
+ "license": "ISC",
28
+ "type": "commonjs",
29
+ "repository": {
30
+ "type": "git",
31
+ "url": "https://github.com/kxbnb/webact.git",
32
+ "directory": "skills/webact"
33
+ },
34
+ "engines": {
35
+ "node": ">=18.0.0"
36
+ },
37
+ "devDependencies": {
38
+ "esbuild": "^0.24.0",
39
+ "ws": "^8.19.0"
40
+ }
41
+ }