@akshayram1/omnibrowser-agent 0.2.29 → 0.2.32

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/arch.md DELETED
@@ -1,220 +0,0 @@
1
- # omnibrowser-agent — Architecture
2
-
3
- > Local-first browser AI operator. Runs entirely in the browser — no API keys, no cloud costs, no data leaving your machine.
4
-
5
- ---
6
-
7
- ## Architecture Diagram
8
-
9
- ```mermaid
10
- flowchart TB
11
- subgraph DELIVERY["Delivery Layer"]
12
- EXT["🧩 Chrome Extension\npopup + background worker"]
13
- LIB["📦 npm Library\ncreateBrowserAgent()"]
14
- end
15
-
16
- subgraph ORCHESTRATION["Orchestration"]
17
- BG["background/index.ts\nSession & tick loop"]
18
- BA["BrowserAgent class\nrunLoop() / resume() / stop()"]
19
- end
20
-
21
- subgraph CORE["Core (src/core/)"]
22
- PL["planner.ts\nheuristicPlan() / webllm bridge\nplanNextAction()"]
23
- OB["observer.ts\ncollectSnapshot()\nDOM candidates + visibility filter"]
24
- EX["executor.ts\nexecuteAction()\nclick / type / navigate\nscroll / focus / wait"]
25
- end
26
-
27
- subgraph SHARED["Shared (src/shared/)"]
28
- CT["contracts.ts\nAgentAction · PageSnapshot\nAgentSession · PlannerResult"]
29
- SF["safety.ts\nassessRisk()\nsafe / review / blocked"]
30
- PA["parse-action.ts\nparseAction()\nparsePlannerResult()"]
31
- end
32
-
33
- subgraph OUTCOMES["Action Outcomes"]
34
- direction LR
35
- OK["✅ safe → execute"]
36
- RV["⚠️ review → needs approval"]
37
- BL["🚫 blocked → stop"]
38
- end
39
-
40
- subgraph PLANNERS["Planner Modes"]
41
- direction LR
42
- HP["Heuristic\nzero deps · offline\nregex patterns"]
43
- WL["WebLLM\non-device · WebGPU\nwindow.__browserAgentWebLLM"]
44
- end
45
-
46
- EXT --> BG
47
- LIB --> BA
48
- BG -. "chrome.tabs.sendMessage" .-> CORE
49
- BA --> CORE
50
-
51
- PL --> OB
52
- PL --> SHARED
53
- OB --> SHARED
54
- EX --> SHARED
55
-
56
- SF --> OUTCOMES
57
- PL --> PLANNERS
58
- ```
59
-
60
- ---
61
-
62
- ## Layer-by-layer explanation
63
-
64
- ### Delivery layer
65
-
66
- There are two ways to use omnibrowser-agent, and they share the same underlying engine.
67
-
68
- **Chrome extension** — Install by loading the `dist/` folder as an unpacked extension in Chrome. A popup UI lets you enter a goal, pick a mode, and click Start. The background service worker manages session state and orchestrates the tick loop across tabs.
69
-
70
- **npm library** — Embed agent logic directly into any web app. Import `createBrowserAgent()` from `@akshayram1/omnibrowser-agent`, pass a goal and config, and wire up event callbacks. No extension required.
71
-
72
- ---
73
-
74
- ### Orchestration
75
-
76
- **`background/index.ts`** (extension path) maintains a `Map<tabId, AgentSession>` and drives each session forward by sending `AGENT_TICK` messages to the active tab's content script. It handles `START_AGENT`, `APPROVE_ACTION`, `STOP_AGENT`, and `GET_STATUS` messages from the popup.
77
-
78
- **`BrowserAgent` class** (library path) runs the same tick loop in-process. It exposes `start()`, `resume()`, `stop()`, `isRunning`, and `hasPendingAction`, along with a full event callback API (`onStep`, `onApprovalRequired`, `onDone`, `onError`, `onMaxStepsReached`). Supports `AbortSignal` for external cancellation.
79
-
80
- ---
81
-
82
- ### Core (`src/core/`)
83
-
84
- These three modules are **shared** between the extension content script and the library. Neither delivery path duplicates them.
85
-
86
- | Module | Responsibility |
87
- |---|---|
88
- | `planner.ts` | Decides the next action given a goal, page snapshot, and history |
89
- | `observer.ts` | Reads the live DOM and returns a structured `PageSnapshot` |
90
- | `executor.ts` | Performs DOM actions and returns a result string |
91
-
92
- **`observer.ts` — `collectSnapshot()`**
93
- Queries all interactive elements (`a`, `button`, `input`, `textarea`, `select`, `[role=button]`, `[contenteditable]`), filters out invisible ones (hidden, `display:none`, zero dimensions), and prioritises in-viewport elements. Resolves accessible labels via `aria-labelledby`, `aria-label`, `for/id`, and wrapping `<label>`. Generates stable CSS selectors preferring `name`, `placeholder`, and `aria-label` attributes over fragile `:nth-of-type()` indices. Caps at 60 candidates. Returns `url`, `title`, `textPreview`, and `candidates[]`.
94
-
95
- **`planner.ts` — `planNextAction()`**
96
- Two modes:
97
- - *Heuristic* — pure regex. Matches `go to <url>`, `search for <x>`, `fill "<text>" in <field>`, `click <target>` patterns against the goal string, then falls back to filling the first visible input or clicking the first visible button.
98
- - *WebLLM* — delegates to `window.__browserAgentWebLLM.plan()`. The bridge is external — you wire it in. Accepts both legacy `AgentAction` returns and the new `PlannerResult` (with `evaluation`, `memory`, `nextGoal` reflection fields).
99
-
100
- **`executor.ts` — `executeAction()`**
101
- Performs the action. Uses `InputEvent` with `bubbles: true` so React/Vue controlled inputs receive proper framework events. Verifies: element exists, is not disabled (for clicks), value updated (for type), extracted text is non-empty. Includes selector fallback: when a selector fails, tries to recover via tag+attribute matching or single-element shortcut before throwing. Throws on failure so the retry loop can feed `lastError` back to the planner.
102
-
103
- ---
104
-
105
- ### Shared (`src/shared/`)
106
-
107
- **`contracts.ts`** — All TypeScript interfaces and union types. The single source of truth for `AgentAction`, `PageSnapshot`, `AgentSession`, `PlannerResult`, `ContentResult`, and the library config/event types.
108
-
109
- **`safety.ts` — `assessRisk()`**
110
- Returns one of three risk levels for any action:
111
-
112
- | Level | Meaning | Examples |
113
- |---|---|---|
114
- | `safe` | Execute immediately | `navigate` to http/https, `click` neutral label, `scroll`, `wait`, `focus` |
115
- | `review` | Pause for human approval in `human-approved` mode | `extract`, `click`/`type` on labels matching delete/pay/submit/confirm/transfer |
116
- | `blocked` | Never execute | `navigate` to `javascript:`, `file:`, or malformed URLs |
117
-
118
- **`parse-action.ts`** — Handles LLM output that may be wrapped in markdown fences, embedded in prose, or using the full reflection format `{ evaluation, memory, next_goal, action }`. Gracefully returns a `done` action on any parse failure so the loop never crashes.
119
-
120
- ---
121
-
122
- ### Planner modes
123
-
124
- | Mode | Description | When to use |
125
- |---|---|---|
126
- | `heuristic` | Zero-dependency regex-based planner. Works fully offline. | Simple, predictable goals — navigate, search, fill a field, click a button |
127
- | `webllm` | Delegates to a `window.__browserAgentWebLLM` bridge. Fully private, runs on-device via WebGPU. | Open-ended, multi-step, or language-heavy goals |
128
-
129
- ---
130
-
131
- ### Agent modes
132
-
133
- | Mode | Behaviour |
134
- |---|---|
135
- | `autonomous` | All `safe` and `review` actions execute without pause |
136
- | `human-approved` | `review`-rated actions pause and emit `onApprovalRequired` — user must call `resume()` or click **Approve** in the popup |
137
-
138
- ---
139
-
140
- ### Data flow (one tick)
141
-
142
- ```
143
- goal + history
144
-
145
-
146
- observer.collectSnapshot() ──→ PageSnapshot (url, title, candidates[])
147
-
148
-
149
- planner.planNextAction() ──→ PlannerResult { action, evaluation?, memory?, nextGoal? }
150
-
151
-
152
- safety.assessRisk(action) ──→ safe | review | blocked
153
-
154
- ┌──┴──────────────────────┐
155
- blocked review (human-approved mode)
156
- │ │
157
- stop pause → user approves → resume
158
-
159
- safe / approved
160
-
161
-
162
- executor.executeAction(action) ──→ result string
163
-
164
-
165
- session.history.push(result)
166
- → next tick
167
- ```
168
-
169
- ---
170
-
171
- ## Project structure
172
-
173
- ```
174
- src/
175
- ├── background/ Extension service worker — session management
176
- ├── content/ Extension content script — runs in page context
177
- ├── core/ Shared engine (planner, observer, executor)
178
- │ ├── planner.ts
179
- │ ├── observer.ts
180
- │ └── executor.ts
181
- ├── lib/ npm library entry — BrowserAgent class
182
- │ └── index.ts
183
- ├── popup/ Extension popup UI
184
- │ ├── index.html
185
- │ └── index.ts
186
- └── shared/ Types, safety, and parse utilities
187
- ├── contracts.ts
188
- ├── safety.ts
189
- └── parse-action.ts
190
- ```
191
-
192
- ---
193
-
194
- ## Quick reference
195
-
196
- ```ts
197
- import { createBrowserAgent } from "@akshayram1/omnibrowser-agent";
198
-
199
- const agent = createBrowserAgent({
200
- goal: "Search for contact John Smith in CRM",
201
- mode: "human-approved", // or "autonomous"
202
- planner: { kind: "heuristic" } // or "webllm"
203
- }, {
204
- onStep: (result, session) => console.log(result.message),
205
- onApprovalRequired:(action, session) => console.log("Review:", action),
206
- onDone: (result, session) => console.log("Done:", result.message),
207
- });
208
-
209
- await agent.start();
210
-
211
- // After onApprovalRequired fires:
212
- await agent.resume();
213
-
214
- // Cancel at any time:
215
- agent.stop();
216
- ```
217
-
218
- ---
219
-
220
- *MIT © Akshay Chame — [github.com/akshayram1/omnibrowser-agent](https://github.com/akshayram1/omnibrowser-agent)*