webclaw-hybrid-engine-ln 1.0.3 β 1.0.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +43 -334
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,385 +1,94 @@
|
|
|
1
|
-
# WebClaw Hybrid Engine
|
|
1
|
+
# WebClaw Hybrid Engine π
|
|
2
2
|
|
|
3
|
-
**
|
|
3
|
+
**The ultimate privacy-first web scraping bridge for AI agents.**
|
|
4
4
|
|
|
5
|
-
**
|
|
6
|
-
[Crawlee](https://crawlee.dev/) **Cheerio** fast path for static HTML, **Playwright** for SPAs and bot-heavy pagesβ100% Node.js, Windows-friendly.
|
|
5
|
+
WebClaw runs entirely on your machine: a **zero-Docker**, **NPM-native** Node.js stack that turns complex pages into **clean Markdown** for LLMsβwithout shipping raw HTML or browsing context to a third-party scraper.
|
|
7
6
|
|
|
8
7
|
---
|
|
9
8
|
|
|
10
|
-
##
|
|
11
|
-
|
|
12
|
-
Most scraping stacks force a bad tradeoff:
|
|
13
|
-
- fast but blocked,
|
|
14
|
-
- or robust but expensive and slow.
|
|
15
|
-
|
|
16
|
-
This project uses a **smart hybrid architecture** to get both:
|
|
17
|
-
|
|
18
|
-
1. **Fast Path (Crawlee Cheerio)**
|
|
19
|
-
HTTP fetch + Cheerio parsing for static pages (β30s timeout per request).
|
|
20
|
-
2. **Stealth Path (Crawlee Playwright)**
|
|
21
|
-
Headless Chromium when HTML is too small, blocked, or SPA-like (Crawlee browser pool + your synced cookies).
|
|
22
|
-
3. **Optional extension bridge**
|
|
23
|
-
Set `WEBCLAW_EXTENSION_WS` for a WebSocket that returns `{ "html": "..." }` if Playwright fails.
|
|
24
|
-
4. **Token-First Output**
|
|
25
|
-
HTML is purified to Markdown (Readability + Turndown for `article` mode; full-body Turndown for `ecommerce`), then measured with `tiktoken`.
|
|
26
|
-
|
|
27
|
-
Bottom line: you stop paying to send useless HTML noise into GPT/Claude.
|
|
28
|
-
|
|
29
|
-
---
|
|
30
|
-
|
|
31
|
-
## Key Features
|
|
32
|
-
|
|
33
|
-
- **Hybrid extractor brain (Crawlee)**
|
|
34
|
-
- Auto-routing between **CheerioCrawler** and **PlaywrightCrawler** (~30s caps per level).
|
|
35
|
-
- **Primary KPI: token reduction**
|
|
36
|
-
- Returns `raw_tokens`, `cleaned_tokens`, `tokens_saved`, `reduction_percentage`.
|
|
37
|
-
- **Markdown purification pipeline**
|
|
38
|
-
- `jsdom` + `@mozilla/readability` + `turndown`.
|
|
39
|
-
- **Anti-bot fallback support**
|
|
40
|
-
- Handles SPA shells, Cloudflare/Captcha-like failures via browser render path.
|
|
41
|
-
- **Cookie sync-ready**
|
|
42
|
-
- Cookie API and `data/cookies.json` support for 1-click sync workflows (including Chrome extension integration).
|
|
43
|
-
- **Zero-Docker Node.js setup**
|
|
44
|
-
- `npm install`, `npm run setup` (`npx playwright install chromium`), `npm start` from repo root.
|
|
45
|
-
- **Built-in dashboard (API monitoring)**
|
|
46
|
-
- UI on `http://localhost:8822`: aggregate stats, paginated scrape history, cookie manager, exclude-URL list, OpenClaw skill installer, bilingual toggle (EN/VI).
|
|
47
|
-
- **SQLite history at scale**
|
|
48
|
-
- Scrape history is persisted in `data/webclaw.db` (no flat `history.json` bottleneck).
|
|
49
|
-
- **User settings (exclude URLs)**
|
|
50
|
-
- `data/settings.json` with `exclude_urls`; blocked URLs return `EXCLUDED_BY_USER` before any crawl (Cheerio/Playwright).
|
|
51
|
-
- **OpenClaw agent integration**
|
|
52
|
-
- One-click auto-install into `~/.openclaw/skills/webclaw_scraper/SKILL.md` (Markdown skill, not a TS tool).
|
|
53
|
-
|
|
54
|
-
---
|
|
55
|
-
|
|
56
|
-
## Architecture
|
|
57
|
-
|
|
58
|
-
Native Node.js hybrid runtime (no Docker required):
|
|
59
|
-
|
|
60
|
-
- `services/gateway/src/orchestrator.js`
|
|
61
|
-
- **CheerioCrawler** β **PlaywrightCrawler** β optional **`extensionFallback.js`** (`WEBCLAW_EXTENSION_WS`)
|
|
62
|
-
- `services/gateway/src/extensionFallback.js`
|
|
63
|
-
- Optional WebSocket bridge for a Chrome extension (`{ "type":"scrape","url" }` / `{ "html" }`)
|
|
64
|
-
- `services/gateway/src/tokenMetrics.js`
|
|
65
|
-
- KPI calculation using `tiktoken`
|
|
66
|
-
- `services/gateway/src/db.js`
|
|
67
|
-
- SQLite initialization + migrations + history/stats queries
|
|
68
|
-
- `services/gateway/src/settings.js`
|
|
69
|
-
- Reads/writes `data/settings.json`; scrape path checks exclude list before orchestration
|
|
70
|
-
- `services/gateway/src/templates/openclaw-skill.md`
|
|
71
|
-
- OpenClaw `SKILL.md` template (curl to the local API, `EXCLUDED_BY_USER` fallback rule)
|
|
72
|
-
|
|
73
|
-
---
|
|
74
|
-
|
|
75
|
-
## Quick Start
|
|
76
|
-
|
|
77
|
-
### Requirements
|
|
78
|
-
|
|
79
|
-
- **Node.js 20+** and **npm**
|
|
80
|
-
- **Git** (recommended) β clone from [github.com/ngoclinh15994/webclaw-gateway](https://github.com/ngoclinh15994/webclaw-gateway)
|
|
81
|
-
|
|
82
|
-
### Install and run (all platforms)
|
|
83
|
-
|
|
84
|
-
**From npm (no clone):**
|
|
9
|
+
## Quick start (the one-liner)
|
|
85
10
|
|
|
86
11
|
```bash
|
|
87
12
|
npx webclaw-hybrid-engine-ln
|
|
88
13
|
```
|
|
89
14
|
|
|
90
|
-
Wait until the terminal
|
|
91
|
-
|
|
92
|
-
**From source:** clone the official repo (review the code on GitHub first if you are security-conscious):
|
|
93
|
-
|
|
94
|
-
```bash
|
|
95
|
-
git clone https://github.com/ngoclinh15994/webclaw-gateway.git
|
|
96
|
-
cd webclaw-gateway
|
|
97
|
-
```
|
|
98
|
-
|
|
99
|
-
From the **repository root**:
|
|
100
|
-
|
|
101
|
-
```bash
|
|
102
|
-
npm install
|
|
103
|
-
npm run setup
|
|
104
|
-
npm start
|
|
105
|
-
```
|
|
106
|
-
|
|
107
|
-
- **`npm run setup`** runs `npx playwright install chromium` (required for the Playwright crawler tier).
|
|
108
|
-
|
|
109
|
-
Dashboard and API: **http://localhost:8822**
|
|
110
|
-
|
|
111
|
-
### Windows (convenience)
|
|
112
|
-
|
|
113
|
-
```bat
|
|
114
|
-
Start_WebClaw.bat
|
|
115
|
-
```
|
|
116
|
-
|
|
117
|
-
Runs `npm install`, `npm run setup`, then `npm start`.
|
|
15
|
+
Wait until the terminal shows **Ready on port 8822**, then open the dashboard at **http://localhost:8822**.
|
|
118
16
|
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
```bash
|
|
122
|
-
chmod +x Start_WebClaw.sh
|
|
123
|
-
./Start_WebClaw.sh
|
|
124
|
-
```
|
|
17
|
+
- **100% Node.js native** β no Docker, no container runtime.
|
|
18
|
+
- **Privacy-first** β fetching, rendering, and Markdown conversion happen **locally**; your URLs and page content are not sent to WebClaw as a hosted service.
|
|
125
19
|
|
|
126
20
|
---
|
|
127
21
|
|
|
128
|
-
##
|
|
129
|
-
|
|
130
|
-
### POST `/api/v1/scrape`
|
|
131
|
-
|
|
132
|
-
Endpoint:
|
|
133
|
-
|
|
134
|
-
```text
|
|
135
|
-
http://localhost:8822/api/v1/scrape
|
|
136
|
-
```
|
|
137
|
-
|
|
138
|
-
Request body:
|
|
139
|
-
|
|
140
|
-
```json
|
|
141
|
-
{
|
|
142
|
-
"url": "https://example.com/article",
|
|
143
|
-
"mode": "auto",
|
|
144
|
-
"extract_mode": "article"
|
|
145
|
-
}
|
|
146
|
-
```
|
|
147
|
-
|
|
148
|
-
Optional **`extract_mode`**: `"article"` (default, Readability) or `"ecommerce"` (full body, no Readability). Omit or use `"article"` for most pages.
|
|
149
|
-
|
|
150
|
-
Modes:
|
|
151
|
-
- `auto`: Cheerio first, then Playwright (then optional extension WS if configured)
|
|
152
|
-
- `fast_only`: Cheerio only (errors if HTML is too small / SPA-like)
|
|
153
|
-
- `playwright_only`: Playwright only (then extension WS if Playwright fails)
|
|
154
|
-
|
|
155
|
-
If the URL matches any string in `exclude_urls` (substring match, case-insensitive), the API returns immediately:
|
|
22
|
+
## Background service (set & forget)
|
|
156
23
|
|
|
157
|
-
|
|
158
|
-
{
|
|
159
|
-
"status": "error",
|
|
160
|
-
"message": "EXCLUDED_BY_USER: This URL is blacklisted by user settings. Please use your default browser tool."
|
|
161
|
-
}
|
|
162
|
-
```
|
|
163
|
-
|
|
164
|
-
Example:
|
|
165
|
-
|
|
166
|
-
```bash
|
|
167
|
-
curl -X POST "http://localhost:8822/api/v1/scrape" \
|
|
168
|
-
-H "Content-Type: application/json" \
|
|
169
|
-
-d '{"url":"https://example.com/article","mode":"auto"}'
|
|
170
|
-
```
|
|
171
|
-
|
|
172
|
-
Success response shape:
|
|
173
|
-
|
|
174
|
-
```json
|
|
175
|
-
{
|
|
176
|
-
"status": "success",
|
|
177
|
-
"extract_mode": "article",
|
|
178
|
-
"engine_used": "crawlee_cheerio",
|
|
179
|
-
"engine_label": "Crawlee Hybrid (Cheerio + Playwright)",
|
|
180
|
-
"data": {
|
|
181
|
-
"title": "Page title",
|
|
182
|
-
"markdown": "# Clean markdown"
|
|
183
|
-
},
|
|
184
|
-
"metrics": {
|
|
185
|
-
"raw_tokens": 12000,
|
|
186
|
-
"cleaned_tokens": 900,
|
|
187
|
-
"tokens_saved": 11100,
|
|
188
|
-
"reduction_percentage": 92.5
|
|
189
|
-
}
|
|
190
|
-
}
|
|
191
|
-
```
|
|
192
|
-
|
|
193
|
-
### GET `/api/v1/history`
|
|
194
|
-
|
|
195
|
-
Returns recent history from SQLite, newest first.
|
|
196
|
-
|
|
197
|
-
Query params:
|
|
198
|
-
- `limit` (default `50`, max `200`)
|
|
199
|
-
- `offset` (default `0`)
|
|
200
|
-
|
|
201
|
-
Example:
|
|
24
|
+
To keep the engine running after you close the terminal and to survive reboots (with PM2βs startup hook), use [PM2](https://pm2.keymetrics.io/):
|
|
202
25
|
|
|
203
26
|
```bash
|
|
204
|
-
|
|
27
|
+
npm install -g pm2
|
|
28
|
+
pm2 start npx --name "webclaw" -- webclaw-hybrid-engine-ln
|
|
29
|
+
pm2 save && pm2 startup
|
|
205
30
|
```
|
|
206
31
|
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
Returns aggregate stats from SQLite.
|
|
32
|
+
Follow the on-screen instructions from `pm2 startup` once (so PM2 respawns your apps after a reboot).
|
|
210
33
|
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
```bash
|
|
214
|
-
curl "http://localhost:8822/api/v1/stats"
|
|
215
|
-
```
|
|
216
|
-
|
|
217
|
-
Response:
|
|
218
|
-
|
|
219
|
-
```json
|
|
220
|
-
{
|
|
221
|
-
"status": "success",
|
|
222
|
-
"stats": {
|
|
223
|
-
"total_requests": 1234,
|
|
224
|
-
"total_tokens_saved": 9876543,
|
|
225
|
-
"overall_reduction_percentage": 85.2
|
|
226
|
-
}
|
|
227
|
-
}
|
|
228
|
-
```
|
|
229
|
-
|
|
230
|
-
### GET `/api/v1/settings`
|
|
34
|
+
---
|
|
231
35
|
|
|
232
|
-
|
|
36
|
+
## Management commands
|
|
233
37
|
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
38
|
+
| Command | What it does |
|
|
39
|
+
|--------|----------------|
|
|
40
|
+
| `pm2 status webclaw` | Check whether the **webclaw** process is running |
|
|
41
|
+
| `pm2 stop webclaw` | Stop the engine (does not remove it from PM2βs list) |
|
|
42
|
+
| `pm2 restart webclaw` | Restart the engine (e.g. after an update) |
|
|
43
|
+
| `pm2 delete webclaw` | Remove **webclaw** from PM2βs process list |
|
|
237
44
|
|
|
238
|
-
|
|
45
|
+
For logs: `pm2 logs webclaw`.
|
|
239
46
|
|
|
240
|
-
|
|
241
|
-
{
|
|
242
|
-
"status": "success",
|
|
243
|
-
"settings": {
|
|
244
|
-
"exclude_urls": ["youtube.com", "localhost"]
|
|
245
|
-
}
|
|
246
|
-
}
|
|
247
|
-
```
|
|
47
|
+
---
|
|
248
48
|
|
|
249
|
-
|
|
49
|
+
## Integration with OpenClaw
|
|
250
50
|
|
|
251
|
-
|
|
51
|
+
Install the published skill for your OpenClaw / ClawHub workflow:
|
|
252
52
|
|
|
253
53
|
```bash
|
|
254
|
-
|
|
255
|
-
-H "Content-Type: application/json" \
|
|
256
|
-
-d '{"exclude_urls":["youtube.com"]}'
|
|
54
|
+
clawhub install webclaw-hybrid-engine-ln
|
|
257
55
|
```
|
|
258
56
|
|
|
259
|
-
|
|
57
|
+
The skill talks to **http://localhost:8822**. **The engine must be running** (foreground `npx` or PM2 **webclaw**) **on port 8822** before the agent can scrape.
|
|
260
58
|
|
|
261
|
-
|
|
262
|
-
|
|
263
|
-
Behavior:
|
|
264
|
-
- detects `~/.openclaw`
|
|
265
|
-
- creates `~/.openclaw/skills/webclaw_scraper/` if needed
|
|
266
|
-
- writes `SKILL.md` from the OpenClaw skill template
|
|
267
|
-
|
|
268
|
-
Success:
|
|
269
|
-
|
|
270
|
-
```json
|
|
271
|
-
{
|
|
272
|
-
"status": "success",
|
|
273
|
-
"message": "WebClaw Skill successfully installed into OpenClaw!"
|
|
274
|
-
}
|
|
275
|
-
```
|
|
276
|
-
|
|
277
|
-
If OpenClaw is not installed (`~/.openclaw` missing), returns `404` with an error message.
|
|
59
|
+
You can also install the skill from the local dashboard (**Install OpenClaw Skill**) or via `POST /api/v1/integrate/openclaw` when the engine is already up.
|
|
278
60
|
|
|
279
61
|
---
|
|
280
62
|
|
|
281
|
-
##
|
|
282
|
-
|
|
283
|
-
### GET `/api/v1/cookies`
|
|
284
|
-
Returns cookie entries from `data/cookies.json`.
|
|
63
|
+
## Why WebClaw?
|
|
285
64
|
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
Request example:
|
|
290
|
-
|
|
291
|
-
```json
|
|
292
|
-
{
|
|
293
|
-
"cookies": [
|
|
294
|
-
{
|
|
295
|
-
"domain": "portal.example.com",
|
|
296
|
-
"cookie_string": "session=abc123; cf_clearance=xyz"
|
|
297
|
-
}
|
|
298
|
-
]
|
|
299
|
-
}
|
|
300
|
-
```
|
|
301
|
-
|
|
302
|
-
This format is designed for quick automation and browser-extension sync.
|
|
65
|
+
- **Hybrid engine** β Automatically uses a **fast Cheerio** path for static HTML and **Playwright** when pages are dynamic, SPA-heavy, or need a real browser contextβpowered by [Crawlee](https://crawlee.dev/).
|
|
66
|
+
- **Privacy-first** β Scraping, rendering, and Markdown extraction stay **on your machine**. You control cookies, blocklists, and data on disk.
|
|
67
|
+
- **Token-efficient** β Delivers **clean Markdown** (with readability-style extraction where appropriate) so agents ingest less noiseβoften **on the order of ~80% fewer tokens** versus sending raw HTML, depending on the site.
|
|
303
68
|
|
|
304
69
|
---
|
|
305
70
|
|
|
306
|
-
##
|
|
307
|
-
|
|
308
|
-
Legacy `Dockerfile` / `docker-compose` samples live under **`.deprecated/docker/`** for reference only. The supported workflow is **native Node** (above).
|
|
71
|
+
## API & dashboard
|
|
309
72
|
|
|
310
|
-
|
|
311
|
-
|
|
312
|
-
|
|
313
|
-
http://localhost:8822/health
|
|
314
|
-
```
|
|
73
|
+
- **Scrape:** `POST http://localhost:8822/api/v1/scrape` with JSON body `{"url": "<url>", "mode": "auto"}` (optional `extract_mode`: `article` | `ecommerce`).
|
|
74
|
+
- **Health:** `GET http://localhost:8822/health`
|
|
75
|
+
- **Dashboard:** **http://localhost:8822** β history, stats, cookies, exclude URLs, OpenClaw skill installer.
|
|
315
76
|
|
|
316
77
|
---
|
|
317
78
|
|
|
318
|
-
##
|
|
319
|
-
|
|
320
|
-
Use `POST /api/v1/scrape` as a standard HTTP node:
|
|
321
|
-
- Input: URL + mode
|
|
322
|
-
- Output: clean Markdown + token reduction KPI
|
|
323
|
-
- Branch on `engine_used` if you want analytics by path
|
|
79
|
+
## Requirements
|
|
324
80
|
|
|
325
|
-
|
|
81
|
+
- **Node.js 20+**
|
|
82
|
+
- **npm** (for `npx`)
|
|
326
83
|
|
|
327
84
|
---
|
|
328
85
|
|
|
329
|
-
##
|
|
330
|
-
|
|
331
|
-
This repository ships a ready-to-use OpenClaw **skill** (Markdown `SKILL.md`, installed under `~/.openclaw/skills/webclaw_scraper/`):
|
|
332
|
-
|
|
333
|
-
- Template source: `services/gateway/src/templates/openclaw-skill.md`
|
|
334
|
-
- Skill name: `webclaw-hybrid-engine-ln`
|
|
335
|
-
- Auto-install endpoint: `POST /api/v1/integrate/openclaw`
|
|
336
|
-
- The skill uses `curl` against `http://localhost:8822/api/v1/scrape` and documents fallback when the API returns `EXCLUDED_BY_USER`.
|
|
337
|
-
|
|
338
|
-
Quick flow:
|
|
339
|
-
|
|
340
|
-
1. Clone and run the engine from the [official repository](https://github.com/ngoclinh15994/webclaw-gateway) (see **Quick Start** above).
|
|
341
|
-
2. Start WebClaw Hybrid Engine (`Start_WebClaw.bat` or `Start_WebClaw.sh`)
|
|
342
|
-
3. Click `β‘ Install OpenClaw Skill` in dashboard (or call API)
|
|
343
|
-
4. Restart OpenClaw so it reloads skills
|
|
86
|
+
## License
|
|
344
87
|
|
|
345
|
-
|
|
346
|
-
- `openclaw-skill/README.md`
|
|
88
|
+
Released under the [MIT License](https://opensource.org/licenses/MIT).
|
|
347
89
|
|
|
348
90
|
---
|
|
349
91
|
|
|
350
|
-
##
|
|
351
|
-
|
|
352
|
-
```text
|
|
353
|
-
.
|
|
354
|
-
ββ data/
|
|
355
|
-
β ββ cookies.json
|
|
356
|
-
β ββ settings.json
|
|
357
|
-
β ββ webclaw.db
|
|
358
|
-
ββ scripts/
|
|
359
|
-
β ββ setup.js # npx playwright install chromium
|
|
360
|
-
ββ openclaw-skill/
|
|
361
|
-
β ββ README.md
|
|
362
|
-
ββ services/
|
|
363
|
-
β ββ gateway/
|
|
364
|
-
β ββ src/
|
|
365
|
-
β β ββ app.js
|
|
366
|
-
β β ββ extensionFallback.js
|
|
367
|
-
β β ββ orchestrator.js
|
|
368
|
-
β β ββ db.js
|
|
369
|
-
β β ββ settings.js
|
|
370
|
-
β β ββ templates/openclaw-skill.md
|
|
371
|
-
β ββ ui/
|
|
372
|
-
β ββ package.json
|
|
373
|
-
ββ package.json # workspace root (npm start / setup)
|
|
374
|
-
ββ Start_WebClaw.bat
|
|
375
|
-
ββ Start_WebClaw.sh
|
|
376
|
-
```
|
|
377
|
-
|
|
378
|
-
---
|
|
379
|
-
|
|
380
|
-
## Credits
|
|
381
|
-
|
|
382
|
-
- Crawlee: [Apify Crawlee](https://crawlee.dev/)
|
|
383
|
-
- Markdown stack: Mozilla Readability, Turndown, tiktoken
|
|
92
|
+
## Repository
|
|
384
93
|
|
|
385
|
-
|
|
94
|
+
**https://github.com/ngoclinh15994/webclaw-gateway**
|