webclaw-hybrid-engine-ln 1.0.3 β†’ 1.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +43 -334
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,385 +1,94 @@
1
- # WebClaw Hybrid Engine
1
+ # WebClaw Hybrid Engine πŸš€
2
2
 
3
- **Official repository:** [github.com/ngoclinh15994/webclaw-gateway](https://github.com/ngoclinh15994/webclaw-gateway)
3
+ **The ultimate privacy-first web scraping bridge for AI agents.**
4
4
 
5
- **Save up to 90% on LLM scraping costs with a hybrid stealth pipeline.**
6
- [Crawlee](https://crawlee.dev/) **Cheerio** fast path for static HTML, **Playwright** for SPAs and bot-heavy pagesβ€”100% Node.js, Windows-friendly.
5
+ WebClaw runs entirely on your machine: a **zero-Docker**, **NPM-native** Node.js stack that turns complex pages into **clean Markdown** for LLMsβ€”without shipping raw HTML or browsing context to a third-party scraper.
7
6
 
8
7
  ---
9
8
 
10
- ## Why WebClaw Hybrid Engine?
11
-
12
- Most scraping stacks force a bad tradeoff:
13
- - fast but blocked,
14
- - or robust but expensive and slow.
15
-
16
- This project uses a **smart hybrid architecture** to get both:
17
-
18
- 1. **Fast Path (Crawlee Cheerio)**
19
- HTTP fetch + Cheerio parsing for static pages (β‰ˆ30s timeout per request).
20
- 2. **Stealth Path (Crawlee Playwright)**
21
- Headless Chromium when HTML is too small, blocked, or SPA-like (Crawlee browser pool + your synced cookies).
22
- 3. **Optional extension bridge**
23
- Set `WEBCLAW_EXTENSION_WS` for a WebSocket that returns `{ "html": "..." }` if Playwright fails.
24
- 4. **Token-First Output**
25
- HTML is purified to Markdown (Readability + Turndown for `article` mode; full-body Turndown for `ecommerce`), then measured with `tiktoken`.
26
-
27
- Bottom line: you stop paying to send useless HTML noise into GPT/Claude.
28
-
29
- ---
30
-
31
- ## Key Features
32
-
33
- - **Hybrid extractor brain (Crawlee)**
34
- - Auto-routing between **CheerioCrawler** and **PlaywrightCrawler** (~30s caps per level).
35
- - **Primary KPI: token reduction**
36
- - Returns `raw_tokens`, `cleaned_tokens`, `tokens_saved`, `reduction_percentage`.
37
- - **Markdown purification pipeline**
38
- - `jsdom` + `@mozilla/readability` + `turndown`.
39
- - **Anti-bot fallback support**
40
- - Handles SPA shells, Cloudflare/Captcha-like failures via browser render path.
41
- - **Cookie sync-ready**
42
- - Cookie API and `data/cookies.json` support for 1-click sync workflows (including Chrome extension integration).
43
- - **Zero-Docker Node.js setup**
44
- - `npm install`, `npm run setup` (`npx playwright install chromium`), `npm start` from repo root.
45
- - **Built-in dashboard (API monitoring)**
46
- - UI on `http://localhost:8822`: aggregate stats, paginated scrape history, cookie manager, exclude-URL list, OpenClaw skill installer, bilingual toggle (EN/VI).
47
- - **SQLite history at scale**
48
- - Scrape history is persisted in `data/webclaw.db` (no flat `history.json` bottleneck).
49
- - **User settings (exclude URLs)**
50
- - `data/settings.json` with `exclude_urls`; blocked URLs return `EXCLUDED_BY_USER` before any crawl (Cheerio/Playwright).
51
- - **OpenClaw agent integration**
52
- - One-click auto-install into `~/.openclaw/skills/webclaw_scraper/SKILL.md` (Markdown skill, not a TS tool).
53
-
54
- ---
55
-
56
- ## Architecture
57
-
58
- Native Node.js hybrid runtime (no Docker required):
59
-
60
- - `services/gateway/src/orchestrator.js`
61
- - **CheerioCrawler** β†’ **PlaywrightCrawler** β†’ optional **`extensionFallback.js`** (`WEBCLAW_EXTENSION_WS`)
62
- - `services/gateway/src/extensionFallback.js`
63
- - Optional WebSocket bridge for a Chrome extension (`{ "type":"scrape","url" }` / `{ "html" }`)
64
- - `services/gateway/src/tokenMetrics.js`
65
- - KPI calculation using `tiktoken`
66
- - `services/gateway/src/db.js`
67
- - SQLite initialization + migrations + history/stats queries
68
- - `services/gateway/src/settings.js`
69
- - Reads/writes `data/settings.json`; scrape path checks exclude list before orchestration
70
- - `services/gateway/src/templates/openclaw-skill.md`
71
- - OpenClaw `SKILL.md` template (curl to the local API, `EXCLUDED_BY_USER` fallback rule)
72
-
73
- ---
74
-
75
- ## Quick Start
76
-
77
- ### Requirements
78
-
79
- - **Node.js 20+** and **npm**
80
- - **Git** (recommended) β€” clone from [github.com/ngoclinh15994/webclaw-gateway](https://github.com/ngoclinh15994/webclaw-gateway)
81
-
82
- ### Install and run (all platforms)
83
-
84
- **From npm (no clone):**
9
+ ## Quick start (the one-liner)
85
10
 
86
11
  ```bash
87
12
  npx webclaw-hybrid-engine-ln
88
13
  ```
89
14
 
90
- Wait until the terminal prints **Ready on port 8822**, then open **http://localhost:8822**.
91
-
92
- **From source:** clone the official repo (review the code on GitHub first if you are security-conscious):
93
-
94
- ```bash
95
- git clone https://github.com/ngoclinh15994/webclaw-gateway.git
96
- cd webclaw-gateway
97
- ```
98
-
99
- From the **repository root**:
100
-
101
- ```bash
102
- npm install
103
- npm run setup
104
- npm start
105
- ```
106
-
107
- - **`npm run setup`** runs `npx playwright install chromium` (required for the Playwright crawler tier).
108
-
109
- Dashboard and API: **http://localhost:8822**
110
-
111
- ### Windows (convenience)
112
-
113
- ```bat
114
- Start_WebClaw.bat
115
- ```
116
-
117
- Runs `npm install`, `npm run setup`, then `npm start`.
15
+ Wait until the terminal shows **Ready on port 8822**, then open the dashboard at **http://localhost:8822**.
118
16
 
119
- ### macOS / Linux (convenience)
120
-
121
- ```bash
122
- chmod +x Start_WebClaw.sh
123
- ./Start_WebClaw.sh
124
- ```
17
+ - **100% Node.js native** β€” no Docker, no container runtime.
18
+ - **Privacy-first** β€” fetching, rendering, and Markdown conversion happen **locally**; your URLs and page content are not sent to WebClaw as a hosted service.
125
19
 
126
20
  ---
127
21
 
128
- ## API Reference
129
-
130
- ### POST `/api/v1/scrape`
131
-
132
- Endpoint:
133
-
134
- ```text
135
- http://localhost:8822/api/v1/scrape
136
- ```
137
-
138
- Request body:
139
-
140
- ```json
141
- {
142
- "url": "https://example.com/article",
143
- "mode": "auto",
144
- "extract_mode": "article"
145
- }
146
- ```
147
-
148
- Optional **`extract_mode`**: `"article"` (default, Readability) or `"ecommerce"` (full body, no Readability). Omit or use `"article"` for most pages.
149
-
150
- Modes:
151
- - `auto`: Cheerio first, then Playwright (then optional extension WS if configured)
152
- - `fast_only`: Cheerio only (errors if HTML is too small / SPA-like)
153
- - `playwright_only`: Playwright only (then extension WS if Playwright fails)
154
-
155
- If the URL matches any string in `exclude_urls` (substring match, case-insensitive), the API returns immediately:
22
+ ## Background service (set & forget)
156
23
 
157
- ```json
158
- {
159
- "status": "error",
160
- "message": "EXCLUDED_BY_USER: This URL is blacklisted by user settings. Please use your default browser tool."
161
- }
162
- ```
163
-
164
- Example:
165
-
166
- ```bash
167
- curl -X POST "http://localhost:8822/api/v1/scrape" \
168
- -H "Content-Type: application/json" \
169
- -d '{"url":"https://example.com/article","mode":"auto"}'
170
- ```
171
-
172
- Success response shape:
173
-
174
- ```json
175
- {
176
- "status": "success",
177
- "extract_mode": "article",
178
- "engine_used": "crawlee_cheerio",
179
- "engine_label": "Crawlee Hybrid (Cheerio + Playwright)",
180
- "data": {
181
- "title": "Page title",
182
- "markdown": "# Clean markdown"
183
- },
184
- "metrics": {
185
- "raw_tokens": 12000,
186
- "cleaned_tokens": 900,
187
- "tokens_saved": 11100,
188
- "reduction_percentage": 92.5
189
- }
190
- }
191
- ```
192
-
193
- ### GET `/api/v1/history`
194
-
195
- Returns recent history from SQLite, newest first.
196
-
197
- Query params:
198
- - `limit` (default `50`, max `200`)
199
- - `offset` (default `0`)
200
-
201
- Example:
24
+ To keep the engine running after you close the terminal and to survive reboots (with PM2’s startup hook), use [PM2](https://pm2.keymetrics.io/):
202
25
 
203
26
  ```bash
204
- curl "http://localhost:8822/api/v1/history?limit=50&offset=0"
27
+ npm install -g pm2
28
+ pm2 start npx --name "webclaw" -- webclaw-hybrid-engine-ln
29
+ pm2 save && pm2 startup
205
30
  ```
206
31
 
207
- ### GET `/api/v1/stats`
208
-
209
- Returns aggregate stats from SQLite.
32
+ Follow the on-screen instructions from `pm2 startup` once (so PM2 respawns your apps after a reboot).
210
33
 
211
- Example:
212
-
213
- ```bash
214
- curl "http://localhost:8822/api/v1/stats"
215
- ```
216
-
217
- Response:
218
-
219
- ```json
220
- {
221
- "status": "success",
222
- "stats": {
223
- "total_requests": 1234,
224
- "total_tokens_saved": 9876543,
225
- "overall_reduction_percentage": 85.2
226
- }
227
- }
228
- ```
229
-
230
- ### GET `/api/v1/settings`
34
+ ---
231
35
 
232
- Returns user settings (`exclude_urls` list).
36
+ ## Management commands
233
37
 
234
- ```bash
235
- curl "http://localhost:8822/api/v1/settings"
236
- ```
38
+ | Command | What it does |
39
+ |--------|----------------|
40
+ | `pm2 status webclaw` | Check whether the **webclaw** process is running |
41
+ | `pm2 stop webclaw` | Stop the engine (does not remove it from PM2’s list) |
42
+ | `pm2 restart webclaw` | Restart the engine (e.g. after an update) |
43
+ | `pm2 delete webclaw` | Remove **webclaw** from PM2’s process list |
237
44
 
238
- Response:
45
+ For logs: `pm2 logs webclaw`.
239
46
 
240
- ```json
241
- {
242
- "status": "success",
243
- "settings": {
244
- "exclude_urls": ["youtube.com", "localhost"]
245
- }
246
- }
247
- ```
47
+ ---
248
48
 
249
- ### POST `/api/v1/settings`
49
+ ## Integration with OpenClaw
250
50
 
251
- Updates settings. Body must include `exclude_urls` (array of strings).
51
+ Install the published skill for your OpenClaw / ClawHub workflow:
252
52
 
253
53
  ```bash
254
- curl -X POST "http://localhost:8822/api/v1/settings" \
255
- -H "Content-Type: application/json" \
256
- -d '{"exclude_urls":["youtube.com"]}'
54
+ clawhub install webclaw-hybrid-engine-ln
257
55
  ```
258
56
 
259
- ### POST `/api/v1/integrate/openclaw`
57
+ The skill talks to **http://localhost:8822**. **The engine must be running** (foreground `npx` or PM2 **webclaw**) **on port 8822** before the agent can scrape.
260
58
 
261
- Automatically installs the OpenClaw skill file into the local user profile.
262
-
263
- Behavior:
264
- - detects `~/.openclaw`
265
- - creates `~/.openclaw/skills/webclaw_scraper/` if needed
266
- - writes `SKILL.md` from the OpenClaw skill template
267
-
268
- Success:
269
-
270
- ```json
271
- {
272
- "status": "success",
273
- "message": "WebClaw Skill successfully installed into OpenClaw!"
274
- }
275
- ```
276
-
277
- If OpenClaw is not installed (`~/.openclaw` missing), returns `404` with an error message.
59
+ You can also install the skill from the local dashboard (**Install OpenClaw Skill**) or via `POST /api/v1/integrate/openclaw` when the engine is already up.
278
60
 
279
61
  ---
280
62
 
281
- ## Cookie Manager API
282
-
283
- ### GET `/api/v1/cookies`
284
- Returns cookie entries from `data/cookies.json`.
63
+ ## Why WebClaw?
285
64
 
286
- ### POST `/api/v1/cookies`
287
- Stores cookie entries used for Playwright fallback.
288
-
289
- Request example:
290
-
291
- ```json
292
- {
293
- "cookies": [
294
- {
295
- "domain": "portal.example.com",
296
- "cookie_string": "session=abc123; cf_clearance=xyz"
297
- }
298
- ]
299
- }
300
- ```
301
-
302
- This format is designed for quick automation and browser-extension sync.
65
+ - **Hybrid engine** β€” Automatically uses a **fast Cheerio** path for static HTML and **Playwright** when pages are dynamic, SPA-heavy, or need a real browser contextβ€”powered by [Crawlee](https://crawlee.dev/).
66
+ - **Privacy-first** β€” Scraping, rendering, and Markdown extraction stay **on your machine**. You control cookies, blocklists, and data on disk.
67
+ - **Token-efficient** β€” Delivers **clean Markdown** (with readability-style extraction where appropriate) so agents ingest less noiseβ€”often **on the order of ~80% fewer tokens** versus sending raw HTML, depending on the site.
303
68
 
304
69
  ---
305
70
 
306
- ## Optional Docker (deprecated)
307
-
308
- Legacy `Dockerfile` / `docker-compose` samples live under **`.deprecated/docker/`** for reference only. The supported workflow is **native Node** (above).
71
+ ## API & dashboard
309
72
 
310
- Health check:
311
-
312
- ```text
313
- http://localhost:8822/health
314
- ```
73
+ - **Scrape:** `POST http://localhost:8822/api/v1/scrape` with JSON body `{"url": "<url>", "mode": "auto"}` (optional `extract_mode`: `article` | `ecommerce`).
74
+ - **Health:** `GET http://localhost:8822/health`
75
+ - **Dashboard:** **http://localhost:8822** β€” history, stats, cookies, exclude URLs, OpenClaw skill installer.
315
76
 
316
77
  ---
317
78
 
318
- ## For n8n / Automation Builders
319
-
320
- Use `POST /api/v1/scrape` as a standard HTTP node:
321
- - Input: URL + mode
322
- - Output: clean Markdown + token reduction KPI
323
- - Branch on `engine_used` if you want analytics by path
79
+ ## Requirements
324
80
 
325
- This is optimized for agents and workflows where token waste directly hits your cloud bill.
81
+ - **Node.js 20+**
82
+ - **npm** (for `npx`)
326
83
 
327
84
  ---
328
85
 
329
- ## OpenClaw Integration
330
-
331
- This repository ships a ready-to-use OpenClaw **skill** (Markdown `SKILL.md`, installed under `~/.openclaw/skills/webclaw_scraper/`):
332
-
333
- - Template source: `services/gateway/src/templates/openclaw-skill.md`
334
- - Skill name: `webclaw-hybrid-engine-ln`
335
- - Auto-install endpoint: `POST /api/v1/integrate/openclaw`
336
- - The skill uses `curl` against `http://localhost:8822/api/v1/scrape` and documents fallback when the API returns `EXCLUDED_BY_USER`.
337
-
338
- Quick flow:
339
-
340
- 1. Clone and run the engine from the [official repository](https://github.com/ngoclinh15994/webclaw-gateway) (see **Quick Start** above).
341
- 2. Start WebClaw Hybrid Engine (`Start_WebClaw.bat` or `Start_WebClaw.sh`)
342
- 3. Click `⚑ Install OpenClaw Skill` in dashboard (or call API)
343
- 4. Restart OpenClaw so it reloads skills
86
+ ## License
344
87
 
345
- Detailed notes:
346
- - `openclaw-skill/README.md`
88
+ Released under the [MIT License](https://opensource.org/licenses/MIT).
347
89
 
348
90
  ---
349
91
 
350
- ## Project Structure
351
-
352
- ```text
353
- .
354
- β”œβ”€ data/
355
- β”‚ β”œβ”€ cookies.json
356
- β”‚ β”œβ”€ settings.json
357
- β”‚ └─ webclaw.db
358
- β”œβ”€ scripts/
359
- β”‚ └─ setup.js # npx playwright install chromium
360
- β”œβ”€ openclaw-skill/
361
- β”‚ └─ README.md
362
- β”œβ”€ services/
363
- β”‚ └─ gateway/
364
- β”‚ β”œβ”€ src/
365
- β”‚ β”‚ β”œβ”€ app.js
366
- β”‚ β”‚ β”œβ”€ extensionFallback.js
367
- β”‚ β”‚ β”œβ”€ orchestrator.js
368
- β”‚ β”‚ β”œβ”€ db.js
369
- β”‚ β”‚ β”œβ”€ settings.js
370
- β”‚ β”‚ └─ templates/openclaw-skill.md
371
- β”‚ β”œβ”€ ui/
372
- β”‚ └─ package.json
373
- β”œβ”€ package.json # workspace root (npm start / setup)
374
- β”œβ”€ Start_WebClaw.bat
375
- └─ Start_WebClaw.sh
376
- ```
377
-
378
- ---
379
-
380
- ## Credits
381
-
382
- - Crawlee: [Apify Crawlee](https://crawlee.dev/)
383
- - Markdown stack: Mozilla Readability, Turndown, tiktoken
92
+ ## Repository
384
93
 
385
- WebClaw Hybrid Engine is a **Node.js** orchestration layer around Crawlee and your local Playwright install.
94
+ **https://github.com/ngoclinh15994/webclaw-gateway**
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "webclaw-hybrid-engine-ln",
3
- "version": "1.0.3",
3
+ "version": "1.0.4",
4
4
  "description": "WebClaw Hybrid Engine β€” privacy-first local bridge for OpenClaw (Crawlee + Playwright).",
5
5
  "repository": {
6
6
  "type": "git",