npm - webclaw-hybrid-engine-ln - Versions diffs - 1.0.2 - Mend

webclaw-hybrid-engine-ln 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

package/README.md +385 -0
package/bin/cli.js +17 -0
package/package.json +45 -0
package/scripts/setup.js +4 -0
package/services/gateway/package.json +25 -0
package/services/gateway/src/app.js +199 -0
package/services/gateway/src/config.js +42 -0
package/services/gateway/src/cookies.js +188 -0
package/services/gateway/src/db.js +78 -0
package/services/gateway/src/extensionFallback.js +76 -0
package/services/gateway/src/markdown.js +96 -0
package/services/gateway/src/orchestrator.js +210 -0
package/services/gateway/src/settings.js +62 -0
package/services/gateway/src/templates/openclaw-skill.md +50 -0
package/services/gateway/src/tokenMetrics.js +40 -0
package/services/gateway/ui/app.js +546 -0
package/services/gateway/ui/index.html +154 -0

package/README.md ADDED Viewed

@@ -0,0 +1,385 @@
+# WebClaw Hybrid Engine
+**Official repository:** [github.com/ngoclinh15994/webclaw-gateway](https://github.com/ngoclinh15994/webclaw-gateway)
+**Save up to 90% on LLM scraping costs with a hybrid stealth pipeline.**
+[Crawlee](https://crawlee.dev/) **Cheerio** fast path for static HTML, **Playwright** for SPAs and bot-heavy pages—100% Node.js, Windows-friendly.
+---
+## Why WebClaw Hybrid Engine?
+Most scraping stacks force a bad tradeoff:
+- fast but blocked,
+- or robust but expensive and slow.
+This project uses a **smart hybrid architecture** to get both:
+1. **Fast Path (Crawlee Cheerio)**
+   HTTP fetch + Cheerio parsing for static pages (≈30s timeout per request).
+2. **Stealth Path (Crawlee Playwright)**
+   Headless Chromium when HTML is too small, blocked, or SPA-like (Crawlee browser pool + your synced cookies).
+3. **Optional extension bridge**
+   Set `WEBCLAW_EXTENSION_WS` for a WebSocket that returns `{ "html": "..." }` if Playwright fails.
+4. **Token-First Output**
+   HTML is purified to Markdown (Readability + Turndown for `article` mode; full-body Turndown for `ecommerce`), then measured with `tiktoken`.
+Bottom line: you stop paying to send useless HTML noise into GPT/Claude.
+---
+## Key Features
+- **Hybrid extractor brain (Crawlee)**
+  - Auto-routing between **CheerioCrawler** and **PlaywrightCrawler** (~30s caps per level).
+- **Primary KPI: token reduction**
+  - Returns `raw_tokens`, `cleaned_tokens`, `tokens_saved`, `reduction_percentage`.
+- **Markdown purification pipeline**
+  - `jsdom` + `@mozilla/readability` + `turndown`.
+- **Anti-bot fallback support**
+  - Handles SPA shells, Cloudflare/Captcha-like failures via browser render path.
+- **Cookie sync-ready**
+  - Cookie API and `data/cookies.json` support for 1-click sync workflows (including Chrome extension integration).
+- **Zero-Docker Node.js setup**
+  - `npm install`, `npm run setup` (`npx playwright install chromium`), `npm start` from repo root.
+- **Built-in dashboard (API monitoring)**
+  - UI on `http://localhost:8822`: aggregate stats, paginated scrape history, cookie manager, exclude-URL list, OpenClaw skill installer, bilingual toggle (EN/VI).
+- **SQLite history at scale**
+  - Scrape history is persisted in `data/webclaw.db` (no flat `history.json` bottleneck).
+- **User settings (exclude URLs)**
+  - `data/settings.json` with `exclude_urls`; blocked URLs return `EXCLUDED_BY_USER` before any crawl (Cheerio/Playwright).
+- **OpenClaw agent integration**
+  - One-click auto-install into `~/.openclaw/skills/webclaw_scraper/SKILL.md` (Markdown skill, not a TS tool).
+---
+## Architecture
+Native Node.js hybrid runtime (no Docker required):
+- `services/gateway/src/orchestrator.js`
+  - **CheerioCrawler** → **PlaywrightCrawler** → optional **`extensionFallback.js`** (`WEBCLAW_EXTENSION_WS`)
+- `services/gateway/src/extensionFallback.js`
+  - Optional WebSocket bridge for a Chrome extension (`{ "type":"scrape","url" }` / `{ "html" }`)
+- `services/gateway/src/tokenMetrics.js`
+  - KPI calculation using `tiktoken`
+- `services/gateway/src/db.js`
+  - SQLite initialization + migrations + history/stats queries
+- `services/gateway/src/settings.js`
+  - Reads/writes `data/settings.json`; scrape path checks exclude list before orchestration
+- `services/gateway/src/templates/openclaw-skill.md`
+  - OpenClaw `SKILL.md` template (curl to the local API, `EXCLUDED_BY_USER` fallback rule)
+---
+## Quick Start
+### Requirements
+- **Node.js 20+** and **npm**
+- **Git** (recommended) — clone from [github.com/ngoclinh15994/webclaw-gateway](https://github.com/ngoclinh15994/webclaw-gateway)
+### Install and run (all platforms)
+**From npm (no clone):**
+```bash
+npx webclaw-hybrid-engine-ln
+```
+Wait until the terminal prints **Ready on port 8822**, then open **http://localhost:8822**.
+**From source:** clone the official repo (review the code on GitHub first if you are security-conscious):
+```bash
+git clone https://github.com/ngoclinh15994/webclaw-gateway.git
+cd webclaw-gateway
+```
+From the **repository root**:
+```bash
+npm install
+npm run setup
+npm start
+```
+- **`npm run setup`** runs `npx playwright install chromium` (required for the Playwright crawler tier).
+Dashboard and API: **http://localhost:8822**
+### Windows (convenience)
+```bat
+Start_WebClaw.bat
+```
+Runs `npm install`, `npm run setup`, then `npm start`.
+### macOS / Linux (convenience)
+```bash
+chmod +x Start_WebClaw.sh
+./Start_WebClaw.sh
+```
+---
+## API Reference
+### POST `/api/v1/scrape`
+Endpoint:
+```text
+http://localhost:8822/api/v1/scrape
+```
+Request body:
+```json
+{
+  "url": "https://example.com/article",
+  "mode": "auto",
+  "extract_mode": "article"
+}
+```
+Optional **`extract_mode`**: `"article"` (default, Readability) or `"ecommerce"` (full body, no Readability). Omit or use `"article"` for most pages.
+Modes:
+- `auto`: Cheerio first, then Playwright (then optional extension WS if configured)
+- `fast_only`: Cheerio only (errors if HTML is too small / SPA-like)
+- `playwright_only`: Playwright only (then extension WS if Playwright fails)
+If the URL matches any string in `exclude_urls` (substring match, case-insensitive), the API returns immediately:
+```json
+{
+  "status": "error",
+  "message": "EXCLUDED_BY_USER: This URL is blacklisted by user settings. Please use your default browser tool."
+}
+```
+Example:
+```bash
+curl -X POST "http://localhost:8822/api/v1/scrape" \
+  -H "Content-Type: application/json" \
+  -d '{"url":"https://example.com/article","mode":"auto"}'
+```
+Success response shape:
+```json
+{
+  "status": "success",
+  "extract_mode": "article",
+  "engine_used": "crawlee_cheerio",
+  "engine_label": "Crawlee Hybrid (Cheerio + Playwright)",
+  "data": {
+    "title": "Page title",
+    "markdown": "# Clean markdown"
+  },
+  "metrics": {
+    "raw_tokens": 12000,
+    "cleaned_tokens": 900,
+    "tokens_saved": 11100,
+    "reduction_percentage": 92.5
+  }
+}
+```
+### GET `/api/v1/history`
+Returns recent history from SQLite, newest first.
+Query params:
+- `limit` (default `50`, max `200`)
+- `offset` (default `0`)
+Example:
+```bash
+curl "http://localhost:8822/api/v1/history?limit=50&offset=0"
+```
+### GET `/api/v1/stats`
+Returns aggregate stats from SQLite.
+Example:
+```bash
+curl "http://localhost:8822/api/v1/stats"
+```
+Response:
+```json
+{
+  "status": "success",
+  "stats": {
+    "total_requests": 1234,
+    "total_tokens_saved": 9876543,
+    "overall_reduction_percentage": 85.2
+  }
+}
+```
+### GET `/api/v1/settings`
+Returns user settings (`exclude_urls` list).
+```bash
+curl "http://localhost:8822/api/v1/settings"
+```
+Response:
+```json
+{
+  "status": "success",
+  "settings": {
+    "exclude_urls": ["youtube.com", "localhost"]
+  }
+}
+```
+### POST `/api/v1/settings`
+Updates settings. Body must include `exclude_urls` (array of strings).
+```bash
+curl -X POST "http://localhost:8822/api/v1/settings" \
+  -H "Content-Type: application/json" \
+  -d '{"exclude_urls":["youtube.com"]}'
+```
+### POST `/api/v1/integrate/openclaw`
+Automatically installs the OpenClaw skill file into the local user profile.
+Behavior:
+- detects `~/.openclaw`
+- creates `~/.openclaw/skills/webclaw_scraper/` if needed
+- writes `SKILL.md` from the OpenClaw skill template
+Success:
+```json
+{
+  "status": "success",
+  "message": "WebClaw Skill successfully installed into OpenClaw!"
+}
+```
+If OpenClaw is not installed (`~/.openclaw` missing), returns `404` with an error message.
+---
+## Cookie Manager API
+### GET `/api/v1/cookies`
+Returns cookie entries from `data/cookies.json`.
+### POST `/api/v1/cookies`
+Stores cookie entries used for Playwright fallback.
+Request example:
+```json
+{
+  "cookies": [
+    {
+      "domain": "portal.example.com",
+      "cookie_string": "session=abc123; cf_clearance=xyz"
+    }
+  ]
+}
+```
+This format is designed for quick automation and browser-extension sync.
+---
+## Optional Docker (deprecated)
+Legacy `Dockerfile` / `docker-compose` samples live under **`.deprecated/docker/`** for reference only. The supported workflow is **native Node** (above).
+Health check:
+```text
+http://localhost:8822/health
+```
+---
+## For n8n / Automation Builders
+Use `POST /api/v1/scrape` as a standard HTTP node:
+- Input: URL + mode
+- Output: clean Markdown + token reduction KPI
+- Branch on `engine_used` if you want analytics by path
+This is optimized for agents and workflows where token waste directly hits your cloud bill.
+---
+## OpenClaw Integration
+This repository ships a ready-to-use OpenClaw **skill** (Markdown `SKILL.md`, installed under `~/.openclaw/skills/webclaw_scraper/`):
+- Template source: `services/gateway/src/templates/openclaw-skill.md`
+- Skill name: `webclaw-hybrid-engine-ln`
+- Auto-install endpoint: `POST /api/v1/integrate/openclaw`
+- The skill uses `curl` against `http://localhost:8822/api/v1/scrape` and documents fallback when the API returns `EXCLUDED_BY_USER`.
+Quick flow:
+1. Clone and run the engine from the [official repository](https://github.com/ngoclinh15994/webclaw-gateway) (see **Quick Start** above).
+2. Start WebClaw Hybrid Engine (`Start_WebClaw.bat` or `Start_WebClaw.sh`)
+3. Click `⚡ Install OpenClaw Skill` in dashboard (or call API)
+4. Restart OpenClaw so it reloads skills
+Detailed notes:
+- `openclaw-skill/README.md`
+---
+## Project Structure
+```text
+.
+├─ data/
+│  ├─ cookies.json
+│  ├─ settings.json
+│  └─ webclaw.db
+├─ scripts/
+│  └─ setup.js             # npx playwright install chromium
+├─ openclaw-skill/
+│  └─ README.md
+├─ services/
+│  └─ gateway/
+│     ├─ src/
+│     │  ├─ app.js
+│     │  ├─ extensionFallback.js
+│     │  ├─ orchestrator.js
+│     │  ├─ db.js
+│     │  ├─ settings.js
+│     │  └─ templates/openclaw-skill.md
+│     ├─ ui/
+│     └─ package.json
+├─ package.json            # workspace root (npm start / setup)
+├─ Start_WebClaw.bat
+└─ Start_WebClaw.sh
+```
+---
+## Credits
+- Crawlee: [Apify Crawlee](https://crawlee.dev/)
+- Markdown stack: Mozilla Readability, Turndown, tiktoken
+WebClaw Hybrid Engine is a **Node.js** orchestration layer around Crawlee and your local Playwright install.

package/bin/cli.js ADDED Viewed

@@ -0,0 +1,17 @@
+#!/usr/bin/env node
+const { execSync } = require("child_process");
+const path = require("path");
+console.log("🚀 Booting WebClaw Hybrid Engine...");
+try {
+  execSync("npx playwright install chromium", { stdio: "ignore" });
+} catch {
+  console.warn(
+    "⚠️ Note: Could not auto-install Playwright browser. If scraping fails, manually run: npx playwright install chromium"
+  );
+}
+console.log("✅ Engine dependencies verified. Starting Gateway...");
+require(path.join(__dirname, "../services/gateway/src/app.js"));

package/package.json ADDED Viewed

@@ -0,0 +1,45 @@
+{
+  "name": "webclaw-hybrid-engine-ln",
+  "version": "1.0.2",
+  "description": "WebClaw Hybrid Engine — privacy-first local bridge for OpenClaw (Crawlee + Playwright).",
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/ngoclinh15994/webclaw-gateway.git"
+  },
+  "bin": {
+    "webclaw-hybrid-engine-ln": "./bin/cli.js"
+  },
+  "files": [
+    "bin",
+    "services/gateway/src",
+    "services/gateway/ui",
+    "services/gateway/package.json",
+    "scripts/setup.js"
+  ],
+  "workspaces": [
+    "services/gateway"
+  ],
+  "scripts": {
+    "setup": "node scripts/setup.js",
+    "start": "node services/gateway/src/app.js",
+    "dev": "nodemon services/gateway/src/app.js",
+    "webclaw": "node bin/cli.js"
+  },
+  "engines": {
+    "node": ">=20.0.0"
+  },
+  "dependencies": {
+    "@mozilla/readability": "^0.6.0",
+    "better-sqlite3": "^12.8.0",
+    "crawlee": "^3.15.3",
+    "express": "^4.21.2",
+    "jsdom": "^26.1.0",
+    "playwright": "^1.53.2",
+    "tiktoken": "^1.0.22",
+    "turndown": "^7.2.0",
+    "ws": "^8.18.0"
+  },
+  "devDependencies": {
+    "nodemon": "^3.1.9"
+  }
+}

package/scripts/setup.js ADDED Viewed

@@ -0,0 +1,4 @@
+#!/usr/bin/env node
+const { execSync } = require("child_process");
+execSync("npx playwright install chromium", { stdio: "inherit" });

package/services/gateway/package.json ADDED Viewed

@@ -0,0 +1,25 @@
+{
+  "name": "webclaw-gateway",
+  "version": "0.1.0",
+  "description": "WebClaw Hybrid Engine service — stealth web scraping and LLM token reduction (Crawlee + Playwright).",
+  "main": "src/app.js",
+  "type": "commonjs",
+  "scripts": {
+    "start": "node src/app.js",
+    "dev": "node --watch src/app.js"
+  },
+  "engines": {
+    "node": ">=20.0.0"
+  },
+  "dependencies": {
+    "@mozilla/readability": "^0.6.0",
+    "better-sqlite3": "^12.8.0",
+    "crawlee": "^3.15.3",
+    "express": "^4.21.2",
+    "jsdom": "^26.1.0",
+    "playwright": "^1.53.2",
+    "tiktoken": "^1.0.22",
+    "turndown": "^7.2.0",
+    "ws": "^8.18.0"
+  }
+}

package/services/gateway/src/app.js ADDED Viewed

@@ -0,0 +1,199 @@
+process.env.CRAWLEE_LOG_LEVEL = process.env.CRAWLEE_LOG_LEVEL || "WARNING";
+const express = require("express");
+const path = require("path");
+const os = require("os");
+const fs = require("fs/promises");
+const fsSync = require("fs");
+const { PORT } = require("./config");
+const { runOrchestrator } = require("./orchestrator");
+const { ensureCookiesFile, writeCookies, readCookies, normalizeCookiesForStorage } = require("./cookies");
+const { ensureSettingsFile, readSettings, writeSettings, isExcludedUrl } = require("./settings");
+const {
+  runMigrations,
+  insertScrapeHistory,
+  listScrapeHistory,
+  countScrapeHistory,
+  getAggregateStats
+} = require("./db");
+const app = express();
+app.use(express.json({ limit: "1mb" }));
+app.use(express.static(path.resolve(__dirname, "../ui")));
+app.get("/health", (_, res) => {
+  res.json({ status: "ok" });
+});
+app.post("/api/v1/scrape", async (req, res) => {
+  try {
+    const { url, mode = "auto", extract_mode = "article" } = req.body || {};
+    if (!url) {
+      return res.status(400).json({ status: "error", message: "url is required" });
+    }
+    const settings = await readSettings();
+    if (isExcludedUrl(url, settings.exclude_urls)) {
+      return res.status(400).json({
+        status: "error",
+        message: "EXCLUDED_BY_USER: This URL is blacklisted by user settings. Please use your default browser tool."
+      });
+    }
+    if (!["auto", "fast_only", "playwright_only"].includes(mode)) {
+      return res.status(400).json({ status: "error", message: "invalid mode" });
+    }
+    if (!["article", "ecommerce"].includes(extract_mode)) {
+      return res.status(400).json({ status: "error", message: "invalid extract_mode" });
+    }
+    const result = await runOrchestrator({ url, mode, extract_mode });
+    if (result.status === "success" && result.metrics) {
+      insertScrapeHistory({
+        url,
+        engine_used: result.engine_used || "",
+        raw_tokens: Number(result.metrics.raw_tokens || 0),
+        cleaned_tokens: Number(result.metrics.cleaned_tokens || 0),
+        tokens_saved: Number(result.metrics.tokens_saved || 0),
+        reduction_percentage: Number(result.metrics.reduction_percentage || 0)
+      });
+    }
+    return res.json(result);
+  } catch (error) {
+    return res.status(500).json({
+      status: "error",
+      message: error.message || "Unexpected error"
+    });
+  }
+});
+app.get("/api/v1/history", (req, res) => {
+  try {
+    const requestedLimit = Number(req.query.limit || 50);
+    const requestedOffset = Number(req.query.offset || 0);
+    const limit = Math.min(Math.max(Number.isFinite(requestedLimit) ? requestedLimit : 50, 1), 200);
+    const offset = Math.max(Number.isFinite(requestedOffset) ? requestedOffset : 0, 0);
+    const items = listScrapeHistory(limit, offset);
+    const total = countScrapeHistory();
+    return res.json({
+      status: "success",
+      items,
+      pagination: { limit, offset, total }
+    });
+  } catch (error) {
+    return res.status(500).json({ status: "error", message: error.message });
+  }
+});
+app.get("/api/v1/stats", (_, res) => {
+  try {
+    const stats = getAggregateStats();
+    return res.json({ status: "success", stats });
+  } catch (error) {
+    return res.status(500).json({ status: "error", message: error.message });
+  }
+});
+app.get("/api/v1/integrate/openclaw/status", (_, res) => {
+  try {
+    const homeDir = os.homedir();
+    const openclawRoot = path.join(homeDir, ".openclaw");
+    const skillPath = path.join(openclawRoot, "skills", "webclaw_scraper", "SKILL.md");
+    const installed = fsSync.existsSync(skillPath);
+    const openclawRootExists = fsSync.existsSync(openclawRoot);
+    return res.json({ status: "success", installed, openclawRootExists });
+  } catch (error) {
+    return res.status(500).json({ status: "error", message: error.message });
+  }
+});
+app.post("/api/v1/integrate/openclaw", async (_, res) => {
+  try {
+    const homeDir = os.homedir();
+    const openclawRoot = path.join(homeDir, ".openclaw");
+    const targetDirectory = path.join(openclawRoot, "skills", "webclaw_scraper");
+    const targetFile = path.join(targetDirectory, "SKILL.md");
+    const templatePath = path.resolve(__dirname, "./templates/openclaw-skill.md");
+    try {
+      await fs.access(openclawRoot);
+    } catch {
+      return res.status(404).json({
+        status: "error",
+        message: "OpenClaw is not installed on this machine."
+      });
+    }
+    await fs.mkdir(targetDirectory, { recursive: true });
+    const templateContent = await fs.readFile(templatePath, "utf8");
+    await fs.writeFile(targetFile, templateContent, "utf8");
+    return res.json({
+      status: "success",
+      message: "WebClaw Skill successfully installed into OpenClaw!"
+    });
+  } catch (error) {
+    return res.status(500).json({ status: "error", message: error.message });
+  }
+});
+app.get("/api/v1/settings", async (_, res) => {
+  try {
+    const settings = await readSettings();
+    return res.json({ status: "success", settings });
+  } catch (error) {
+    return res.status(500).json({ status: "error", message: error.message });
+  }
+});
+app.post("/api/v1/settings", async (req, res) => {
+  try {
+    const payload = req.body || {};
+    if (!Array.isArray(payload.exclude_urls)) {
+      return res.status(400).json({ status: "error", message: "exclude_urls must be an array" });
+    }
+    const settings = await writeSettings(payload);
+    return res.json({ status: "success", settings });
+  } catch (error) {
+    return res.status(500).json({ status: "error", message: error.message });
+  }
+});
+app.get("/api/v1/cookies", async (_, res) => {
+  try {
+    const cookies = await readCookies();
+    return res.json({ status: "success", cookies });
+  } catch (error) {
+    return res.status(500).json({ status: "error", message: error.message });
+  }
+});
+app.post("/api/v1/cookies", async (req, res) => {
+  try {
+    const { cookies } = req.body || {};
+    if (!Array.isArray(cookies)) {
+      return res.status(400).json({ status: "error", message: "cookies must be an array" });
+    }
+    const normalized = normalizeCookiesForStorage(cookies);
+    await writeCookies(normalized);
+    return res.json({ status: "success", count: normalized.length });
+  } catch (error) {
+    return res.status(500).json({ status: "error", message: error.message });
+  }
+});
+async function start() {
+  await ensureCookiesFile();
+  await ensureSettingsFile();
+  runMigrations();
+  app.listen(PORT, () => {
+    console.log(`webclaw-hybrid-engine-ln listening on :${PORT}`);
+    console.log(`Ready on port ${PORT}`);
+  });
+}
+start().catch((err) => {
+  console.error("Failed to start gateway:", err);
+  process.exit(1);
+});