webpeel 0.12.2 → 0.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -7,15 +7,15 @@
7
7
  <p align="center">
8
8
  <a href="https://www.npmjs.com/package/webpeel"><img src="https://img.shields.io/npm/v/webpeel.svg" alt="npm version"></a>
9
9
  <a href="https://pypi.org/project/webpeel/"><img src="https://img.shields.io/pypi/v/webpeel.svg" alt="PyPI version"></a>
10
- <a href="https://www.npmjs.com/package/webpeel"><img src="https://img.shields.io/npm/dm/webpeel.svg" alt="npm downloads"></a>
10
+ <a href="https://www.npmjs.com/package/webpeel"><img src="https://img.shields.io/npm/dm/webpeel.svg" alt="downloads"></a>
11
11
  <a href="https://github.com/webpeel/webpeel/stargazers"><img src="https://img.shields.io/github/stars/webpeel/webpeel.svg" alt="GitHub stars"></a>
12
12
  <a href="https://github.com/webpeel/webpeel/actions/workflows/ci.yml"><img src="https://github.com/webpeel/webpeel/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
13
- <a href="https://www.typescriptlang.org/"><img src="https://img.shields.io/badge/TypeScript-5.6-blue.svg" alt="TypeScript"></a>
14
- <a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/License-AGPL%20v3-blue.svg" alt="AGPL v3 License"></a>
13
+ <a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/License-AGPL%20v3-blue.svg" alt="AGPL v3"></a>
15
14
  </p>
16
15
 
17
16
  <p align="center">
18
- <b>Turn any web page into AI-ready markdown. Smart escalation. Stealth mode. Free to start.</b>
17
+ <strong>Reliable web access for AI agents.</strong><br>
18
+ Fetch any page · Extract structured data · Crawl entire sites · Deep research — one tool, three interfaces.
19
19
  </p>
20
20
 
21
21
  <p align="center">
@@ -28,345 +28,415 @@
28
28
 
29
29
  ---
30
30
 
31
- ## Quick Start
31
+ ## What is WebPeel?
32
32
 
33
- ```bash
34
- # Zero install — just run it
35
- npx webpeel https://news.ycombinator.com
36
- ```
37
-
38
- ```bash
39
- # Agent mode — JSON + budget + extraction in one flag
40
- npx webpeel https://example.com --agent
41
-
42
- # Site search — no URL knowledge needed (27 supported sites)
43
- npx webpeel search --site ebay "charizard card"
44
- npx webpeel search --site amazon "laptop stand" --table
33
+ WebPeel gives your AI agent reliable access to the web. Fetch any page, extract structured data, crawl entire sites, and research topics — all through a single CLI, API, or MCP server.
45
34
 
46
- # CSS schema extraction auto-detected by domain
47
- npx webpeel https://www.amazon.com/s?k=keyboard --json
48
- npx webpeel https://www.booking.com/searchresults.html --schema booking --json
49
- npx webpeel --list-schemas
50
-
51
- # LLM extraction — structured data from any page (BYOK)
52
- npx webpeel https://example.com/product --llm-extract "title, price, rating" --json
53
- npx webpeel https://hn.algolia.com --llm-extract "top 5 posts with scores" --llm-key $OPENAI_API_KEY
54
-
55
- # Hotel search — multi-source parallel search
56
- npx webpeel hotels "Paris" --checkin 2026-03-01 --checkout 2026-03-05 --sort price
57
- npx webpeel hotels "New York" --checkin 2026-04-10 --json
35
+ It automatically handles the hard parts: JavaScript rendering, bot detection, Cloudflare challenges, infinite scroll, pagination, and content noise. Your agent gets clean markdown. You don't think about the plumbing.
58
36
 
59
- # Browser profiles — persistent sessions across requests
60
- npx webpeel profile create myprofile
61
- npx webpeel https://protected-site.com --profile myprofile --stealth
62
- npx webpeel profile list
37
+ ---
63
38
 
64
- # Stealth mode (auto-detects & bypasses bot protection)
65
- npx webpeel https://protected-site.com --stealth
39
+ ## 🚀 Quick Start
66
40
 
67
- # Crawl a website
68
- npx webpeel crawl https://example.com --max-pages 20
41
+ **Three paths in, all free to try:**
69
42
 
70
- # Search the web
71
- npx webpeel search "best AI frameworks 2026"
43
+ ### CLI
72
44
 
73
- # Extract product listings automatically
74
- npx webpeel https://store.com/search --extract-all --json
45
+ ```bash
46
+ npx webpeel "https://news.ycombinator.com"
75
47
  ```
76
48
 
77
- First 25 fetches work instantly, no signup. After that, [sign up free](https://app.webpeel.dev/signup) for 125/week.
78
-
79
- ## Why WebPeel?
80
-
81
- | Feature | **WebPeel** | Firecrawl | Jina Reader | MCP Fetch |
82
- |---------|:-----------:|:---------:|:-----------:|:---------:|
83
- | **Free tier** | ✅ 125/wk recurring | 500 one-time | ❌ Cloud only | ✅ Unlimited |
84
- | **Smart escalation** | ✅ HTTP→Browser→Stealth | Manual | ❌ | ❌ |
85
- | **Challenge detection** | ✅ 7 vendors auto-detected | ❌ | ❌ | ❌ |
86
- | **Site search** | ✅ 27 sites built-in | ❌ | ❌ | ❌ |
87
- | **Stealth mode** | ✅ v2, all plans | ✅ | ⚠️ Limited | ❌ |
88
- | **Browser profiles** | ✅ Persistent sessions | ❌ | ❌ | ❌ |
89
- | **Hotel search** | ✅ Multi-source parallel | ❌ | ❌ | ❌ |
90
- | **CSS schema extraction** | ✅ 7 bundled + auto-detect | ❌ | ❌ | ❌ |
91
- | **LLM extraction** | ✅ BYOK, cost tracking | ⚠️ Cloud only | ❌ | ❌ |
92
- | **Firecrawl-compatible** | ✅ Drop-in replacement | ✅ Native | ❌ | ❌ |
93
- | **Self-hosting** | ✅ Docker compose | ⚠️ Complex | ❌ | N/A |
94
- | **Autonomous agent** | ✅ BYOK any LLM | ⚠️ Locked | ❌ | ❌ |
95
- | **Deep research** | ✅ Multi-source + BM25 | ⚠️ Cloud only | ❌ | ❌ |
96
- | **Content pruning** | ✅ 2-pass, 15-33% savings | ❌ | ❌ | ❌ |
97
- | **BM25 filtering** | ✅ Query-focused | ❌ | ❌ | ❌ |
98
- | **Python SDK** | ✅ `pip install` | ✅ | ❌ | ❌ |
99
- | **MCP tools** | ✅ 13 tools | ~6 | 0 | 1 |
100
- | **License** | ✅ AGPL-3.0 | AGPL-3.0 | Proprietary | MIT |
101
- | **Pricing** | **Free / $9 / $29** | $0 / $16 / $83 | Custom | Free |
102
-
103
- ## Benchmarks
104
-
105
- Evaluated on 30 real-world URLs across 6 categories (static, dynamic, SPA, protected, documents, international) against 6 competing web fetching APIs.
106
-
107
- | Metric | WebPeel | Next best |
108
- |--------|:-------:|:---------:|
109
- | **Success rate** | **100%** (30/30) | 93.3% (Firecrawl, Exa, LinkUp) |
110
- | **Content quality** | **92.3%** | 83.2% (Exa) |
111
-
112
- WebPeel is the only tool that successfully extracted content from all 30 test URLs. Full methodology and per-category breakdown: [webpeel.dev/blog/benchmarks](https://webpeel.dev/blog/benchmarks)
113
-
114
- ## Install
49
+ No install needed. First 25 fetches work without signup. [Get 500/week free](https://app.webpeel.dev/signup)
115
50
 
116
- ```bash
117
- # Node.js
118
- npm install webpeel # or: pnpm add webpeel
51
+ ### MCP Server (for Claude, Cursor, VS Code, Windsurf)
119
52
 
120
- # Python
121
- pip install webpeel
53
+ Add to your MCP config (`~/Library/Application Support/Claude/claude_desktop_config.json` on Mac):
122
54
 
123
- # Global CLI
124
- npm install -g webpeel
55
+ ```json
56
+ {
57
+ "mcpServers": {
58
+ "webpeel": {
59
+ "command": "npx",
60
+ "args": ["-y", "webpeel", "mcp"]
61
+ }
62
+ }
63
+ }
125
64
  ```
126
65
 
127
- ## Usage
128
-
129
- ### Node.js
66
+ [![Install in Claude Desktop](https://img.shields.io/badge/Install-Claude%20Desktop-5B3FFF?style=for-the-badge&logo=anthropic)](https://mcp.so/install/webpeel?for=claude)
67
+ [![Install in VS Code](https://img.shields.io/badge/Install-VS%20Code-007ACC?style=for-the-badge&logo=visualstudiocode)](https://mcp.so/install/webpeel?for=vscode)
130
68
 
131
- ```typescript
132
- import { peel } from 'webpeel';
69
+ ### REST API
133
70
 
134
- const result = await peel('https://example.com');
135
- console.log(result.content); // Clean markdown
136
- console.log(result.metadata); // { title, description, author, ... }
137
- console.log(result.tokens); // Estimated token count
138
-
139
- // With options
140
- const advanced = await peel('https://example.com', {
141
- render: true, // Browser for JS-heavy sites
142
- stealth: true, // Anti-bot stealth mode
143
- maxTokens: 4000, // Limit output
144
- includeTags: ['main'], // Filter HTML tags
145
- });
71
+ ```bash
72
+ curl "https://api.webpeel.dev/v1/fetch?url=https://example.com" \
73
+ -H "Authorization: Bearer wp_YOUR_KEY"
146
74
  ```
147
75
 
148
- ### Python
149
-
150
- ```python
151
- from webpeel import WebPeel
76
+ ---
152
77
 
153
- client = WebPeel() # Free tier, no key needed
78
+ ## Features
79
+
80
+ ### Core
81
+
82
+ | Feature | Description |
83
+ |---------|-------------|
84
+ | **Web Fetching** | Any URL → clean markdown, text, HTML, or JSON |
85
+ | **Smart Escalation** | Auto-upgrades: HTTP → Browser → Stealth. Uses the fastest method, escalates only when needed |
86
+ | **Content Pruning** | 2-pass HTML reduction — strips nav/footer/sidebar/ads automatically |
87
+ | **Token Budget** | Hard-cap output to N tokens. No surprises in your LLM bill |
88
+ | **Screenshots** | Full-page or viewport screenshots with a single flag |
89
+ | **Batch Mode** | Process multiple URLs concurrently |
90
+
91
+ ### AI Agent
92
+
93
+ | Feature | Description |
94
+ |---------|-------------|
95
+ | **MCP Server** | 12 tools for Claude Desktop, Cursor, VS Code, and Windsurf |
96
+ | **Deep Research** | Multi-hop agent: search → fetch → analyze → follow leads → synthesize |
97
+ | **Search** | Web search across 27+ structured sources |
98
+ | **Hotel Search** | Kayak, Booking.com, Google Travel, Expedia — in parallel |
99
+ | **Browser Profiles** | Persistent sessions for sites that require login |
100
+ | **Infinite Scroll** | Auto-scrolls lazy-loaded feeds until stable |
101
+ | **Actions** | Click, type, fill, select, hover, press, scroll — full browser automation |
102
+
103
+ ### Extraction
104
+
105
+ | Feature | Description |
106
+ |---------|-------------|
107
+ | **CSS Schema Extraction** | 7 built-in schemas (Amazon, Booking.com, eBay, Expedia, Hacker News, Walmart, Yelp) — auto-detected by domain |
108
+ | **JSON Schema Extraction** | Pass any JSON Schema and get back typed, structured data |
109
+ | **LLM Extraction (BYOK)** | Natural language → structured data using your own OpenAI-compatible key |
110
+ | **BM25 Filtering** | Query-focused content: only the parts relevant to your question |
111
+ | **Links / Images / Meta** | Extract just the links, images, or metadata from any page |
112
+
113
+ ### Anti-Bot
114
+
115
+ | Feature | Description |
116
+ |---------|-------------|
117
+ | **Stealth Mode** | Bypasses Cloudflare, PerimeterX, DataDome, Akamai, and more |
118
+ | **28 Auto-Stealth Domains** | Amazon, LinkedIn, Glassdoor, Zillow, and 24 more — stealth kicks in automatically |
119
+ | **Challenge Detection** | 7 bot-protection vendors detected and handled automatically |
120
+ | **Browser Fingerprinting** | Masks WebGL, navigator properties, canvas fingerprint |
121
+
122
+ ### Advanced
123
+
124
+ | Feature | Description |
125
+ |---------|-------------|
126
+ | **Crawl + Sitemap** | BFS/DFS crawling, sitemap discovery, robots.txt compliance, deduplication |
127
+ | **Site Map** | Map all URLs on a domain up to any depth |
128
+ | **Pagination** | Follow "Next" links automatically for N pages |
129
+ | **Chunking** | Split long content into LLM-sized pieces (fixed, semantic, or paragraph) |
130
+ | **Caching** | Local result cache with configurable TTL (`5m`, `1h`, `1d`) |
131
+ | **Geo-targeting** | ISO country code + language preferences per request |
132
+ | **Change Tracking** | Detect what changed between two fetches of the same page |
133
+ | **Brand Extraction** | Pull logo, colors, fonts, and social links from any site |
134
+ | **PDF Extraction** | Extract text from PDF documents |
135
+ | **Self-Hostable** | Docker Compose for full on-premise deployment |
136
+ | **Python SDK** | Sync + async client, `pip install webpeel` |
154
137
 
155
- result = client.scrape("https://example.com")
156
- print(result.content) # Clean markdown
138
+ ---
157
139
 
158
- results = client.search("python web scraping")
159
- job = client.crawl("https://docs.example.com", limit=100)
140
+ ## 🤖 MCP Integration
141
+
142
+ WebPeel exposes **13 tools** to your AI coding assistant:
143
+
144
+ | Tool | What it does |
145
+ |------|--------------|
146
+ | `webpeel_fetch` | Fetch any URL → markdown (smart escalation built in) |
147
+ | `webpeel_search` | Web search with structured results |
148
+ | `webpeel_batch` | Fetch multiple URLs concurrently |
149
+ | `webpeel_crawl` | Crawl a site with depth/page limits |
150
+ | `webpeel_map` | Discover all URLs on a domain |
151
+ | `webpeel_extract` | Structured extraction (CSS, JSON Schema, or LLM) |
152
+ | `webpeel_screenshot` | Screenshot any page |
153
+ | `webpeel_research` | Deep multi-hop research on a topic |
154
+ | `webpeel_summarize` | AI summary of any URL |
155
+ | `webpeel_answer` | Ask a question about a URL's content |
156
+ | `webpeel_change_track` | Detect changes between two fetches |
157
+ | `webpeel_brand` | Extract branding assets from a site |
158
+
159
+ <details>
160
+ <summary>Setup for each editor</summary>
161
+
162
+ **Claude Desktop** (`~/Library/Application Support/Claude/claude_desktop_config.json`):
163
+ ```json
164
+ {
165
+ "mcpServers": {
166
+ "webpeel": { "command": "npx", "args": ["-y", "webpeel", "mcp"] }
167
+ }
168
+ }
160
169
  ```
161
170
 
162
- Zero dependencies. Pure Python 3.8+. [Full SDK docs →](python-sdk/README.md)
163
-
164
- ### MCP Server
165
-
166
- 11 tools for Claude Desktop, Cursor, VS Code, and Windsurf:
171
+ **Cursor** (Settings MCP Servers):
172
+ ```json
173
+ {
174
+ "mcpServers": {
175
+ "webpeel": { "command": "npx", "args": ["-y", "webpeel", "mcp"] }
176
+ }
177
+ }
178
+ ```
167
179
 
168
- `webpeel_fetch` · `webpeel_search` · `webpeel_crawl` · `webpeel_map` · `webpeel_extract` · `webpeel_batch` · `webpeel_brand` · `webpeel_change_track` · `webpeel_summarize` · `webpeel_answer` · `webpeel_screenshot`
180
+ **VS Code** (`~/.vscode/mcp.json`):
181
+ ```json
182
+ {
183
+ "servers": {
184
+ "webpeel": { "command": "npx", "args": ["-y", "webpeel", "mcp"] }
185
+ }
186
+ }
187
+ ```
169
188
 
189
+ **Windsurf** (`~/.codeium/windsurf/mcp_config.json`):
170
190
  ```json
171
191
  {
172
192
  "mcpServers": {
173
- "webpeel": {
174
- "command": "npx",
175
- "args": ["-y", "webpeel", "mcp"]
176
- }
193
+ "webpeel": { "command": "npx", "args": ["-y", "webpeel", "mcp"] }
177
194
  }
178
195
  }
179
196
  ```
180
197
 
181
- [![Install in Claude Desktop](https://img.shields.io/badge/Install-Claude%20Desktop-5B3FFF?style=for-the-badge&logo=anthropic)](https://mcp.so/install/webpeel?for=claude)
182
- [![Install in VS Code](https://img.shields.io/badge/Install-VS%20Code-007ACC?style=for-the-badge&logo=visualstudiocode)](https://mcp.so/install/webpeel?for=vscode)
198
+ **Docker (stdio)**:
199
+ ```json
200
+ {
201
+ "mcpServers": {
202
+ "webpeel": { "command": "docker", "args": ["run", "-i", "--rm", "webpeel/mcp"] }
203
+ }
204
+ }
205
+ ```
206
+ </details>
183
207
 
184
- > **Where to add this config:** Claude Desktop → `~/Library/Application Support/Claude/claude_desktop_config.json` · Cursor → Settings → MCP Servers · VS Code → `~/.vscode/mcp.json` · Windsurf → `~/.codeium/windsurf/mcp_config.json`
208
+ ---
185
209
 
186
- ### Docker
210
+ ## 🔬 Deep Research
187
211
 
188
- **MCP Server (stdio for Claude Desktop, Cursor, Windsurf):**
212
+ Multi-hop research that thinks like a researcher, not a search engine:
189
213
 
190
214
  ```bash
191
- docker run -i webpeel/mcp
215
+ # Sources only — no API key needed
216
+ npx webpeel research "best practices for rate limiting APIs" --max-sources 8
217
+
218
+ # Full synthesis with LLM (BYOK)
219
+ npx webpeel research "compare Firecrawl vs Crawl4AI vs WebPeel" --llm-key sk-...
192
220
  ```
193
221
 
194
- **MCP Server (HTTP Streamable transport):**
222
+ **How it works:** Search → fetch top results → extract key passages (BM25) follow the most relevant links → synthesize across sources. No circular references, no duplicate content.
223
+
224
+ ---
225
+
226
+ ## 📦 Extraction
227
+
228
+ Three ways to get structured data out of any page:
229
+
230
+ ### CSS Schema (zero config, auto-detected)
195
231
 
196
232
  ```bash
197
- docker run -e MCP_HTTP_MODE=true -p 3100:3100 webpeel/mcp
233
+ # Auto-detects Amazon and applies the built-in schema
234
+ npx webpeel "https://www.amazon.com/s?k=mechanical+keyboard" --json
235
+
236
+ # Force a specific schema
237
+ npx webpeel "https://www.booking.com/searchresults.html?city=Paris" --schema booking --json
238
+
239
+ # List all built-in schemas
240
+ npx webpeel --list-schemas
198
241
  ```
199
242
 
200
- **API Server (Firecrawl-compatible REST API):**
243
+ Built-in schemas: `amazon` · `booking` · `ebay` · `expedia` · `hackernews` · `walmart` · `yelp`
244
+
245
+ ### JSON Schema (type-safe structured extraction)
201
246
 
202
247
  ```bash
203
- docker run -p 3000:3000 webpeel/api
248
+ npx webpeel "https://example.com/product" \
249
+ --extract-schema '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}' \
250
+ --llm-key sk-...
204
251
  ```
205
252
 
206
- **Self-Hosted (full stack with database):**
253
+ ### LLM Extraction (natural language, BYOK)
207
254
 
208
255
  ```bash
209
- git clone https://github.com/webpeel/webpeel.git
210
- cd webpeel && docker compose up
256
+ npx webpeel "https://hn.algolia.com" \
257
+ --llm-extract "top 10 posts with title, score, and comment count" \
258
+ --llm-key $OPENAI_API_KEY \
259
+ --json
211
260
  ```
212
261
 
213
- Full API at `http://localhost:3000`. AGPL-3.0 licensed. [Commercial licensing available](mailto:support@webpeel.dev).
262
+ <details>
263
+ <summary>Node.js extraction example</summary>
214
264
 
215
- **MCP config for Docker:**
265
+ ```typescript
266
+ import { peel } from 'webpeel';
216
267
 
217
- ```json
218
- {
219
- "mcpServers": {
220
- "webpeel": {
221
- "command": "docker",
222
- "args": ["run", "-i", "--rm", "webpeel/mcp"]
268
+ // CSS selector extraction
269
+ const result = await peel('https://news.ycombinator.com', {
270
+ extract: {
271
+ selectors: {
272
+ titles: '.titleline > a',
273
+ scores: '.score',
223
274
  }
224
275
  }
225
- }
226
- ```
276
+ });
277
+ console.log(result.extracted); // { titles: [...], scores: [...] }
227
278
 
228
- ## Features
279
+ // LLM extraction with JSON Schema
280
+ const product = await peel('https://example.com/product', {
281
+ llmExtract: 'title, price, rating, availability',
282
+ llmKey: process.env.OPENAI_API_KEY,
283
+ });
284
+ ```
285
+ </details>
229
286
 
230
- ### 🎯 Smart Escalation
287
+ ---
231
288
 
232
- Automatically uses the fastest method, escalates only when needed:
289
+ ## 🛡️ Stealth & Anti-Bot
233
290
 
234
- ```
235
- HTTP Fetch (200ms) → Browser Rendering (2s) → Stealth Mode (5s)
236
- 80% of sites 15% of sites 5% of sites
237
- ```
291
+ WebPeel detects 7 bot-protection vendors and handles them automatically:
238
292
 
239
- ### 🎭 Stealth Mode
293
+ - **Cloudflare** (JS challenge, Turnstile, Bot Management)
294
+ - **PerimeterX / HUMAN** (behavioral analysis)
295
+ - **DataDome** (ML-based bot detection)
296
+ - **Akamai Bot Manager**
297
+ - **Distil Networks**
298
+ - **reCAPTCHA / hCaptcha** (page-level detection)
299
+ - **Generic challenge pages**
240
300
 
241
- Bypass Cloudflare and bot detection. Masks browser fingerprints, navigator properties, WebGL vendor.
301
+ 28 high-protection domains (Amazon, LinkedIn, Glassdoor, Zillow, Ticketmaster, and more) automatically route through stealth mode — no flags needed.
242
302
 
243
303
  ```bash
244
- npx webpeel https://protected-site.com --stealth
304
+ # Explicitly enable stealth
305
+ npx webpeel "https://glassdoor.com/jobs" --stealth
306
+
307
+ # Auto-escalation (stealth triggers automatically on challenge detection)
308
+ npx webpeel "https://amazon.com/dp/ASIN"
245
309
  ```
246
310
 
247
- ### 🕷️ Crawl & Map
311
+ ---
248
312
 
249
- Crawl websites with link following, sitemap discovery, robots.txt compliance, and deduplication.
313
+ ## Benchmark
250
314
 
251
- ```bash
252
- npx webpeel crawl https://docs.example.com --max-pages 100
253
- npx webpeel map https://example.com --max-urls 5000
254
- ```
315
+ Evaluated on 30 real-world URLs across 6 categories (static, dynamic, SPA, protected, documents, international):
255
316
 
256
- ### 🔬 Deep Research
317
+ | | WebPeel | Next best |
318
+ |---|:---:|:---:|
319
+ | **Success rate** | **100%** (30/30) | 93.3% |
320
+ | **Content quality** | **92.3%** | 83.2% |
257
321
 
258
- Multi-source research with BM25 relevance ranking. No API key needed for sources mode.
322
+ WebPeel is the only tool that extracted content from all 30 test URLs. [Full methodology →](https://webpeel.dev/blog/benchmarks)
259
323
 
260
- ```bash
261
- # Get ranked sources with relevance scores
262
- npx webpeel research "best web scraping tools 2025" --max-sources 5
324
+ ---
263
325
 
264
- # Full synthesis with LLM (BYOK)
265
- npx webpeel research "compare Firecrawl vs Crawl4AI" --llm-key sk-...
266
- ```
326
+ ## 🆚 Comparison
327
+
328
+ | Feature | **WebPeel** | Firecrawl | Jina Reader | ScrapingBee | Tavily |
329
+ |---------|:-----------:|:---------:|:-----------:|:-----------:|:------:|
330
+ | **Free tier** | ✅ 500/wk recurring | ⚠️ 500 one-time | ❌ | ❌ | ⚠️ 1,000 one-time |
331
+ | **Smart escalation** | ✅ auto HTTP→browser→stealth | ❌ manual | ❌ | ❌ | ❌ |
332
+ | **Stealth mode** | ✅ all plans | ✅ | ❌ | ✅ paid | ❌ |
333
+ | **Challenge detection** | ✅ 7 vendors | ❌ | ❌ | ❌ | ❌ |
334
+ | **MCP tools** | ✅ 12 tools | ⚠️ ~6 | ❌ | ❌ | ✅ |
335
+ | **Deep research** | ✅ multi-hop + BM25 | ⚠️ cloud only | ❌ | ❌ | ✅ |
336
+ | **CSS schema extraction** | ✅ 7 bundled | ❌ | ❌ | ❌ | ❌ |
337
+ | **LLM extraction (BYOK)** | ✅ | ⚠️ cloud only | ❌ | ❌ | ❌ |
338
+ | **Site search (27+ sites)** | ✅ | ❌ | ❌ | ❌ | ⚠️ web only |
339
+ | **Hotel search** | ✅ 4 sources parallel | ❌ | ❌ | ❌ | ❌ |
340
+ | **Browser profiles** | ✅ persistent sessions | ❌ | ❌ | ❌ | ❌ |
341
+ | **Self-hosting** | ✅ Docker Compose | ⚠️ complex | ❌ | ❌ | ❌ |
342
+ | **Python SDK** | ✅ `pip install` | ✅ | ❌ | ✅ | ✅ |
343
+ | **Firecrawl-compatible API** | ✅ drop-in | ✅ native | ❌ | ❌ | ❌ |
344
+ | **License** | AGPL-3.0 | AGPL-3.0 | Proprietary | Proprietary | Proprietary |
345
+ | **Price** | **$0 / $9 / $29** | $0 / $16 / $83 | custom | $49 / $149 | $0 / $99 |
346
+
347
+ ---
348
+
349
+ ## 💳 Pricing
350
+
351
+ | Plan | Price | Weekly Fetches | Burst |
352
+ |------|------:|:--------------:|:-----:|
353
+ | **Free** | $0/mo | 500/wk | 50/hr |
354
+ | **Pro** | $9/mo | 1,250/wk | 100/hr |
355
+ | **Max** | $29/mo | 6,250/wk | 500/hr |
356
+
357
+ All features on all plans. Pro/Max add pay-as-you-go extra usage (fetch $0.002, search $0.001, stealth $0.01). Quota resets every Monday.
358
+
359
+ [Sign up free →](https://app.webpeel.dev/signup) · [Compare with Firecrawl →](https://webpeel.dev/migrate-from-firecrawl)
267
360
 
268
- ### 🧹 Token Efficiency
361
+ ---
269
362
 
270
- Save 15-77% on AI tokens automatically.
363
+ ## 🐍 Python SDK
271
364
 
272
365
  ```bash
273
- # Content pruning (default ON — strips nav/footer/sidebar)
274
- npx webpeel https://en.wikipedia.org/wiki/Web_scraping
366
+ pip install webpeel
367
+ ```
368
+
369
+ ```python
370
+ from webpeel import WebPeel
275
371
 
276
- # Query-focused filtering (BM25)
277
- npx webpeel https://en.wikipedia.org/wiki/Web_scraping --focus "legal issues"
372
+ client = WebPeel(api_key="wp_...") # or use WEBPEEL_API_KEY env var
278
373
 
279
- # Token budget (hard cap)
280
- npx webpeel https://en.wikipedia.org/wiki/Web_scraping --budget 3000
374
+ # Fetch a page
375
+ result = client.scrape("https://example.com")
376
+ print(result.content) # Clean markdown
377
+ print(result.metadata) # title, description, author, ...
281
378
 
282
- # Combined: prune → focus → budget = 77% savings
283
- npx webpeel https://en.wikipedia.org/wiki/Web_scraping --focus "legal" --budget 3000
379
+ # Search the web
380
+ results = client.search("latest AI research papers")
381
+
382
+ # Crawl a site
383
+ job = client.crawl("https://docs.example.com", limit=100)
384
+
385
+ # With browser + stealth
386
+ result = client.scrape("https://protected-site.com", render=True, stealth=True)
284
387
  ```
285
388
 
286
- ### 🤖 Autonomous Agent (BYOK)
389
+ Sync and async clients. Pure Python 3.8+, zero dependencies. [Full SDK docs →](python-sdk/README.md)
390
+
391
+ ---
287
392
 
288
- Give it a prompt, it researches the web using your own LLM key.
393
+ ## 🐳 Self-Hosting
289
394
 
290
395
  ```bash
291
- npx webpeel agent "Compare pricing of Notion vs Coda" --llm-key sk-...
396
+ git clone https://github.com/webpeel/webpeel.git
397
+ cd webpeel && docker compose up
292
398
  ```
293
399
 
294
- ### 📊 More Features
295
-
296
- | Feature | CLI | Node.js | Python | API |
297
- |---------|:---:|:-------:|:------:|:---:|
298
- | Web scraping | ✅ | ✅ | ✅ | ✅ |
299
- | Deep research | ✅ | ✅ | ✅ | ✅ |
300
- | Content pruning | ✅ | ✅ | ✅ | ✅ |
301
- | BM25 query filtering | ✅ | ✅ | — | ✅ |
302
- | Structured extraction | ✅ | ✅ | ✅ | ✅ |
303
- | CSS schema extraction | ✅ | ✅ | — | ✅ |
304
- | LLM extraction (BYOK) | ✅ | ✅ | ✅ | ✅ |
305
- | Page actions | ✅ | ✅ | ✅ | ✅ |
306
- | Browser profiles | ✅ | ✅ | — | — |
307
- | Screenshots | ✅ | ✅ | ✅ | ✅ |
308
- | Crawling | ✅ | ✅ | ✅ | ✅ |
309
- | Batch fetching | ✅ | ✅ | ✅ | ✅ |
310
- | Hotel search | ✅ | — | — | — |
311
- | Token budget | ✅ | ✅ | ✅ | ✅ |
312
- | Smart chunking | ✅ | ✅ | — | — |
313
- | Branding extraction | ✅ | ✅ | — | — |
314
- | Change tracking | ✅ | ✅ | — | — |
315
- | AI summarization | ✅ | ✅ | — | ✅ |
316
- | Batch processing | — | ✅ | — | ✅ |
317
- | PDF extraction | ✅ | ✅ | — | — |
318
-
319
- ## Integrations
320
-
321
- Works with **CrewAI**, **Dify**, and **n8n** via the Firecrawl-compatible API. LangChain & LlamaIndex integrations coming soon. [Integration docs →](https://webpeel.dev/docs)
322
-
323
- ## Hosted API
324
-
325
- Live at [`api.webpeel.dev`](https://api.webpeel.dev) — Firecrawl-compatible endpoints.
400
+ Full REST API available at `http://localhost:3000`. AGPL-3.0 licensed. [Self-hosting guide →](SELF_HOST.md)
326
401
 
402
+ **Just the MCP server:**
327
403
  ```bash
328
- # Fetch a page (free, no auth needed for first 25)
329
- curl "https://api.webpeel.dev/v1/fetch?url=https://example.com"
404
+ docker run -i webpeel/mcp
405
+ ```
330
406
 
331
- # With API key
332
- curl "https://api.webpeel.dev/v1/fetch?url=https://example.com" \
333
- -H "Authorization: Bearer wp_..."
407
+ **Just the API server:**
408
+ ```bash
409
+ docker run -p 3000:3000 webpeel/api
334
410
  ```
335
411
 
336
- ### Pricing
412
+ ---
337
413
 
338
- | Plan | Price | Weekly Fetches | Burst | Extra Usage |
339
- |------|------:|---------------:|:-----:|:-----------:|
340
- | **Free** | $0 | 125/wk | 25/hr | — |
341
- | **Pro** | $9/mo | 1,250/wk | 100/hr | ✅ from $0.001 |
342
- | **Max** | $29/mo | 6,250/wk | 500/hr | ✅ from $0.001 |
414
+ ## 📖 API Reference
343
415
 
344
- Extra credit costs: fetch $0.002, search $0.001, stealth $0.01. Resets every Monday. All features on all plans. [Compare with Firecrawl →](https://webpeel.dev/migrate-from-firecrawl)
416
+ Full OpenAPI spec at [`openapi.yaml`](openapi.yaml) and [`api.webpeel.dev`](https://api.webpeel.dev).
345
417
 
346
- ## Project Structure
418
+ ```bash
419
+ # Fetch
420
+ GET /v1/fetch?url=<url>
347
421
 
422
+ # Search
423
+ GET /v1/search?q=<query>
424
+
425
+ # Crawl
426
+ POST /v1/crawl { "url": "...", "limit": 100 }
427
+
428
+ # Map
429
+ GET /v1/map?url=<url>
430
+
431
+ # Extract
432
+ POST /v1/extract { "url": "...", "schema": { ... } }
348
433
  ```
349
- webpeel/
350
- ├── src/
351
- │ ├── core/ # Core library (fetcher, strategies, markdown, crawl, search)
352
- │ ├── mcp/ # MCP server (11 tools for AI assistants)
353
- │ ├── server/ # Express API server (hosted version)
354
- │ │ ├── routes/ # API route handlers
355
- │ │ ├── middleware/ # Auth, rate limiting, SSRF protection
356
- │ │ └── premium/ # Server-only premium features
357
- │ ├── tests/ # Vitest test suites
358
- │ ├── cli.ts # CLI entry point
359
- │ ├── index.ts # Library exports
360
- │ └── types.ts # TypeScript type definitions
361
- ├── python-sdk/ # Python SDK (PyPI: webpeel)
362
- ├── integrations/ # LangChain, LlamaIndex, CrewAI, Dify, n8n
363
- ├── site/ # Landing page (webpeel.dev)
364
- ├── dashboard/ # Next.js dashboard (app.webpeel.dev)
365
- ├── benchmarks/ # Performance comparison suite
366
- └── skills/ # AI agent skills (Claude Code, etc.)
367
- ```
368
434
 
369
- ## Development
435
+ [Full API reference →](https://webpeel.dev/docs/api-reference)
436
+
437
+ ---
438
+
439
+ ## 🤝 Contributing
370
440
 
371
441
  ```bash
372
442
  git clone https://github.com/webpeel/webpeel.git
@@ -375,11 +445,13 @@ npm install && npm run build
375
445
  npm test
376
446
  ```
377
447
 
378
- See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
448
+ - **Bug reports:** [Open an issue](https://github.com/webpeel/webpeel/issues)
449
+ - **Feature requests:** [Start a discussion](https://github.com/webpeel/webpeel/discussions)
450
+ - **Code:** See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines
379
451
 
380
- ## Links
452
+ The project has a comprehensive test suite. Please add tests for new features.
381
453
 
382
- [Documentation](https://webpeel.dev/docs) · [Playground](https://webpeel.dev/playground) · [API Reference](https://webpeel.dev/docs/api-reference) · [npm](https://www.npmjs.com/package/webpeel) · [PyPI](https://pypi.org/project/webpeel/) · [Migration Guide](https://webpeel.dev/migrate-from-firecrawl) · [Blog](https://webpeel.dev/blog) · [Discussions](https://github.com/webpeel/webpeel/discussions)
454
+ ---
383
455
 
384
456
  ## Star History
385
457
 
@@ -391,23 +463,20 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
391
463
  </picture>
392
464
  </a>
393
465
 
394
- ## License
395
-
396
- This project is licensed under the [GNU Affero General Public License v3.0 (AGPL-3.0)](https://www.gnu.org/licenses/agpl-3.0.html).
466
+ ---
397
467
 
398
- **What this means:**
399
- - ✅ Free to use, modify, and distribute
400
- - ✅ Free for personal and commercial use
401
- - ⚠️ If you run a modified version as a network service, you must release your source code under AGPL-3.0
468
+ ## License
402
469
 
403
- **Need a commercial license?** Contact us at [support@webpeel.dev](mailto:support@webpeel.dev) for proprietary/enterprise licensing.
470
+ [AGPL-3.0](LICENSE) free to use, modify, and distribute. If you run a modified version as a network service, you must release your source under AGPL-3.0.
404
471
 
405
- > **Note:** Versions 0.7.1 and earlier were released under MIT. Those releases remain MIT-licensed.
472
+ Need a commercial license? [support@webpeel.dev](mailto:support@webpeel.dev)
406
473
 
407
- © [WebPeel](https://github.com/webpeel)
474
+ > Versions 0.7.1 and earlier were released under MIT and remain MIT-licensed.
408
475
 
409
476
  ---
410
477
 
411
478
  <p align="center">
412
- <b>Like WebPeel?</b> <a href="https://github.com/webpeel/webpeel">⭐ Star us on GitHub</a> — it helps others discover the project!
479
+ If WebPeel saves you time, <a href="https://github.com/webpeel/webpeel"><strong>⭐ star the repo</strong></a> — it helps others find it.
413
480
  </p>
481
+
482
+ © [WebPeel](https://github.com/webpeel)