webpeel 0.12.3 → 0.13.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +330 -261
- package/dist/core/search-provider.d.ts +6 -0
- package/dist/core/search-provider.d.ts.map +1 -1
- package/dist/core/search-provider.js +77 -1
- package/dist/core/search-provider.js.map +1 -1
- package/dist/server/routes/mcp.d.ts.map +1 -1
- package/dist/server/routes/mcp.js +370 -13
- package/dist/server/routes/mcp.js.map +1 -1
- package/dist/server/routes/search.d.ts.map +1 -1
- package/dist/server/routes/search.js +14 -1
- package/dist/server/routes/search.js.map +1 -1
- package/llms.txt +2 -2
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -7,15 +7,15 @@
|
|
|
7
7
|
<p align="center">
|
|
8
8
|
<a href="https://www.npmjs.com/package/webpeel"><img src="https://img.shields.io/npm/v/webpeel.svg" alt="npm version"></a>
|
|
9
9
|
<a href="https://pypi.org/project/webpeel/"><img src="https://img.shields.io/pypi/v/webpeel.svg" alt="PyPI version"></a>
|
|
10
|
-
<a href="https://www.npmjs.com/package/webpeel"><img src="https://img.shields.io/npm/dm/webpeel.svg" alt="
|
|
10
|
+
<a href="https://www.npmjs.com/package/webpeel"><img src="https://img.shields.io/npm/dm/webpeel.svg" alt="downloads"></a>
|
|
11
11
|
<a href="https://github.com/webpeel/webpeel/stargazers"><img src="https://img.shields.io/github/stars/webpeel/webpeel.svg" alt="GitHub stars"></a>
|
|
12
12
|
<a href="https://github.com/webpeel/webpeel/actions/workflows/ci.yml"><img src="https://github.com/webpeel/webpeel/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
|
|
13
|
-
<a href="https://www.
|
|
14
|
-
<a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/License-AGPL%20v3-blue.svg" alt="AGPL v3 License"></a>
|
|
13
|
+
<a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/License-AGPL%20v3-blue.svg" alt="AGPL v3"></a>
|
|
15
14
|
</p>
|
|
16
15
|
|
|
17
16
|
<p align="center">
|
|
18
|
-
<
|
|
17
|
+
<strong>Reliable web access for AI agents.</strong><br>
|
|
18
|
+
Fetch any page · Extract structured data · Crawl entire sites · Deep research — one tool, three interfaces.
|
|
19
19
|
</p>
|
|
20
20
|
|
|
21
21
|
<p align="center">
|
|
@@ -28,345 +28,415 @@
|
|
|
28
28
|
|
|
29
29
|
---
|
|
30
30
|
|
|
31
|
-
##
|
|
31
|
+
## What is WebPeel?
|
|
32
32
|
|
|
33
|
-
|
|
34
|
-
# Zero install — just run it
|
|
35
|
-
npx webpeel https://news.ycombinator.com
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
```bash
|
|
39
|
-
# Agent mode — JSON + budget + extraction in one flag
|
|
40
|
-
npx webpeel https://example.com --agent
|
|
41
|
-
|
|
42
|
-
# Site search — no URL knowledge needed (27 supported sites)
|
|
43
|
-
npx webpeel search --site ebay "charizard card"
|
|
44
|
-
npx webpeel search --site amazon "laptop stand" --table
|
|
33
|
+
WebPeel gives your AI agent reliable access to the web. Fetch any page, extract structured data, crawl entire sites, and research topics — all through a single CLI, API, or MCP server.
|
|
45
34
|
|
|
46
|
-
|
|
47
|
-
npx webpeel https://www.amazon.com/s?k=keyboard --json
|
|
48
|
-
npx webpeel https://www.booking.com/searchresults.html --schema booking --json
|
|
49
|
-
npx webpeel --list-schemas
|
|
50
|
-
|
|
51
|
-
# LLM extraction — structured data from any page (BYOK)
|
|
52
|
-
npx webpeel https://example.com/product --llm-extract "title, price, rating" --json
|
|
53
|
-
npx webpeel https://hn.algolia.com --llm-extract "top 5 posts with scores" --llm-key $OPENAI_API_KEY
|
|
54
|
-
|
|
55
|
-
# Hotel search — multi-source parallel search
|
|
56
|
-
npx webpeel hotels "Paris" --checkin 2026-03-01 --checkout 2026-03-05 --sort price
|
|
57
|
-
npx webpeel hotels "New York" --checkin 2026-04-10 --json
|
|
35
|
+
It automatically handles the hard parts: JavaScript rendering, bot detection, Cloudflare challenges, infinite scroll, pagination, and content noise. Your agent gets clean markdown. You don't think about the plumbing.
|
|
58
36
|
|
|
59
|
-
|
|
60
|
-
npx webpeel profile create myprofile
|
|
61
|
-
npx webpeel https://protected-site.com --profile myprofile --stealth
|
|
62
|
-
npx webpeel profile list
|
|
37
|
+
---
|
|
63
38
|
|
|
64
|
-
|
|
65
|
-
npx webpeel https://protected-site.com --stealth
|
|
39
|
+
## 🚀 Quick Start
|
|
66
40
|
|
|
67
|
-
|
|
68
|
-
npx webpeel crawl https://example.com --max-pages 20
|
|
41
|
+
**Three paths in, all free to try:**
|
|
69
42
|
|
|
70
|
-
|
|
71
|
-
npx webpeel search "best AI frameworks 2026"
|
|
43
|
+
### CLI
|
|
72
44
|
|
|
73
|
-
|
|
74
|
-
npx webpeel https://
|
|
45
|
+
```bash
|
|
46
|
+
npx webpeel "https://news.ycombinator.com"
|
|
75
47
|
```
|
|
76
48
|
|
|
77
|
-
First 25 fetches work
|
|
78
|
-
|
|
79
|
-
## Why WebPeel?
|
|
80
|
-
|
|
81
|
-
| Feature | **WebPeel** | Firecrawl | Jina Reader | MCP Fetch |
|
|
82
|
-
|---------|:-----------:|:---------:|:-----------:|:---------:|
|
|
83
|
-
| **Free tier** | ✅ 500/wk recurring | 500 one-time | ❌ Cloud only | ✅ Unlimited |
|
|
84
|
-
| **Smart escalation** | ✅ HTTP→Browser→Stealth | Manual | ❌ | ❌ |
|
|
85
|
-
| **Challenge detection** | ✅ 7 vendors auto-detected | ❌ | ❌ | ❌ |
|
|
86
|
-
| **Site search** | ✅ 27 sites built-in | ❌ | ❌ | ❌ |
|
|
87
|
-
| **Stealth mode** | ✅ v2, all plans | ✅ | ⚠️ Limited | ❌ |
|
|
88
|
-
| **Browser profiles** | ✅ Persistent sessions | ❌ | ❌ | ❌ |
|
|
89
|
-
| **Hotel search** | ✅ Multi-source parallel | ❌ | ❌ | ❌ |
|
|
90
|
-
| **CSS schema extraction** | ✅ 7 bundled + auto-detect | ❌ | ❌ | ❌ |
|
|
91
|
-
| **LLM extraction** | ✅ BYOK, cost tracking | ⚠️ Cloud only | ❌ | ❌ |
|
|
92
|
-
| **Firecrawl-compatible** | ✅ Drop-in replacement | ✅ Native | ❌ | ❌ |
|
|
93
|
-
| **Self-hosting** | ✅ Docker compose | ⚠️ Complex | ❌ | N/A |
|
|
94
|
-
| **Autonomous agent** | ✅ BYOK any LLM | ⚠️ Locked | ❌ | ❌ |
|
|
95
|
-
| **Deep research** | ✅ Multi-source + BM25 | ⚠️ Cloud only | ❌ | ❌ |
|
|
96
|
-
| **Content pruning** | ✅ 2-pass, 15-33% savings | ❌ | ❌ | ❌ |
|
|
97
|
-
| **BM25 filtering** | ✅ Query-focused | ❌ | ❌ | ❌ |
|
|
98
|
-
| **Python SDK** | ✅ `pip install` | ✅ | ❌ | ❌ |
|
|
99
|
-
| **MCP tools** | ✅ 13 tools | ~6 | 0 | 1 |
|
|
100
|
-
| **License** | ✅ AGPL-3.0 | AGPL-3.0 | Proprietary | MIT |
|
|
101
|
-
| **Pricing** | **Free / $9 / $29** | $0 / $16 / $83 | Custom | Free |
|
|
102
|
-
|
|
103
|
-
## Benchmarks
|
|
104
|
-
|
|
105
|
-
Evaluated on 30 real-world URLs across 6 categories (static, dynamic, SPA, protected, documents, international) against 6 competing web fetching APIs.
|
|
106
|
-
|
|
107
|
-
| Metric | WebPeel | Next best |
|
|
108
|
-
|--------|:-------:|:---------:|
|
|
109
|
-
| **Success rate** | **100%** (30/30) | 93.3% (Firecrawl, Exa, LinkUp) |
|
|
110
|
-
| **Content quality** | **92.3%** | 83.2% (Exa) |
|
|
111
|
-
|
|
112
|
-
WebPeel is the only tool that successfully extracted content from all 30 test URLs. Full methodology and per-category breakdown: [webpeel.dev/blog/benchmarks](https://webpeel.dev/blog/benchmarks)
|
|
113
|
-
|
|
114
|
-
## Install
|
|
49
|
+
No install needed. First 25 fetches work without signup. [Get 500/week free →](https://app.webpeel.dev/signup)
|
|
115
50
|
|
|
116
|
-
|
|
117
|
-
# Node.js
|
|
118
|
-
npm install webpeel # or: pnpm add webpeel
|
|
51
|
+
### MCP Server (for Claude, Cursor, VS Code, Windsurf)
|
|
119
52
|
|
|
120
|
-
|
|
121
|
-
pip install webpeel
|
|
53
|
+
Add to your MCP config (`~/Library/Application Support/Claude/claude_desktop_config.json` on Mac):
|
|
122
54
|
|
|
123
|
-
|
|
124
|
-
|
|
55
|
+
```json
|
|
56
|
+
{
|
|
57
|
+
"mcpServers": {
|
|
58
|
+
"webpeel": {
|
|
59
|
+
"command": "npx",
|
|
60
|
+
"args": ["-y", "webpeel", "mcp"]
|
|
61
|
+
}
|
|
62
|
+
}
|
|
63
|
+
}
|
|
125
64
|
```
|
|
126
65
|
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
### Node.js
|
|
66
|
+
[](https://mcp.so/install/webpeel?for=claude)
|
|
67
|
+
[](https://mcp.so/install/webpeel?for=vscode)
|
|
130
68
|
|
|
131
|
-
|
|
132
|
-
import { peel } from 'webpeel';
|
|
69
|
+
### REST API
|
|
133
70
|
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
console.log(result.tokens); // Estimated token count
|
|
138
|
-
|
|
139
|
-
// With options
|
|
140
|
-
const advanced = await peel('https://example.com', {
|
|
141
|
-
render: true, // Browser for JS-heavy sites
|
|
142
|
-
stealth: true, // Anti-bot stealth mode
|
|
143
|
-
maxTokens: 4000, // Limit output
|
|
144
|
-
includeTags: ['main'], // Filter HTML tags
|
|
145
|
-
});
|
|
71
|
+
```bash
|
|
72
|
+
curl "https://api.webpeel.dev/v1/fetch?url=https://example.com" \
|
|
73
|
+
-H "Authorization: Bearer wp_YOUR_KEY"
|
|
146
74
|
```
|
|
147
75
|
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
```python
|
|
151
|
-
from webpeel import WebPeel
|
|
76
|
+
---
|
|
152
77
|
|
|
153
|
-
|
|
78
|
+
## ✨ Features
|
|
79
|
+
|
|
80
|
+
### Core
|
|
81
|
+
|
|
82
|
+
| Feature | Description |
|
|
83
|
+
|---------|-------------|
|
|
84
|
+
| **Web Fetching** | Any URL → clean markdown, text, HTML, or JSON |
|
|
85
|
+
| **Smart Escalation** | Auto-upgrades: HTTP → Browser → Stealth. Uses the fastest method, escalates only when needed |
|
|
86
|
+
| **Content Pruning** | 2-pass HTML reduction — strips nav/footer/sidebar/ads automatically |
|
|
87
|
+
| **Token Budget** | Hard-cap output to N tokens. No surprises in your LLM bill |
|
|
88
|
+
| **Screenshots** | Full-page or viewport screenshots with a single flag |
|
|
89
|
+
| **Batch Mode** | Process multiple URLs concurrently |
|
|
90
|
+
|
|
91
|
+
### AI Agent
|
|
92
|
+
|
|
93
|
+
| Feature | Description |
|
|
94
|
+
|---------|-------------|
|
|
95
|
+
| **MCP Server** | 12 tools for Claude Desktop, Cursor, VS Code, and Windsurf |
|
|
96
|
+
| **Deep Research** | Multi-hop agent: search → fetch → analyze → follow leads → synthesize |
|
|
97
|
+
| **Search** | Web search across 27+ structured sources |
|
|
98
|
+
| **Hotel Search** | Kayak, Booking.com, Google Travel, Expedia — in parallel |
|
|
99
|
+
| **Browser Profiles** | Persistent sessions for sites that require login |
|
|
100
|
+
| **Infinite Scroll** | Auto-scrolls lazy-loaded feeds until stable |
|
|
101
|
+
| **Actions** | Click, type, fill, select, hover, press, scroll — full browser automation |
|
|
102
|
+
|
|
103
|
+
### Extraction
|
|
104
|
+
|
|
105
|
+
| Feature | Description |
|
|
106
|
+
|---------|-------------|
|
|
107
|
+
| **CSS Schema Extraction** | 7 built-in schemas (Amazon, Booking.com, eBay, Expedia, Hacker News, Walmart, Yelp) — auto-detected by domain |
|
|
108
|
+
| **JSON Schema Extraction** | Pass any JSON Schema and get back typed, structured data |
|
|
109
|
+
| **LLM Extraction (BYOK)** | Natural language → structured data using your own OpenAI-compatible key |
|
|
110
|
+
| **BM25 Filtering** | Query-focused content: only the parts relevant to your question |
|
|
111
|
+
| **Links / Images / Meta** | Extract just the links, images, or metadata from any page |
|
|
112
|
+
|
|
113
|
+
### Anti-Bot
|
|
114
|
+
|
|
115
|
+
| Feature | Description |
|
|
116
|
+
|---------|-------------|
|
|
117
|
+
| **Stealth Mode** | Bypasses Cloudflare, PerimeterX, DataDome, Akamai, and more |
|
|
118
|
+
| **28 Auto-Stealth Domains** | Amazon, LinkedIn, Glassdoor, Zillow, and 24 more — stealth kicks in automatically |
|
|
119
|
+
| **Challenge Detection** | 7 bot-protection vendors detected and handled automatically |
|
|
120
|
+
| **Browser Fingerprinting** | Masks WebGL, navigator properties, canvas fingerprint |
|
|
121
|
+
|
|
122
|
+
### Advanced
|
|
123
|
+
|
|
124
|
+
| Feature | Description |
|
|
125
|
+
|---------|-------------|
|
|
126
|
+
| **Crawl + Sitemap** | BFS/DFS crawling, sitemap discovery, robots.txt compliance, deduplication |
|
|
127
|
+
| **Site Map** | Map all URLs on a domain up to any depth |
|
|
128
|
+
| **Pagination** | Follow "Next" links automatically for N pages |
|
|
129
|
+
| **Chunking** | Split long content into LLM-sized pieces (fixed, semantic, or paragraph) |
|
|
130
|
+
| **Caching** | Local result cache with configurable TTL (`5m`, `1h`, `1d`) |
|
|
131
|
+
| **Geo-targeting** | ISO country code + language preferences per request |
|
|
132
|
+
| **Change Tracking** | Detect what changed between two fetches of the same page |
|
|
133
|
+
| **Brand Extraction** | Pull logo, colors, fonts, and social links from any site |
|
|
134
|
+
| **PDF Extraction** | Extract text from PDF documents |
|
|
135
|
+
| **Self-Hostable** | Docker Compose for full on-premise deployment |
|
|
136
|
+
| **Python SDK** | Sync + async client, `pip install webpeel` |
|
|
154
137
|
|
|
155
|
-
|
|
156
|
-
print(result.content) # Clean markdown
|
|
138
|
+
---
|
|
157
139
|
|
|
158
|
-
|
|
159
|
-
|
|
140
|
+
## 🤖 MCP Integration
|
|
141
|
+
|
|
142
|
+
WebPeel exposes **13 tools** to your AI coding assistant:
|
|
143
|
+
|
|
144
|
+
| Tool | What it does |
|
|
145
|
+
|------|--------------|
|
|
146
|
+
| `webpeel_fetch` | Fetch any URL → markdown (smart escalation built in) |
|
|
147
|
+
| `webpeel_search` | Web search with structured results |
|
|
148
|
+
| `webpeel_batch` | Fetch multiple URLs concurrently |
|
|
149
|
+
| `webpeel_crawl` | Crawl a site with depth/page limits |
|
|
150
|
+
| `webpeel_map` | Discover all URLs on a domain |
|
|
151
|
+
| `webpeel_extract` | Structured extraction (CSS, JSON Schema, or LLM) |
|
|
152
|
+
| `webpeel_screenshot` | Screenshot any page |
|
|
153
|
+
| `webpeel_research` | Deep multi-hop research on a topic |
|
|
154
|
+
| `webpeel_summarize` | AI summary of any URL |
|
|
155
|
+
| `webpeel_answer` | Ask a question about a URL's content |
|
|
156
|
+
| `webpeel_change_track` | Detect changes between two fetches |
|
|
157
|
+
| `webpeel_brand` | Extract branding assets from a site |
|
|
158
|
+
|
|
159
|
+
<details>
|
|
160
|
+
<summary>Setup for each editor</summary>
|
|
161
|
+
|
|
162
|
+
**Claude Desktop** (`~/Library/Application Support/Claude/claude_desktop_config.json`):
|
|
163
|
+
```json
|
|
164
|
+
{
|
|
165
|
+
"mcpServers": {
|
|
166
|
+
"webpeel": { "command": "npx", "args": ["-y", "webpeel", "mcp"] }
|
|
167
|
+
}
|
|
168
|
+
}
|
|
160
169
|
```
|
|
161
170
|
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
171
|
+
**Cursor** (Settings → MCP Servers):
|
|
172
|
+
```json
|
|
173
|
+
{
|
|
174
|
+
"mcpServers": {
|
|
175
|
+
"webpeel": { "command": "npx", "args": ["-y", "webpeel", "mcp"] }
|
|
176
|
+
}
|
|
177
|
+
}
|
|
178
|
+
```
|
|
167
179
|
|
|
168
|
-
|
|
180
|
+
**VS Code** (`~/.vscode/mcp.json`):
|
|
181
|
+
```json
|
|
182
|
+
{
|
|
183
|
+
"servers": {
|
|
184
|
+
"webpeel": { "command": "npx", "args": ["-y", "webpeel", "mcp"] }
|
|
185
|
+
}
|
|
186
|
+
}
|
|
187
|
+
```
|
|
169
188
|
|
|
189
|
+
**Windsurf** (`~/.codeium/windsurf/mcp_config.json`):
|
|
170
190
|
```json
|
|
171
191
|
{
|
|
172
192
|
"mcpServers": {
|
|
173
|
-
"webpeel": {
|
|
174
|
-
"command": "npx",
|
|
175
|
-
"args": ["-y", "webpeel", "mcp"]
|
|
176
|
-
}
|
|
193
|
+
"webpeel": { "command": "npx", "args": ["-y", "webpeel", "mcp"] }
|
|
177
194
|
}
|
|
178
195
|
}
|
|
179
196
|
```
|
|
180
197
|
|
|
181
|
-
|
|
182
|
-
|
|
198
|
+
**Docker (stdio)**:
|
|
199
|
+
```json
|
|
200
|
+
{
|
|
201
|
+
"mcpServers": {
|
|
202
|
+
"webpeel": { "command": "docker", "args": ["run", "-i", "--rm", "webpeel/mcp"] }
|
|
203
|
+
}
|
|
204
|
+
}
|
|
205
|
+
```
|
|
206
|
+
</details>
|
|
183
207
|
|
|
184
|
-
|
|
208
|
+
---
|
|
185
209
|
|
|
186
|
-
|
|
210
|
+
## 🔬 Deep Research
|
|
187
211
|
|
|
188
|
-
|
|
212
|
+
Multi-hop research that thinks like a researcher, not a search engine:
|
|
189
213
|
|
|
190
214
|
```bash
|
|
191
|
-
|
|
215
|
+
# Sources only — no API key needed
|
|
216
|
+
npx webpeel research "best practices for rate limiting APIs" --max-sources 8
|
|
217
|
+
|
|
218
|
+
# Full synthesis with LLM (BYOK)
|
|
219
|
+
npx webpeel research "compare Firecrawl vs Crawl4AI vs WebPeel" --llm-key sk-...
|
|
192
220
|
```
|
|
193
221
|
|
|
194
|
-
**
|
|
222
|
+
**How it works:** Search → fetch top results → extract key passages (BM25) → follow the most relevant links → synthesize across sources. No circular references, no duplicate content.
|
|
223
|
+
|
|
224
|
+
---
|
|
225
|
+
|
|
226
|
+
## 📦 Extraction
|
|
227
|
+
|
|
228
|
+
Three ways to get structured data out of any page:
|
|
229
|
+
|
|
230
|
+
### CSS Schema (zero config, auto-detected)
|
|
195
231
|
|
|
196
232
|
```bash
|
|
197
|
-
|
|
233
|
+
# Auto-detects Amazon and applies the built-in schema
|
|
234
|
+
npx webpeel "https://www.amazon.com/s?k=mechanical+keyboard" --json
|
|
235
|
+
|
|
236
|
+
# Force a specific schema
|
|
237
|
+
npx webpeel "https://www.booking.com/searchresults.html?city=Paris" --schema booking --json
|
|
238
|
+
|
|
239
|
+
# List all built-in schemas
|
|
240
|
+
npx webpeel --list-schemas
|
|
198
241
|
```
|
|
199
242
|
|
|
200
|
-
|
|
243
|
+
Built-in schemas: `amazon` · `booking` · `ebay` · `expedia` · `hackernews` · `walmart` · `yelp`
|
|
244
|
+
|
|
245
|
+
### JSON Schema (type-safe structured extraction)
|
|
201
246
|
|
|
202
247
|
```bash
|
|
203
|
-
|
|
248
|
+
npx webpeel "https://example.com/product" \
|
|
249
|
+
--extract-schema '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}' \
|
|
250
|
+
--llm-key sk-...
|
|
204
251
|
```
|
|
205
252
|
|
|
206
|
-
|
|
253
|
+
### LLM Extraction (natural language, BYOK)
|
|
207
254
|
|
|
208
255
|
```bash
|
|
209
|
-
|
|
210
|
-
|
|
256
|
+
npx webpeel "https://hn.algolia.com" \
|
|
257
|
+
--llm-extract "top 10 posts with title, score, and comment count" \
|
|
258
|
+
--llm-key $OPENAI_API_KEY \
|
|
259
|
+
--json
|
|
211
260
|
```
|
|
212
261
|
|
|
213
|
-
|
|
262
|
+
<details>
|
|
263
|
+
<summary>Node.js extraction example</summary>
|
|
214
264
|
|
|
215
|
-
|
|
265
|
+
```typescript
|
|
266
|
+
import { peel } from 'webpeel';
|
|
216
267
|
|
|
217
|
-
|
|
218
|
-
{
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
268
|
+
// CSS selector extraction
|
|
269
|
+
const result = await peel('https://news.ycombinator.com', {
|
|
270
|
+
extract: {
|
|
271
|
+
selectors: {
|
|
272
|
+
titles: '.titleline > a',
|
|
273
|
+
scores: '.score',
|
|
223
274
|
}
|
|
224
275
|
}
|
|
225
|
-
}
|
|
226
|
-
|
|
276
|
+
});
|
|
277
|
+
console.log(result.extracted); // { titles: [...], scores: [...] }
|
|
227
278
|
|
|
228
|
-
|
|
279
|
+
// LLM extraction with JSON Schema
|
|
280
|
+
const product = await peel('https://example.com/product', {
|
|
281
|
+
llmExtract: 'title, price, rating, availability',
|
|
282
|
+
llmKey: process.env.OPENAI_API_KEY,
|
|
283
|
+
});
|
|
284
|
+
```
|
|
285
|
+
</details>
|
|
229
286
|
|
|
230
|
-
|
|
287
|
+
---
|
|
231
288
|
|
|
232
|
-
|
|
289
|
+
## 🛡️ Stealth & Anti-Bot
|
|
233
290
|
|
|
234
|
-
|
|
235
|
-
HTTP Fetch (200ms) → Browser Rendering (2s) → Stealth Mode (5s)
|
|
236
|
-
80% of sites 15% of sites 5% of sites
|
|
237
|
-
```
|
|
291
|
+
WebPeel detects 7 bot-protection vendors and handles them automatically:
|
|
238
292
|
|
|
239
|
-
|
|
293
|
+
- **Cloudflare** (JS challenge, Turnstile, Bot Management)
|
|
294
|
+
- **PerimeterX / HUMAN** (behavioral analysis)
|
|
295
|
+
- **DataDome** (ML-based bot detection)
|
|
296
|
+
- **Akamai Bot Manager**
|
|
297
|
+
- **Distil Networks**
|
|
298
|
+
- **reCAPTCHA / hCaptcha** (page-level detection)
|
|
299
|
+
- **Generic challenge pages**
|
|
240
300
|
|
|
241
|
-
|
|
301
|
+
28 high-protection domains (Amazon, LinkedIn, Glassdoor, Zillow, Ticketmaster, and more) automatically route through stealth mode — no flags needed.
|
|
242
302
|
|
|
243
303
|
```bash
|
|
244
|
-
|
|
304
|
+
# Explicitly enable stealth
|
|
305
|
+
npx webpeel "https://glassdoor.com/jobs" --stealth
|
|
306
|
+
|
|
307
|
+
# Auto-escalation (stealth triggers automatically on challenge detection)
|
|
308
|
+
npx webpeel "https://amazon.com/dp/ASIN"
|
|
245
309
|
```
|
|
246
310
|
|
|
247
|
-
|
|
311
|
+
---
|
|
248
312
|
|
|
249
|
-
|
|
313
|
+
## ⚡ Benchmark
|
|
250
314
|
|
|
251
|
-
|
|
252
|
-
npx webpeel crawl https://docs.example.com --max-pages 100
|
|
253
|
-
npx webpeel map https://example.com --max-urls 5000
|
|
254
|
-
```
|
|
315
|
+
Evaluated on 30 real-world URLs across 6 categories (static, dynamic, SPA, protected, documents, international):
|
|
255
316
|
|
|
256
|
-
|
|
317
|
+
| | WebPeel | Next best |
|
|
318
|
+
|---|:---:|:---:|
|
|
319
|
+
| **Success rate** | **100%** (30/30) | 93.3% |
|
|
320
|
+
| **Content quality** | **92.3%** | 83.2% |
|
|
257
321
|
|
|
258
|
-
|
|
322
|
+
WebPeel is the only tool that extracted content from all 30 test URLs. [Full methodology →](https://webpeel.dev/blog/benchmarks)
|
|
259
323
|
|
|
260
|
-
|
|
261
|
-
# Get ranked sources with relevance scores
|
|
262
|
-
npx webpeel research "best web scraping tools 2025" --max-sources 5
|
|
324
|
+
---
|
|
263
325
|
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
326
|
+
## 🆚 Comparison
|
|
327
|
+
|
|
328
|
+
| Feature | **WebPeel** | Firecrawl | Jina Reader | ScrapingBee | Tavily |
|
|
329
|
+
|---------|:-----------:|:---------:|:-----------:|:-----------:|:------:|
|
|
330
|
+
| **Free tier** | ✅ 500/wk recurring | ⚠️ 500 one-time | ❌ | ❌ | ⚠️ 1,000 one-time |
|
|
331
|
+
| **Smart escalation** | ✅ auto HTTP→browser→stealth | ❌ manual | ❌ | ❌ | ❌ |
|
|
332
|
+
| **Stealth mode** | ✅ all plans | ✅ | ❌ | ✅ paid | ❌ |
|
|
333
|
+
| **Challenge detection** | ✅ 7 vendors | ❌ | ❌ | ❌ | ❌ |
|
|
334
|
+
| **MCP tools** | ✅ 12 tools | ⚠️ ~6 | ❌ | ❌ | ✅ |
|
|
335
|
+
| **Deep research** | ✅ multi-hop + BM25 | ⚠️ cloud only | ❌ | ❌ | ✅ |
|
|
336
|
+
| **CSS schema extraction** | ✅ 7 bundled | ❌ | ❌ | ❌ | ❌ |
|
|
337
|
+
| **LLM extraction (BYOK)** | ✅ | ⚠️ cloud only | ❌ | ❌ | ❌ |
|
|
338
|
+
| **Site search (27+ sites)** | ✅ | ❌ | ❌ | ❌ | ⚠️ web only |
|
|
339
|
+
| **Hotel search** | ✅ 4 sources parallel | ❌ | ❌ | ❌ | ❌ |
|
|
340
|
+
| **Browser profiles** | ✅ persistent sessions | ❌ | ❌ | ❌ | ❌ |
|
|
341
|
+
| **Self-hosting** | ✅ Docker Compose | ⚠️ complex | ❌ | ❌ | ❌ |
|
|
342
|
+
| **Python SDK** | ✅ `pip install` | ✅ | ❌ | ✅ | ✅ |
|
|
343
|
+
| **Firecrawl-compatible API** | ✅ drop-in | ✅ native | ❌ | ❌ | ❌ |
|
|
344
|
+
| **License** | AGPL-3.0 | AGPL-3.0 | Proprietary | Proprietary | Proprietary |
|
|
345
|
+
| **Price** | **$0 / $9 / $29** | $0 / $16 / $83 | custom | $49 / $149 | $0 / $99 |
|
|
346
|
+
|
|
347
|
+
---
|
|
348
|
+
|
|
349
|
+
## 💳 Pricing
|
|
350
|
+
|
|
351
|
+
| Plan | Price | Weekly Fetches | Burst |
|
|
352
|
+
|------|------:|:--------------:|:-----:|
|
|
353
|
+
| **Free** | $0/mo | 500/wk | 50/hr |
|
|
354
|
+
| **Pro** | $9/mo | 1,250/wk | 100/hr |
|
|
355
|
+
| **Max** | $29/mo | 6,250/wk | 500/hr |
|
|
356
|
+
|
|
357
|
+
All features on all plans. Pro/Max add pay-as-you-go extra usage (fetch $0.002, search $0.001, stealth $0.01). Quota resets every Monday.
|
|
358
|
+
|
|
359
|
+
[Sign up free →](https://app.webpeel.dev/signup) · [Compare with Firecrawl →](https://webpeel.dev/migrate-from-firecrawl)
|
|
267
360
|
|
|
268
|
-
|
|
361
|
+
---
|
|
269
362
|
|
|
270
|
-
|
|
363
|
+
## 🐍 Python SDK
|
|
271
364
|
|
|
272
365
|
```bash
|
|
273
|
-
|
|
274
|
-
|
|
366
|
+
pip install webpeel
|
|
367
|
+
```
|
|
368
|
+
|
|
369
|
+
```python
|
|
370
|
+
from webpeel import WebPeel
|
|
275
371
|
|
|
276
|
-
#
|
|
277
|
-
npx webpeel https://en.wikipedia.org/wiki/Web_scraping --focus "legal issues"
|
|
372
|
+
client = WebPeel(api_key="wp_...") # or use WEBPEEL_API_KEY env var
|
|
278
373
|
|
|
279
|
-
#
|
|
280
|
-
|
|
374
|
+
# Fetch a page
|
|
375
|
+
result = client.scrape("https://example.com")
|
|
376
|
+
print(result.content) # Clean markdown
|
|
377
|
+
print(result.metadata) # title, description, author, ...
|
|
281
378
|
|
|
282
|
-
#
|
|
283
|
-
|
|
379
|
+
# Search the web
|
|
380
|
+
results = client.search("latest AI research papers")
|
|
381
|
+
|
|
382
|
+
# Crawl a site
|
|
383
|
+
job = client.crawl("https://docs.example.com", limit=100)
|
|
384
|
+
|
|
385
|
+
# With browser + stealth
|
|
386
|
+
result = client.scrape("https://protected-site.com", render=True, stealth=True)
|
|
284
387
|
```
|
|
285
388
|
|
|
286
|
-
|
|
389
|
+
Sync and async clients. Pure Python 3.8+, zero dependencies. [Full SDK docs →](python-sdk/README.md)
|
|
390
|
+
|
|
391
|
+
---
|
|
287
392
|
|
|
288
|
-
|
|
393
|
+
## 🐳 Self-Hosting
|
|
289
394
|
|
|
290
395
|
```bash
|
|
291
|
-
|
|
396
|
+
git clone https://github.com/webpeel/webpeel.git
|
|
397
|
+
cd webpeel && docker compose up
|
|
292
398
|
```
|
|
293
399
|
|
|
294
|
-
|
|
295
|
-
|
|
296
|
-
| Feature | CLI | Node.js | Python | API |
|
|
297
|
-
|---------|:---:|:-------:|:------:|:---:|
|
|
298
|
-
| Web scraping | ✅ | ✅ | ✅ | ✅ |
|
|
299
|
-
| Deep research | ✅ | ✅ | ✅ | ✅ |
|
|
300
|
-
| Content pruning | ✅ | ✅ | ✅ | ✅ |
|
|
301
|
-
| BM25 query filtering | ✅ | ✅ | — | ✅ |
|
|
302
|
-
| Structured extraction | ✅ | ✅ | ✅ | ✅ |
|
|
303
|
-
| CSS schema extraction | ✅ | ✅ | — | ✅ |
|
|
304
|
-
| LLM extraction (BYOK) | ✅ | ✅ | ✅ | ✅ |
|
|
305
|
-
| Page actions | ✅ | ✅ | ✅ | ✅ |
|
|
306
|
-
| Browser profiles | ✅ | ✅ | — | — |
|
|
307
|
-
| Screenshots | ✅ | ✅ | ✅ | ✅ |
|
|
308
|
-
| Crawling | ✅ | ✅ | ✅ | ✅ |
|
|
309
|
-
| Batch fetching | ✅ | ✅ | ✅ | ✅ |
|
|
310
|
-
| Hotel search | ✅ | — | — | — |
|
|
311
|
-
| Token budget | ✅ | ✅ | ✅ | ✅ |
|
|
312
|
-
| Smart chunking | ✅ | ✅ | — | — |
|
|
313
|
-
| Branding extraction | ✅ | ✅ | — | — |
|
|
314
|
-
| Change tracking | ✅ | ✅ | — | — |
|
|
315
|
-
| AI summarization | ✅ | ✅ | — | ✅ |
|
|
316
|
-
| Batch processing | — | ✅ | — | ✅ |
|
|
317
|
-
| PDF extraction | ✅ | ✅ | — | — |
|
|
318
|
-
|
|
319
|
-
## Integrations
|
|
320
|
-
|
|
321
|
-
Works with **CrewAI**, **Dify**, and **n8n** via the Firecrawl-compatible API. LangChain & LlamaIndex integrations coming soon. [Integration docs →](https://webpeel.dev/docs)
|
|
322
|
-
|
|
323
|
-
## Hosted API
|
|
324
|
-
|
|
325
|
-
Live at [`api.webpeel.dev`](https://api.webpeel.dev) — Firecrawl-compatible endpoints.
|
|
400
|
+
Full REST API available at `http://localhost:3000`. AGPL-3.0 licensed. [Self-hosting guide →](SELF_HOST.md)
|
|
326
401
|
|
|
402
|
+
**Just the MCP server:**
|
|
327
403
|
```bash
|
|
328
|
-
|
|
329
|
-
|
|
404
|
+
docker run -i webpeel/mcp
|
|
405
|
+
```
|
|
330
406
|
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
407
|
+
**Just the API server:**
|
|
408
|
+
```bash
|
|
409
|
+
docker run -p 3000:3000 webpeel/api
|
|
334
410
|
```
|
|
335
411
|
|
|
336
|
-
|
|
412
|
+
---
|
|
337
413
|
|
|
338
|
-
|
|
339
|
-
|------|------:|---------------:|:-----:|:-----------:|
|
|
340
|
-
| **Free** | $0 | 500/wk | 50/hr | — |
|
|
341
|
-
| **Pro** | $9/mo | 1,250/wk | 100/hr | ✅ from $0.001 |
|
|
342
|
-
| **Max** | $29/mo | 6,250/wk | 500/hr | ✅ from $0.001 |
|
|
414
|
+
## 📖 API Reference
|
|
343
415
|
|
|
344
|
-
|
|
416
|
+
Full OpenAPI spec at [`openapi.yaml`](openapi.yaml) and [`api.webpeel.dev`](https://api.webpeel.dev).
|
|
345
417
|
|
|
346
|
-
|
|
418
|
+
```bash
|
|
419
|
+
# Fetch
|
|
420
|
+
GET /v1/fetch?url=<url>
|
|
347
421
|
|
|
422
|
+
# Search
|
|
423
|
+
GET /v1/search?q=<query>
|
|
424
|
+
|
|
425
|
+
# Crawl
|
|
426
|
+
POST /v1/crawl { "url": "...", "limit": 100 }
|
|
427
|
+
|
|
428
|
+
# Map
|
|
429
|
+
GET /v1/map?url=<url>
|
|
430
|
+
|
|
431
|
+
# Extract
|
|
432
|
+
POST /v1/extract { "url": "...", "schema": { ... } }
|
|
348
433
|
```
|
|
349
|
-
webpeel/
|
|
350
|
-
├── src/
|
|
351
|
-
│ ├── core/ # Core library (fetcher, strategies, markdown, crawl, search)
|
|
352
|
-
│ ├── mcp/ # MCP server (11 tools for AI assistants)
|
|
353
|
-
│ ├── server/ # Express API server (hosted version)
|
|
354
|
-
│ │ ├── routes/ # API route handlers
|
|
355
|
-
│ │ ├── middleware/ # Auth, rate limiting, SSRF protection
|
|
356
|
-
│ │ └── premium/ # Server-only premium features
|
|
357
|
-
│ ├── tests/ # Vitest test suites
|
|
358
|
-
│ ├── cli.ts # CLI entry point
|
|
359
|
-
│ ├── index.ts # Library exports
|
|
360
|
-
│ └── types.ts # TypeScript type definitions
|
|
361
|
-
├── python-sdk/ # Python SDK (PyPI: webpeel)
|
|
362
|
-
├── integrations/ # LangChain, LlamaIndex, CrewAI, Dify, n8n
|
|
363
|
-
├── site/ # Landing page (webpeel.dev)
|
|
364
|
-
├── dashboard/ # Next.js dashboard (app.webpeel.dev)
|
|
365
|
-
├── benchmarks/ # Performance comparison suite
|
|
366
|
-
└── skills/ # AI agent skills (Claude Code, etc.)
|
|
367
|
-
```
|
|
368
434
|
|
|
369
|
-
|
|
435
|
+
[Full API reference →](https://webpeel.dev/docs/api-reference)
|
|
436
|
+
|
|
437
|
+
---
|
|
438
|
+
|
|
439
|
+
## 🤝 Contributing
|
|
370
440
|
|
|
371
441
|
```bash
|
|
372
442
|
git clone https://github.com/webpeel/webpeel.git
|
|
@@ -375,11 +445,13 @@ npm install && npm run build
|
|
|
375
445
|
npm test
|
|
376
446
|
```
|
|
377
447
|
|
|
378
|
-
|
|
448
|
+
- **Bug reports:** [Open an issue](https://github.com/webpeel/webpeel/issues)
|
|
449
|
+
- **Feature requests:** [Start a discussion](https://github.com/webpeel/webpeel/discussions)
|
|
450
|
+
- **Code:** See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines
|
|
379
451
|
|
|
380
|
-
|
|
452
|
+
The project has a comprehensive test suite. Please add tests for new features.
|
|
381
453
|
|
|
382
|
-
|
|
454
|
+
---
|
|
383
455
|
|
|
384
456
|
## Star History
|
|
385
457
|
|
|
@@ -391,23 +463,20 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
|
|
|
391
463
|
</picture>
|
|
392
464
|
</a>
|
|
393
465
|
|
|
394
|
-
|
|
395
|
-
|
|
396
|
-
This project is licensed under the [GNU Affero General Public License v3.0 (AGPL-3.0)](https://www.gnu.org/licenses/agpl-3.0.html).
|
|
466
|
+
---
|
|
397
467
|
|
|
398
|
-
|
|
399
|
-
- ✅ Free to use, modify, and distribute
|
|
400
|
-
- ✅ Free for personal and commercial use
|
|
401
|
-
- ⚠️ If you run a modified version as a network service, you must release your source code under AGPL-3.0
|
|
468
|
+
## License
|
|
402
469
|
|
|
403
|
-
|
|
470
|
+
[AGPL-3.0](LICENSE) — free to use, modify, and distribute. If you run a modified version as a network service, you must release your source under AGPL-3.0.
|
|
404
471
|
|
|
405
|
-
|
|
472
|
+
Need a commercial license? [support@webpeel.dev](mailto:support@webpeel.dev)
|
|
406
473
|
|
|
407
|
-
|
|
474
|
+
> Versions 0.7.1 and earlier were released under MIT and remain MIT-licensed.
|
|
408
475
|
|
|
409
476
|
---
|
|
410
477
|
|
|
411
478
|
<p align="center">
|
|
412
|
-
|
|
479
|
+
If WebPeel saves you time, <a href="https://github.com/webpeel/webpeel"><strong>⭐ star the repo</strong></a> — it helps others find it.
|
|
413
480
|
</p>
|
|
481
|
+
|
|
482
|
+
© [WebPeel](https://github.com/webpeel)
|