crawl4agent 0.2.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- crawl4agent-0.2.0.dist-info/METADATA +333 -0
- crawl4agent-0.2.0.dist-info/RECORD +16 -0
- crawl4agent-0.2.0.dist-info/WHEEL +5 -0
- crawl4agent-0.2.0.dist-info/entry_points.txt +2 -0
- crawl4agent-0.2.0.dist-info/licenses/LICENSE +559 -0
- crawl4agent-0.2.0.dist-info/top_level.txt +1 -0
- crawl4ai_mcp/__init__.py +3 -0
- crawl4ai_mcp/__main__.py +7 -0
- crawl4ai_mcp/config.py +52 -0
- crawl4ai_mcp/crawler.py +645 -0
- crawl4ai_mcp/golden_urls.py +13 -0
- crawl4ai_mcp/mcp_server.py +151 -0
- crawl4ai_mcp/openai_client.py +65 -0
- crawl4ai_mcp/searcher.py +467 -0
- crawl4ai_mcp/smoke_golden.py +52 -0
- crawl4ai_mcp/types.py +35 -0
|
@@ -0,0 +1,333 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: crawl4agent
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: An MCP server built on crawl4ai for reliable webpage extraction
|
|
5
|
+
Author: crawl4ai-mcp contributors
|
|
6
|
+
License: AGPL-3.0-or-later
|
|
7
|
+
Requires-Python: <3.13,>=3.10
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: crawl4ai==0.6.2
|
|
11
|
+
Requires-Dist: mcp>=1.2.0
|
|
12
|
+
Requires-Dist: pydantic>=2.6.0
|
|
13
|
+
Requires-Dist: pydantic-settings>=2.2.0
|
|
14
|
+
Requires-Dist: httpx>=0.27.0
|
|
15
|
+
Requires-Dist: beautifulsoup4>=4.12.0
|
|
16
|
+
Requires-Dist: lxml>=5.0.0
|
|
17
|
+
Provides-Extra: dev
|
|
18
|
+
Requires-Dist: pytest>=8.0.0; extra == "dev"
|
|
19
|
+
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
|
|
20
|
+
Requires-Dist: ruff>=0.5.0; extra == "dev"
|
|
21
|
+
Dynamic: license-file
|
|
22
|
+
|
|
23
|
+
# crawl4ai-mcp
|
|
24
|
+
|
|
25
|
+
<div align="center">
|
|
26
|
+
|
|
27
|
+
[](https://www.gnu.org/licenses/agpl-3.0)
|
|
28
|
+
[](https://www.python.org/downloads/)
|
|
29
|
+
[](https://modelcontextprotocol.io)
|
|
30
|
+
[](https://playwright.dev)
|
|
31
|
+
[](https://github.com/unclecode/crawl4ai)
|
|
32
|
+
[](https://pypi.org/project/crawl4agent/)
|
|
33
|
+
[](https://github.com/pazyork/crawl4ai-mcp)
|
|
34
|
+
|
|
35
|
+
**A minimal MCP server for agent-friendly web extraction and search.**
|
|
36
|
+
|
|
37
|
+
Two tools: fetch real pages with Playwright + Crawl4AI, or search across 7 engines with automatic fallback.
|
|
38
|
+
|
|
39
|
+
</div>
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## Quick entry
|
|
44
|
+
|
|
45
|
+
| Audience | Read this |
|
|
46
|
+
|---|---|
|
|
47
|
+
| Human developer | **[README.zh-CN.md](./README.zh-CN.md)** / **[README.md](./README.md)** |
|
|
48
|
+
| Living in the AI era, delegating your remaining sanity to an agent | **[README_AGENT.md](./README_AGENT.md)** |
|
|
49
|
+
|
|
50
|
+
## At a glance
|
|
51
|
+
|
|
52
|
+
| Item | Reality in this repo |
|
|
53
|
+
|---|---|
|
|
54
|
+
| MCP tools | **2 tools**: `fetch_urls` + `search_web` |
|
|
55
|
+
| Single-page fetch | `urls: ["https://example.com"]` |
|
|
56
|
+
| Web search | `search_web(query="...", engine="auto")` — 7 engines, auto fallback |
|
|
57
|
+
| Search engines | DuckDuckGo · Bing · Google · Yandex · Sogou · 360Search · Baidu |
|
|
58
|
+
| Output | `title + content + links + blocked + llm_used/llm_error` |
|
|
59
|
+
| Non-LLM mode | First-class, default, usable without any model |
|
|
60
|
+
| LLM mode | **Off by default**. Enabled only with `use_llm=true` + optional `llm_instruction` |
|
|
61
|
+
| Fallback | Missing/failed LLM call automatically falls back to non-LLM result |
|
|
62
|
+
| Anti-bot realism | proxy / cookies / persistent profile / randomized browser behavior |
|
|
63
|
+
| License | **AGPL-3.0-or-later** |
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## How it works
|
|
68
|
+
|
|
69
|
+
**Fetch flow:**
|
|
70
|
+
|
|
71
|
+
```mermaid
|
|
72
|
+
flowchart LR
|
|
73
|
+
A[URL list] --> B[Playwright + Crawl4AI]
|
|
74
|
+
B --> C{Fast path enough?}
|
|
75
|
+
C -- Yes --> D[Markdown / HTML]
|
|
76
|
+
C -- No --> E[Stronger fallback]
|
|
77
|
+
E --> D
|
|
78
|
+
D --> F{use_llm?}
|
|
79
|
+
F -- No --> G[Return result]
|
|
80
|
+
F -- Yes --> H[OpenAI-compatible cleanup]
|
|
81
|
+
H --> I{LLM success?}
|
|
82
|
+
I -- Yes --> J[Return enhanced result]
|
|
83
|
+
I -- No --> G
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
**Search flow:**
|
|
87
|
+
|
|
88
|
+
```mermaid
|
|
89
|
+
flowchart LR
|
|
90
|
+
A[query + engine] --> B{engine=auto?}
|
|
91
|
+
B -- Yes --> C[Detect language]
|
|
92
|
+
C --> D[Build engine plan]
|
|
93
|
+
B -- No --> E[Use specified engine]
|
|
94
|
+
D --> F[Try engines in order]
|
|
95
|
+
E --> F
|
|
96
|
+
F --> G{Results?}
|
|
97
|
+
G -- Yes --> H[Aggregate + deduplicate]
|
|
98
|
+
G -- No, next engine --> F
|
|
99
|
+
H --> I[Return results]
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## Why this project exists
|
|
105
|
+
|
|
106
|
+
Most generic “web fetch” tools either fail on JS-heavy pages or return too much boilerplate. This project focuses on four things:
|
|
107
|
+
|
|
108
|
+
- **Non-LLM quality first**: usable even with zero model config
|
|
109
|
+
- **Minimal MCP surface**: easier for agents, easier to maintain
|
|
110
|
+
- **Pragmatic anti-bot workflow**: proxy / cookies / persistent profile are first-class
|
|
111
|
+
- **Golden regression review**: full markdown outputs can be saved and inspected page by page
|
|
112
|
+
|
|
113
|
+
---
|
|
114
|
+
|
|
115
|
+
## Core capabilities
|
|
116
|
+
|
|
117
|
+
### Non-LLM mode
|
|
118
|
+
|
|
119
|
+
| Capability | Actual behavior |
|
|
120
|
+
|---|---|
|
|
121
|
+
| Rendering | Real browser rendering via Playwright |
|
|
122
|
+
| Extraction | Crawl4AI markdown/html extraction |
|
|
123
|
+
| Fallback | Fast path → stronger path when content is too thin |
|
|
124
|
+
| Cleanup | Remove obvious noise, compress blanks, strip data-image placeholders |
|
|
125
|
+
| Site tuning | Medium / Claude Docs / GitHub and other mainstream sites |
|
|
126
|
+
| Block detection | `blocked=true` for likely verification/interstitial output |
|
|
127
|
+
| Batch control | Bounded concurrency via `concurrency` |
|
|
128
|
+
|
|
129
|
+
### Optional LLM mode
|
|
130
|
+
|
|
131
|
+
| Input | Meaning |
|
|
132
|
+
|---|---|
|
|
133
|
+
| `use_llm=true` | Turn on post-cleanup with an OpenAI-compatible model |
|
|
134
|
+
| `llm_instruction` | Tell the model what to keep / remove |
|
|
135
|
+
|
|
136
|
+
**Important reality check:**
|
|
137
|
+
|
|
138
|
+
- With `llm_instruction`, the prompt is **constraint-heavy** and biased toward preserving original lines.
|
|
139
|
+
- Without `llm_instruction`, the model does a more generic “clean readable markdown” pass.
|
|
140
|
+
- If the LLM call fails for any reason, the tool returns the original non-LLM extraction plus `llm_used=false` and `llm_error`.
|
|
141
|
+
|
|
142
|
+
---
|
|
143
|
+
|
|
144
|
+
## MCP Tools
|
|
145
|
+
|
|
146
|
+
### `fetch_urls`
|
|
147
|
+
|
|
148
|
+
```json
|
|
149
|
+
{
|
|
150
|
+
"urls": ["https://a.com", "https://b.com"],
|
|
151
|
+
"format": "markdown",
|
|
152
|
+
"max_chars": 200000,
|
|
153
|
+
"concurrency": 3,
|
|
154
|
+
"use_llm": false,
|
|
155
|
+
"llm_instruction": "keep only the tutorial body and in-body references"
|
|
156
|
+
}
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
Use a single-element list if you only need one page.
|
|
160
|
+
|
|
161
|
+
### Return shape
|
|
162
|
+
|
|
163
|
+
| Field | Meaning |
|
|
164
|
+
|---|---|
|
|
165
|
+
| `url` | Original URL |
|
|
166
|
+
| `final_url` | Final resolved URL after redirects |
|
|
167
|
+
| `title` | Extracted title |
|
|
168
|
+
| `content` | Markdown or HTML |
|
|
169
|
+
| `content_format` | `markdown` or `html` |
|
|
170
|
+
| `links` | Normalized extracted links |
|
|
171
|
+
| `blocked` | Likely anti-bot / verification / denied result |
|
|
172
|
+
| `llm_used` | Whether LLM enhancement was actually applied |
|
|
173
|
+
| `llm_error` | Why the LLM step degraded |
|
|
174
|
+
|
|
175
|
+
### `search_web`
|
|
176
|
+
|
|
177
|
+
```json
|
|
178
|
+
{
|
|
179
|
+
"query": "crawl4ai web scraping",
|
|
180
|
+
"engine": "auto",
|
|
181
|
+
"max_results": 10,
|
|
182
|
+
"lang": ""
|
|
183
|
+
}
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
| Parameter | Default | Description |
|
|
187
|
+
|---|---|---|
|
|
188
|
+
| `query` | (required) | Search query string |
|
|
189
|
+
| `engine` | `auto` | Engine to use: `auto`, `google`, `bing`, `duckduckgo`, `baidu` |
|
|
190
|
+
| `max_results` | `10` | Maximum number of results |
|
|
191
|
+
| `lang` | `""` | Language hint (e.g. `en`, `zh-CN`) |
|
|
192
|
+
|
|
193
|
+
When `engine="auto"`, the server tries engines in fallback order: DuckDuckGo → Bing → Google → Baidu. The first engine that returns results wins.
|
|
194
|
+
|
|
195
|
+
#### Search return shape
|
|
196
|
+
|
|
197
|
+
| Field | Meaning |
|
|
198
|
+
|---|---|
|
|
199
|
+
| `engine` | Which engine actually returned results |
|
|
200
|
+
| `query` | Original query |
|
|
201
|
+
| `results` | List of `{title, url, snippet}` |
|
|
202
|
+
| `total` | Number of results |
|
|
203
|
+
| `fallback_engines_tried` | Engines that failed before the successful one |
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
## Anti-bot realism
|
|
208
|
+
|
|
209
|
+
The server already includes randomized browser behavior in code:
|
|
210
|
+
|
|
211
|
+
| Mechanism | Actual status |
|
|
212
|
+
|---|---|
|
|
213
|
+
| Random viewport | Yes |
|
|
214
|
+
| Random user agent mode | Yes, when explicit UA is not provided |
|
|
215
|
+
| Delay jitter | Yes |
|
|
216
|
+
| `override_navigator` | Yes |
|
|
217
|
+
| `simulate_user` | Yes, in stronger fallback mode |
|
|
218
|
+
| Proxy / cookies / persistent profile | Supported via env vars |
|
|
219
|
+
| Cloudflare bypass | Enhanced browser fingerprinting + configurable wait strategies |
|
|
220
|
+
|
|
221
|
+
**Note**: For overseas websites (Medium, ProductHunt, etc.), using a proxy is recommended. The server supports HTTP/HTTPS/SOCKS5 proxies via `CRAWL4AI_MCP_PROXY` environment variable.
|
|
222
|
+
|
|
223
|
+
### Proxy input formats
|
|
224
|
+
|
|
225
|
+
`CRAWL4AI_MCP_PROXY` accepts all of these:
|
|
226
|
+
|
|
227
|
+
| Input | Interpreted as |
|
|
228
|
+
|---|---|
|
|
229
|
+
| `http://127.0.0.1:7890` | HTTP proxy |
|
|
230
|
+
| `https://127.0.0.1:7890` | HTTPS proxy |
|
|
231
|
+
| `socks5://127.0.0.1:7890` | SOCKS5 proxy |
|
|
232
|
+
| `socket5://127.0.0.1:7890` | Auto-normalized to `socks5://...` |
|
|
233
|
+
| `127.0.0.1:7890` | Auto-normalized to `http://127.0.0.1:7890` |
|
|
234
|
+
| `7890` | Auto-normalized to `http://127.0.0.1:7890` |
|
|
235
|
+
|
|
236
|
+
That means the README should not claim “perfect stealth”, but it can honestly claim **human-like randomization** and **practical anti-bot knobs**.
|
|
237
|
+
|
|
238
|
+
---
|
|
239
|
+
|
|
240
|
+
## Quickstart
|
|
241
|
+
|
|
242
|
+
### Conda
|
|
243
|
+
|
|
244
|
+
```bash
|
|
245
|
+
conda env create -f environment.yml
|
|
246
|
+
conda activate crawl4ai-mcp
|
|
247
|
+
python -m playwright install
|
|
248
|
+
crawl4ai-mcp
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
### venv
|
|
252
|
+
|
|
253
|
+
```bash
|
|
254
|
+
python -m venv .venv
|
|
255
|
+
source .venv/bin/activate
|
|
256
|
+
python -m pip install -U pip
|
|
257
|
+
python -m pip install -e '.[dev]'
|
|
258
|
+
python -m playwright install
|
|
259
|
+
crawl4ai-mcp
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
---
|
|
263
|
+
|
|
264
|
+
## MCP server config example
|
|
265
|
+
|
|
266
|
+
```json
|
|
267
|
+
{
|
|
268
|
+
"mcpServers": {
|
|
269
|
+
"crawl4ai": {
|
|
270
|
+
"command": "crawl4ai-mcp",
|
|
271
|
+
"env": {
|
|
272
|
+
"CRAWL4AI_MCP_HEADLESS": "true",
|
|
273
|
+
"CRAWL4AI_MCP_PROXY": "127.0.0.1:7890",
|
|
274
|
+
"CRAWL4AI_MCP_NAVIGATION_TIMEOUT_MS": "30000",
|
|
275
|
+
"CRAWL4AI_MCP_WAIT_UNTIL": "load",
|
|
276
|
+
|
|
277
|
+
"OPENAI_BASE_URL": "https://your-openai-compatible-host",
|
|
278
|
+
"OPENAI_API_KEY": "your-api-key",
|
|
279
|
+
"OPENAI_MODEL": "your-model-name"
|
|
280
|
+
}
|
|
281
|
+
}
|
|
282
|
+
}
|
|
283
|
+
}
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
LLM-related env vars are **optional**. `use_llm` is still **off by default** at call time. If any LLM env is missing, invalid, or the model call fails, the server automatically falls back to non-LLM extraction.
|
|
287
|
+
|
|
288
|
+
---
|
|
289
|
+
|
|
290
|
+
## Runtime configuration
|
|
291
|
+
|
|
292
|
+
| Env var | Purpose |
|
|
293
|
+
|---|---|
|
|
294
|
+
| `CRAWL4AI_MCP_HEADLESS` | Run browser headless |
|
|
295
|
+
| `CRAWL4AI_MCP_PROXY` | Upstream proxy, supports `http://`, `https://`, `socks5://`, `host:port`, and `port-only` |
|
|
296
|
+
| `CRAWL4AI_MCP_COOKIES_JSON` | Playwright storage state JSON |
|
|
297
|
+
| `CRAWL4AI_MCP_USE_PERSISTENT_CONTEXT` | Reuse browser profile |
|
|
298
|
+
| `CRAWL4AI_MCP_USER_DATA_DIR` | Profile directory |
|
|
299
|
+
| `CRAWL4AI_MCP_NAVIGATION_TIMEOUT_MS` | Default max single navigation wait, default `30000` |
|
|
300
|
+
| `CRAWL4AI_MCP_WAIT_UNTIL` | Default page readiness strategy, default `load` |
|
|
301
|
+
| `OPENAI_BASE_URL` | OpenAI-compatible base URL |
|
|
302
|
+
| `OPENAI_API_KEY` | API key |
|
|
303
|
+
| `OPENAI_MODEL` | Model name |
|
|
304
|
+
|
|
305
|
+
---
|
|
306
|
+
|
|
307
|
+
## Golden smoke regression
|
|
308
|
+
|
|
309
|
+
```bash
|
|
310
|
+
CRAWL4AI_MCP_SMOKE_DIR=./_golden_outputs .venv/bin/python -m crawl4ai_mcp.smoke_golden
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
This writes full markdown outputs to `_golden_outputs/` so you can inspect extraction quality page by page.
|
|
314
|
+
|
|
315
|
+
The golden set now includes the earlier baseline URLs plus `ainew.me`, `openclaw`, `watcha`, `producthunt`, `mydrivers`, `caihongtu`, `openrouter`, and mobile Douban. For sites outside mainland China, proxy-based verification is recommended.
|
|
316
|
+
|
|
317
|
+
Some overseas sites may still return Cloudflare or similar verification pages even when a proxy is configured. In those cases the server now marks them with `blocked=true`. The recommended path is: better proxy quality, valid cookies, or a persistent browser profile after manual verification.
|
|
318
|
+
|
|
319
|
+
---
|
|
320
|
+
|
|
321
|
+
## Prior art
|
|
322
|
+
|
|
323
|
+
- Crawl4AI: <https://github.com/unclecode/crawl4ai>
|
|
324
|
+
- mcp-crawl4ai-rag: <https://github.com/coleam00/mcp-crawl4ai-rag>
|
|
325
|
+
- weidwonder/crawl4ai-mcp-server: <https://github.com/weidwonder/crawl4ai-mcp-server>
|
|
326
|
+
- WaterCrawl: <https://github.com/watercrawl/WaterCrawl>
|
|
327
|
+
- teracrawl: <https://github.com/BrowserCash/teracrawl>
|
|
328
|
+
|
|
329
|
+
---
|
|
330
|
+
|
|
331
|
+
## License
|
|
332
|
+
|
|
333
|
+
This project is licensed under **AGPL-3.0-or-later**.
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
crawl4agent-0.2.0.dist-info/licenses/LICENSE,sha256=q76n28XV0G2jnoJ1AAHDGQ5UA03tayDMJfIkiavcz-s,28779
|
|
2
|
+
crawl4ai_mcp/__init__.py,sha256=QwLGZDZhVE7dBKbOwCLRt3tRMXTtg8fYYHHDXi3sZ1c,49
|
|
3
|
+
crawl4ai_mcp/__main__.py,sha256=nYX5vCIfnhF3QWurxQTTX5gqH7OnaJjpaKoZvkegocA,108
|
|
4
|
+
crawl4ai_mcp/config.py,sha256=4tpCgkgYDU6tI-58blaGecnGLyPFnJyUWkxbE04KFaU,1718
|
|
5
|
+
crawl4ai_mcp/crawler.py,sha256=o_iEdbEOyeFHaSMxTXsbbiNOTM0qfnf7CiUdVxygHeM,21901
|
|
6
|
+
crawl4ai_mcp/golden_urls.py,sha256=hSA8XumUMXTjQkb0AIq4OiO1QqyxWDUu-GRGAJ8QQ_c,610
|
|
7
|
+
crawl4ai_mcp/mcp_server.py,sha256=AT0N0cj7e-khthGBnyJh6vJMZxduJHmf9EH_L6B5IFw,5570
|
|
8
|
+
crawl4ai_mcp/openai_client.py,sha256=jeDzP6wUBtbxz5vqU1YvXMPMrIWHDbchamKGkkBYbO0,1902
|
|
9
|
+
crawl4ai_mcp/searcher.py,sha256=DO4B8nadPhO_75tiY9Gs3fqOeo3RqqCCBiDlWUi_Yk4,16300
|
|
10
|
+
crawl4ai_mcp/smoke_golden.py,sha256=909rNrp9dH8ha27jNzgGl-FWBxKIROsECtLqPbX2GUw,1541
|
|
11
|
+
crawl4ai_mcp/types.py,sha256=9MlM8uzF57sspxRR4_oZDbtBJ0KwAdptpxC89CFZlqk,787
|
|
12
|
+
crawl4agent-0.2.0.dist-info/METADATA,sha256=OsHvut99ImsDxz6ifiQ2t5uS5MKieIVpx_s5PTkQZGs,10951
|
|
13
|
+
crawl4agent-0.2.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
|
|
14
|
+
crawl4agent-0.2.0.dist-info/entry_points.txt,sha256=RFIevJQ7FOVwGIA1ISB-1vNESczgn9H--DXRItQzouU,60
|
|
15
|
+
crawl4agent-0.2.0.dist-info/top_level.txt,sha256=vnjkvW11ahFhq93JqHBtcfuIWorAbXE9WL90RkFZ__o,13
|
|
16
|
+
crawl4agent-0.2.0.dist-info/RECORD,,
|