crawl4agent 0.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,333 @@
1
+ Metadata-Version: 2.4
2
+ Name: crawl4agent
3
+ Version: 0.2.0
4
+ Summary: An MCP server built on crawl4ai for reliable webpage extraction
5
+ Author: crawl4ai-mcp contributors
6
+ License: AGPL-3.0-or-later
7
+ Requires-Python: <3.13,>=3.10
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: crawl4ai==0.6.2
11
+ Requires-Dist: mcp>=1.2.0
12
+ Requires-Dist: pydantic>=2.6.0
13
+ Requires-Dist: pydantic-settings>=2.2.0
14
+ Requires-Dist: httpx>=0.27.0
15
+ Requires-Dist: beautifulsoup4>=4.12.0
16
+ Requires-Dist: lxml>=5.0.0
17
+ Provides-Extra: dev
18
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
19
+ Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
20
+ Requires-Dist: ruff>=0.5.0; extra == "dev"
21
+ Dynamic: license-file
22
+
23
+ # crawl4ai-mcp
24
+
25
+ <div align="center">
26
+
27
+ [![License: AGPL v3](https://img.shields.io/badge/license-AGPL--3.0--or--later-6f42c1)](https://www.gnu.org/licenses/agpl-3.0)
28
+ [![Python](https://img.shields.io/badge/python-3.10--3.12-3776AB?logo=python&logoColor=white)](https://www.python.org/downloads/)
29
+ [![MCP](https://img.shields.io/badge/protocol-MCP-0A7EA4)](https://modelcontextprotocol.io)
30
+ [![Playwright](https://img.shields.io/badge/browser-Playwright-2EAD33?logo=playwright&logoColor=white)](https://playwright.dev)
31
+ [![Crawl4AI](https://img.shields.io/badge/extractor-Crawl4AI-111827)](https://github.com/unclecode/crawl4ai)
32
+ [![PyPI](https://img.shields.io/pypi/v/crawl4agent)](https://pypi.org/project/crawl4agent/)
33
+ [![GitHub stars](https://img.shields.io/github/stars/pazyork/crawl4ai-mcp?style=social)](https://github.com/pazyork/crawl4ai-mcp)
34
+
35
+ **A minimal MCP server for agent-friendly web extraction and search.**
36
+
37
+ Two tools: fetch real pages with Playwright + Crawl4AI, or search across 7 engines with automatic fallback.
38
+
39
+ </div>
40
+
41
+ ---
42
+
43
+ ## Quick entry
44
+
45
+ | Audience | Read this |
46
+ |---|---|
47
+ | Human developer | **[README.zh-CN.md](./README.zh-CN.md)** / **[README.md](./README.md)** |
48
+ | Living in the AI era, delegating your remaining sanity to an agent | **[README_AGENT.md](./README_AGENT.md)** |
49
+
50
+ ## At a glance
51
+
52
+ | Item | Reality in this repo |
53
+ |---|---|
54
+ | MCP tools | **2 tools**: `fetch_urls` + `search_web` |
55
+ | Single-page fetch | `urls: ["https://example.com"]` |
56
+ | Web search | `search_web(query="...", engine="auto")` — 7 engines, auto fallback |
57
+ | Search engines | DuckDuckGo · Bing · Google · Yandex · Sogou · 360Search · Baidu |
58
+ | Output | `title + content + links + blocked + llm_used/llm_error` |
59
+ | Non-LLM mode | First-class, default, usable without any model |
60
+ | LLM mode | **Off by default**. Enabled only with `use_llm=true` + optional `llm_instruction` |
61
+ | Fallback | Missing/failed LLM call automatically falls back to non-LLM result |
62
+ | Anti-bot realism | proxy / cookies / persistent profile / randomized browser behavior |
63
+ | License | **AGPL-3.0-or-later** |
64
+
65
+ ---
66
+
67
+ ## How it works
68
+
69
+ **Fetch flow:**
70
+
71
+ ```mermaid
72
+ flowchart LR
73
+ A[URL list] --> B[Playwright + Crawl4AI]
74
+ B --> C{Fast path enough?}
75
+ C -- Yes --> D[Markdown / HTML]
76
+ C -- No --> E[Stronger fallback]
77
+ E --> D
78
+ D --> F{use_llm?}
79
+ F -- No --> G[Return result]
80
+ F -- Yes --> H[OpenAI-compatible cleanup]
81
+ H --> I{LLM success?}
82
+ I -- Yes --> J[Return enhanced result]
83
+ I -- No --> G
84
+ ```
85
+
86
+ **Search flow:**
87
+
88
+ ```mermaid
89
+ flowchart LR
90
+ A[query + engine] --> B{engine=auto?}
91
+ B -- Yes --> C[Detect language]
92
+ C --> D[Build engine plan]
93
+ B -- No --> E[Use specified engine]
94
+ D --> F[Try engines in order]
95
+ E --> F
96
+ F --> G{Results?}
97
+ G -- Yes --> H[Aggregate + deduplicate]
98
+ G -- No, next engine --> F
99
+ H --> I[Return results]
100
+ ```
101
+
102
+ ---
103
+
104
+ ## Why this project exists
105
+
106
+ Most generic “web fetch” tools either fail on JS-heavy pages or return too much boilerplate. This project focuses on four things:
107
+
108
+ - **Non-LLM quality first**: usable even with zero model config
109
+ - **Minimal MCP surface**: easier for agents, easier to maintain
110
+ - **Pragmatic anti-bot workflow**: proxy / cookies / persistent profile are first-class
111
+ - **Golden regression review**: full markdown outputs can be saved and inspected page by page
112
+
113
+ ---
114
+
115
+ ## Core capabilities
116
+
117
+ ### Non-LLM mode
118
+
119
+ | Capability | Actual behavior |
120
+ |---|---|
121
+ | Rendering | Real browser rendering via Playwright |
122
+ | Extraction | Crawl4AI markdown/html extraction |
123
+ | Fallback | Fast path → stronger path when content is too thin |
124
+ | Cleanup | Remove obvious noise, compress blanks, strip data-image placeholders |
125
+ | Site tuning | Medium / Claude Docs / GitHub and other mainstream sites |
126
+ | Block detection | `blocked=true` for likely verification/interstitial output |
127
+ | Batch control | Bounded concurrency via `concurrency` |
128
+
129
+ ### Optional LLM mode
130
+
131
+ | Input | Meaning |
132
+ |---|---|
133
+ | `use_llm=true` | Turn on post-cleanup with an OpenAI-compatible model |
134
+ | `llm_instruction` | Tell the model what to keep / remove |
135
+
136
+ **Important reality check:**
137
+
138
+ - With `llm_instruction`, the prompt is **constraint-heavy** and biased toward preserving original lines.
139
+ - Without `llm_instruction`, the model does a more generic “clean readable markdown” pass.
140
+ - If the LLM call fails for any reason, the tool returns the original non-LLM extraction plus `llm_used=false` and `llm_error`.
141
+
142
+ ---
143
+
144
+ ## MCP Tools
145
+
146
+ ### `fetch_urls`
147
+
148
+ ```json
149
+ {
150
+ "urls": ["https://a.com", "https://b.com"],
151
+ "format": "markdown",
152
+ "max_chars": 200000,
153
+ "concurrency": 3,
154
+ "use_llm": false,
155
+ "llm_instruction": "keep only the tutorial body and in-body references"
156
+ }
157
+ ```
158
+
159
+ Use a single-element list if you only need one page.
160
+
161
+ ### Return shape
162
+
163
+ | Field | Meaning |
164
+ |---|---|
165
+ | `url` | Original URL |
166
+ | `final_url` | Final resolved URL after redirects |
167
+ | `title` | Extracted title |
168
+ | `content` | Markdown or HTML |
169
+ | `content_format` | `markdown` or `html` |
170
+ | `links` | Normalized extracted links |
171
+ | `blocked` | Likely anti-bot / verification / denied result |
172
+ | `llm_used` | Whether LLM enhancement was actually applied |
173
+ | `llm_error` | Why the LLM step degraded |
174
+
175
+ ### `search_web`
176
+
177
+ ```json
178
+ {
179
+ "query": "crawl4ai web scraping",
180
+ "engine": "auto",
181
+ "max_results": 10,
182
+ "lang": ""
183
+ }
184
+ ```
185
+
186
+ | Parameter | Default | Description |
187
+ |---|---|---|
188
+ | `query` | (required) | Search query string |
189
+ | `engine` | `auto` | Engine to use: `auto`, `google`, `bing`, `duckduckgo`, `baidu` |
190
+ | `max_results` | `10` | Maximum number of results |
191
+ | `lang` | `""` | Language hint (e.g. `en`, `zh-CN`) |
192
+
193
+ When `engine="auto"`, the server tries engines in fallback order: DuckDuckGo → Bing → Google → Baidu. The first engine that returns results wins.
194
+
195
+ #### Search return shape
196
+
197
+ | Field | Meaning |
198
+ |---|---|
199
+ | `engine` | Which engine actually returned results |
200
+ | `query` | Original query |
201
+ | `results` | List of `{title, url, snippet}` |
202
+ | `total` | Number of results |
203
+ | `fallback_engines_tried` | Engines that failed before the successful one |
204
+
205
+ ---
206
+
207
+ ## Anti-bot realism
208
+
209
+ The server already includes randomized browser behavior in code:
210
+
211
+ | Mechanism | Actual status |
212
+ |---|---|
213
+ | Random viewport | Yes |
214
+ | Random user agent mode | Yes, when explicit UA is not provided |
215
+ | Delay jitter | Yes |
216
+ | `override_navigator` | Yes |
217
+ | `simulate_user` | Yes, in stronger fallback mode |
218
+ | Proxy / cookies / persistent profile | Supported via env vars |
219
+ | Cloudflare bypass | Enhanced browser fingerprinting + configurable wait strategies |
220
+
221
+ **Note**: For overseas websites (Medium, ProductHunt, etc.), using a proxy is recommended. The server supports HTTP/HTTPS/SOCKS5 proxies via `CRAWL4AI_MCP_PROXY` environment variable.
222
+
223
+ ### Proxy input formats
224
+
225
+ `CRAWL4AI_MCP_PROXY` accepts all of these:
226
+
227
+ | Input | Interpreted as |
228
+ |---|---|
229
+ | `http://127.0.0.1:7890` | HTTP proxy |
230
+ | `https://127.0.0.1:7890` | HTTPS proxy |
231
+ | `socks5://127.0.0.1:7890` | SOCKS5 proxy |
232
+ | `socket5://127.0.0.1:7890` | Auto-normalized to `socks5://...` |
233
+ | `127.0.0.1:7890` | Auto-normalized to `http://127.0.0.1:7890` |
234
+ | `7890` | Auto-normalized to `http://127.0.0.1:7890` |
235
+
236
+ That means the README should not claim “perfect stealth”, but it can honestly claim **human-like randomization** and **practical anti-bot knobs**.
237
+
238
+ ---
239
+
240
+ ## Quickstart
241
+
242
+ ### Conda
243
+
244
+ ```bash
245
+ conda env create -f environment.yml
246
+ conda activate crawl4ai-mcp
247
+ python -m playwright install
248
+ crawl4ai-mcp
249
+ ```
250
+
251
+ ### venv
252
+
253
+ ```bash
254
+ python -m venv .venv
255
+ source .venv/bin/activate
256
+ python -m pip install -U pip
257
+ python -m pip install -e '.[dev]'
258
+ python -m playwright install
259
+ crawl4ai-mcp
260
+ ```
261
+
262
+ ---
263
+
264
+ ## MCP server config example
265
+
266
+ ```json
267
+ {
268
+ "mcpServers": {
269
+ "crawl4ai": {
270
+ "command": "crawl4ai-mcp",
271
+ "env": {
272
+ "CRAWL4AI_MCP_HEADLESS": "true",
273
+ "CRAWL4AI_MCP_PROXY": "127.0.0.1:7890",
274
+ "CRAWL4AI_MCP_NAVIGATION_TIMEOUT_MS": "30000",
275
+ "CRAWL4AI_MCP_WAIT_UNTIL": "load",
276
+
277
+ "OPENAI_BASE_URL": "https://your-openai-compatible-host",
278
+ "OPENAI_API_KEY": "your-api-key",
279
+ "OPENAI_MODEL": "your-model-name"
280
+ }
281
+ }
282
+ }
283
+ }
284
+ ```
285
+
286
+ LLM-related env vars are **optional**. `use_llm` is still **off by default** at call time. If any LLM env is missing, invalid, or the model call fails, the server automatically falls back to non-LLM extraction.
287
+
288
+ ---
289
+
290
+ ## Runtime configuration
291
+
292
+ | Env var | Purpose |
293
+ |---|---|
294
+ | `CRAWL4AI_MCP_HEADLESS` | Run browser headless |
295
+ | `CRAWL4AI_MCP_PROXY` | Upstream proxy, supports `http://`, `https://`, `socks5://`, `host:port`, and `port-only` |
296
+ | `CRAWL4AI_MCP_COOKIES_JSON` | Playwright storage state JSON |
297
+ | `CRAWL4AI_MCP_USE_PERSISTENT_CONTEXT` | Reuse browser profile |
298
+ | `CRAWL4AI_MCP_USER_DATA_DIR` | Profile directory |
299
+ | `CRAWL4AI_MCP_NAVIGATION_TIMEOUT_MS` | Default max single navigation wait, default `30000` |
300
+ | `CRAWL4AI_MCP_WAIT_UNTIL` | Default page readiness strategy, default `load` |
301
+ | `OPENAI_BASE_URL` | OpenAI-compatible base URL |
302
+ | `OPENAI_API_KEY` | API key |
303
+ | `OPENAI_MODEL` | Model name |
304
+
305
+ ---
306
+
307
+ ## Golden smoke regression
308
+
309
+ ```bash
310
+ CRAWL4AI_MCP_SMOKE_DIR=./_golden_outputs .venv/bin/python -m crawl4ai_mcp.smoke_golden
311
+ ```
312
+
313
+ This writes full markdown outputs to `_golden_outputs/` so you can inspect extraction quality page by page.
314
+
315
+ The golden set now includes the earlier baseline URLs plus `ainew.me`, `openclaw`, `watcha`, `producthunt`, `mydrivers`, `caihongtu`, `openrouter`, and mobile Douban. For sites outside mainland China, proxy-based verification is recommended.
316
+
317
+ Some overseas sites may still return Cloudflare or similar verification pages even when a proxy is configured. In those cases the server now marks them with `blocked=true`. The recommended path is: better proxy quality, valid cookies, or a persistent browser profile after manual verification.
318
+
319
+ ---
320
+
321
+ ## Prior art
322
+
323
+ - Crawl4AI: <https://github.com/unclecode/crawl4ai>
324
+ - mcp-crawl4ai-rag: <https://github.com/coleam00/mcp-crawl4ai-rag>
325
+ - weidwonder/crawl4ai-mcp-server: <https://github.com/weidwonder/crawl4ai-mcp-server>
326
+ - WaterCrawl: <https://github.com/watercrawl/WaterCrawl>
327
+ - teracrawl: <https://github.com/BrowserCash/teracrawl>
328
+
329
+ ---
330
+
331
+ ## License
332
+
333
+ This project is licensed under **AGPL-3.0-or-later**.
@@ -0,0 +1,16 @@
1
+ crawl4agent-0.2.0.dist-info/licenses/LICENSE,sha256=q76n28XV0G2jnoJ1AAHDGQ5UA03tayDMJfIkiavcz-s,28779
2
+ crawl4ai_mcp/__init__.py,sha256=QwLGZDZhVE7dBKbOwCLRt3tRMXTtg8fYYHHDXi3sZ1c,49
3
+ crawl4ai_mcp/__main__.py,sha256=nYX5vCIfnhF3QWurxQTTX5gqH7OnaJjpaKoZvkegocA,108
4
+ crawl4ai_mcp/config.py,sha256=4tpCgkgYDU6tI-58blaGecnGLyPFnJyUWkxbE04KFaU,1718
5
+ crawl4ai_mcp/crawler.py,sha256=o_iEdbEOyeFHaSMxTXsbbiNOTM0qfnf7CiUdVxygHeM,21901
6
+ crawl4ai_mcp/golden_urls.py,sha256=hSA8XumUMXTjQkb0AIq4OiO1QqyxWDUu-GRGAJ8QQ_c,610
7
+ crawl4ai_mcp/mcp_server.py,sha256=AT0N0cj7e-khthGBnyJh6vJMZxduJHmf9EH_L6B5IFw,5570
8
+ crawl4ai_mcp/openai_client.py,sha256=jeDzP6wUBtbxz5vqU1YvXMPMrIWHDbchamKGkkBYbO0,1902
9
+ crawl4ai_mcp/searcher.py,sha256=DO4B8nadPhO_75tiY9Gs3fqOeo3RqqCCBiDlWUi_Yk4,16300
10
+ crawl4ai_mcp/smoke_golden.py,sha256=909rNrp9dH8ha27jNzgGl-FWBxKIROsECtLqPbX2GUw,1541
11
+ crawl4ai_mcp/types.py,sha256=9MlM8uzF57sspxRR4_oZDbtBJ0KwAdptpxC89CFZlqk,787
12
+ crawl4agent-0.2.0.dist-info/METADATA,sha256=OsHvut99ImsDxz6ifiQ2t5uS5MKieIVpx_s5PTkQZGs,10951
13
+ crawl4agent-0.2.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
14
+ crawl4agent-0.2.0.dist-info/entry_points.txt,sha256=RFIevJQ7FOVwGIA1ISB-1vNESczgn9H--DXRItQzouU,60
15
+ crawl4agent-0.2.0.dist-info/top_level.txt,sha256=vnjkvW11ahFhq93JqHBtcfuIWorAbXE9WL90RkFZ__o,13
16
+ crawl4agent-0.2.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ crawl4ai-mcp = crawl4ai_mcp.__main__:main