PyPI - stealthfetch - Versions diffs - 0.2.0__tar.gz - Mend

stealthfetch 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

stealthfetch-0.2.0/.github/workflows/ci.yml +47 -0
stealthfetch-0.2.0/.github/workflows/publish.yml +34 -0
stealthfetch-0.2.0/.gitignore +37 -0
stealthfetch-0.2.0/CHANGELOG.md +24 -0
stealthfetch-0.2.0/CLAUDE.md +51 -0
stealthfetch-0.2.0/LICENSE +21 -0
stealthfetch-0.2.0/PKG-INFO +257 -0
stealthfetch-0.2.0/README.md +215 -0
stealthfetch-0.2.0/examples/async_usage.py +23 -0
stealthfetch-0.2.0/examples/basic_usage.py +8 -0
stealthfetch-0.2.0/examples/browser_mode.py +21 -0
stealthfetch-0.2.0/examples/mcp_config.json +7 -0
stealthfetch-0.2.0/pyproject.toml +94 -0
stealthfetch-0.2.0/skill/SKILL.md +69 -0
stealthfetch-0.2.0/skill/reference.md +95 -0
stealthfetch-0.2.0/src/stealthfetch/__init__.py +32 -0
stealthfetch-0.2.0/src/stealthfetch/_browsers/__init__.py +75 -0
stealthfetch-0.2.0/src/stealthfetch/_browsers/_camoufox.py +73 -0
stealthfetch-0.2.0/src/stealthfetch/_browsers/_constants.py +16 -0
stealthfetch-0.2.0/src/stealthfetch/_browsers/_patchright.py +67 -0
stealthfetch-0.2.0/src/stealthfetch/_compat.py +50 -0
stealthfetch-0.2.0/src/stealthfetch/_core.py +482 -0
stealthfetch-0.2.0/src/stealthfetch/_detect.py +90 -0
stealthfetch-0.2.0/src/stealthfetch/_errors.py +134 -0
stealthfetch-0.2.0/src/stealthfetch/cli.py +140 -0
stealthfetch-0.2.0/src/stealthfetch/mcp_server.py +112 -0
stealthfetch-0.2.0/src/stealthfetch/py.typed +0 -0
stealthfetch-0.2.0/tests/conftest.py +34 -0
stealthfetch-0.2.0/tests/fixtures/article.html +20 -0
stealthfetch-0.2.0/tests/fixtures/captcha.html +13 -0
stealthfetch-0.2.0/tests/fixtures/cloudflare_block.html +12 -0
stealthfetch-0.2.0/tests/fixtures/reddit_challenge.html +221 -0
stealthfetch-0.2.0/tests/fixtures/tables.html +23 -0
stealthfetch-0.2.0/tests/integration/conftest.py +24 -0
stealthfetch-0.2.0/tests/integration/test_live.py +32 -0
stealthfetch-0.2.0/tests/test_browsers.py +153 -0
stealthfetch-0.2.0/tests/test_cli.py +119 -0
stealthfetch-0.2.0/tests/test_compat.py +36 -0
stealthfetch-0.2.0/tests/test_convert.py +27 -0
stealthfetch-0.2.0/tests/test_core.py +373 -0
stealthfetch-0.2.0/tests/test_detect.py +126 -0
stealthfetch-0.2.0/tests/test_extract.py +71 -0
stealthfetch-0.2.0/tests/test_mcp.py +197 -0
stealthfetch-0.2.0/tests/test_validation.py +144 -0

stealthfetch-0.2.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,47 @@
+name: CI
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+      - run: pip install -e ".[dev]"
+      - run: ruff check src/ tests/
+      - run: mypy src/
+  test:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-latest, macos-latest]
+        python-version: ["3.10", "3.11", "3.12", "3.13"]
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: pip
+      - run: pip install -e ".[dev]"
+      - run: pytest tests/ -v
+  integration:
+    if: github.event_name == 'push'
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+      - run: pip install -e ".[dev]"
+      - run: pytest tests/integration/ --run-integration -v

stealthfetch-0.2.0/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,34 @@
+name: Publish to PyPI
+on:
+  push:
+    tags: ["v*"]
+permissions:
+  id-token: write
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install -e ".[dev]"
+      - run: ruff check src/ tests/
+      - run: mypy src/
+      - run: pytest tests/ -v
+  publish:
+    needs: test
+    runs-on: ubuntu-latest
+    environment: pypi
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install build
+      - run: python -m build
+      - uses: pypa/gh-action-pypi-publish@release/v1

stealthfetch-0.2.0/.gitignore ADDED Viewed

@@ -0,0 +1,37 @@
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+*.egg
+dist/
+build/
+# Type checkers / linters
+.mypy_cache/
+.ruff_cache/
+# Testing
+.pytest_cache/
+htmlcov/
+.coverage
+coverage.xml
+# Environments
+.env
+.venv/
+venv/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+Thumbs.db
+# Project
+sacredtexts.md
+.claude/

stealthfetch-0.2.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,24 @@
+# Changelog
+## 0.2.0 (2026-02-24)
+- Add `fetch_result()` / `afetch_result()` — same pipeline as `fetch_markdown`, returns `FetchResult` dataclass with `markdown` + metadata fields (`title`, `author`, `date`, `description`, `url`, `hostname`, `sitename`) extracted as a free side-effect of trafilatura parsing
+- Add `FetchResult` dataclass, exported from the top-level package
+- MCP server: add `include_metadata` parameter to `fetch_markdown` tool — when `True`, returns JSON with markdown and metadata instead of plain string
+## 0.1.0 (2026-02-24)
+Initial release.
+- 3-layer pipeline: fetch (curl_cffi) → extract (trafilatura) → convert (html-to-markdown)
+- Auto-escalation from HTTP to stealth browser on block detection
+- Browser backends: Camoufox (default) and Patchright (fallback)
+- Block detection: HTTP status codes, content-type awareness, pattern matching (Cloudflare, DataDome, PerimeterX, Akamai)
+- SSRF protection: rejects private IPs, non-http(s) schemes, DNS rebinding, redirect-chain exploits
+- CLI: `stealthfetch <url>` with proxy, timeout, headers, and output options
+- MCP server: `stealthfetch-mcp` with full parameter support
+- Async support: `afetch_markdown()`
+- Proxy support with optional authentication
+- Custom HTTP headers
+- Response size limit (50 MB)
+- Strict type hints (mypy strict) and full linting (ruff)

stealthfetch-0.2.0/CLAUDE.md ADDED Viewed

@@ -0,0 +1,51 @@
+# StealthFetch
+URL in, LLM-ready markdown out. Orchestration layer over curl_cffi, trafilatura, html-to-markdown, Camoufox, and Patchright.
+## Architecture
+Three-layer pipeline in `src/stealthfetch/_core.py`: **fetch → extract → convert**.
+- **Fetch** — HTTP via curl_cffi with Chrome TLS fingerprint. Auto-escalates to stealth browser on block detection.
+- **Extract** — trafilatura strips nav, ads, boilerplate. Returns clean HTML.
+- **Convert** — html-to-markdown (Rust) produces final markdown.
+Key modules:
+- `_core.py` — pipeline + public API (`fetch_markdown`, `afetch_markdown`, `fetch_result`, `afetch_result`, `FetchResult`)
+- `_detect.py` — block detection heuristics. Strong patterns (vendor-specific, always checked) vs weak patterns (generic, checked only on small pages <15k chars)
+- `_errors.py` — exception hierarchy + SSRF URL/proxy validation (pre- and post-redirect)
+- `_compat.py` — feature detection for optional browser deps (non-cached, allows mid-process install)
+- `_browsers/` — browser backend abstraction. Dispatcher resolves "auto" → camoufox (preferred) or patchright
+- `cli.py` — CLI entry point
+- `mcp_server.py` — MCP server entry point (FastMCP, single `fetch_markdown` tool)
+## Public API
+4 functions: `fetch_markdown`, `afetch_markdown`, `fetch_result`, `afetch_result`
+1 dataclass: `FetchResult` (markdown, title, author, date, description, url, hostname, sitename)
+3 exceptions: `FetchError`, `ExtractionError`, `BrowserNotAvailable` (all inherit `StealthFetchError`)
+## Conventions
+- Strict mypy (`--strict` equivalent via pyproject.toml)
+- Ruff linting: E, F, W, I, UP, B, SIM, C4, RUF, PERF, LOG
+- Lazy imports for optional deps (browser backends, mcp) — keep startup fast
+- `_` prefix for all private modules
+- Async variants use `a` prefix (`afetch_markdown`)
+- CPU-bound work runs off the event loop via `asyncio.to_thread` in async paths
+## Commands
+```bash
+pytest                       # unit tests (156 tests)
+pytest --run-integration     # + live HTTP tests
+ruff check src/ tests/       # lint
+mypy src/                    # type check
+```
+## Design Decisions
+- HTTP-first, browser-only-when-needed — browsers are slow and detectable
+- Strong vs weak pattern split in `_detect.py` prevents false-positive escalation on large articles
+- SSRF validated twice: before request (literal IP + DNS resolution) and after redirects
+- `FetchResult` metadata comes free from trafilatura's existing parse — no extra HTTP calls

stealthfetch-0.2.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 leba01
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

stealthfetch-0.2.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,257 @@
+Metadata-Version: 2.4
+Name: stealthfetch
+Version: 0.2.0
+Summary: URL in, LLM-ready markdown out. Stealth fetch with anti-bot bypass.
+Project-URL: Homepage, https://github.com/leba01/stealthfetch
+Project-URL: Repository, https://github.com/leba01/stealthfetch
+Project-URL: Issues, https://github.com/leba01/stealthfetch/issues
+Author: leba01
+License-Expression: MIT
+License-File: LICENSE
+Keywords: anti-bot,llm,markdown,scraping,stealth
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Internet :: WWW/HTTP
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Classifier: Typing :: Typed
+Requires-Python: >=3.10
+Requires-Dist: curl-cffi>=0.14.0
+Requires-Dist: html-to-markdown>=2.25.0
+Requires-Dist: trafilatura>=1.8.0
+Provides-Extra: browser
+Requires-Dist: camoufox[geoip]>=0.4.11; extra == 'browser'
+Requires-Dist: patchright>=1.50; extra == 'browser'
+Provides-Extra: camoufox
+Requires-Dist: camoufox[geoip]>=0.4.11; extra == 'camoufox'
+Provides-Extra: dev
+Requires-Dist: mypy>=1.13; extra == 'dev'
+Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
+Requires-Dist: pytest>=8.0; extra == 'dev'
+Requires-Dist: ruff>=0.9.0; extra == 'dev'
+Provides-Extra: mcp
+Requires-Dist: mcp>=1.26.0; extra == 'mcp'
+Provides-Extra: patchright
+Requires-Dist: patchright>=1.50; extra == 'patchright'
+Description-Content-Type: text/markdown
+# StealthFetch
+[![CI](https://github.com/leba01/stealthfetch/actions/workflows/ci.yml/badge.svg)](https://github.com/leba01/stealthfetch/actions/workflows/ci.yml)
+[![PyPI](https://img.shields.io/pypi/v/stealthfetch)](https://pypi.org/project/stealthfetch/)
+[![Python](https://img.shields.io/pypi/pyversions/stealthfetch)](https://pypi.org/project/stealthfetch/)
+[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
+URL in, LLM-ready markdown out.
+```python
+from stealthfetch import fetch_markdown
+md = fetch_markdown("https://en.wikipedia.org/wiki/Web_scraping")
+```
+Fetches any web page, strips nav, ads, and boilerplate, returns clean markdown. If the site blocks you, it auto-escalates to a stealth browser. One function, no config.
+StealthFetch doesn't reinvent the hard parts: [curl_cffi](https://github.com/lexiforest/curl_cffi), [trafilatura](https://github.com/adbar/trafilatura), [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown), [Camoufox](https://github.com/daijro/camoufox), and [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright) do the heavy lifting. StealthFetch is the orchestration layer: wiring them together, detecting blocks, deciding when to escalate, and handling the security concerns most tools skip.
+## How It Works
+```
+URL
+ │
+ ▼
+┌───────────────────────────────────────────┐
+│  FETCH          curl_cffi                 │
+│                 Chrome TLS fingerprint    │
+│                 ↓ blocked?                │
+│                 auto-escalate to stealth  │
+│                 browser (Camoufox /       │
+│                 Patchright)               │
+└─────────────────┬─────────────────────────┘
+                  │
+┌─────────────────▼─────────────────────────┐
+│  EXTRACT        trafilatura               │
+│                 strips nav, ads,          │
+│                 boilerplate               │
+└─────────────────┬─────────────────────────┘
+                  │
+┌─────────────────▼─────────────────────────┐
+│  CONVERT        html-to-markdown (Rust)   │
+└─────────────────┬─────────────────────────┘
+                  │
+                  ▼
+               markdown
+```
+Each layer is one library call. The libraries do the hard work.
+## What StealthFetch Owns
+### Block Detection
+Most anti-bot systems give themselves away before you ever see a captcha. StealthFetch uses status codes (403, 429, 503) as a fast first pass, then pattern-matches HTML signatures from Cloudflare, DataDome, PerimeterX, and Akamai. The trick is knowing when *not* to check: vendor-specific signatures (like `_cf_chl_opt` or `perimeterx`) are always checked because they never appear in real content. Generic phrases like "just a moment" or "access denied" are only checked on small pages (< 15k chars) since on a real article those strings are just words.
+### Auto-Escalation
+Headless browsers are slow, heavy, and detectable in their own right. An HTTP request with a Chrome TLS fingerprint (via curl_cffi) gets through most sites just fine. So StealthFetch tries HTTP first always. It only spins up a stealth browser when the response actually looks blocked. The interesting part isn't the browser itself, it's the decision of *when* to use it.
+### SSRF Protection
+Most scraping tools — [including ones with 60-85k GitHub stars](https://www.bluerock.io/post/mcp-furi-microsoft-markitdown-vulnerabilities) — trust whatever URL you hand them. StealthFetch doesn't. A hostname that resolves to `127.0.0.1`? Rejected. A redirect chain that bounces through three domains and lands on a private IP? Caught. IPv6-mapped IPv4 bypasses, link-local addresses are all validated before the request goes out, and again after redirects resolve.
+## Works On
+Most sites return clean markdown in **under a second**. Sites that fight back (Reddit, Amazon) get auto-escalated to a stealth browser — takes **5–8 seconds** but you don't have to think about it.
+| Site | What You Get |
+|------|-------------|
+| Wikipedia, Reuters, BBC News, TechCrunch | Articles and news — straight through |
+| Hacker News | Threads and comments |
+| Stack Overflow | Q&A with code blocks |
+| Medium | Articles — Cloudflare-protected, but no false-positive escalation (passive JS, not a block page) |
+| Reddit | Blocked by challenge page → auto-escalates to browser |
+| Amazon | Blocked by CAPTCHA → auto-escalates to browser |
+## Install
+Try it — no install needed (requires [uv](https://docs.astral.sh/uv/getting-started/installation/)):
+```bash
+uvx stealthfetch https://en.wikipedia.org/wiki/Web_scraping
+```
+Install as a library:
+```bash
+pip install stealthfetch
+```
+> **Note:** trafilatura brings ~20 transitive dependencies (lxml, charset-normalizer, etc.). Total install is ~50 packages.
+Add stealth browser support (necessary for escalation logic):
+```bash
+pip install "stealthfetch[browser]"
+camoufox fetch
+```
+## CLI
+```bash
+stealthfetch https://en.wikipedia.org/wiki/Web_scraping
+stealthfetch https://spa-app.com -m browser
+stealthfetch https://example.com --no-links --no-tables
+stealthfetch https://example.com --header "Cookie: session=abc"
+```
+## MCP Server
+StealthFetch is an [MCP](https://modelcontextprotocol.io/) server — any MCP client (Claude Desktop, Claude Code, Cursor, etc.) can call it as a tool to fetch web pages as markdown.
+No install needed — add this to your MCP client config:
+```json
+{
+  "mcpServers": {
+    "stealthfetch": {
+      "command": "uvx",
+      "args": ["--from", "stealthfetch[mcp]", "stealthfetch-mcp"]
+    }
+  }
+}
+```
+Or if you prefer a persistent install:
+```bash
+pip install "stealthfetch[mcp]"
+```
+```json
+{
+  "mcpServers": {
+    "stealthfetch": {
+      "command": "stealthfetch-mcp"
+    }
+  }
+}
+```
+## API
+### `fetch_markdown(url, **kwargs) -> str`
+Also available as `afetch_markdown` — same signature, async. Extraction and conversion run off the event loop via `asyncio.to_thread`.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `url` | `str` | required | URL to fetch |
+| `method` | `str` | `"auto"` | `"auto"`, `"http"`, or `"browser"` |
+| `browser_backend` | `str` | `"auto"` | `"auto"`, `"camoufox"`, or `"patchright"` |
+| `include_links` | `bool` | `True` | Preserve hyperlinks |
+| `include_images` | `bool` | `False` | Preserve image references |
+| `include_tables` | `bool` | `True` | Preserve tables |
+| `timeout` | `int` | `30` | Timeout in seconds |
+| `proxy` | `dict` | `None` | `{"server": "...", "username": "...", "password": "..."}` |
+| `headers` | `dict` | `None` | Additional HTTP headers |
+### `fetch_result(url, **kwargs) -> FetchResult`
+Same fetch/extract/convert pipeline as `fetch_markdown`, but returns a structured dataclass with the markdown **and** page metadata extracted as a free side-effect of parsing.
+```python
+from stealthfetch import fetch_result
+r = fetch_result("https://en.wikipedia.org/wiki/Web_scraping", method="http")
+print(r.title)       # "Web scraping"
+print(r.author)      # "Wikipedia contributors" (when available)
+print(r.date)        # ISO 8601 date (when available)
+print(r.markdown[:200])
+```
+`FetchResult` fields:
+| Field | Type | Description |
+|-------|------|-------------|
+| `markdown` | `str` | Cleaned markdown content |
+| `title` | `str \| None` | Page title |
+| `author` | `str \| None` | Author name |
+| `date` | `str \| None` | Publication date (ISO 8601 when available) |
+| `description` | `str \| None` | Meta description |
+| `url` | `str \| None` | Canonical URL (may differ from input) |
+| `hostname` | `str \| None` | Hostname |
+| `sitename` | `str \| None` | Publisher name |
+To get a plain dict: `dataclasses.asdict(result)`.
+`afetch_result` has the same signature, async.
+## Optional Dependencies
+| Extra | What it adds |
+|-------|-------------|
+| `stealthfetch[camoufox]` | Camoufox stealth Firefox |
+| `stealthfetch[patchright]` | Patchright stealth Chromium |
+| `stealthfetch[browser]` | Both |
+| `stealthfetch[mcp]` | MCP server |
+Python 3.10+. Tested on 3.10–3.13, Linux and macOS.
+## Roadmap
+Things that would make sense if this gets traction:
+- **Homebrew tap** — `brew install stealthfetch` for people who don't want to think about Python
+- **Docker image** — bundle browser backends pre-installed, no `camoufox fetch` step, plays well with [Docker's MCP Catalog](https://docs.docker.com/ai/mcp-catalog-and-toolkit/)
+Contributions welcome.
+## License
+MIT