PyPI - switchback - Versions diffs - 0.1.0__tar.gz → 0.4.0__tar.gz - Mend

switchback 0.1.0tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (53) hide show

{switchback-0.1.0 → switchback-0.4.0}/.env.example RENAMED Viewed

@@ -14,6 +14,26 @@ OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 # ── Search (Tier-0 SearXNG, query → URLs) ───────────────────────────────────
 SEARXNG_URL=http://localhost:8888
+# ── Output format ───────────────────────────────────────────────────────────
+# Shape of the scraped content. Default markdown is byte-identical to before;
+# override per-call with scrape(fmt=...), the CLI --format flag, or the /scrape
+# {"format": ...} field. html-family results land under a "html" key (instead of
+# "markdown") in the CLI/server JSON.
+#   markdown          whole-page markdown (default)
+#   markdown_trimmed  markdown with extra ad/nav/boilerplate lines removed
+#   html              raw HTML exactly as fetched (no cleaning)
+#   html_selectors    cleaned HTML (boilerplate strip + per-domain drop/selector)
+# Note: the API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats
+# fall back to their text for those sources.
+SCRAPER_OUTPUT_FORMAT=markdown
+# ── Tier 2 · Cloudflare solver (cloudscraper) ───────────────────────────────
+# Needs the 3.x Enhanced Edition fork (see README); with the frozen PyPI build
+# the tier reports `unavailable` and fails fast. Wall-clock cap on a single solve
+# so an unsolvable challenge can't eat the per-URL deadline before the browser
+# tier runs. Lower (e.g. 12) if Tier 2 rarely wins on your hosts.
+SCRAPER_CLOUDSCRAPER_TIMEOUT_S=25
 # ── Tier 2.5 · Jina Reader (r.jina.ai) ──────────────────────────────────────
 # Optional: keyless works at 20 RPM. A key gives 500 RPM + a 10M-token grant.
 JINA_API_KEY=

switchback-0.4.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,94 @@
+# Changelog
+All notable changes to this project are documented here. Format loosely follows
+[Keep a Changelog](https://keepachangelog.com/); this project uses semantic-ish
+versioning while pre-1.0.
+## [Unreleased]
+## [0.4.0] - 2026-06-29
+### Added
+- **Configurable per-tier retries** — a tier can now re-attempt before falling
+  through to the next, more capable one. `SCRAPER_TIER_RETRIES` (global, default
+  `0` = off; `N` → up to `1+N` tries per tier), per-tier overrides
+  `SCRAPER_TIER_RETRIES_<TIER>` (e.g. `SCRAPER_TIER_RETRIES_TIER3_BROWSER=2`), and
+  `SCRAPER_TIER_RETRY_ON` (retryable failure classes; default
+  `timeout,rate_limited,connection` — widen to include `botwall,http_block` behind
+  a rotating residential proxy, where each retry gets a fresh IP). Retries stay
+  bounded by `SCRAPER_DEADLINE_S`, and intermediate retries are traced/logged but
+  **not** persisted to the botwall policy DB, so they never inflate the
+  self-healing skip / `needs_egress` counters. Default `0` keeps behaviour
+  unchanged. Enabling retries on the paid Firecrawl tier bills per attempt.
+### Fixed
+- **Quality gate rejects content shells** — the gate no longer passes a page just
+  because it clears the length floor; thin "shell" pages (nav/boilerplate with no
+  real article body) are now treated as a tier miss so the cascade falls through.
+- **Paid last-resort budget reserve** — `SCRAPER_FIRECRAWL_FALLBACK_AFTER_S`
+  (default 25s) stops starting local tiers once enough of the per-URL deadline has
+  elapsed and an enabled paid tier is still ahead, so a hard host can't burn the
+  whole budget before Firecrawl gets a turn.
+## [0.3.0] - 2026-06-27
+### Added
+- **`unavailable` tier outcome** — when a tier's optional dependency is missing,
+  the wrong version, or not installed yet (frozen PyPI `cloudscraper` instead of
+  the 3.x stealth fork; patchright's Chromium not downloaded during an async
+  cold-start install), the tier now fails fast (~0ms) with a distinct
+  `unavailable` outcome carrying the exact install command, logged once per tier.
+  It ranks above bot-wall in the verdict, so an environment problem is no longer
+  masked as `botwall` — and a missing Tier 2 dependency no longer burns the
+  per-URL solve budget before the browser tier runs.
+- **`switchback --doctor`** — preflight tier-readiness check (doubles as a
+  healthcheck: exit 0 when the capable tiers are ready). Reports whether
+  cloudscraper is the stealth-capable 3.x fork, patchright + Chromium are
+  installed, Camoufox/Node are present, and Firecrawl is configured. Built for
+  cold-start deploys where the browser is installed by a background thread after
+  boot.
+### Docs
+- README **Production / cold-start deployment** section and a `.env.example`
+  Tier 2 block: install `patchright install chromium` in the post-boot step, the
+  cloudscraper 3.x fork requirement, Node.js for Tier 2 concurrency, and the
+  `SCRAPER_CLOUDSCRAPER_TIMEOUT_S` budget knob.
+## [0.2.0] - 2026-06-25
+### Added
+- **Selectable output formats** — `SCRAPER_OUTPUT_FORMAT` (or per-call
+  `scrape(fmt=...)`, CLI `--format`, `/scrape` `{"format": ...}`) selects the
+  content shape: `markdown` (default, unchanged), `markdown_trimmed` (extra
+  ad/nav/boilerplate removed), `html` (raw), or `html_selectors` (cleaned HTML
+  with per-domain `drop`/`selector` applied). Default output is byte-identical;
+  html-family results use a `html` JSON key instead of `markdown`.
+## [0.1.0] - 2026-06-23
+### Added
+- **Challenge-type learning** — bot-walls are classified by vendor (Cloudflare,
+  DataDome, Akamai, PerimeterX, Incapsula, Google) and counted per host in the
+  botwall DB; the vendor is attached to each event and OTel span (`scrape.challenge`).
+- **Metrics & reporting** — `switchback.reporting` rolls the event log + botwall DB
+  into cost-savings-vs-Firecrawl, coverage, overall/per-tier/per-domain latency
+  (mean/median/min/max/p50/p95), outcomes, error codes by domain, and challenges
+  by domain. Exposed via `GET /metrics` and `GET /metrics/domains` (both accept
+  `?minutes=N`).
+- **Periodic flagging** — `python -m switchback.flags` emits a cron-friendly digest
+  (domains stuck on Firecrawl, escalated to egress, most-challenged) to logs/OTel.
+- **Content cache** — optional URL→result cache (`SCRAPER_CONTENT_TTL_S`, sqlite,
+  off by default) short-circuits re-scrapes before any tier runs.
+- **Login-session refresh** — `SCRAPER_LOGIN_HOOK` (`pkg.module:func`) refreshes a
+  dead logged-in session on demand; cookies overlay every tier and persist.
+- **Exponential backoff** — between-tier backoff with jitter after rate-limit /
+  timeout (`SCRAPER_BACKOFF_BASE_MS` / `SCRAPER_BACKOFF_MAX_MS`, off by default).
+- **Per-domain extraction prefs** — `config/extraction.json` (CSS scope selector +
+  extra drops) applied automatically in the normalize step for every tier.
+- **Session traces** — opt-in Playwright trace capture (`SCRAPER_TRACE_SESSION=1`)
+  for browser tiers, with `GET/DELETE /traces` management endpoints.
+### Changed
+- Tier 2's `cloudscraper` moved from a core dependency (which pinned a git-URL
+  fork PyPI can't publish) to the `cloudflare` extra; see the README for installing
+  the 3.x Enhanced Edition fork for full stealth.

{switchback-0.1.0 → switchback-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: switchback
-Version: 0.1.0
+Version: 0.4.0
 Summary: One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool.
 Author-email: Akash Kodavuru <akash@theaklabs.com>
 License: MIT
@@ -75,8 +75,8 @@ Dynamic: license-file
 Give it a URL; it tries the cheapest way to get clean Markdown first and only escalates
 to a heavier (slower, costlier) tier when the cheap one is walled. Stops at the first success.
-[![PyPI](https://img.shields.io/pypi/v/switchback.svg)](https://pypi.org/project/switchback/)
-[![Python](https://img.shields.io/pypi/pyversions/switchback.svg)](https://pypi.org/project/switchback/)
+[![PyPI](https://img.shields.io/pypi/v/switchback)](https://pypi.org/project/switchback/)
+[![Python](https://img.shields.io/pypi/pyversions/switchback)](https://pypi.org/project/switchback/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![CI](https://github.com/akash-kr/switchback/actions/workflows/ci.yml/badge.svg)](https://github.com/akash-kr/switchback/actions/workflows/ci.yml)
@@ -163,6 +163,34 @@ pip install "cloudscraper @ git+https://github.com/VeNoMouS/cloudscraper@3.0.0"
 Or run the whole thing as a container:
 `docker build -t switchback . && docker run -p 8799:8799 switchback`.
+### Production / cold-start deployment
+The two heavy tiers pull dependencies that often can't be baked into a base image
+and land *after* boot (e.g. an async install thread on Azure). Until they're
+ready, those tiers report **`unavailable`** (a distinct outcome carrying the exact
+fix) and the cascade falls through — they are never silently skipped. Checklist:
+- **Tier 3 is the real workhorse for Cloudflare/JS sites** — make sure its browser
+  is installed: `patchright install chromium` (note: **patchright**, not vanilla
+  `playwright`). On a cold start, run this in your post-boot install step/thread;
+  Tier 3 flips to ready once it finishes.
+- **Tier 2 needs the cloudscraper 3.x fork** (above) to attempt stealth. With the
+  frozen PyPI `cloudscraper` it reports `unavailable` and fails fast (no wasted
+  solve budget) instead of erroring mid-cascade. Tier 2 is a *weak* solver for
+  modern Cloudflare — treat it as a cheap try before the browser, not the primary.
+- **Install Node.js** for Tier 2's v3 JS-VM challenges — faster and thread-safe
+  vs. the pure-Python js2py fallback (relevant under concurrent load).
+- **Bound Tier 2's solve budget** with `SCRAPER_CLOUDSCRAPER_TIMEOUT_S` (default
+  `25`) so an unsolvable challenge can't eat the per-URL deadline before the
+  browser tier runs. Lower it (e.g. `12`) if Tier 2 rarely wins on your hosts.
+**Verify readiness on the box** with the preflight check (doubles as a healthcheck
+— exit 0 when the capable tiers are ready):
+```bash
+switchback --doctor          # or: python -m switchback --doctor
+```
 ## Use it from your app
 Three interchangeable entry points — all return the same shape
@@ -269,7 +297,9 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 <details>
 <summary><b>Tunables</b> — budgets, timeouts, caches, backoff</summary>
+- `SCRAPER_OUTPUT_FORMAT` — output shape: `markdown` (default) · `markdown_trimmed` · `html` · `html_selectors` (see [Output formats](#output-formats))
 - `SCRAPER_DEADLINE_S` — per-URL budget (45s)
+- `SCRAPER_FIRECRAWL_FALLBACK_AFTER_S` — after this many seconds on a URL, stop trying the local tiers and fall back to Firecrawl, so a hard host doesn't burn the whole deadline before the paid last resort gets a turn (25s; 0 = off)
 - `SCRAPER_CAMOUFOX_TIMEOUT_MS` — (45000)
 - `SCRAPER_BROWSER_CONCURRENCY` — max simultaneous headless browsers (default 1)
 - `SCRAPER_BOTWALL_URL_SKIP_COOLDOWN_H` — auto-skip re-test window (24h; 0 = never)
@@ -278,6 +308,8 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 - `SCRAPER_DISABLE_SESSION_CACHE` — turn off cf_clearance reuse
 - `SCRAPER_CONTENT_TTL_S` — URL→result cache TTL (**0 = off**; set e.g. 86400 to skip re-scraping a page within a day)
 - `SCRAPER_BACKOFF_BASE_MS` / `SCRAPER_BACKOFF_MAX_MS` — exponential backoff between tiers after a rate-limit/timeout (base 0 = off)
+- `SCRAPER_TIER_RETRIES` — same-tier retries before falling through (default 0 = off; `N` → up to `1+N` tries per tier), with per-tier overrides `SCRAPER_TIER_RETRIES_<TIER>` (e.g. `SCRAPER_TIER_RETRIES_TIER3_BROWSER=2`)
+- `SCRAPER_TIER_RETRY_ON` — failure classes eligible for a same-tier retry (default `timeout,rate_limited,connection`; widen to include `botwall,http_block` behind a rotating residential proxy so each retry gets a fresh IP). Retries are bounded by `SCRAPER_DEADLINE_S`; enabling them on the paid Firecrawl tier bills per attempt
 - `SCRAPER_LOGIN_HOOK` — `pkg.module:func` returning `{cookie: value}` for a host (see [Logged-in sessions](#logged-in-sessions))
 - `SCRAPER_EXTRACTION_FILE` — per-domain extraction prefs JSON (default `config/extraction.json`)
 - `SCRAPER_TRACE_SESSION` — opt-in: capture a Playwright trace (screenshots + DOM + network) per browser-tier attempt, written to `state/traces/`
@@ -297,6 +329,34 @@ trace zip to `state/traces/`. Manage them over HTTP — `GET /traces` (list),
 `GET /traces/{id}` (download), `DELETE /traces/{id}` — and open one with
 `playwright show-trace <zip>`. Off by default (traces are MBs each).
+### Output formats
+Markdown is the default and is unchanged. Pick a different shape globally with
+`SCRAPER_OUTPUT_FORMAT`, or per call:
+```python
+from switchback import scrape
+scrape(["https://example.com/article"])                    # markdown (default)
+scrape(["https://example.com/article"], fmt="html")        # raw HTML
+scrape(["https://example.com/article"], fmt="markdown_trimmed")
+```
+```bash
+switchback --format html_selectors https://example.com/article
+curl -s localhost:8799/scrape -d '{"urls":["https://example.com"],"format":"html"}'
+```
+| format | what you get |
+| --- | --- |
+| `markdown` | whole-page markdown (boilerplate stripped + per-domain prefs) — **default** |
+| `markdown_trimmed` | markdown with extra ad/nav/boilerplate lines removed |
+| `html` | the raw HTML exactly as fetched, untouched |
+| `html_selectors` | cleaned HTML (boilerplate strip + per-domain `drop`/`selector`), not converted |
+The chosen content rides in the result's `markdown` field; in the CLI/server JSON
+the key is `markdown` for markdown formats and `html` for html formats. The
+API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats fall back to
+their text for those sources.
 ### Per-domain extraction
 Markdown of the whole page is the default. To scope a site to its content node or
 strip site-specific noise, declare prefs per host in `config/extraction.json`

{switchback-0.1.0 → switchback-0.4.0}/README.md RENAMED Viewed

@@ -16,8 +16,8 @@
 Give it a URL; it tries the cheapest way to get clean Markdown first and only escalates
 to a heavier (slower, costlier) tier when the cheap one is walled. Stops at the first success.
-[![PyPI](https://img.shields.io/pypi/v/switchback.svg)](https://pypi.org/project/switchback/)
-[![Python](https://img.shields.io/pypi/pyversions/switchback.svg)](https://pypi.org/project/switchback/)
+[![PyPI](https://img.shields.io/pypi/v/switchback)](https://pypi.org/project/switchback/)
+[![Python](https://img.shields.io/pypi/pyversions/switchback)](https://pypi.org/project/switchback/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![CI](https://github.com/akash-kr/switchback/actions/workflows/ci.yml/badge.svg)](https://github.com/akash-kr/switchback/actions/workflows/ci.yml)
@@ -104,6 +104,34 @@ pip install "cloudscraper @ git+https://github.com/VeNoMouS/cloudscraper@3.0.0"
 Or run the whole thing as a container:
 `docker build -t switchback . && docker run -p 8799:8799 switchback`.
+### Production / cold-start deployment
+The two heavy tiers pull dependencies that often can't be baked into a base image
+and land *after* boot (e.g. an async install thread on Azure). Until they're
+ready, those tiers report **`unavailable`** (a distinct outcome carrying the exact
+fix) and the cascade falls through — they are never silently skipped. Checklist:
+- **Tier 3 is the real workhorse for Cloudflare/JS sites** — make sure its browser
+  is installed: `patchright install chromium` (note: **patchright**, not vanilla
+  `playwright`). On a cold start, run this in your post-boot install step/thread;
+  Tier 3 flips to ready once it finishes.
+- **Tier 2 needs the cloudscraper 3.x fork** (above) to attempt stealth. With the
+  frozen PyPI `cloudscraper` it reports `unavailable` and fails fast (no wasted
+  solve budget) instead of erroring mid-cascade. Tier 2 is a *weak* solver for
+  modern Cloudflare — treat it as a cheap try before the browser, not the primary.
+- **Install Node.js** for Tier 2's v3 JS-VM challenges — faster and thread-safe
+  vs. the pure-Python js2py fallback (relevant under concurrent load).
+- **Bound Tier 2's solve budget** with `SCRAPER_CLOUDSCRAPER_TIMEOUT_S` (default
+  `25`) so an unsolvable challenge can't eat the per-URL deadline before the
+  browser tier runs. Lower it (e.g. `12`) if Tier 2 rarely wins on your hosts.
+**Verify readiness on the box** with the preflight check (doubles as a healthcheck
+— exit 0 when the capable tiers are ready):
+```bash
+switchback --doctor          # or: python -m switchback --doctor
+```
 ## Use it from your app
 Three interchangeable entry points — all return the same shape
@@ -210,7 +238,9 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 <details>
 <summary><b>Tunables</b> — budgets, timeouts, caches, backoff</summary>
+- `SCRAPER_OUTPUT_FORMAT` — output shape: `markdown` (default) · `markdown_trimmed` · `html` · `html_selectors` (see [Output formats](#output-formats))
 - `SCRAPER_DEADLINE_S` — per-URL budget (45s)
+- `SCRAPER_FIRECRAWL_FALLBACK_AFTER_S` — after this many seconds on a URL, stop trying the local tiers and fall back to Firecrawl, so a hard host doesn't burn the whole deadline before the paid last resort gets a turn (25s; 0 = off)
 - `SCRAPER_CAMOUFOX_TIMEOUT_MS` — (45000)
 - `SCRAPER_BROWSER_CONCURRENCY` — max simultaneous headless browsers (default 1)
 - `SCRAPER_BOTWALL_URL_SKIP_COOLDOWN_H` — auto-skip re-test window (24h; 0 = never)
@@ -219,6 +249,8 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 - `SCRAPER_DISABLE_SESSION_CACHE` — turn off cf_clearance reuse
 - `SCRAPER_CONTENT_TTL_S` — URL→result cache TTL (**0 = off**; set e.g. 86400 to skip re-scraping a page within a day)
 - `SCRAPER_BACKOFF_BASE_MS` / `SCRAPER_BACKOFF_MAX_MS` — exponential backoff between tiers after a rate-limit/timeout (base 0 = off)
+- `SCRAPER_TIER_RETRIES` — same-tier retries before falling through (default 0 = off; `N` → up to `1+N` tries per tier), with per-tier overrides `SCRAPER_TIER_RETRIES_<TIER>` (e.g. `SCRAPER_TIER_RETRIES_TIER3_BROWSER=2`)
+- `SCRAPER_TIER_RETRY_ON` — failure classes eligible for a same-tier retry (default `timeout,rate_limited,connection`; widen to include `botwall,http_block` behind a rotating residential proxy so each retry gets a fresh IP). Retries are bounded by `SCRAPER_DEADLINE_S`; enabling them on the paid Firecrawl tier bills per attempt
 - `SCRAPER_LOGIN_HOOK` — `pkg.module:func` returning `{cookie: value}` for a host (see [Logged-in sessions](#logged-in-sessions))
 - `SCRAPER_EXTRACTION_FILE` — per-domain extraction prefs JSON (default `config/extraction.json`)
 - `SCRAPER_TRACE_SESSION` — opt-in: capture a Playwright trace (screenshots + DOM + network) per browser-tier attempt, written to `state/traces/`
@@ -238,6 +270,34 @@ trace zip to `state/traces/`. Manage them over HTTP — `GET /traces` (list),
 `GET /traces/{id}` (download), `DELETE /traces/{id}` — and open one with
 `playwright show-trace <zip>`. Off by default (traces are MBs each).
+### Output formats
+Markdown is the default and is unchanged. Pick a different shape globally with
+`SCRAPER_OUTPUT_FORMAT`, or per call:
+```python
+from switchback import scrape
+scrape(["https://example.com/article"])                    # markdown (default)
+scrape(["https://example.com/article"], fmt="html")        # raw HTML
+scrape(["https://example.com/article"], fmt="markdown_trimmed")
+```
+```bash
+switchback --format html_selectors https://example.com/article
+curl -s localhost:8799/scrape -d '{"urls":["https://example.com"],"format":"html"}'
+```
+| format | what you get |
+| --- | --- |
+| `markdown` | whole-page markdown (boilerplate stripped + per-domain prefs) — **default** |
+| `markdown_trimmed` | markdown with extra ad/nav/boilerplate lines removed |
+| `html` | the raw HTML exactly as fetched, untouched |
+| `html_selectors` | cleaned HTML (boilerplate strip + per-domain `drop`/`selector`), not converted |
+The chosen content rides in the result's `markdown` field; in the CLI/server JSON
+the key is `markdown` for markdown formats and `html` for html formats. The
+API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats fall back to
+their text for those sources.
 ### Per-domain extraction
 Markdown of the whole page is the default. To scope a site to its content node or
 strip site-specific noise, declare prefs per host in `config/extraction.json`

{switchback-0.1.0 → switchback-0.4.0}/clients/python_client.py RENAMED Viewed

@@ -59,9 +59,10 @@ def _service_up() -> bool:
         return False
-def _cli_scrape(urls: list[str]) -> list[dict]:
+def _cli_scrape(urls: list[str], fmt: str | None = None) -> list[dict]:
+    flag = ["--format", fmt] if fmt else []
     proc = subprocess.run(
-        [sys.executable, "-m", "switchback", *urls],
+        [sys.executable, "-m", "switchback", *flag, *urls],
         cwd=ENGINE_DIR, capture_output=True, text=True,
     )
     if proc.returncode not in (0, 1):  # 1 == "no successes", still valid JSON ([])
@@ -69,15 +70,20 @@ def _cli_scrape(urls: list[str]) -> list[dict]:
     return json.loads(proc.stdout or "[]")
-def scrape(urls: str | list[str]) -> list[dict]:
-    """Scrape one or many URLs through the engine cascade. Successes only."""
+def scrape(urls: str | list[str], fmt: str | None = None) -> list[dict]:
+    """Scrape one or many URLs through the engine cascade. Successes only.
+    fmt selects the output format (markdown | markdown_trimmed | html |
+    html_selectors); None uses the engine default (markdown). For html formats the
+    content lands under a "html" key instead of "markdown"."""
     if isinstance(urls, str):
         urls = [urls]
     if not urls:
         return []
     if _service_up():
-        return _http_post("/scrape", {"urls": urls})
-    return _cli_scrape(urls)
+        payload = {"urls": urls, "format": fmt} if fmt else {"urls": urls}
+        return _http_post("/scrape", payload)
+    return _cli_scrape(urls, fmt)
 def search(query: str) -> list[dict]:

{switchback-0.1.0 → switchback-0.4.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "switchback"
-version = "0.1.0"
+version = "0.4.0"
 description = "One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool."
 readme = "README.md"
 requires-python = ">=3.10"

{switchback-0.1.0 → switchback-0.4.0}/switchback/api.py RENAMED Viewed

@@ -10,27 +10,30 @@ from __future__ import annotations
 import sys
+from .normalize import output_key
 from .orchestrator import ScrapeOutcome, ScrapeResult, TierAttempt, run, run_detailed
 from .search import search  # re-export: query → URLs (SearXNG)
-def scrape(urls: str | list[str]) -> list[ScrapeResult]:
+def scrape(urls: str | list[str], fmt: str | None = None) -> list[ScrapeResult]:
     """Scrape one or many URLs through the cascade. Returns successes only.
+    fmt selects the output format (markdown | markdown_trimmed | html |
+    html_selectors); None uses the SCRAPER_OUTPUT_FORMAT default (markdown).
     For failures with classified reasons + the per-tier cascade, use
     scrape_detailed()."""
     if isinstance(urls, str):
         urls = [urls]
-    return run(urls)
+    return run(urls, fmt)
-def scrape_detailed(urls: str | list[str]) -> list[ScrapeOutcome]:
+def scrape_detailed(urls: str | list[str], fmt: str | None = None) -> list[ScrapeOutcome]:
     """Like scrape() but returns a ScrapeOutcome per URL — successes *and*
     failures, each with final_outcome, error_class, status_code, and the
-    per-tier attempts that were made."""
+    per-tier attempts that were made. fmt as in scrape()."""
     if isinstance(urls, str):
         urls = [urls]
-    return run_detailed(urls)
+    return run_detailed(urls, fmt)
 def _main() -> int:
@@ -50,14 +53,20 @@ def _main() -> int:
                 _k = _k.strip()
                 if _k and _k not in _os.environ:
                     _os.environ[_k] = _v.strip()
-    usage = ("usage: switchback <url> [<url> ...]\n"
+    usage = ("usage: switchback [--format FMT] <url> [<url> ...]\n"
              "       switchback --search <query ...>\n"
-             "       (or: python -m switchback <url> ...)")
+             "       switchback --doctor\n"
+             "       (or: python -m switchback <url> ...)\n"
+             "  FMT: markdown (default) | markdown_trimmed | html | html_selectors")
     # --help/-h is an explicit request: usage to stdout, exit 0 (don't treat it
     # as a URL to scrape). Check before any work so it stays fast and side-effect-free.
     if any(a in ("--help", "-h") for a in sys.argv[1:]):
         print(usage)
         return 0
+    # --doctor: preflight tier-readiness report (no scrape). Side-effect-free.
+    if "--doctor" in sys.argv[1:]:
+        from .doctor import report
+        return report()
     logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
     setup_logs()  # also ship logs to the OTLP backend when configured
     if len(sys.argv) < 2:
@@ -69,9 +78,25 @@ def _main() -> int:
             [{"title": h.title, "url": h.url, "snippet": h.snippet} for h in hits],
             indent=2))
         return 0 if hits else 1
-    results = scrape(sys.argv[1:])
+    # Optional --format / --format=FMT flag; everything else is a URL.
+    fmt: str | None = None
+    rest: list[str] = []
+    argv = sys.argv[1:]
+    i = 0
+    while i < len(argv):
+        a = argv[i]
+        if a == "--format" and i + 1 < len(argv):
+            fmt = argv[i + 1]; i += 2; continue
+        if a.startswith("--format="):
+            fmt = a.split("=", 1)[1]; i += 1; continue
+        rest.append(a); i += 1
+    if not rest:
+        print(usage, file=sys.stderr)
+        return 2
+    results = scrape(rest, fmt=fmt)
     print(json.dumps(
-        [{"url": r.url, "source_method": r.source_method, "markdown": r.markdown}
+        [{"url": r.url, "source_method": r.source_method,
+          output_key(r.format): r.markdown}
          for r in results],
         indent=2))
     return 0 if results else 1

{switchback-0.1.0 → switchback-0.4.0}/switchback/content_cache.py RENAMED Viewed

@@ -36,11 +36,14 @@ def enabled() -> bool:
     return _TTL_S > 0
-def _norm(url: str) -> str:
-    """Drop the fragment; everything else is significant (query strings select
-    content)."""
+def _norm(url: str, fmt: str = "markdown") -> str:
+    """Cache key: URL with the fragment dropped (query strings select content).
+    Non-default output formats are namespaced so an html result is never served
+    for a markdown request; the default `markdown` key is unprefixed, so existing
+    caches and the default path are unchanged."""
     p = urlsplit(url)
-    return urlunsplit((p.scheme, p.netloc, p.path, p.query, ""))
+    key = urlunsplit((p.scheme, p.netloc, p.path, p.query, ""))
+    return key if fmt == "markdown" else f"{fmt}\x00{key}"
 def _conn() -> sqlite3.Connection:
@@ -58,8 +61,8 @@ def _conn() -> sqlite3.Connection:
     return _CONN
-def get(url: str) -> tuple[str, str] | None:
-    """Return ``(markdown, source_method)`` for a fresh cache hit, else None."""
+def get(url: str, fmt: str = "markdown") -> tuple[str, str] | None:
+    """Return ``(content, source_method)`` for a fresh cache hit, else None."""
     if not enabled():
         return None
     conn = _conn()  # NB: acquires _LOCK itself — must be outside the lock below
@@ -67,7 +70,7 @@ def get(url: str) -> tuple[str, str] | None:
         with _LOCK:
             row = conn.execute(
                 "SELECT markdown, source_method, ts FROM cache WHERE url=?",
-                (_norm(url),)).fetchone()
+                (_norm(url, fmt),)).fetchone()
     except Exception as e:
         logger.warning(f"content_cache: read failed: {e}")
         return None
@@ -79,7 +82,7 @@ def get(url: str) -> tuple[str, str] | None:
     return markdown, source_method
-def put(url: str, markdown: str, source_method: str) -> None:
+def put(url: str, markdown: str, source_method: str, fmt: str = "markdown") -> None:
     """Store a successful scrape. No-op when disabled."""
     if not enabled():
         return
@@ -88,7 +91,7 @@ def put(url: str, markdown: str, source_method: str) -> None:
         with _LOCK:
             conn.execute("INSERT OR REPLACE INTO cache (url, markdown, source_method, ts) "
                          "VALUES (?, ?, ?, ?)",
-                         (_norm(url), markdown, source_method, time.time()))
+                         (_norm(url, fmt), markdown, source_method, time.time()))
             conn.commit()
     except Exception as e:
         logger.warning(f"content_cache: write failed: {e}")

switchback-0.4.0/switchback/doctor.py ADDED Viewed

@@ -0,0 +1,59 @@
+"""Preflight readiness check — `switchback doctor`.
+Reports which tiers can actually run on this box and, when one can't, the exact
+fix. Built for cold-start deploys (e.g. Azure) where the stealth browser is
+installed by a background thread *after* boot: run this to confirm the tiers are
+live before sending traffic, or to see why Tier 2/3 aren't catching anything.
+Exit code: 0 if both capable local tiers (cloudscraper + browser) are ready,
+else 1 — so it doubles as a healthcheck.
+"""
+from __future__ import annotations
+import os
+import shutil
+from .tiers import tier2_cloudscraper, tier3_browser
+def _camoufox() -> tuple[bool, str]:
+    if os.getenv("SCRAPER_DISABLE_CAMOUFOX"):
+        return False, "off (SCRAPER_DISABLE_CAMOUFOX set)"
+    try:
+        import camoufox  # noqa: F401
+    except ImportError:
+        return False, 'not installed — pip install "switchback[camoufox]" && camoufox fetch'
+    return True, "camoufox installed"
+def probe() -> list[tuple[str, bool, str]]:
+    """(label, ok, detail) for each tier/dependency that matters at runtime."""
+    cs_ok, cs_detail = tier2_cloudscraper.available()
+    br_ok, br_detail = tier3_browser.available()
+    node = shutil.which("node")
+    return [
+        ("tier2_cloudscraper", cs_ok, cs_detail),
+        ("tier3_browser", br_ok, br_detail),
+        ("tier3b_camoufox", *_camoufox()),
+        ("node (tier2 v3 concurrency)", bool(node),
+         node or "not on PATH — Tier 2 falls back to slower, thread-fragile js2py"),
+        ("tier4_firecrawl", bool(os.getenv("FIRECRAWL_API_KEY")),
+         "FIRECRAWL_API_KEY set" if os.getenv("FIRECRAWL_API_KEY")
+         else "off (no FIRECRAWL_API_KEY)"),
+    ]
+def report() -> int:
+    rows = probe()
+    print("switchback doctor — tier readiness\n")
+    for label, ok, detail in rows:
+        mark = "OK  " if ok else "MISS"
+        print(f"  [{mark}] {label:30} {detail}")
+    cs_ok = rows[0][1]
+    br_ok = rows[1][1]
+    if cs_ok and br_ok:
+        print("\nCapable tiers ready.")
+        return 0
+    print("\nOne or more capable tiers are unavailable (see above). On a cold "
+          "start this may resolve once the async install thread finishes.")
+    return 1

switchback 0.1.0__tar.gz → 0.4.0__tar.gz

switchback 0.1.0tar.gz → 0.4.0tar.gz