PyPI - switchback - Versions diffs - 0.2.0__tar.gz → 0.4.0__tar.gz - Mend

switchback 0.2.0tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

{switchback-0.2.0 → switchback-0.4.0}/.env.example RENAMED Viewed

@@ -27,6 +27,13 @@ SEARXNG_URL=http://localhost:8888
 # fall back to their text for those sources.
 SCRAPER_OUTPUT_FORMAT=markdown
+# ── Tier 2 · Cloudflare solver (cloudscraper) ───────────────────────────────
+# Needs the 3.x Enhanced Edition fork (see README); with the frozen PyPI build
+# the tier reports `unavailable` and fails fast. Wall-clock cap on a single solve
+# so an unsolvable challenge can't eat the per-URL deadline before the browser
+# tier runs. Lower (e.g. 12) if Tier 2 rarely wins on your hosts.
+SCRAPER_CLOUDSCRAPER_TIMEOUT_S=25
 # ── Tier 2.5 · Jina Reader (r.jina.ai) ──────────────────────────────────────
 # Optional: keyless works at 20 RPM. A key gives 500 RPM + a 10M-token grant.
 JINA_API_KEY=

switchback-0.4.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,94 @@
+# Changelog
+All notable changes to this project are documented here. Format loosely follows
+[Keep a Changelog](https://keepachangelog.com/); this project uses semantic-ish
+versioning while pre-1.0.
+## [Unreleased]
+## [0.4.0] - 2026-06-29
+### Added
+- **Configurable per-tier retries** — a tier can now re-attempt before falling
+  through to the next, more capable one. `SCRAPER_TIER_RETRIES` (global, default
+  `0` = off; `N` → up to `1+N` tries per tier), per-tier overrides
+  `SCRAPER_TIER_RETRIES_<TIER>` (e.g. `SCRAPER_TIER_RETRIES_TIER3_BROWSER=2`), and
+  `SCRAPER_TIER_RETRY_ON` (retryable failure classes; default
+  `timeout,rate_limited,connection` — widen to include `botwall,http_block` behind
+  a rotating residential proxy, where each retry gets a fresh IP). Retries stay
+  bounded by `SCRAPER_DEADLINE_S`, and intermediate retries are traced/logged but
+  **not** persisted to the botwall policy DB, so they never inflate the
+  self-healing skip / `needs_egress` counters. Default `0` keeps behaviour
+  unchanged. Enabling retries on the paid Firecrawl tier bills per attempt.
+### Fixed
+- **Quality gate rejects content shells** — the gate no longer passes a page just
+  because it clears the length floor; thin "shell" pages (nav/boilerplate with no
+  real article body) are now treated as a tier miss so the cascade falls through.
+- **Paid last-resort budget reserve** — `SCRAPER_FIRECRAWL_FALLBACK_AFTER_S`
+  (default 25s) stops starting local tiers once enough of the per-URL deadline has
+  elapsed and an enabled paid tier is still ahead, so a hard host can't burn the
+  whole budget before Firecrawl gets a turn.
+## [0.3.0] - 2026-06-27
+### Added
+- **`unavailable` tier outcome** — when a tier's optional dependency is missing,
+  the wrong version, or not installed yet (frozen PyPI `cloudscraper` instead of
+  the 3.x stealth fork; patchright's Chromium not downloaded during an async
+  cold-start install), the tier now fails fast (~0ms) with a distinct
+  `unavailable` outcome carrying the exact install command, logged once per tier.
+  It ranks above bot-wall in the verdict, so an environment problem is no longer
+  masked as `botwall` — and a missing Tier 2 dependency no longer burns the
+  per-URL solve budget before the browser tier runs.
+- **`switchback --doctor`** — preflight tier-readiness check (doubles as a
+  healthcheck: exit 0 when the capable tiers are ready). Reports whether
+  cloudscraper is the stealth-capable 3.x fork, patchright + Chromium are
+  installed, Camoufox/Node are present, and Firecrawl is configured. Built for
+  cold-start deploys where the browser is installed by a background thread after
+  boot.
+### Docs
+- README **Production / cold-start deployment** section and a `.env.example`
+  Tier 2 block: install `patchright install chromium` in the post-boot step, the
+  cloudscraper 3.x fork requirement, Node.js for Tier 2 concurrency, and the
+  `SCRAPER_CLOUDSCRAPER_TIMEOUT_S` budget knob.
+## [0.2.0] - 2026-06-25
+### Added
+- **Selectable output formats** — `SCRAPER_OUTPUT_FORMAT` (or per-call
+  `scrape(fmt=...)`, CLI `--format`, `/scrape` `{"format": ...}`) selects the
+  content shape: `markdown` (default, unchanged), `markdown_trimmed` (extra
+  ad/nav/boilerplate removed), `html` (raw), or `html_selectors` (cleaned HTML
+  with per-domain `drop`/`selector` applied). Default output is byte-identical;
+  html-family results use a `html` JSON key instead of `markdown`.
+## [0.1.0] - 2026-06-23
+### Added
+- **Challenge-type learning** — bot-walls are classified by vendor (Cloudflare,
+  DataDome, Akamai, PerimeterX, Incapsula, Google) and counted per host in the
+  botwall DB; the vendor is attached to each event and OTel span (`scrape.challenge`).
+- **Metrics & reporting** — `switchback.reporting` rolls the event log + botwall DB
+  into cost-savings-vs-Firecrawl, coverage, overall/per-tier/per-domain latency
+  (mean/median/min/max/p50/p95), outcomes, error codes by domain, and challenges
+  by domain. Exposed via `GET /metrics` and `GET /metrics/domains` (both accept
+  `?minutes=N`).
+- **Periodic flagging** — `python -m switchback.flags` emits a cron-friendly digest
+  (domains stuck on Firecrawl, escalated to egress, most-challenged) to logs/OTel.
+- **Content cache** — optional URL→result cache (`SCRAPER_CONTENT_TTL_S`, sqlite,
+  off by default) short-circuits re-scrapes before any tier runs.
+- **Login-session refresh** — `SCRAPER_LOGIN_HOOK` (`pkg.module:func`) refreshes a
+  dead logged-in session on demand; cookies overlay every tier and persist.
+- **Exponential backoff** — between-tier backoff with jitter after rate-limit /
+  timeout (`SCRAPER_BACKOFF_BASE_MS` / `SCRAPER_BACKOFF_MAX_MS`, off by default).
+- **Per-domain extraction prefs** — `config/extraction.json` (CSS scope selector +
+  extra drops) applied automatically in the normalize step for every tier.
+- **Session traces** — opt-in Playwright trace capture (`SCRAPER_TRACE_SESSION=1`)
+  for browser tiers, with `GET/DELETE /traces` management endpoints.
+### Changed
+- Tier 2's `cloudscraper` moved from a core dependency (which pinned a git-URL
+  fork PyPI can't publish) to the `cloudflare` extra; see the README for installing
+  the 3.x Enhanced Edition fork for full stealth.

{switchback-0.2.0 → switchback-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: switchback
-Version: 0.2.0
+Version: 0.4.0
 Summary: One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool.
 Author-email: Akash Kodavuru <akash@theaklabs.com>
 License: MIT
@@ -163,6 +163,34 @@ pip install "cloudscraper @ git+https://github.com/VeNoMouS/cloudscraper@3.0.0"
 Or run the whole thing as a container:
 `docker build -t switchback . && docker run -p 8799:8799 switchback`.
+### Production / cold-start deployment
+The two heavy tiers pull dependencies that often can't be baked into a base image
+and land *after* boot (e.g. an async install thread on Azure). Until they're
+ready, those tiers report **`unavailable`** (a distinct outcome carrying the exact
+fix) and the cascade falls through — they are never silently skipped. Checklist:
+- **Tier 3 is the real workhorse for Cloudflare/JS sites** — make sure its browser
+  is installed: `patchright install chromium` (note: **patchright**, not vanilla
+  `playwright`). On a cold start, run this in your post-boot install step/thread;
+  Tier 3 flips to ready once it finishes.
+- **Tier 2 needs the cloudscraper 3.x fork** (above) to attempt stealth. With the
+  frozen PyPI `cloudscraper` it reports `unavailable` and fails fast (no wasted
+  solve budget) instead of erroring mid-cascade. Tier 2 is a *weak* solver for
+  modern Cloudflare — treat it as a cheap try before the browser, not the primary.
+- **Install Node.js** for Tier 2's v3 JS-VM challenges — faster and thread-safe
+  vs. the pure-Python js2py fallback (relevant under concurrent load).
+- **Bound Tier 2's solve budget** with `SCRAPER_CLOUDSCRAPER_TIMEOUT_S` (default
+  `25`) so an unsolvable challenge can't eat the per-URL deadline before the
+  browser tier runs. Lower it (e.g. `12`) if Tier 2 rarely wins on your hosts.
+**Verify readiness on the box** with the preflight check (doubles as a healthcheck
+— exit 0 when the capable tiers are ready):
+```bash
+switchback --doctor          # or: python -m switchback --doctor
+```
 ## Use it from your app
 Three interchangeable entry points — all return the same shape
@@ -271,6 +299,7 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 - `SCRAPER_OUTPUT_FORMAT` — output shape: `markdown` (default) · `markdown_trimmed` · `html` · `html_selectors` (see [Output formats](#output-formats))
 - `SCRAPER_DEADLINE_S` — per-URL budget (45s)
+- `SCRAPER_FIRECRAWL_FALLBACK_AFTER_S` — after this many seconds on a URL, stop trying the local tiers and fall back to Firecrawl, so a hard host doesn't burn the whole deadline before the paid last resort gets a turn (25s; 0 = off)
 - `SCRAPER_CAMOUFOX_TIMEOUT_MS` — (45000)
 - `SCRAPER_BROWSER_CONCURRENCY` — max simultaneous headless browsers (default 1)
 - `SCRAPER_BOTWALL_URL_SKIP_COOLDOWN_H` — auto-skip re-test window (24h; 0 = never)
@@ -279,6 +308,8 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 - `SCRAPER_DISABLE_SESSION_CACHE` — turn off cf_clearance reuse
 - `SCRAPER_CONTENT_TTL_S` — URL→result cache TTL (**0 = off**; set e.g. 86400 to skip re-scraping a page within a day)
 - `SCRAPER_BACKOFF_BASE_MS` / `SCRAPER_BACKOFF_MAX_MS` — exponential backoff between tiers after a rate-limit/timeout (base 0 = off)
+- `SCRAPER_TIER_RETRIES` — same-tier retries before falling through (default 0 = off; `N` → up to `1+N` tries per tier), with per-tier overrides `SCRAPER_TIER_RETRIES_<TIER>` (e.g. `SCRAPER_TIER_RETRIES_TIER3_BROWSER=2`)
+- `SCRAPER_TIER_RETRY_ON` — failure classes eligible for a same-tier retry (default `timeout,rate_limited,connection`; widen to include `botwall,http_block` behind a rotating residential proxy so each retry gets a fresh IP). Retries are bounded by `SCRAPER_DEADLINE_S`; enabling them on the paid Firecrawl tier bills per attempt
 - `SCRAPER_LOGIN_HOOK` — `pkg.module:func` returning `{cookie: value}` for a host (see [Logged-in sessions](#logged-in-sessions))
 - `SCRAPER_EXTRACTION_FILE` — per-domain extraction prefs JSON (default `config/extraction.json`)
 - `SCRAPER_TRACE_SESSION` — opt-in: capture a Playwright trace (screenshots + DOM + network) per browser-tier attempt, written to `state/traces/`

{switchback-0.2.0 → switchback-0.4.0}/README.md RENAMED Viewed

@@ -104,6 +104,34 @@ pip install "cloudscraper @ git+https://github.com/VeNoMouS/cloudscraper@3.0.0"
 Or run the whole thing as a container:
 `docker build -t switchback . && docker run -p 8799:8799 switchback`.
+### Production / cold-start deployment
+The two heavy tiers pull dependencies that often can't be baked into a base image
+and land *after* boot (e.g. an async install thread on Azure). Until they're
+ready, those tiers report **`unavailable`** (a distinct outcome carrying the exact
+fix) and the cascade falls through — they are never silently skipped. Checklist:
+- **Tier 3 is the real workhorse for Cloudflare/JS sites** — make sure its browser
+  is installed: `patchright install chromium` (note: **patchright**, not vanilla
+  `playwright`). On a cold start, run this in your post-boot install step/thread;
+  Tier 3 flips to ready once it finishes.
+- **Tier 2 needs the cloudscraper 3.x fork** (above) to attempt stealth. With the
+  frozen PyPI `cloudscraper` it reports `unavailable` and fails fast (no wasted
+  solve budget) instead of erroring mid-cascade. Tier 2 is a *weak* solver for
+  modern Cloudflare — treat it as a cheap try before the browser, not the primary.
+- **Install Node.js** for Tier 2's v3 JS-VM challenges — faster and thread-safe
+  vs. the pure-Python js2py fallback (relevant under concurrent load).
+- **Bound Tier 2's solve budget** with `SCRAPER_CLOUDSCRAPER_TIMEOUT_S` (default
+  `25`) so an unsolvable challenge can't eat the per-URL deadline before the
+  browser tier runs. Lower it (e.g. `12`) if Tier 2 rarely wins on your hosts.
+**Verify readiness on the box** with the preflight check (doubles as a healthcheck
+— exit 0 when the capable tiers are ready):
+```bash
+switchback --doctor          # or: python -m switchback --doctor
+```
 ## Use it from your app
 Three interchangeable entry points — all return the same shape
@@ -212,6 +240,7 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 - `SCRAPER_OUTPUT_FORMAT` — output shape: `markdown` (default) · `markdown_trimmed` · `html` · `html_selectors` (see [Output formats](#output-formats))
 - `SCRAPER_DEADLINE_S` — per-URL budget (45s)
+- `SCRAPER_FIRECRAWL_FALLBACK_AFTER_S` — after this many seconds on a URL, stop trying the local tiers and fall back to Firecrawl, so a hard host doesn't burn the whole deadline before the paid last resort gets a turn (25s; 0 = off)
 - `SCRAPER_CAMOUFOX_TIMEOUT_MS` — (45000)
 - `SCRAPER_BROWSER_CONCURRENCY` — max simultaneous headless browsers (default 1)
 - `SCRAPER_BOTWALL_URL_SKIP_COOLDOWN_H` — auto-skip re-test window (24h; 0 = never)
@@ -220,6 +249,8 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 - `SCRAPER_DISABLE_SESSION_CACHE` — turn off cf_clearance reuse
 - `SCRAPER_CONTENT_TTL_S` — URL→result cache TTL (**0 = off**; set e.g. 86400 to skip re-scraping a page within a day)
 - `SCRAPER_BACKOFF_BASE_MS` / `SCRAPER_BACKOFF_MAX_MS` — exponential backoff between tiers after a rate-limit/timeout (base 0 = off)
+- `SCRAPER_TIER_RETRIES` — same-tier retries before falling through (default 0 = off; `N` → up to `1+N` tries per tier), with per-tier overrides `SCRAPER_TIER_RETRIES_<TIER>` (e.g. `SCRAPER_TIER_RETRIES_TIER3_BROWSER=2`)
+- `SCRAPER_TIER_RETRY_ON` — failure classes eligible for a same-tier retry (default `timeout,rate_limited,connection`; widen to include `botwall,http_block` behind a rotating residential proxy so each retry gets a fresh IP). Retries are bounded by `SCRAPER_DEADLINE_S`; enabling them on the paid Firecrawl tier bills per attempt
 - `SCRAPER_LOGIN_HOOK` — `pkg.module:func` returning `{cookie: value}` for a host (see [Logged-in sessions](#logged-in-sessions))
 - `SCRAPER_EXTRACTION_FILE` — per-domain extraction prefs JSON (default `config/extraction.json`)
 - `SCRAPER_TRACE_SESSION` — opt-in: capture a Playwright trace (screenshots + DOM + network) per browser-tier attempt, written to `state/traces/`

{switchback-0.2.0 → switchback-0.4.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "switchback"
-version = "0.2.0"
+version = "0.4.0"
 description = "One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool."
 readme = "README.md"
 requires-python = ">=3.10"

{switchback-0.2.0 → switchback-0.4.0}/switchback/api.py RENAMED Viewed

@@ -55,6 +55,7 @@ def _main() -> int:
                     _os.environ[_k] = _v.strip()
     usage = ("usage: switchback [--format FMT] <url> [<url> ...]\n"
              "       switchback --search <query ...>\n"
+             "       switchback --doctor\n"
              "       (or: python -m switchback <url> ...)\n"
              "  FMT: markdown (default) | markdown_trimmed | html | html_selectors")
     # --help/-h is an explicit request: usage to stdout, exit 0 (don't treat it
@@ -62,6 +63,10 @@ def _main() -> int:
     if any(a in ("--help", "-h") for a in sys.argv[1:]):
         print(usage)
         return 0
+    # --doctor: preflight tier-readiness report (no scrape). Side-effect-free.
+    if "--doctor" in sys.argv[1:]:
+        from .doctor import report
+        return report()
     logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
     setup_logs()  # also ship logs to the OTLP backend when configured
     if len(sys.argv) < 2:

switchback-0.4.0/switchback/doctor.py ADDED Viewed

@@ -0,0 +1,59 @@
+"""Preflight readiness check — `switchback doctor`.
+Reports which tiers can actually run on this box and, when one can't, the exact
+fix. Built for cold-start deploys (e.g. Azure) where the stealth browser is
+installed by a background thread *after* boot: run this to confirm the tiers are
+live before sending traffic, or to see why Tier 2/3 aren't catching anything.
+Exit code: 0 if both capable local tiers (cloudscraper + browser) are ready,
+else 1 — so it doubles as a healthcheck.
+"""
+from __future__ import annotations
+import os
+import shutil
+from .tiers import tier2_cloudscraper, tier3_browser
+def _camoufox() -> tuple[bool, str]:
+    if os.getenv("SCRAPER_DISABLE_CAMOUFOX"):
+        return False, "off (SCRAPER_DISABLE_CAMOUFOX set)"
+    try:
+        import camoufox  # noqa: F401
+    except ImportError:
+        return False, 'not installed — pip install "switchback[camoufox]" && camoufox fetch'
+    return True, "camoufox installed"
+def probe() -> list[tuple[str, bool, str]]:
+    """(label, ok, detail) for each tier/dependency that matters at runtime."""
+    cs_ok, cs_detail = tier2_cloudscraper.available()
+    br_ok, br_detail = tier3_browser.available()
+    node = shutil.which("node")
+    return [
+        ("tier2_cloudscraper", cs_ok, cs_detail),
+        ("tier3_browser", br_ok, br_detail),
+        ("tier3b_camoufox", *_camoufox()),
+        ("node (tier2 v3 concurrency)", bool(node),
+         node or "not on PATH — Tier 2 falls back to slower, thread-fragile js2py"),
+        ("tier4_firecrawl", bool(os.getenv("FIRECRAWL_API_KEY")),
+         "FIRECRAWL_API_KEY set" if os.getenv("FIRECRAWL_API_KEY")
+         else "off (no FIRECRAWL_API_KEY)"),
+    ]
+def report() -> int:
+    rows = probe()
+    print("switchback doctor — tier readiness\n")
+    for label, ok, detail in rows:
+        mark = "OK  " if ok else "MISS"
+        print(f"  [{mark}] {label:30} {detail}")
+    cs_ok = rows[0][1]
+    br_ok = rows[1][1]
+    if cs_ok and br_ok:
+        print("\nCapable tiers ready.")
+        return 0
+    print("\nOne or more capable tiers are unavailable (see above). On a cold "
+          "start this may resolve once the async install thread finishes.")
+    return 1

{switchback-0.2.0 → switchback-0.4.0}/switchback/orchestrator.py RENAMED Viewed

@@ -19,7 +19,8 @@ from dataclasses import dataclass, field
 from . import content_cache, egress, session_cache
 from .normalize import active_format, output_format_scope
 from .policy import botwall
-from .policy.gates import BotWall, RateLimited, ShortContent, classify_error, host_of
+from .policy.gates import (BotWall, RateLimited, ShortContent, Unavailable,
+                           classify_error, host_of)
 from .tiers import TIERS, INDEX
 from .tracing import Attr, flush, span
@@ -31,6 +32,15 @@ logger = logging.getLogger(__name__)
 # tiers fail fast, while still bounding the worst case.
 _DEADLINE_S = float(os.getenv("SCRAPER_DEADLINE_S", "45"))
+# Fall back to Firecrawl after this many seconds on a URL. On a hard host the
+# cheaper tiers can burn the whole deadline (e.g. cloudscraper's ~25s timeout +
+# two browser solves), so the cascade would hit the deadline and quit *before*
+# ever trying the one tier that reliably works. Once this much time has elapsed,
+# we stop starting more local tiers and jump straight to Firecrawl — so the
+# safety net actually gets a turn. Default 25s leaves ~20s of the 45s deadline
+# for Firecrawl. Only applies when a paid, enabled tier is still ahead; 0 = off.
+_FIRECRAWL_FALLBACK_AFTER_S = float(os.getenv("SCRAPER_FIRECRAWL_FALLBACK_AFTER_S", "25"))
 # Exponential backoff between tiers after a *transient* failure (rate_limited /
 # timeout) — gives a rate limiter or a slow origin a moment before the next tier
 # hammers it. Disabled by default (base 0) so behaviour is unchanged until opted
@@ -50,14 +60,43 @@ def _maybe_backoff(transient_n: int, deadline: float) -> None:
         return
     time.sleep(delay)
+# Configurable same-tier retries. A failing tier normally falls straight through
+# to the next, more capable one; these let a tier re-attempt first. Off by default
+# (0 retries) so behaviour is unchanged until opted in. Read at call time (not
+# import) so a caller/test can set the env per run.
+#   SCRAPER_TIER_RETRIES            global extra attempts per tier (N → 1+N tries)
+#   SCRAPER_TIER_RETRIES_<TIER>     per-tier override, <TIER> = uppercased NAME
+#   SCRAPER_TIER_RETRY_ON           failure classes eligible for a retry
+_DEFAULT_RETRY_ON = "timeout,rate_limited,connection"
+def _retries_for(name: str) -> int:
+    """Extra attempts for a tier: its per-tier override, else the global default."""
+    raw = os.getenv(f"SCRAPER_TIER_RETRIES_{name.upper()}",
+                    os.getenv("SCRAPER_TIER_RETRIES", "0"))
+    try:
+        return max(0, int(raw))
+    except ValueError:
+        logger.warning(f"invalid retry count {raw!r} for {name}; using 0")
+        return 0
+def _retryable_outcomes() -> set[str]:
+    raw = os.getenv("SCRAPER_TIER_RETRY_ON", _DEFAULT_RETRY_ON)
+    return {o.strip() for o in raw.split(",") if o.strip()}
 # Per-attempt outcomes that aren't real failures (don't carry a failure reason).
-_NON_FAILURE = ("ok", "not_applicable", "disabled")
+_NON_FAILURE = ("ok", "not_applicable", "disabled", "skipped_for_budget")
 # How explanatory each failure class is, for picking the reason that best
 # describes why a URL failed. A real wall (403 / bot-wall) outranks a trailing
 # config error (e.g. Firecrawl with no API key → "error"), so the verdict points
 # at the actual blocker rather than the last thing that happened to throw.
 _FAILURE_PRIORITY = {
+    # A missing/old/not-yet-installed tier dependency is an operator-fixable
+    # environment problem; rank it above site walls so it surfaces as the verdict
+    # instead of being masked as "botwall" when the capable tiers can't run.
+    "unavailable": 6,
     "botwall": 5, "http_block": 5,
     "rate_limited": 4, "short_content": 4,
     "timeout": 3, "connection": 3,
@@ -65,6 +104,10 @@ _FAILURE_PRIORITY = {
     "error": 1,
 }
+# Tiers whose dependency we've already warned about this process — the install
+# hint is logged once at WARNING, not per-URL across a whole batch.
+_unavail_warned: set[str] = set()
 @dataclass
 class ScrapeResult:
@@ -139,11 +182,17 @@ def _start_index(url: str, db: dict) -> int:
 def _record_failure(sp, attempts, db, url, tier_name, outcome, exc, status, dt,
-                    challenge=None):
+                    challenge=None, persist=True):
     """Annotate the span, persist to botwall, and append the attempt — for one
     failed tier attempt. Shared by every except branch so classification,
     tracing, and the event log never drift apart. `challenge` names the bot-wall
-    vendor when one was served, so the policy can learn it per host."""
+    vendor when one was served, so the policy can learn it per host.
+    `persist=False` is for an intermediate same-tier retry: the attempt is still
+    traced, logged, and appended (so retries are observable), but it does NOT
+    touch the policy DB — otherwise a single URL's retries would inflate the
+    self-healing failure counters (URL skip / needs_egress) and over-count misses.
+    Only the final per-tier outcome persists."""
     msg = f"{type(exc).__name__}: {exc}"
     sp.set(Attr.OUTCOME, outcome)
     sp.set(Attr.ERROR, msg)
@@ -151,10 +200,13 @@ def _record_failure(sp, attempts, db, url, tier_name, outcome, exc, status, dt,
     sp.set(Attr.CHALLENGE, challenge)
     sp.set(Attr.STATUS_CODE, status)
     sp.set(Attr.LATENCY_MS, dt)
-    botwall.record(db, url, tier_name, outcome, error=msg, latency_ms=dt,
-                   status_code=status, challenge=challenge)
+    if persist:
+        botwall.record(db, url, tier_name, outcome, error=msg, latency_ms=dt,
+                       status_code=status, challenge=challenge)
     # A wall on a host we had a cached cf_clearance for means the cookie is stale
-    # or IP-mismatched: drop it so the next attempt re-solves instead of replaying.
+    # or IP-mismatched: drop it so the next attempt (a same-tier retry or the next
+    # tier) re-solves instead of replaying. Safe on intermediate retries too — it's
+    # a cache drop, not a policy counter.
     if outcome in ("botwall", "http_block"):
         session_cache.forget(url)
     attempts.append(TierAttempt(tier_name, outcome, msg, status, dt))
@@ -214,6 +266,17 @@ def _run_one(url: str, db: dict) -> ScrapeOutcome:
             return res
+def _enabled_paid_ahead(i: int) -> bool:
+    """Is there a paid, currently-enabled tier after index i? (i.e. a last-resort
+    worth reserving budget for.)"""
+    for tier in TIERS[i + 1:]:
+        if getattr(tier, "PAID", False):
+            disabled_fn = getattr(tier, "disabled", None)
+            if not (disabled_fn and disabled_fn()):
+                return True
+    return False
 def _run_cascade(url, host, db, root, t0, deadline) -> ScrapeOutcome:
     attempts: list[TierAttempt] = []
     transient = 0  # count of rate_limited/timeout misses so far (drives backoff)
@@ -241,8 +304,25 @@ def _run_cascade(url, host, db, root, t0, deadline) -> ScrapeOutcome:
             attempts.append(TierAttempt(tier.NAME, "not_applicable"))
             continue
-        # Limit: stop before starting another tier if we're out of budget.
-        if time.monotonic() >= deadline:
+        # Fall back to Firecrawl: once enough time has elapsed on this URL and a
+        # paid enabled tier is still ahead, skip this (non-paid) tier and any
+        # others so the paid tier actually gets a turn instead of the cascade
+        # dying on the deadline mid-browser-solve.
+        if (_FIRECRAWL_FALLBACK_AFTER_S and not getattr(tier, "PAID", False)
+                and (time.monotonic() - t0) >= _FIRECRAWL_FALLBACK_AFTER_S
+                and _enabled_paid_ahead(i)):
+            logger.info(
+                f"{tier.NAME} skipped after {_FIRECRAWL_FALLBACK_AFTER_S}s to "
+                f"fall back to Firecrawl (last resort): {url}")
+            attempts.append(TierAttempt(tier.NAME, "skipped_for_budget"))
+            continue
+        # Limit: stop before starting another tier if we're out of budget. The
+        # paid last resort is exempt — if the cascade reached it, let it run even
+        # a touch over the deadline rather than quit with nothing (it has its own
+        # internal timeout). Non-paid tiers with a paid tier ahead were already
+        # skipped above, so this only ever quits when no paid tier remains.
+        if time.monotonic() >= deadline and not getattr(tier, "PAID", False):
             total = int((time.monotonic() - t0) * 1000)
             ec, sc = _dominant_failure(attempts)
             root.set(Attr.OUTCOME, "deadline_exceeded")
@@ -259,60 +339,89 @@ def _run_cascade(url, host, db, root, t0, deadline) -> ScrapeOutcome:
                                  status_code=sc, latency_ms=total, attempts=attempts)
         paid = getattr(tier, "PAID", False)
-        with span(tier.NAME, **{Attr.HOST: host, Attr.TIER: tier.NAME}) as sp:
-            if paid:
-                # Count every invocation so the host can be promoted to skip.
-                botwall.record(db, url, tier.NAME, "firecrawl_used")
-            ts = time.monotonic()
-            try:
-                md = tier.fetch(url)
-            except BotWall as e:
-                dt = int((time.monotonic() - ts) * 1000)
-                _record_failure(sp, attempts, db, url, tier.NAME, "botwall", e, None, dt,
-                                challenge=getattr(e, "vendor", None))
-                continue
-            except ShortContent as e:
-                dt = int((time.monotonic() - ts) * 1000)
-                _record_failure(sp, attempts, db, url, tier.NAME, "short_content", e, None, dt)
-                continue
-            except RateLimited as e:
-                dt = int((time.monotonic() - ts) * 1000)
-                _record_failure(sp, attempts, db, url, tier.NAME, "rate_limited", e, 429, dt)
-                transient += 1
-                _maybe_backoff(transient, deadline)
-                continue
-            except Exception as e:
+        retries = _retries_for(tier.NAME)
+        retryable = _retryable_outcomes()
+        # Same-tier retry loop: 1 base attempt + N configured retries. A retryable
+        # failure with budget left re-attempts this tier; anything else (or the
+        # last attempt) falls through to the next tier. Each attempt is its own span.
+        for attempt in range(retries + 1):
+            with span(tier.NAME, **{Attr.HOST: host, Attr.TIER: tier.NAME}) as sp:
+                if paid:
+                    # Count every invocation so the host can be promoted to skip
+                    # (and to reflect real per-attempt spend on retries).
+                    botwall.record(db, url, tier.NAME, "firecrawl_used")
+                ts = time.monotonic()
+                outcome = exc = status = challenge = unavailable_exc = None
+                try:
+                    md = tier.fetch(url)
+                except BotWall as e:
+                    outcome, exc, challenge = "botwall", e, getattr(e, "vendor", None)
+                except ShortContent as e:
+                    outcome, exc = "short_content", e
+                except RateLimited as e:
+                    outcome, exc, status = "rate_limited", e, 429
+                except Unavailable as e:
+                    outcome, unavailable_exc = "unavailable", e
+                except Exception as e:
+                    exc = e
+                    outcome, status = classify_error(e)
                 dt = int((time.monotonic() - ts) * 1000)
-                error_class, status = classify_error(e)
-                _record_failure(sp, attempts, db, url, tier.NAME, error_class, e, status, dt)
-                if error_class in _TRANSIENT:
-                    transient += 1
-                    _maybe_backoff(transient, deadline)
-                continue
-            dt = int((time.monotonic() - ts) * 1000)
-            if md is None:  # tier not applicable (e.g. no API mirror)
-                sp.set(Attr.OUTCOME, "not_applicable")
-                sp.set(Attr.LATENCY_MS, dt)
-                attempts.append(TierAttempt(tier.NAME, "not_applicable", latency_ms=dt))
-                continue
-            total = int((time.monotonic() - t0) * 1000)
-            sp.set(Attr.OUTCOME, "ok")
-            sp.set(Attr.MD_LEN, len(md))
-            sp.set(Attr.SOURCE, tier.NAME)
-            sp.set(Attr.LATENCY_MS, dt)
-            botwall.record(db, url, tier.NAME, "ok", md_len=len(md), latency_ms=dt)
-            content_cache.put(url, md, tier.NAME, active_format())
-            root.set(Attr.OUTCOME, "ok")
-            root.set(Attr.SOURCE, tier.NAME)
-            root.set(Attr.LATENCY_MS, total)
-            attempts.append(TierAttempt(tier.NAME, "ok", latency_ms=dt))
-            logger.info(
-                f"{tier.NAME} OK {url} md_len={len(md)} {dt}ms (total {total}ms)")
-            return ScrapeOutcome(url, True, markdown=md, source_method=tier.NAME,
-                                 final_outcome="ok", latency_ms=total,
-                                 format=active_format(), attempts=attempts)
+                if outcome == "unavailable":
+                    # Tier dependency missing/old/not-installed-yet. An environment
+                    # problem, not a host trait — never retried and never taught to
+                    # botwall; warn once per tier with the exact fix instead of
+                    # spamming every URL.
+                    sp.set(Attr.OUTCOME, "unavailable")
+                    sp.set(Attr.ERROR, str(unavailable_exc))
+                    sp.set(Attr.ERROR_CLASS, "unavailable")
+                    sp.set(Attr.LATENCY_MS, dt)
+                    attempts.append(TierAttempt(tier.NAME, "unavailable",
+                                                str(unavailable_exc), None, dt))
+                    if tier.NAME not in _unavail_warned:
+                        _unavail_warned.add(tier.NAME)
+                        logger.warning(f"{tier.NAME} unavailable: {unavailable_exc}")
+                    break  # fall through to the next tier
+                if outcome is not None:  # this attempt failed
+                    # Retry the same tier only if attempts remain, the failure is
+                    # retryable, and there's budget left for another shot.
+                    do_retry = (attempt < retries and outcome in retryable
+                                and time.monotonic() < deadline)
+                    # Intermediate retries are traced/logged but not persisted to
+                    # the policy DB — only the final per-tier outcome counts.
+                    _record_failure(sp, attempts, db, url, tier.NAME, outcome, exc,
+                                    status, dt, challenge=challenge, persist=not do_retry)
+                    if do_retry:
+                        _maybe_backoff(attempt + 1, deadline)  # space the retry
+                        continue
+                    if outcome in _TRANSIENT:
+                        transient += 1
+                        _maybe_backoff(transient, deadline)
+                    break  # fall through to the next tier
+                if md is None:  # tier not applicable (e.g. no API mirror)
+                    sp.set(Attr.OUTCOME, "not_applicable")
+                    sp.set(Attr.LATENCY_MS, dt)
+                    attempts.append(TierAttempt(tier.NAME, "not_applicable", latency_ms=dt))
+                    break
+                total = int((time.monotonic() - t0) * 1000)
+                sp.set(Attr.OUTCOME, "ok")
+                sp.set(Attr.MD_LEN, len(md))
+                sp.set(Attr.SOURCE, tier.NAME)
+                sp.set(Attr.LATENCY_MS, dt)
+                botwall.record(db, url, tier.NAME, "ok", md_len=len(md), latency_ms=dt)
+                content_cache.put(url, md, tier.NAME, active_format())
+                root.set(Attr.OUTCOME, "ok")
+                root.set(Attr.SOURCE, tier.NAME)
+                root.set(Attr.LATENCY_MS, total)
+                attempts.append(TierAttempt(tier.NAME, "ok", latency_ms=dt))
+                logger.info(
+                    f"{tier.NAME} OK {url} md_len={len(md)} {dt}ms (total {total}ms)")
+                return ScrapeOutcome(url, True, markdown=md, source_method=tier.NAME,
+                                     final_outcome="ok", latency_ms=total,
+                                     format=active_format(), attempts=attempts)
     total = int((time.monotonic() - t0) * 1000)
     ec, sc = _dominant_failure(attempts)

{switchback-0.2.0 → switchback-0.4.0}/switchback/policy/gates.py RENAMED Viewed

@@ -140,6 +140,50 @@ def classify_error(exc: BaseException) -> tuple[str, int | None]:
     return "error", status
+# A page can clear the length gate yet carry no article: a media page whose body
+# never rendered (headline + "Loading video…") or a nav/listing shell that is
+# almost all links. Length alone can't tell "1600 chars of nav links" from "1600
+# chars of prose", so these high-precision checks reject the shell. A false
+# positive (rejecting a real article) is worse than missing an exotic shell, so
+# the thresholds are deliberately conservative — validated to reject NONE of a
+# 90-URL real-content sample while catching the unrendered-media / nav-shell
+# cases that otherwise pass as false-positive "successes".
+_PLACEHOLDER_HEAD_MARKERS = (
+    "loading video",          # video page whose player never hydrated (headline only)
+)
+_NAV_SHELL_LINK_DENSITY = 0.65   # words-inside-links / total words, above which …
+_NAV_SHELL_MAX_TEXT = 600        # … and with this few chars of real text, it's a shell
+def _link_density(md: str) -> float:
+    """Fraction of words that live inside markdown links — a nav/listing shell is
+    nearly all links; an article is mostly prose."""
+    words = md.split()
+    if not words:
+        return 1.0
+    link_words = sum(len(m.split()) for m in re.findall(r"\[([^\]]+)\]\(", md))
+    return link_words / len(words)
+def _nonlink_text_len(md: str) -> int:
+    """Chars of real text once markdown links, URLs and formatting are stripped."""
+    t = re.sub(r"\[[^\]]*\]\([^)]*\)", "", md)   # [text](url)
+    t = re.sub(r"https?://\S+", "", t)
+    t = re.sub(r"[#*>`|!\-]", " ", t)
+    return len(re.sub(r"\s+", " ", t).strip())
+def _content_shell_reason(md: str) -> str | None:
+    """Reason if `md` cleared the length gate but is not an article (media
+    placeholder in the head, or a mostly-links nav/listing shell), else None."""
+    head = md[:_BOTWALL_HEAD_CHARS].lower()
+    if any(m in head for m in _PLACEHOLDER_HEAD_MARKERS):
+        return "unrendered media placeholder"
+    if _link_density(md) > _NAV_SHELL_LINK_DENSITY and _nonlink_text_len(md) < _NAV_SHELL_MAX_TEXT:
+        return "nav/listing shell (mostly links)"
+    return None
 def check(url: str, md: str | None) -> str:
     """Return md if it clears the gates, else raise BotWall / ShortContent."""
     vendor = classify_botwall(md)
@@ -149,6 +193,11 @@ def check(url: str, md: str | None) -> str:
     n = len(md) if md else 0
     if n < gate:
         raise ShortContent(f"body too short: {n} < {gate}")
+    # Length cleared, but is it actually content? Reject shells/placeholders so a
+    # tier falls through instead of returning a confident false-positive success.
+    shell = _content_shell_reason(md or "")
+    if shell:
+        raise ShortContent(f"no article content: {shell}")
     return md
@@ -156,6 +205,16 @@ class ShortContent(RuntimeError):
     """Content fetched but below the quality gate — treated as a tier miss."""
+class Unavailable(RuntimeError):
+    """A tier can't run because an optional dependency is missing, the wrong
+    version, or not installed yet (e.g. cloudscraper 1.2.71 instead of the 3.x
+    stealth fork; patchright's Chromium not downloaded yet during an async
+    cold-start install). Distinct from a tier *failure*: the tier never got to
+    attempt the URL. Surfaced as its own `unavailable` outcome so an environment
+    problem isn't masked as a generic error or a site bot-wall. The message
+    carries the exact fix (e.g. `patchright install chromium`)."""
 class BotWall(RuntimeError):
     """Content fetched but it's a bot-wall / block interstitial (e.g. Cloudflare
     "Just a moment...") rather than the real page — treated as a tier miss so the

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/tier2_cloudscraper.py RENAMED Viewed

@@ -22,13 +22,40 @@ import threading
 from .. import egress, session_cache
 from ..egress import requests_proxies
 from ..normalize import html_to_markdown
-from ..policy.gates import check
+from ..policy.gates import Unavailable, check
 logger = logging.getLogger(__name__)
 NAME = "tier2_cloudscraper"
 PAID = False
+# Install hint surfaced when cloudscraper is missing or the frozen PyPI 1.2.71
+# (v1/v2, no stealth) instead of the 3.x Enhanced Edition this tier needs.
+_INSTALL_HINT = ('pip install "cloudscraper @ '
+                 'git+https://github.com/VeNoMouS/cloudscraper@3.0.0"')
+def available() -> tuple[bool, str]:
+    """Whether cloudscraper is importable *and* the stealth-capable 3.x fork.
+    Returns (ok, detail). Used by `fetch` (to fail fast with a clear reason
+    instead of wasting the solve budget) and by `switchback doctor`.
+    Discriminates by major version: the Enhanced Edition fork this tier needs is
+    3.x; PyPI is frozen at 1.2.71 (v1/v2, no stealth, rejects `enable_stealth`)."""
+    try:
+        import cloudscraper
+    except ImportError:
+        return False, f"cloudscraper not installed — {_INSTALL_HINT}"
+    ver = getattr(cloudscraper, "__version__", "0")
+    try:
+        major = int(ver.split(".")[0])
+    except (ValueError, AttributeError):
+        major = 0
+    if major < 3:
+        return False, (f"cloudscraper {ver} has no stealth support (frozen PyPI "
+                       f"v1/v2) — {_INSTALL_HINT}")
+    return True, f"cloudscraper {ver}"
 # Wall-clock cap on the whole solve. cloudscraper 3.x *attempts* interactive
 # Turnstile and can loop for minutes on a challenge it can't clear — far past the
 # per-request socket timeout. Capping it here lets the cascade fall through to the
@@ -78,6 +105,11 @@ def _interpreter_opts() -> dict:
 def _make_scraper():
+    ok, detail = available()
+    if not ok:
+        # Fail fast (~0ms) with the exact fix, rather than wasting the solve
+        # budget or surfacing a cryptic TypeError mid-cascade.
+        raise Unavailable(detail)
     import cloudscraper
     # enable_stealth / auto_refresh_on_403 are on by default in 3.x; we pass the
     # stealth tuning explicitly. No UA override: cloudscraper derives a UA (and

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/tier3_browser.py RENAMED Viewed

@@ -10,21 +10,59 @@ auth-walled pages (set BU_CDP_URL).
 """
 from __future__ import annotations
+import os
 from . import _browser
 from .. import session_cache, session_trace
 from ..concurrency import browser_slot
 from ..egress import playwright_proxy, add_wire_bytes
 from ..normalize import html_to_markdown
-from ..policy.gates import check
+from ..policy.gates import Unavailable, check
 NAME = "tier3_browser"
 PAID = False
+# Install hint surfaced when patchright or its Chromium isn't ready — notably
+# during an async cold-start install (the browser binary lands after boot).
+_INSTALL_HINT = 'pip install "switchback[browser]" && patchright install chromium'
+def available() -> tuple[bool, str]:
+    """Whether patchright is importable *and* its Chromium is downloaded.
+    Returns (ok, detail). On a cold start where the browser is installed by a
+    background thread, this flips to True once that finishes. Used by `fetch`
+    (clear `unavailable` reason instead of a buried launch error) and by
+    `switchback doctor`."""
+    try:
+        from patchright.sync_api import sync_playwright
+    except ImportError:
+        return False, f"patchright not installed — {_INSTALL_HINT}"
+    try:
+        with sync_playwright() as p:
+            exe = p.chromium.executable_path
+    except Exception as e:  # pragma: no cover — driver start is environment-specific
+        return False, f"patchright driver error: {e}"
+    if not exe or not os.path.exists(exe):
+        return False, f"patchright Chromium not installed — {_INSTALL_HINT}"
+    return True, "patchright + Chromium ready"
 def fetch(url: str, timeout_ms: int = 15000) -> str:
-    from patchright.sync_api import sync_playwright
+    try:
+        from patchright.sync_api import sync_playwright
+    except ImportError:
+        raise Unavailable(f"patchright not installed — {_INSTALL_HINT}")
     with browser_slot(NAME), sync_playwright() as p:
-        browser = p.chromium.launch(headless=True, proxy=playwright_proxy())
+        try:
+            browser = p.chromium.launch(headless=True, proxy=playwright_proxy())
+        except Exception as e:
+            # Chromium not downloaded yet (cold-start window) reads as a launch
+            # error; surface it as unavailable + the fix, not a generic failure.
+            msg = str(e)
+            if "Executable doesn't exist" in msg or "patchright install" in msg:
+                raise Unavailable(
+                    f"patchright Chromium not installed — {_INSTALL_HINT}")
+            raise
         ctx = None
         try:
             # No user_agent override: patchright ships a real, internally

{switchback-0.2.0 → switchback-0.4.0}/switchback.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: switchback
-Version: 0.2.0
+Version: 0.4.0
 Summary: One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool.
 Author-email: Akash Kodavuru <akash@theaklabs.com>
 License: MIT
@@ -163,6 +163,34 @@ pip install "cloudscraper @ git+https://github.com/VeNoMouS/cloudscraper@3.0.0"
 Or run the whole thing as a container:
 `docker build -t switchback . && docker run -p 8799:8799 switchback`.
+### Production / cold-start deployment
+The two heavy tiers pull dependencies that often can't be baked into a base image
+and land *after* boot (e.g. an async install thread on Azure). Until they're
+ready, those tiers report **`unavailable`** (a distinct outcome carrying the exact
+fix) and the cascade falls through — they are never silently skipped. Checklist:
+- **Tier 3 is the real workhorse for Cloudflare/JS sites** — make sure its browser
+  is installed: `patchright install chromium` (note: **patchright**, not vanilla
+  `playwright`). On a cold start, run this in your post-boot install step/thread;
+  Tier 3 flips to ready once it finishes.
+- **Tier 2 needs the cloudscraper 3.x fork** (above) to attempt stealth. With the
+  frozen PyPI `cloudscraper` it reports `unavailable` and fails fast (no wasted
+  solve budget) instead of erroring mid-cascade. Tier 2 is a *weak* solver for
+  modern Cloudflare — treat it as a cheap try before the browser, not the primary.
+- **Install Node.js** for Tier 2's v3 JS-VM challenges — faster and thread-safe
+  vs. the pure-Python js2py fallback (relevant under concurrent load).
+- **Bound Tier 2's solve budget** with `SCRAPER_CLOUDSCRAPER_TIMEOUT_S` (default
+  `25`) so an unsolvable challenge can't eat the per-URL deadline before the
+  browser tier runs. Lower it (e.g. `12`) if Tier 2 rarely wins on your hosts.
+**Verify readiness on the box** with the preflight check (doubles as a healthcheck
+— exit 0 when the capable tiers are ready):
+```bash
+switchback --doctor          # or: python -m switchback --doctor
+```
 ## Use it from your app
 Three interchangeable entry points — all return the same shape
@@ -271,6 +299,7 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 - `SCRAPER_OUTPUT_FORMAT` — output shape: `markdown` (default) · `markdown_trimmed` · `html` · `html_selectors` (see [Output formats](#output-formats))
 - `SCRAPER_DEADLINE_S` — per-URL budget (45s)
+- `SCRAPER_FIRECRAWL_FALLBACK_AFTER_S` — after this many seconds on a URL, stop trying the local tiers and fall back to Firecrawl, so a hard host doesn't burn the whole deadline before the paid last resort gets a turn (25s; 0 = off)
 - `SCRAPER_CAMOUFOX_TIMEOUT_MS` — (45000)
 - `SCRAPER_BROWSER_CONCURRENCY` — max simultaneous headless browsers (default 1)
 - `SCRAPER_BOTWALL_URL_SKIP_COOLDOWN_H` — auto-skip re-test window (24h; 0 = never)
@@ -279,6 +308,8 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 - `SCRAPER_DISABLE_SESSION_CACHE` — turn off cf_clearance reuse
 - `SCRAPER_CONTENT_TTL_S` — URL→result cache TTL (**0 = off**; set e.g. 86400 to skip re-scraping a page within a day)
 - `SCRAPER_BACKOFF_BASE_MS` / `SCRAPER_BACKOFF_MAX_MS` — exponential backoff between tiers after a rate-limit/timeout (base 0 = off)
+- `SCRAPER_TIER_RETRIES` — same-tier retries before falling through (default 0 = off; `N` → up to `1+N` tries per tier), with per-tier overrides `SCRAPER_TIER_RETRIES_<TIER>` (e.g. `SCRAPER_TIER_RETRIES_TIER3_BROWSER=2`)
+- `SCRAPER_TIER_RETRY_ON` — failure classes eligible for a same-tier retry (default `timeout,rate_limited,connection`; widen to include `botwall,http_block` behind a rotating residential proxy so each retry gets a fresh IP). Retries are bounded by `SCRAPER_DEADLINE_S`; enabling them on the paid Firecrawl tier bills per attempt
 - `SCRAPER_LOGIN_HOOK` — `pkg.module:func` returning `{cookie: value}` for a host (see [Logged-in sessions](#logged-in-sessions))
 - `SCRAPER_EXTRACTION_FILE` — per-domain extraction prefs JSON (default `config/extraction.json`)
 - `SCRAPER_TRACE_SESSION` — opt-in: capture a Playwright trace (screenshots + DOM + network) per browser-tier attempt, written to `state/traces/`

{switchback-0.2.0 → switchback-0.4.0}/switchback.egg-info/SOURCES.txt RENAMED Viewed

@@ -16,6 +16,7 @@ switchback/__main__.py
 switchback/api.py
 switchback/concurrency.py
 switchback/content_cache.py
+switchback/doctor.py
 switchback/egress.py
 switchback/extract.py
 switchback/flags.py

switchback-0.2.0/CHANGELOG.md DELETED Viewed

@@ -1,46 +0,0 @@
-# Changelog
-All notable changes to this project are documented here. Format loosely follows
-[Keep a Changelog](https://keepachangelog.com/); this project uses semantic-ish
-versioning while pre-1.0.
-## [Unreleased]
-## [0.2.0] - 2026-06-25
-### Added
-- **Selectable output formats** — `SCRAPER_OUTPUT_FORMAT` (or per-call
-  `scrape(fmt=...)`, CLI `--format`, `/scrape` `{"format": ...}`) selects the
-  content shape: `markdown` (default, unchanged), `markdown_trimmed` (extra
-  ad/nav/boilerplate removed), `html` (raw), or `html_selectors` (cleaned HTML
-  with per-domain `drop`/`selector` applied). Default output is byte-identical;
-  html-family results use a `html` JSON key instead of `markdown`.
-## [0.1.0] - 2026-06-23
-### Added
-- **Challenge-type learning** — bot-walls are classified by vendor (Cloudflare,
-  DataDome, Akamai, PerimeterX, Incapsula, Google) and counted per host in the
-  botwall DB; the vendor is attached to each event and OTel span (`scrape.challenge`).
-- **Metrics & reporting** — `switchback.reporting` rolls the event log + botwall DB
-  into cost-savings-vs-Firecrawl, coverage, overall/per-tier/per-domain latency
-  (mean/median/min/max/p50/p95), outcomes, error codes by domain, and challenges
-  by domain. Exposed via `GET /metrics` and `GET /metrics/domains` (both accept
-  `?minutes=N`).
-- **Periodic flagging** — `python -m switchback.flags` emits a cron-friendly digest
-  (domains stuck on Firecrawl, escalated to egress, most-challenged) to logs/OTel.
-- **Content cache** — optional URL→result cache (`SCRAPER_CONTENT_TTL_S`, sqlite,
-  off by default) short-circuits re-scrapes before any tier runs.
-- **Login-session refresh** — `SCRAPER_LOGIN_HOOK` (`pkg.module:func`) refreshes a
-  dead logged-in session on demand; cookies overlay every tier and persist.
-- **Exponential backoff** — between-tier backoff with jitter after rate-limit /
-  timeout (`SCRAPER_BACKOFF_BASE_MS` / `SCRAPER_BACKOFF_MAX_MS`, off by default).
-- **Per-domain extraction prefs** — `config/extraction.json` (CSS scope selector +
-  extra drops) applied automatically in the normalize step for every tier.
-- **Session traces** — opt-in Playwright trace capture (`SCRAPER_TRACE_SESSION=1`)
-  for browser tiers, with `GET/DELETE /traces` management endpoints.
-### Changed
-- Tier 2's `cloudscraper` moved from a core dependency (which pinned a git-URL
-  fork PyPI can't publish) to the `cloudflare` extra; see the README for installing
-  the 3.x Enhanced Edition fork for full stealth.

{switchback-0.2.0 → switchback-0.4.0}/CONTRIBUTING.md RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/LICENSE RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/MANIFEST.in RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/NOTICE RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/SECURITY.md RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/clients/node_bridge.md RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/clients/python_client.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/config/botwall_skip_urls.txt RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/config/extraction.example.json RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/setup.cfg RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/__init__.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/__main__.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/concurrency.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/content_cache.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/egress.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/extract.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/flags.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/normalize.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/policy/__init__.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/policy/botwall.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/py.typed RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/reporting.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/search.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/server.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/session_cache.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/session_trace.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/__init__.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/_browser.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/tier0_apis.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/tier1_http.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/tier3b_camoufox.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/tier4_firecrawl.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/tiers/tier_residential.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback/tracing.py RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback.egg-info/entry_points.txt RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback.egg-info/requires.txt RENAMED Viewed

File without changes

{switchback-0.2.0 → switchback-0.4.0}/switchback.egg-info/top_level.txt RENAMED Viewed

File without changes

switchback 0.2.0__tar.gz → 0.4.0__tar.gz

switchback 0.2.0tar.gz → 0.4.0tar.gz