PyPI - switchback - Versions diffs - 0.1.0__tar.gz → 0.2.0__tar.gz - Mend

switchback 0.1.0tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (51) hide show

{switchback-0.1.0 → switchback-0.2.0}/.env.example RENAMED Viewed

@@ -14,6 +14,19 @@ OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 # ── Search (Tier-0 SearXNG, query → URLs) ───────────────────────────────────
 SEARXNG_URL=http://localhost:8888
+# ── Output format ───────────────────────────────────────────────────────────
+# Shape of the scraped content. Default markdown is byte-identical to before;
+# override per-call with scrape(fmt=...), the CLI --format flag, or the /scrape
+# {"format": ...} field. html-family results land under a "html" key (instead of
+# "markdown") in the CLI/server JSON.
+#   markdown          whole-page markdown (default)
+#   markdown_trimmed  markdown with extra ad/nav/boilerplate lines removed
+#   html              raw HTML exactly as fetched (no cleaning)
+#   html_selectors    cleaned HTML (boilerplate strip + per-domain drop/selector)
+# Note: the API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats
+# fall back to their text for those sources.
+SCRAPER_OUTPUT_FORMAT=markdown
 # ── Tier 2.5 · Jina Reader (r.jina.ai) ──────────────────────────────────────
 # Optional: keyless works at 20 RPM. A key gives 500 RPM + a 10M-token grant.
 JINA_API_KEY=

{switchback-0.1.0 → switchback-0.2.0}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,18 @@ versioning while pre-1.0.
 ## [Unreleased]
+## [0.2.0] - 2026-06-25
+### Added
+- **Selectable output formats** — `SCRAPER_OUTPUT_FORMAT` (or per-call
+  `scrape(fmt=...)`, CLI `--format`, `/scrape` `{"format": ...}`) selects the
+  content shape: `markdown` (default, unchanged), `markdown_trimmed` (extra
+  ad/nav/boilerplate removed), `html` (raw), or `html_selectors` (cleaned HTML
+  with per-domain `drop`/`selector` applied). Default output is byte-identical;
+  html-family results use a `html` JSON key instead of `markdown`.
+## [0.1.0] - 2026-06-23
 ### Added
 - **Challenge-type learning** — bot-walls are classified by vendor (Cloudflare,
   DataDome, Akamai, PerimeterX, Incapsula, Google) and counted per host in the

{switchback-0.1.0 → switchback-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: switchback
-Version: 0.1.0
+Version: 0.2.0
 Summary: One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool.
 Author-email: Akash Kodavuru <akash@theaklabs.com>
 License: MIT
@@ -75,8 +75,8 @@ Dynamic: license-file
 Give it a URL; it tries the cheapest way to get clean Markdown first and only escalates
 to a heavier (slower, costlier) tier when the cheap one is walled. Stops at the first success.
-[![PyPI](https://img.shields.io/pypi/v/switchback.svg)](https://pypi.org/project/switchback/)
-[![Python](https://img.shields.io/pypi/pyversions/switchback.svg)](https://pypi.org/project/switchback/)
+[![PyPI](https://img.shields.io/pypi/v/switchback)](https://pypi.org/project/switchback/)
+[![Python](https://img.shields.io/pypi/pyversions/switchback)](https://pypi.org/project/switchback/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![CI](https://github.com/akash-kr/switchback/actions/workflows/ci.yml/badge.svg)](https://github.com/akash-kr/switchback/actions/workflows/ci.yml)
@@ -269,6 +269,7 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 <details>
 <summary><b>Tunables</b> — budgets, timeouts, caches, backoff</summary>
+- `SCRAPER_OUTPUT_FORMAT` — output shape: `markdown` (default) · `markdown_trimmed` · `html` · `html_selectors` (see [Output formats](#output-formats))
 - `SCRAPER_DEADLINE_S` — per-URL budget (45s)
 - `SCRAPER_CAMOUFOX_TIMEOUT_MS` — (45000)
 - `SCRAPER_BROWSER_CONCURRENCY` — max simultaneous headless browsers (default 1)
@@ -297,6 +298,34 @@ trace zip to `state/traces/`. Manage them over HTTP — `GET /traces` (list),
 `GET /traces/{id}` (download), `DELETE /traces/{id}` — and open one with
 `playwright show-trace <zip>`. Off by default (traces are MBs each).
+### Output formats
+Markdown is the default and is unchanged. Pick a different shape globally with
+`SCRAPER_OUTPUT_FORMAT`, or per call:
+```python
+from switchback import scrape
+scrape(["https://example.com/article"])                    # markdown (default)
+scrape(["https://example.com/article"], fmt="html")        # raw HTML
+scrape(["https://example.com/article"], fmt="markdown_trimmed")
+```
+```bash
+switchback --format html_selectors https://example.com/article
+curl -s localhost:8799/scrape -d '{"urls":["https://example.com"],"format":"html"}'
+```
+| format | what you get |
+| --- | --- |
+| `markdown` | whole-page markdown (boilerplate stripped + per-domain prefs) — **default** |
+| `markdown_trimmed` | markdown with extra ad/nav/boilerplate lines removed |
+| `html` | the raw HTML exactly as fetched, untouched |
+| `html_selectors` | cleaned HTML (boilerplate strip + per-domain `drop`/`selector`), not converted |
+The chosen content rides in the result's `markdown` field; in the CLI/server JSON
+the key is `markdown` for markdown formats and `html` for html formats. The
+API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats fall back to
+their text for those sources.
 ### Per-domain extraction
 Markdown of the whole page is the default. To scope a site to its content node or
 strip site-specific noise, declare prefs per host in `config/extraction.json`

{switchback-0.1.0 → switchback-0.2.0}/README.md RENAMED Viewed

@@ -16,8 +16,8 @@
 Give it a URL; it tries the cheapest way to get clean Markdown first and only escalates
 to a heavier (slower, costlier) tier when the cheap one is walled. Stops at the first success.
-[![PyPI](https://img.shields.io/pypi/v/switchback.svg)](https://pypi.org/project/switchback/)
-[![Python](https://img.shields.io/pypi/pyversions/switchback.svg)](https://pypi.org/project/switchback/)
+[![PyPI](https://img.shields.io/pypi/v/switchback)](https://pypi.org/project/switchback/)
+[![Python](https://img.shields.io/pypi/pyversions/switchback)](https://pypi.org/project/switchback/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![CI](https://github.com/akash-kr/switchback/actions/workflows/ci.yml/badge.svg)](https://github.com/akash-kr/switchback/actions/workflows/ci.yml)
@@ -210,6 +210,7 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 <details>
 <summary><b>Tunables</b> — budgets, timeouts, caches, backoff</summary>
+- `SCRAPER_OUTPUT_FORMAT` — output shape: `markdown` (default) · `markdown_trimmed` · `html` · `html_selectors` (see [Output formats](#output-formats))
 - `SCRAPER_DEADLINE_S` — per-URL budget (45s)
 - `SCRAPER_CAMOUFOX_TIMEOUT_MS` — (45000)
 - `SCRAPER_BROWSER_CONCURRENCY` — max simultaneous headless browsers (default 1)
@@ -238,6 +239,34 @@ trace zip to `state/traces/`. Manage them over HTTP — `GET /traces` (list),
 `GET /traces/{id}` (download), `DELETE /traces/{id}` — and open one with
 `playwright show-trace <zip>`. Off by default (traces are MBs each).
+### Output formats
+Markdown is the default and is unchanged. Pick a different shape globally with
+`SCRAPER_OUTPUT_FORMAT`, or per call:
+```python
+from switchback import scrape
+scrape(["https://example.com/article"])                    # markdown (default)
+scrape(["https://example.com/article"], fmt="html")        # raw HTML
+scrape(["https://example.com/article"], fmt="markdown_trimmed")
+```
+```bash
+switchback --format html_selectors https://example.com/article
+curl -s localhost:8799/scrape -d '{"urls":["https://example.com"],"format":"html"}'
+```
+| format | what you get |
+| --- | --- |
+| `markdown` | whole-page markdown (boilerplate stripped + per-domain prefs) — **default** |
+| `markdown_trimmed` | markdown with extra ad/nav/boilerplate lines removed |
+| `html` | the raw HTML exactly as fetched, untouched |
+| `html_selectors` | cleaned HTML (boilerplate strip + per-domain `drop`/`selector`), not converted |
+The chosen content rides in the result's `markdown` field; in the CLI/server JSON
+the key is `markdown` for markdown formats and `html` for html formats. The
+API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats fall back to
+their text for those sources.
 ### Per-domain extraction
 Markdown of the whole page is the default. To scope a site to its content node or
 strip site-specific noise, declare prefs per host in `config/extraction.json`

{switchback-0.1.0 → switchback-0.2.0}/clients/python_client.py RENAMED Viewed

@@ -59,9 +59,10 @@ def _service_up() -> bool:
         return False
-def _cli_scrape(urls: list[str]) -> list[dict]:
+def _cli_scrape(urls: list[str], fmt: str | None = None) -> list[dict]:
+    flag = ["--format", fmt] if fmt else []
     proc = subprocess.run(
-        [sys.executable, "-m", "switchback", *urls],
+        [sys.executable, "-m", "switchback", *flag, *urls],
         cwd=ENGINE_DIR, capture_output=True, text=True,
     )
     if proc.returncode not in (0, 1):  # 1 == "no successes", still valid JSON ([])
@@ -69,15 +70,20 @@ def _cli_scrape(urls: list[str]) -> list[dict]:
     return json.loads(proc.stdout or "[]")
-def scrape(urls: str | list[str]) -> list[dict]:
-    """Scrape one or many URLs through the engine cascade. Successes only."""
+def scrape(urls: str | list[str], fmt: str | None = None) -> list[dict]:
+    """Scrape one or many URLs through the engine cascade. Successes only.
+    fmt selects the output format (markdown | markdown_trimmed | html |
+    html_selectors); None uses the engine default (markdown). For html formats the
+    content lands under a "html" key instead of "markdown"."""
     if isinstance(urls, str):
         urls = [urls]
     if not urls:
         return []
     if _service_up():
-        return _http_post("/scrape", {"urls": urls})
-    return _cli_scrape(urls)
+        payload = {"urls": urls, "format": fmt} if fmt else {"urls": urls}
+        return _http_post("/scrape", payload)
+    return _cli_scrape(urls, fmt)
 def search(query: str) -> list[dict]:

{switchback-0.1.0 → switchback-0.2.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "switchback"
-version = "0.1.0"
+version = "0.2.0"
 description = "One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool."
 readme = "README.md"
 requires-python = ">=3.10"

{switchback-0.1.0 → switchback-0.2.0}/switchback/api.py RENAMED Viewed

@@ -10,27 +10,30 @@ from __future__ import annotations
 import sys
+from .normalize import output_key
 from .orchestrator import ScrapeOutcome, ScrapeResult, TierAttempt, run, run_detailed
 from .search import search  # re-export: query → URLs (SearXNG)
-def scrape(urls: str | list[str]) -> list[ScrapeResult]:
+def scrape(urls: str | list[str], fmt: str | None = None) -> list[ScrapeResult]:
     """Scrape one or many URLs through the cascade. Returns successes only.
+    fmt selects the output format (markdown | markdown_trimmed | html |
+    html_selectors); None uses the SCRAPER_OUTPUT_FORMAT default (markdown).
     For failures with classified reasons + the per-tier cascade, use
     scrape_detailed()."""
     if isinstance(urls, str):
         urls = [urls]
-    return run(urls)
+    return run(urls, fmt)
-def scrape_detailed(urls: str | list[str]) -> list[ScrapeOutcome]:
+def scrape_detailed(urls: str | list[str], fmt: str | None = None) -> list[ScrapeOutcome]:
     """Like scrape() but returns a ScrapeOutcome per URL — successes *and*
     failures, each with final_outcome, error_class, status_code, and the
-    per-tier attempts that were made."""
+    per-tier attempts that were made. fmt as in scrape()."""
     if isinstance(urls, str):
         urls = [urls]
-    return run_detailed(urls)
+    return run_detailed(urls, fmt)
 def _main() -> int:
@@ -50,9 +53,10 @@ def _main() -> int:
                 _k = _k.strip()
                 if _k and _k not in _os.environ:
                     _os.environ[_k] = _v.strip()
-    usage = ("usage: switchback <url> [<url> ...]\n"
+    usage = ("usage: switchback [--format FMT] <url> [<url> ...]\n"
              "       switchback --search <query ...>\n"
-             "       (or: python -m switchback <url> ...)")
+             "       (or: python -m switchback <url> ...)\n"
+             "  FMT: markdown (default) | markdown_trimmed | html | html_selectors")
     # --help/-h is an explicit request: usage to stdout, exit 0 (don't treat it
     # as a URL to scrape). Check before any work so it stays fast and side-effect-free.
     if any(a in ("--help", "-h") for a in sys.argv[1:]):
@@ -69,9 +73,25 @@ def _main() -> int:
             [{"title": h.title, "url": h.url, "snippet": h.snippet} for h in hits],
             indent=2))
         return 0 if hits else 1
-    results = scrape(sys.argv[1:])
+    # Optional --format / --format=FMT flag; everything else is a URL.
+    fmt: str | None = None
+    rest: list[str] = []
+    argv = sys.argv[1:]
+    i = 0
+    while i < len(argv):
+        a = argv[i]
+        if a == "--format" and i + 1 < len(argv):
+            fmt = argv[i + 1]; i += 2; continue
+        if a.startswith("--format="):
+            fmt = a.split("=", 1)[1]; i += 1; continue
+        rest.append(a); i += 1
+    if not rest:
+        print(usage, file=sys.stderr)
+        return 2
+    results = scrape(rest, fmt=fmt)
     print(json.dumps(
-        [{"url": r.url, "source_method": r.source_method, "markdown": r.markdown}
+        [{"url": r.url, "source_method": r.source_method,
+          output_key(r.format): r.markdown}
          for r in results],
         indent=2))
     return 0 if results else 1

{switchback-0.1.0 → switchback-0.2.0}/switchback/content_cache.py RENAMED Viewed

@@ -36,11 +36,14 @@ def enabled() -> bool:
     return _TTL_S > 0
-def _norm(url: str) -> str:
-    """Drop the fragment; everything else is significant (query strings select
-    content)."""
+def _norm(url: str, fmt: str = "markdown") -> str:
+    """Cache key: URL with the fragment dropped (query strings select content).
+    Non-default output formats are namespaced so an html result is never served
+    for a markdown request; the default `markdown` key is unprefixed, so existing
+    caches and the default path are unchanged."""
     p = urlsplit(url)
-    return urlunsplit((p.scheme, p.netloc, p.path, p.query, ""))
+    key = urlunsplit((p.scheme, p.netloc, p.path, p.query, ""))
+    return key if fmt == "markdown" else f"{fmt}\x00{key}"
 def _conn() -> sqlite3.Connection:
@@ -58,8 +61,8 @@ def _conn() -> sqlite3.Connection:
     return _CONN
-def get(url: str) -> tuple[str, str] | None:
-    """Return ``(markdown, source_method)`` for a fresh cache hit, else None."""
+def get(url: str, fmt: str = "markdown") -> tuple[str, str] | None:
+    """Return ``(content, source_method)`` for a fresh cache hit, else None."""
     if not enabled():
         return None
     conn = _conn()  # NB: acquires _LOCK itself — must be outside the lock below
@@ -67,7 +70,7 @@ def get(url: str) -> tuple[str, str] | None:
         with _LOCK:
             row = conn.execute(
                 "SELECT markdown, source_method, ts FROM cache WHERE url=?",
-                (_norm(url),)).fetchone()
+                (_norm(url, fmt),)).fetchone()
     except Exception as e:
         logger.warning(f"content_cache: read failed: {e}")
         return None
@@ -79,7 +82,7 @@ def get(url: str) -> tuple[str, str] | None:
     return markdown, source_method
-def put(url: str, markdown: str, source_method: str) -> None:
+def put(url: str, markdown: str, source_method: str, fmt: str = "markdown") -> None:
     """Store a successful scrape. No-op when disabled."""
     if not enabled():
         return
@@ -88,7 +91,7 @@ def put(url: str, markdown: str, source_method: str) -> None:
         with _LOCK:
             conn.execute("INSERT OR REPLACE INTO cache (url, markdown, source_method, ts) "
                          "VALUES (?, ?, ?, ?)",
-                         (_norm(url), markdown, source_method, time.time()))
+                         (_norm(url, fmt), markdown, source_method, time.time()))
             conn.commit()
     except Exception as e:
         logger.warning(f"content_cache: write failed: {e}")

switchback-0.2.0/switchback/normalize.py ADDED Viewed

@@ -0,0 +1,183 @@
+"""Shared content normalization — HTML→Markdown and PDF→text.
+Ported from musings-by-hermes/scripts/muse_helpers.py (the most mature version):
+strips boilerplate, promotes lazy-loaded images, resolves relative URLs.
+"""
+from __future__ import annotations
+import io
+import logging
+import os
+import re
+import threading
+from contextlib import contextmanager
+logger = logging.getLogger(__name__)
+UA = ("Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 "
+      "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
+# ── Output format ───────────────────────────────────────────────────────────
+# Default is markdown (today's behavior, byte-identical). Opt into other shapes
+# globally via SCRAPER_OUTPUT_FORMAT, or per-call via the scrape(fmt=...) /
+# --format / {"format": ...} overrides which set output_format_scope().
+#   markdown          whole-page markdown (default)
+#   markdown_trimmed  markdown with extra ad/nav/boilerplate lines removed
+#   html              raw HTML exactly as fetched (no cleaning)
+#   html_selectors    cleaned HTML (boilerplate strip + per-domain drop/selector),
+#                     not converted to markdown
+VALID_FORMATS = ("markdown", "markdown_trimmed", "html", "html_selectors")
+def _validate(fmt: str | None) -> str:
+    f = (fmt or "").strip().lower()
+    if f not in VALID_FORMATS:
+        logger.warning(f"unknown output format {fmt!r}; using 'markdown' "
+                       f"(valid: {', '.join(VALID_FORMATS)})")
+        return "markdown"
+    return f
+OUTPUT_FORMAT = _validate(os.getenv("SCRAPER_OUTPUT_FORMAT", "markdown"))
+# Per-thread active-format override. Thread-local by construction (like egress):
+# the orchestrator sets it on the worker thread that also runs the tier fetch +
+# this module's conversion, so concurrent server requests can't bleed formats.
+_scope = threading.local()
+@contextmanager
+def output_format_scope(fmt: str | None):
+    """Set the active output format for the enclosed work (per-thread). A falsy
+    fmt means 'use the SCRAPER_OUTPUT_FORMAT default'. Always restored on exit."""
+    prev = getattr(_scope, "fmt", None)
+    _scope.fmt = _validate(fmt) if fmt else None
+    try:
+        yield
+    finally:
+        _scope.fmt = prev
+def active_format() -> str:
+    """The format in effect for the current thread: the per-call override if set,
+    else the SCRAPER_OUTPUT_FORMAT default."""
+    return getattr(_scope, "fmt", None) or OUTPUT_FORMAT
+def output_key(fmt: str) -> str:
+    """The JSON/result key for a format's content family: html-family → "html",
+    markdown-family → "markdown". Lets the default path stay {"...","markdown"}."""
+    return "html" if fmt.startswith("html") else "markdown"
+def _clean_html(html: str, base_url: str | None = None) -> str:
+    """Return cleaned HTML: boilerplate stripped, per-domain drop/selector applied
+    (see switchback.extract), lazy-load image attrs promoted, relative image/link
+    URLs resolved against base_url. On any failure returns `html` unchanged."""
+    try:
+        from bs4 import BeautifulSoup
+        from urllib.parse import urljoin
+        from .extract import prefs_for
+        prefs = prefs_for(base_url)
+        soup = BeautifulSoup(html or "", "html.parser")
+        for tag in soup(["script", "style", "noscript", "nav", "header",
+                         "footer", "aside", "form", "iframe"]):
+            tag.decompose()
+        # Per-domain: remove configured noise, then scope to the content node.
+        for sel in prefs.get("drop", []):
+            for tag in soup.select(sel):
+                tag.decompose()
+        selector = prefs.get("selector")
+        if selector:
+            node = soup.select_one(selector)
+            if node is not None:
+                soup = BeautifulSoup(str(node), "html.parser")
+            else:
+                logger.debug(f"extract: selector {selector!r} matched nothing for {base_url}")
+        for img in soup.find_all("img"):
+            src = (img.get("src") or img.get("data-src")
+                   or img.get("data-original") or img.get("data-lazy-src"))
+            if not src and img.get("srcset"):
+                src = img["srcset"].split(",")[0].strip().split(" ")[0]
+            if src:
+                if base_url:
+                    src = urljoin(base_url, src)
+                img["src"] = src
+        if base_url:
+            for a in soup.find_all("a", href=True):
+                a["href"] = urljoin(base_url, a["href"])
+        return str(soup)
+    except Exception as e:
+        logger.debug(f"soup pre-clean skipped: {e}")
+        return html
+# Lines that markdown_trimmed drops: standalone images, link-only/nav rows, and
+# short promotional boilerplate. Conservative on purpose — prose is never touched.
+_TRIM_IMG_RE = re.compile(r"^!\[[^\]]*\]\([^)]*\)$")
+_TRIM_LINKS_ONLY_RE = re.compile(r"^(?:[-*>]\s*)?(?:\[[^\]]*\]\([^)]*\)[\s|·•\-–—]*)+$")
+_TRIM_BOILERPLATE_RE = re.compile(
+    r"^(subscribe|sign\s*up|sign\s*in|log\s*in|logout|newsletter|advertisement|"
+    r"accept\s+all|cookie|follow\s+us|share\s+this)\b", re.I)
+def _trim_markdown(md: str) -> str:
+    """Markdown minus common ad/nav/boilerplate noise. Drops only standalone-image
+    lines, link-only/nav rows, and short promotional boilerplate lines; keeps all
+    prose. Collapses 3+ blank lines to one."""
+    kept: list[str] = []
+    for line in md.splitlines():
+        s = line.strip()
+        if not s:
+            kept.append("")
+            continue
+        if _TRIM_IMG_RE.match(s) or _TRIM_LINKS_ONLY_RE.match(s):
+            continue
+        if len(s) <= 60 and _TRIM_BOILERPLATE_RE.match(s):
+            continue
+        kept.append(line)
+    return re.sub(r"\n{3,}", "\n\n", "\n".join(kept)).strip()
+def render(html: str, base_url: str | None = None, fmt: str | None = None) -> str:
+    """Render fetched HTML in `fmt` (default: the active output format).
+    - markdown          whole-page markdown (boilerplate strip + per-domain prefs)
+    - markdown_trimmed  markdown with extra ad/nav/boilerplate lines removed
+    - html              the raw HTML, untouched
+    - html_selectors    cleaned HTML (per-domain prefs applied), not converted
+    """
+    fmt = _validate(fmt) if fmt is not None else active_format()
+    if fmt == "html":
+        return html or ""
+    if fmt == "html_selectors":
+        return (_clean_html(html, base_url) or "").strip()
+    try:
+        from markdownify import markdownify
+        cleaned = _clean_html(html, base_url)
+        md = (markdownify(cleaned, heading_style="ATX", code_language="",
+                          bullets="-", strip=["script", "style"]) or "").strip()
+        return _trim_markdown(md) if fmt == "markdown_trimmed" else md
+    except Exception as e:
+        logger.warning(f"markdownify failed: {e}")
+        return (html or "").strip()
+def html_to_markdown(html: str, base_url: str | None = None) -> str:
+    """Render `html` in the active output format (default markdown). Name kept for
+    back-compat: every tier calls this, so it automatically honors the selected
+    SCRAPER_OUTPUT_FORMAT / per-call format with no per-tier changes."""
+    return render(html, base_url)
+def pdf_bytes_to_text(data: bytes) -> str:
+    """Extract text from PDF bytes. In-memory only — nothing written to disk."""
+    from pypdf import PdfReader
+    buf = io.BytesIO(data)
+    try:
+        reader = PdfReader(buf)
+        return "\n\n".join((p.extract_text() or "") for p in reader.pages).strip()
+    finally:
+        buf.close()

{switchback-0.1.0 → switchback-0.2.0}/switchback/orchestrator.py RENAMED Viewed

@@ -17,6 +17,7 @@ import time
 from dataclasses import dataclass, field
 from . import content_cache, egress, session_cache
+from .normalize import active_format, output_format_scope
 from .policy import botwall
 from .policy.gates import BotWall, RateLimited, ShortContent, classify_error, host_of
 from .tiers import TIERS, INDEX
@@ -68,8 +69,9 @@ _FAILURE_PRIORITY = {
 @dataclass
 class ScrapeResult:
     url: str
-    markdown: str
+    markdown: str       # the rendered content (format named by `format`)
     source_method: str  # tier NAME that won
+    format: str = "markdown"  # markdown | markdown_trimmed | html | html_selectors
 @dataclass
@@ -95,6 +97,7 @@ class ScrapeOutcome:
     latency_ms: int | None = None
     egress: str = "direct"         # "egress" if routed via SCRAPER_EGRESS_PROXY, else "direct"
     wire_bytes: int = 0            # bytes transferred over the network (cost basis for proxy GB)
+    format: str = "markdown"       # output format of `markdown` (the content field)
     attempts: list[TierAttempt] = field(default_factory=list)
@@ -180,7 +183,7 @@ def _run_one(url: str, db: dict) -> ScrapeOutcome:
         if botwall.is_url_skipped(url, db):
             return _skipped(url, root, "url_excluded",
                             db.get("urls", {}).get(url, {}).get("reason", ""))
-        hit = content_cache.get(url)
+        hit = content_cache.get(url, active_format())
         if hit:
             md, method = hit
             root.set(Attr.OUTCOME, "cache_hit")
@@ -188,7 +191,7 @@ def _run_one(url: str, db: dict) -> ScrapeOutcome:
             root.set(Attr.MD_LEN, len(md))
             logger.info(f"cache_hit {url} (was {method})")
             return ScrapeOutcome(url, True, markdown=md, source_method=method,
-                                 final_outcome="ok")
+                                 final_outcome="ok", format=active_format())
         # A needs_egress host runs the whole cascade in the egress scope, so the
         # tiers route through SCRAPER_EGRESS_PROXY (when set); easy hosts stay
         # direct and never spend residential bandwidth.
@@ -300,7 +303,7 @@ def _run_cascade(url, host, db, root, t0, deadline) -> ScrapeOutcome:
             sp.set(Attr.SOURCE, tier.NAME)
             sp.set(Attr.LATENCY_MS, dt)
             botwall.record(db, url, tier.NAME, "ok", md_len=len(md), latency_ms=dt)
-            content_cache.put(url, md, tier.NAME)
+            content_cache.put(url, md, tier.NAME, active_format())
             root.set(Attr.OUTCOME, "ok")
             root.set(Attr.SOURCE, tier.NAME)
             root.set(Attr.LATENCY_MS, total)
@@ -308,7 +311,8 @@ def _run_cascade(url, host, db, root, t0, deadline) -> ScrapeOutcome:
             logger.info(
                 f"{tier.NAME} OK {url} md_len={len(md)} {dt}ms (total {total}ms)")
             return ScrapeOutcome(url, True, markdown=md, source_method=tier.NAME,
-                                 final_outcome="ok", latency_ms=total, attempts=attempts)
+                                 final_outcome="ok", latency_ms=total,
+                                 format=active_format(), attempts=attempts)
     total = int((time.monotonic() - t0) * 1000)
     ec, sc = _dominant_failure(attempts)
@@ -323,21 +327,24 @@ def _run_cascade(url, host, db, root, t0, deadline) -> ScrapeOutcome:
                          latency_ms=total, attempts=attempts)
-def run_detailed(urls: list[str]) -> list[ScrapeOutcome]:
+def run_detailed(urls: list[str], fmt: str | None = None) -> list[ScrapeOutcome]:
     """Scrape each URL; return a full ScrapeOutcome (success or failure with the
-    per-tier cascade and a classified reason) for every URL."""
+    per-tier cascade and a classified reason) for every URL.
+    fmt overrides SCRAPER_OUTPUT_FORMAT for this call (None = use the default)."""
     db = botwall.load_db()
     out = []
     try:
-        for url in urls:
-            out.append(_run_one(url, db))
+        with output_format_scope(fmt):
+            for url in urls:
+                out.append(_run_one(url, db))
     finally:
         botwall.save_db(db)
         flush()
     return out
-def run(urls: list[str]) -> list[ScrapeResult]:
+def run(urls: list[str], fmt: str | None = None) -> list[ScrapeResult]:
     """Successes only (backward-compatible). Use run_detailed() for failures."""
-    return [ScrapeResult(o.url, o.markdown, o.source_method)
-            for o in run_detailed(urls) if o.ok]
+    return [ScrapeResult(o.url, o.markdown, o.source_method, o.format)
+            for o in run_detailed(urls, fmt) if o.ok]

{switchback-0.1.0 → switchback-0.2.0}/switchback/server.py RENAMED Viewed

@@ -25,15 +25,17 @@ from pydantic import BaseModel
 from . import session_trace
 from .api import scrape
+from .normalize import output_key
 from .reporting import build_report, domain_report
 from .search import search
 from .tracing import setup_logs
-app = FastAPI(title="switchback", version="0.1.0")
+app = FastAPI(title="switchback", version="0.2.0")
 class ScrapeRequest(BaseModel):
     urls: list[str]
+    format: str | None = None  # markdown (default) | markdown_trimmed | html | html_selectors
 @app.get("/healthz")
@@ -43,9 +45,12 @@ def healthz() -> dict:
 @app.post("/scrape")
 def scrape_endpoint(req: ScrapeRequest) -> list[dict]:
-    """Run URLs through the cascade. Returns successes only (failed URLs omitted)."""
-    return [{"url": r.url, "source_method": r.source_method, "markdown": r.markdown}
-            for r in scrape(req.urls)]
+    """Run URLs through the cascade. Returns successes only (failed URLs omitted).
+    Optional "format" selects the output shape; the content key is "markdown" for
+    markdown formats and "html" for html formats."""
+    return [{"url": r.url, "source_method": r.source_method,
+             output_key(r.format): r.markdown}
+            for r in scrape(req.urls, fmt=req.format)]
 @app.get("/search")

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/tier4_firecrawl.py RENAMED Viewed

@@ -9,6 +9,7 @@ from __future__ import annotations
 import os
 import threading
+from ..normalize import active_format, render
 from ..policy.gates import check
 NAME = "tier4_firecrawl"
@@ -19,12 +20,18 @@ def disabled() -> bool:
     return bool(os.getenv("SCRAPER_DISABLE_FIRECRAWL"))
-def _scrape(url: str) -> str:
+def _scrape(url: str, fmt: str) -> str:
     from firecrawl import Firecrawl
     app = Firecrawl(api_key=os.environ["FIRECRAWL_API_KEY"])
-    doc = app.scrape(url, formats=["markdown"])
+    if fmt == "markdown":
+        doc = app.scrape(url, formats=["markdown"])
+        d = doc.model_dump() if hasattr(doc, "model_dump") else (doc if isinstance(doc, dict) else {})
+        return check(url, (d.get("markdown") or "").strip())
+    # Non-default formats: fetch HTML and derive every shape through normalize, so
+    # html / html_selectors / markdown_trimmed match the rest of the cascade.
+    doc = app.scrape(url, formats=["html"])
     d = doc.model_dump() if hasattr(doc, "model_dump") else (doc if isinstance(doc, dict) else {})
-    return check(url, (d.get("markdown") or "").strip())
+    return check(url, render(d.get("html") or "", base_url=url, fmt=fmt))
 def fetch(url: str) -> str:
@@ -32,11 +39,13 @@ def fetch(url: str) -> str:
     # the calling thread, which then makes a later sync-Playwright browser tier in
     # the same batch raise "Sync API inside the asyncio loop". A worker thread
     # confines that loop so the browser tiers stay usable across a multi-URL run.
+    # active_format() is thread-local, so read it here (main thread) and pass it in.
     box: dict = {}
+    fmt = active_format()
     def work():
         try:
-            box["md"] = _scrape(url)
+            box["md"] = _scrape(url, fmt)
         except BaseException as e:  # noqa: BLE001 — re-raised to the caller below
             box["err"] = e

{switchback-0.1.0 → switchback-0.2.0}/switchback.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: switchback
-Version: 0.1.0
+Version: 0.2.0
 Summary: One cost-ordered scrape cascade (HTTP → stealth browser → paid), shared by every tool.
 Author-email: Akash Kodavuru <akash@theaklabs.com>
 License: MIT
@@ -75,8 +75,8 @@ Dynamic: license-file
 Give it a URL; it tries the cheapest way to get clean Markdown first and only escalates
 to a heavier (slower, costlier) tier when the cheap one is walled. Stops at the first success.
-[![PyPI](https://img.shields.io/pypi/v/switchback.svg)](https://pypi.org/project/switchback/)
-[![Python](https://img.shields.io/pypi/pyversions/switchback.svg)](https://pypi.org/project/switchback/)
+[![PyPI](https://img.shields.io/pypi/v/switchback)](https://pypi.org/project/switchback/)
+[![Python](https://img.shields.io/pypi/pyversions/switchback)](https://pypi.org/project/switchback/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![CI](https://github.com/akash-kr/switchback/actions/workflows/ci.yml/badge.svg)](https://github.com/akash-kr/switchback/actions/workflows/ci.yml)
@@ -269,6 +269,7 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
 <details>
 <summary><b>Tunables</b> — budgets, timeouts, caches, backoff</summary>
+- `SCRAPER_OUTPUT_FORMAT` — output shape: `markdown` (default) · `markdown_trimmed` · `html` · `html_selectors` (see [Output formats](#output-formats))
 - `SCRAPER_DEADLINE_S` — per-URL budget (45s)
 - `SCRAPER_CAMOUFOX_TIMEOUT_MS` — (45000)
 - `SCRAPER_BROWSER_CONCURRENCY` — max simultaneous headless browsers (default 1)
@@ -297,6 +298,34 @@ trace zip to `state/traces/`. Manage them over HTTP — `GET /traces` (list),
 `GET /traces/{id}` (download), `DELETE /traces/{id}` — and open one with
 `playwright show-trace <zip>`. Off by default (traces are MBs each).
+### Output formats
+Markdown is the default and is unchanged. Pick a different shape globally with
+`SCRAPER_OUTPUT_FORMAT`, or per call:
+```python
+from switchback import scrape
+scrape(["https://example.com/article"])                    # markdown (default)
+scrape(["https://example.com/article"], fmt="html")        # raw HTML
+scrape(["https://example.com/article"], fmt="markdown_trimmed")
+```
+```bash
+switchback --format html_selectors https://example.com/article
+curl -s localhost:8799/scrape -d '{"urls":["https://example.com"],"format":"html"}'
+```
+| format | what you get |
+| --- | --- |
+| `markdown` | whole-page markdown (boilerplate stripped + per-domain prefs) — **default** |
+| `markdown_trimmed` | markdown with extra ad/nav/boilerplate lines removed |
+| `html` | the raw HTML exactly as fetched, untouched |
+| `html_selectors` | cleaned HTML (boilerplate strip + per-domain `drop`/`selector`), not converted |
+The chosen content rides in the result's `markdown` field; in the CLI/server JSON
+the key is `markdown` for markdown formats and `html` for html formats. The
+API/PDF tiers (arXiv synth, PDF→text) have no HTML, so html formats fall back to
+their text for those sources.
 ### Per-domain extraction
 Markdown of the whole page is the default. To scope a site to its content node or
 strip site-specific noise, declare prefs per host in `config/extraction.json`

switchback-0.1.0/switchback/normalize.py DELETED Viewed

@@ -1,81 +0,0 @@
-"""Shared content normalization — HTML→Markdown and PDF→text.
-Ported from musings-by-hermes/scripts/muse_helpers.py (the most mature version):
-strips boilerplate, promotes lazy-loaded images, resolves relative URLs.
-"""
-from __future__ import annotations
-import io
-import logging
-logger = logging.getLogger(__name__)
-UA = ("Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 "
-      "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
-def html_to_markdown(html: str, base_url: str | None = None) -> str:
-    """HTML → Markdown, preserving images/blockquotes/code.
-    - Strips script/style/nav/header/footer/aside boilerplate.
-    - Applies any per-domain extraction prefs (scope selector / extra drops),
-      see switchback.extract.
-    - Promotes lazy-load attrs (data-src, data-original, srcset) to src.
-    - Resolves relative image/link URLs against base_url.
-    """
-    try:
-        from markdownify import markdownify
-        try:
-            from bs4 import BeautifulSoup
-            from urllib.parse import urljoin
-            from .extract import prefs_for
-            prefs = prefs_for(base_url)
-            soup = BeautifulSoup(html or "", "html.parser")
-            for tag in soup(["script", "style", "noscript", "nav", "header",
-                             "footer", "aside", "form", "iframe"]):
-                tag.decompose()
-            # Per-domain: remove configured noise, then scope to the content node.
-            for sel in prefs.get("drop", []):
-                for tag in soup.select(sel):
-                    tag.decompose()
-            selector = prefs.get("selector")
-            if selector:
-                node = soup.select_one(selector)
-                if node is not None:
-                    soup = BeautifulSoup(str(node), "html.parser")
-                else:
-                    logger.debug(f"extract: selector {selector!r} matched nothing for {base_url}")
-            for img in soup.find_all("img"):
-                src = (img.get("src") or img.get("data-src")
-                       or img.get("data-original") or img.get("data-lazy-src"))
-                if not src and img.get("srcset"):
-                    src = img["srcset"].split(",")[0].strip().split(" ")[0]
-                if src:
-                    if base_url:
-                        src = urljoin(base_url, src)
-                    img["src"] = src
-            if base_url:
-                for a in soup.find_all("a", href=True):
-                    a["href"] = urljoin(base_url, a["href"])
-            html = str(soup)
-        except Exception as e:
-            logger.debug(f"soup pre-clean skipped: {e}")
-        md = markdownify(html, heading_style="ATX", code_language="",
-                         bullets="-", strip=["script", "style"])
-        return (md or "").strip()
-    except Exception as e:
-        logger.warning(f"markdownify failed: {e}")
-        return (html or "").strip()
-def pdf_bytes_to_text(data: bytes) -> str:
-    """Extract text from PDF bytes. In-memory only — nothing written to disk."""
-    from pypdf import PdfReader
-    buf = io.BytesIO(data)
-    try:
-        reader = PdfReader(buf)
-        return "\n\n".join((p.extract_text() or "") for p in reader.pages).strip()
-    finally:
-        buf.close()

{switchback-0.1.0 → switchback-0.2.0}/CONTRIBUTING.md RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/LICENSE RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/MANIFEST.in RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/NOTICE RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/SECURITY.md RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/clients/node_bridge.md RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/config/botwall_skip_urls.txt RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/config/extraction.example.json RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/setup.cfg RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/__init__.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/__main__.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/concurrency.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/egress.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/extract.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/flags.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/policy/__init__.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/policy/botwall.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/policy/gates.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/py.typed RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/reporting.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/search.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/session_cache.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/session_trace.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/__init__.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/_browser.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/tier0_apis.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/tier1_http.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/tier2_cloudscraper.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/tier3_browser.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/tier3b_camoufox.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tiers/tier_residential.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback/tracing.py RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback.egg-info/entry_points.txt RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback.egg-info/requires.txt RENAMED Viewed

File without changes

{switchback-0.1.0 → switchback-0.2.0}/switchback.egg-info/top_level.txt RENAMED Viewed

File without changes

switchback 0.1.0__tar.gz → 0.2.0__tar.gz

switchback 0.1.0tar.gz → 0.2.0tar.gz