PyPI - rolling-reader - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

rolling-reader 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

rolling_reader-0.3.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,116 @@
+Metadata-Version: 2.4
+Name: rolling-reader
+Version: 0.3.0
+Summary: Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction
+License: MIT
+Requires-Python: >=3.11
+Requires-Dist: beautifulsoup4>=4.14
+Requires-Dist: httpx>=0.28
+Requires-Dist: playwright>=1.44
+Requires-Dist: typer>=0.12
+Description-Content-Type: text/markdown
+# rolling-reader
+Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.
+## Install
+```bash
+pip install rolling-reader
+playwright install chromium  # required for Level 2 / Level 3
+```
+Python 3.11+. No Node.js required.
+## Quick start
+**Static page** (Level 1, no browser needed):
+```bash
+rr https://news.ycombinator.com/
+```
+**SPA or login-required page** (Level 2, reuses your existing Chrome session):
+```bash
+# 1. Start Chrome with remote debugging enabled (see section below)
+# 2. Run the command — Level 2 is selected automatically
+rr https://app.example.com/dashboard
+```
+**Output as Markdown:**
+```bash
+rr https://example.com --output md
+```
+## How it works
+| Level | Trigger | Speed |
+|-------|---------|-------|
+| 1 HTTP | Standard SSR page, no JS rendering needed | ~500 ms |
+| 2 CDP | SPA, JS rendering required, or auth-gated | ~3 s |
+| 3 JS State | Next.js / Nuxt / Redux / Remix state variable detected | ~1 s (3–4x faster than Level 2 DOM parse) |
+The dispatcher probes each level in order and stops at the first one that returns usable content. Level 3 is attempted after Level 2 attaches to the browser — if a known JS state variable is found, DOM parsing is skipped entirely.
+## Starting Chrome for Level 2 / Level 3
+Chrome must be running with remote debugging before invoking Level 2 or Level 3:
+```bash
+# macOS
+open -a "Google Chrome" --args --remote-debugging-port=9222
+# Windows
+chrome --remote-debugging-port=9222
+# Linux
+google-chrome --remote-debugging-port=9222
+```
+The existing Chrome session (including cookies and local storage) is reused — no separate login step required.
+## CLI options
+| Flag | Values | Description |
+|------|--------|-------------|
+| `--output` | `json`, `md` | Output format (default: plain text) |
+| `--force-level` | `1`, `2`, `3` | Skip auto-detection, force a specific level |
+| `--json-path` | dot-notation string | Extract a nested key from JSON output, e.g. `title` or `props.pageProps` |
+| `--no-cache` | — | Disable response cache |
+| `--cdp` | — | Force CDP connection (equivalent to `--force-level 2`) |
+| `--verbose` | — | Print level selection reasoning and timing |
+## Why not X
+| Tool | Limitation |
+|------|-----------|
+| **Scrapling** | Cannot reuse an existing logged-in Chrome session; no JS state extraction |
+| **Firecrawl** | Cloud API — data leaves your machine, metered pricing |
+| **Jina Reader** | Cloud API — data leaves your machine, metered pricing |
+| **rolling-reader** | Fully local, reuses your Chrome session and cookies, free forever |
+## Supported JS state variables (v0.2)
+The following `window.*` variables are probed automatically for Level 3 extraction:
+- `window.__NEXT_DATA__` — Next.js (Vercel ecosystem)
+- `window.__NUXT__` — Nuxt.js
+- `window.__PRELOADED_STATE__` — Redux / custom
+- `window.__INITIAL_STATE__` — various frameworks
+- `window.__REDUX_STATE__` — Redux explicit naming
+- `window.__APP_STATE__` — various frameworks
+- `window.__STATE__` — generic
+- `window.__STORE__` — MobX / custom
+- `window.APP_STATE` — no-underscore variant
+- `window.initialState` — camelCase variant
+- `window.__remixContext` — Remix
+- `window.__staticRouterHydrationData` — React Router v6 SSR
+Unknown variables matching the pattern `window.VAR = {…}` are also detected via regex scan.
+## License
+MIT

rolling_reader-0.3.0/README.md ADDED Viewed

@@ -0,0 +1,104 @@
+# rolling-reader
+Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction.
+## Install
+```bash
+pip install rolling-reader
+playwright install chromium  # required for Level 2 / Level 3
+```
+Python 3.11+. No Node.js required.
+## Quick start
+**Static page** (Level 1, no browser needed):
+```bash
+rr https://news.ycombinator.com/
+```
+**SPA or login-required page** (Level 2, reuses your existing Chrome session):
+```bash
+# 1. Start Chrome with remote debugging enabled (see section below)
+# 2. Run the command — Level 2 is selected automatically
+rr https://app.example.com/dashboard
+```
+**Output as Markdown:**
+```bash
+rr https://example.com --output md
+```
+## How it works
+| Level | Trigger | Speed |
+|-------|---------|-------|
+| 1 HTTP | Standard SSR page, no JS rendering needed | ~500 ms |
+| 2 CDP | SPA, JS rendering required, or auth-gated | ~3 s |
+| 3 JS State | Next.js / Nuxt / Redux / Remix state variable detected | ~1 s (3–4x faster than Level 2 DOM parse) |
+The dispatcher probes each level in order and stops at the first one that returns usable content. Level 3 is attempted after Level 2 attaches to the browser — if a known JS state variable is found, DOM parsing is skipped entirely.
+## Starting Chrome for Level 2 / Level 3
+Chrome must be running with remote debugging before invoking Level 2 or Level 3:
+```bash
+# macOS
+open -a "Google Chrome" --args --remote-debugging-port=9222
+# Windows
+chrome --remote-debugging-port=9222
+# Linux
+google-chrome --remote-debugging-port=9222
+```
+The existing Chrome session (including cookies and local storage) is reused — no separate login step required.
+## CLI options
+| Flag | Values | Description |
+|------|--------|-------------|
+| `--output` | `json`, `md` | Output format (default: plain text) |
+| `--force-level` | `1`, `2`, `3` | Skip auto-detection, force a specific level |
+| `--json-path` | dot-notation string | Extract a nested key from JSON output, e.g. `title` or `props.pageProps` |
+| `--no-cache` | — | Disable response cache |
+| `--cdp` | — | Force CDP connection (equivalent to `--force-level 2`) |
+| `--verbose` | — | Print level selection reasoning and timing |
+## Why not X
+| Tool | Limitation |
+|------|-----------|
+| **Scrapling** | Cannot reuse an existing logged-in Chrome session; no JS state extraction |
+| **Firecrawl** | Cloud API — data leaves your machine, metered pricing |
+| **Jina Reader** | Cloud API — data leaves your machine, metered pricing |
+| **rolling-reader** | Fully local, reuses your Chrome session and cookies, free forever |
+## Supported JS state variables (v0.2)
+The following `window.*` variables are probed automatically for Level 3 extraction:
+- `window.__NEXT_DATA__` — Next.js (Vercel ecosystem)
+- `window.__NUXT__` — Nuxt.js
+- `window.__PRELOADED_STATE__` — Redux / custom
+- `window.__INITIAL_STATE__` — various frameworks
+- `window.__REDUX_STATE__` — Redux explicit naming
+- `window.__APP_STATE__` — various frameworks
+- `window.__STATE__` — generic
+- `window.__STORE__` — MobX / custom
+- `window.APP_STATE` — no-underscore variant
+- `window.initialState` — camelCase variant
+- `window.__remixContext` — Remix
+- `window.__staticRouterHydrationData` — React Router v6 SSR
+Unknown variables matching the pattern `window.VAR = {…}` are also detected via regex scan.
+## License
+MIT

{rolling_reader-0.2.0 → rolling_reader-0.3.0}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,10 @@ build-backend = "hatchling.build"
 [project]
 name = "rolling-reader"
-version = "0.2.0"
-description = "Local-first CLI web scraper with automatic strategy selection"
+version = "0.3.0"
+description = "Local-first web scraper that automatically rolls through HTTP → browser → JS state extraction"
+readme = "README.md"
+license = { text = "MIT" }
 requires-python = ">=3.11"
 dependencies = [
     "httpx>=0.28",

rolling_reader-0.3.0/src/rolling_reader/cli.py ADDED Viewed

@@ -0,0 +1,162 @@
+"""
+rolling_reader/cli.py
+=================
+CLI 入口（typer）
+用法：
+    rr <url>
+    rr <url> --output md
+    rr <url> --force-level 2
+    rr <url> --json-path props.pageProps
+    rr <url> --no-cache --verbose
+"""
+from __future__ import annotations
+import asyncio
+import json
+import sys
+from enum import Enum
+from typing import Optional
+import typer
+from rolling_reader.dispatcher import dispatch
+from rolling_reader.models import ExtractionError
+app = typer.Typer(
+    name="rr",
+    help="Local-first web scraper — automatically selects HTTP, CDP, or JS state extraction.",
+    add_completion=False,
+)
+class OutputFormat(str, Enum):
+    json = "json"
+    md   = "md"
+def _resolve_json_path(data: dict, path: str):
+    """
+    按 dot-notation 路径从字典中取值。
+    例：path="props.pageProps.title" → data["props"]["pageProps"]["title"]
+    路径不存在时返回 None。
+    """
+    current = data
+    for key in path.split("."):
+        if isinstance(current, dict) and key in current:
+            current = current[key]
+        else:
+            return None
+    return current
+@app.command()
+def scrape(
+    url: str = typer.Argument(..., help="Target URL to scrape"),
+    output: OutputFormat = typer.Option(
+        OutputFormat.json,
+        "--output", "-o",
+        help="Output format: json (default) or md (markdown)",
+    ),
+    force_level: Optional[int] = typer.Option(
+        None,
+        "--force-level", "-l",
+        help="Force a specific extraction level (1=HTTP, 2=CDP, 3=JS state)",
+        min=1,
+        max=3,
+    ),
+    json_path: Optional[str] = typer.Option(
+        None,
+        "--json-path", "-p",
+        help=(
+            "Dot-notation path into the result JSON. "
+            "Top-level keys: url, level, title, text, links, elapsed_ms. "
+            "For Level 3 JS state results, dig into nested data, e.g. --json-path text"
+        ),
+    ),
+    no_cache: bool = typer.Option(
+        False,
+        "--no-cache",
+        help="Bypass profile cache and always re-explore the best strategy",
+    ),
+    cdp_endpoint: str = typer.Option(
+        "http://localhost:9222",
+        "--cdp",
+        help="Chrome DevTools endpoint (default: http://localhost:9222)",
+    ),
+    verbose: bool = typer.Option(
+        False,
+        "--verbose", "-v",
+        help="Print escalation steps to stderr",
+    ),
+) -> None:
+    """Scrape a URL and output structured data."""
+    try:
+        result = asyncio.run(
+            dispatch(
+                url,
+                force_level=force_level,
+                cdp_endpoint=cdp_endpoint,
+                verbose=verbose,
+                use_cache=not no_cache,
+            )
+        )
+    except ExtractionError as e:
+        _print_error(e)
+        raise typer.Exit(code=1)
+    except KeyboardInterrupt:
+        raise typer.Exit(code=130)
+    # ── --json-path：从结果中提取指定字段 ──────────────────────────────────
+    if json_path:
+        value = _resolve_json_path(result.to_dict(), json_path)
+        if value is None:
+            typer.echo(
+                f"Error: path '{json_path}' not found in result. "
+                f"Available top-level keys: {', '.join(result.to_dict().keys())}",
+                err=True,
+            )
+            raise typer.Exit(code=1)
+        # 字符串直接输出，其他类型序列化为 JSON
+        if isinstance(value, str):
+            typer.echo(value)
+        else:
+            typer.echo(json.dumps(value, ensure_ascii=False, indent=2))
+        return
+    # ── 正常输出 ────────────────────────────────────────────────────────────
+    if output == OutputFormat.json:
+        typer.echo(result.to_json())
+    else:
+        typer.echo(result.to_markdown())
+def _print_error(e: ExtractionError) -> None:
+    """格式化错误输出，给出可行动的提示。"""
+    reason = e.reason or str(e)
+    # Chrome 未启动
+    if "Chrome is not available" in reason or "Cannot connect to Chrome" in reason:
+        typer.echo(
+            "\nError: Chrome is not running with remote debugging enabled.\n\n"
+            "Start Chrome first:\n"
+            "  macOS:   open -a 'Google Chrome' --args --remote-debugging-port=9222\n"
+            "  Windows: chrome --remote-debugging-port=9222\n"
+            "  Linux:   google-chrome --remote-debugging-port=9222\n",
+            err=True,
+        )
+        return
+    # 超时
+    if "timeout" in reason.lower():
+        typer.echo(
+            f"\nError: Request timed out — {reason}\n\n"
+            "Try: rr <url> --force-level 2  (use Chrome for slow-loading pages)\n",
+            err=True,
+        )
+        return
+    # 通用
+    typer.echo(f"\nError: {reason}\n", err=True)

rolling_reader-0.2.0/PKG-INFO DELETED Viewed

@@ -1,9 +0,0 @@
-Metadata-Version: 2.4
-Name: rolling-reader
-Version: 0.2.0
-Summary: Local-first CLI web scraper with automatic strategy selection
-Requires-Python: >=3.11
-Requires-Dist: beautifulsoup4>=4.14
-Requires-Dist: httpx>=0.28
-Requires-Dist: playwright>=1.44
-Requires-Dist: typer>=0.12

rolling_reader-0.2.0/src/rolling_reader/cli.py DELETED Viewed

@@ -1,83 +0,0 @@
-"""
-rolling_reader/cli.py
-=================
-CLI 入口（typer）
-用法：
-    scrape <url>
-    scrape <url> --output md
-    scrape <url> --force-level 2
-    scrape <url> --verbose
-"""
-from __future__ import annotations
-import asyncio
-import sys
-from enum import Enum
-from typing import Optional
-import typer
-from rolling_reader.dispatcher import dispatch
-from rolling_reader.models import ExtractionError
-app = typer.Typer(
-    name="scrape",
-    help="Local-first web scraper with automatic strategy selection.",
-    add_completion=False,
-)
-class OutputFormat(str, Enum):
-    json = "json"
-    md   = "md"
-@app.command()
-def scrape(
-    url: str = typer.Argument(..., help="Target URL to scrape"),
-    output: OutputFormat = typer.Option(
-        OutputFormat.json,
-        "--output", "-o",
-        help="Output format: json (default) or md (markdown)",
-    ),
-    force_level: Optional[int] = typer.Option(
-        None,
-        "--force-level", "-l",
-        help="Force a specific extraction level (1=HTTP, 2=CDP)",
-        min=1,
-        max=2,
-    ),
-    cdp_endpoint: str = typer.Option(
-        "http://localhost:9222",
-        "--cdp",
-        help="Chrome DevTools endpoint",
-    ),
-    verbose: bool = typer.Option(
-        False,
-        "--verbose", "-v",
-        help="Print escalation steps to stderr",
-    ),
-) -> None:
-    """Scrape a URL and output structured data."""
-    try:
-        result = asyncio.run(
-            dispatch(
-                url,
-                force_level=force_level,
-                cdp_endpoint=cdp_endpoint,
-                verbose=verbose,
-            )
-        )
-    except ExtractionError as e:
-        typer.echo(f"Error: {e.reason}", err=True)
-        raise typer.Exit(code=1)
-    except KeyboardInterrupt:
-        raise typer.Exit(code=130)
-    if output == OutputFormat.json:
-        typer.echo(result.to_json())
-    else:
-        typer.echo(result.to_markdown())