PyPI - scrapingbee-cli - Versions diffs - 1.2.2__tar.gz → 1.3.0__tar.gz - Mend

scrapingbee-cli 1.2.2tar.gz → 1.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

{scrapingbee_cli-1.2.2/src/scrapingbee_cli.egg-info → scrapingbee_cli-1.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scrapingbee-cli
-Version: 1.2.2
+Version: 1.3.0
 Summary: Command-line client for the ScrapingBee API: scrape pages (single or batch), crawl sites, check usage/credits, and use Google Search, Fast Search, Amazon, Walmart, YouTube, and ChatGPT from the terminal.
 Author: ScrapingBee
 License-Expression: MIT
@@ -47,9 +47,17 @@ Command-line client for the [ScrapingBee](https://www.scrapingbee.com/) API: scr
 ## Installation
+**Recommended** — install with [uv](https://docs.astral.sh/uv/) (no virtual environment needed):
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+uv tool install scrapingbee-cli
+```
+**Alternative** — install with pip in a virtual environment:
 ```bash
 pip install scrapingbee-cli
-# or (isolated): pipx install scrapingbee-cli
 ```
 From source: clone the repo and run `pip install -e .` in the project root.
@@ -88,7 +96,7 @@ scrapingbee [command] [arguments] [options]
 | `amazon-product` / `amazon-search` | Amazon product and search |
 | `walmart-search` / `walmart-product` | Walmart search and product |
 | `youtube-search` / `youtube-metadata` | YouTube search and video metadata |
-| `chatgpt` | ChatGPT API |
+| `chatgpt` | ChatGPT API (`--search true` for web-enhanced responses) |
 | `export` | Merge batch/crawl output to ndjson, txt, or csv (with --flatten, --columns) |
 | `schedule` | Schedule commands via cron (--name, --list, --stop) |
@@ -125,8 +133,18 @@ scrapingbee schedule --every 1d --name price-tracker scrape --input-file product
 scrapingbee schedule --list
 ```
+## Security
+The `--post-process`, `--on-complete`, and `schedule` commands execute arbitrary shell commands on your machine. These features are **disabled by default** and require explicit human setup to enable.
+For advanced features setup, see the Security section in our [CLI documentation](https://www.scrapingbee.com/documentation/cli/).
+**Do not enable these features in AI agent environments** where commands may be constructed from scraped web content. ScrapingBee is not responsible for any damages caused by shell execution features. Use at your own discretion.
 ## More information
+- **[CLI Documentation](https://www.scrapingbee.com/documentation/cli/)** – Full CLI reference with pipelines, parameters, and examples.
+- **[Advanced usage examples](docs/advanced-usage.md)** – Shell piping, command chaining, batch workflows, monitoring scripts, NDJSON streaming, screenshots, Google search patterns, LLM chunking, and more.
 - **[ScrapingBee API documentation](https://www.scrapingbee.com/documentation/)** – Parameters, response formats, credit costs, and best practices.
 - **Claude / AI agents:** This repo includes a [Claude Skill](https://github.com/ScrapingBee/scrapingbee-cli/tree/main/skills/scrapingbee-cli) and [Claude Plugin](.claude-plugin/) for agent use with file-based output and security rules.

{scrapingbee_cli-1.2.2 → scrapingbee_cli-1.3.0}/README.md RENAMED Viewed

@@ -10,9 +10,17 @@ Command-line client for the [ScrapingBee](https://www.scrapingbee.com/) API: scr
 ## Installation
+**Recommended** — install with [uv](https://docs.astral.sh/uv/) (no virtual environment needed):
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+uv tool install scrapingbee-cli
+```
+**Alternative** — install with pip in a virtual environment:
 ```bash
 pip install scrapingbee-cli
-# or (isolated): pipx install scrapingbee-cli
 ```
 From source: clone the repo and run `pip install -e .` in the project root.
@@ -51,7 +59,7 @@ scrapingbee [command] [arguments] [options]
 | `amazon-product` / `amazon-search` | Amazon product and search |
 | `walmart-search` / `walmart-product` | Walmart search and product |
 | `youtube-search` / `youtube-metadata` | YouTube search and video metadata |
-| `chatgpt` | ChatGPT API |
+| `chatgpt` | ChatGPT API (`--search true` for web-enhanced responses) |
 | `export` | Merge batch/crawl output to ndjson, txt, or csv (with --flatten, --columns) |
 | `schedule` | Schedule commands via cron (--name, --list, --stop) |
@@ -88,8 +96,18 @@ scrapingbee schedule --every 1d --name price-tracker scrape --input-file product
 scrapingbee schedule --list
 ```
+## Security
+The `--post-process`, `--on-complete`, and `schedule` commands execute arbitrary shell commands on your machine. These features are **disabled by default** and require explicit human setup to enable.
+For advanced features setup, see the Security section in our [CLI documentation](https://www.scrapingbee.com/documentation/cli/).
+**Do not enable these features in AI agent environments** where commands may be constructed from scraped web content. ScrapingBee is not responsible for any damages caused by shell execution features. Use at your own discretion.
 ## More information
+- **[CLI Documentation](https://www.scrapingbee.com/documentation/cli/)** – Full CLI reference with pipelines, parameters, and examples.
+- **[Advanced usage examples](docs/advanced-usage.md)** – Shell piping, command chaining, batch workflows, monitoring scripts, NDJSON streaming, screenshots, Google search patterns, LLM chunking, and more.
 - **[ScrapingBee API documentation](https://www.scrapingbee.com/documentation/)** – Parameters, response formats, credit costs, and best practices.
 - **Claude / AI agents:** This repo includes a [Claude Skill](https://github.com/ScrapingBee/scrapingbee-cli/tree/main/skills/scrapingbee-cli) and [Claude Plugin](.claude-plugin/) for agent use with file-based output and security rules.

{scrapingbee_cli-1.2.2 → scrapingbee_cli-1.3.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "scrapingbee-cli"
-version = "1.2.2"
+version = "1.3.0"
 description = "Command-line client for the ScrapingBee API: scrape pages (single or batch), crawl sites, check usage/credits, and use Google Search, Fast Search, Amazon, Walmart, YouTube, and ChatGPT from the terminal."
 readme = "README.md"
 license = "MIT"

scrapingbee_cli-1.3.0/src/scrapingbee_cli/__init__.py ADDED Viewed

@@ -0,0 +1,16 @@
+"""ScrapingBee CLI - Command-line client for the ScrapingBee API."""
+import platform
+import sys
+__version__ = "1.3.0"
+def user_agent() -> str:
+    """Build a descriptive User-Agent string for API requests.
+    Format: scrapingbee-cli/1.2.3 Python/3.12.0 (Darwin arm64)
+    """
+    py = f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}"
+    os_info = f"{platform.system()} {platform.machine()}"
+    return f"scrapingbee-cli/{__version__} Python/{py} ({os_info})"

scrapingbee_cli-1.3.0/src/scrapingbee_cli/audit.py ADDED Viewed

@@ -0,0 +1,60 @@
+"""Audit logging for exec features (--post-process, --on-complete, schedule).
+Logs every shell command execution to a fixed location for forensics
+and guard skill monitoring.
+"""
+from __future__ import annotations
+from datetime import datetime, timezone
+from pathlib import Path
+AUDIT_LOG_PATH = Path.home() / ".config" / "scrapingbee-cli" / "audit.log"
+MAX_LINES = 10_000
+def log_exec(
+    feature: str,
+    command: str,
+    *,
+    input_source: str = "",
+    output_dir: str = "",
+) -> None:
+    """Append an entry to the audit log.
+    Format: ISO_TIMESTAMP | FEATURE | COMMAND | INPUT | OUTPUT_DIR
+    """
+    AUDIT_LOG_PATH.parent.mkdir(parents=True, exist_ok=True)
+    timestamp = datetime.now(timezone.utc).isoformat()
+    entry = f"{timestamp} | {feature} | {command} | {input_source} | {output_dir}\n"
+    try:
+        with open(AUDIT_LOG_PATH, "a", encoding="utf-8") as f:
+            f.write(entry)
+        _rotate_if_needed()
+    except OSError:
+        pass
+def read_audit_log(n: int = 50) -> str:
+    """Read the last N lines of the audit log."""
+    if not AUDIT_LOG_PATH.is_file():
+        return "No audit log found."
+    try:
+        with open(AUDIT_LOG_PATH, encoding="utf-8") as f:
+            lines = f.readlines()
+        recent = lines[-n:] if len(lines) > n else lines
+        return "".join(recent)
+    except OSError:
+        return "Could not read audit log."
+def _rotate_if_needed() -> None:
+    """Keep only the last MAX_LINES entries."""
+    try:
+        with open(AUDIT_LOG_PATH, encoding="utf-8") as f:
+            lines = f.readlines()
+        if len(lines) > MAX_LINES:
+            with open(AUDIT_LOG_PATH, "w", encoding="utf-8") as f:
+                f.writelines(lines[-MAX_LINES:])
+    except OSError:
+        pass

{scrapingbee_cli-1.2.2 → scrapingbee_cli-1.3.0}/src/scrapingbee_cli/batch.py RENAMED Viewed

@@ -297,10 +297,27 @@ async def _fetch_usage_async(api_key: str) -> dict:
         return parse_usage(body)
+# Cache usage API responses to avoid hitting the 6 calls/min rate limit.
+_usage_cache: dict | None = None
+_usage_cache_time: float = 0
+_USAGE_CACHE_TTL = 30  # seconds
 def get_batch_usage(api_key_flag: str | None) -> dict:
-    """Return usage info (max_concurrency, credits) from usage API."""
+    """Return usage info (max_concurrency, credits) from usage API.
+    Caches the result for 30 seconds to avoid hitting the usage API
+    rate limit (6 calls/min).
+    """
+    global _usage_cache, _usage_cache_time  # noqa: PLW0603
+    now = time.monotonic()
+    if _usage_cache is not None and (now - _usage_cache_time) < _USAGE_CACHE_TTL:
+        return _usage_cache
     key = get_api_key(api_key_flag)
-    return asyncio.run(_fetch_usage_async(key))
+    result = asyncio.run(_fetch_usage_async(key))
+    _usage_cache = result
+    _usage_cache_time = now
+    return result
 MIN_CREDITS_TO_RUN_BATCH = 100
@@ -688,6 +705,13 @@ def apply_post_process(body: bytes, cmd: str) -> bytes:
     """Run shell command with body as stdin, return stdout. On failure, return original body."""
     import subprocess
+    from .audit import log_exec
+    from .exec_gate import require_exec
+    require_exec("--post-process", cmd)
+    log_exec("post-process", cmd)
+    click.echo(f"⚠ Executing: {cmd.split()[0] if cmd.split() else cmd} (whitelisted)", err=True)
     try:
         result = subprocess.run(
             cmd,

{scrapingbee_cli-1.2.2 → scrapingbee_cli-1.3.0}/src/scrapingbee_cli/cli_utils.py RENAMED Viewed

@@ -105,7 +105,7 @@ def _batch_options(f: Any) -> Any:
         "post_process",
         type=str,
         default=None,
-        help="Batch: pipe each result through a shell command (e.g. 'jq .title').",
+        help="[Advanced] Batch: pipe each result through a shell command (e.g. 'jq .title'). Requires unsafe mode.",
     )(f)
     f = click.option(
         "--update-csv",
@@ -132,7 +132,7 @@ def _batch_options(f: Any) -> Any:
         "on_complete",
         type=str,
         default=None,
-        help="Batch: shell command to run after completion.",
+        help="[Advanced] Batch: shell command to run after completion. Requires unsafe mode.",
     )(f)
     f = click.option("--retries", type=int, default=3, help="Retry on errors (default: 3).")(f)
     f = click.option(
@@ -195,6 +195,28 @@ def _resolve_dotpath(obj: Any, keys: list[str]) -> Any:
     return cur
+def _collect_dotpaths(obj: Any, prefix: str = "", max_depth: int = 4) -> list[str]:
+    """Recursively collect all valid dot-paths from a JSON object.
+    For arrays, peeks into the first element. Caps at *max_depth* to
+    avoid huge output on deeply nested structures.
+    """
+    if max_depth <= 0:
+        return []
+    paths: list[str] = []
+    if isinstance(obj, dict):
+        for key in obj.keys():
+            full = f"{prefix}.{key}" if prefix else key
+            paths.append(full)
+            paths.extend(_collect_dotpaths(obj[key], full, max_depth - 1))
+    elif isinstance(obj, list) and obj:
+        # Peek into first element to show available sub-paths
+        first = obj[0] if isinstance(obj[0], dict) else None
+        if first:
+            paths.extend(_collect_dotpaths(first, prefix, max_depth - 1))
+    return paths
 def _extract_field_values(data: bytes, path: str) -> bytes:
     """Extract values from JSON data using a dot-path expression.
@@ -215,6 +237,14 @@ def _extract_field_values(data: bytes, path: str) -> bytes:
     result = _resolve_dotpath(obj, keys)
     if result is None:
+        paths = _collect_dotpaths(obj)
+        hint = ""
+        if paths:
+            hint = "\n  Available paths:\n    " + "\n    ".join(paths)
+        click.echo(
+            f"Warning: --extract-field '{path}' did not match any data.{hint}",
+            err=True,
+        )
         return b""
     if isinstance(result, list):
         values = [str(v) for v in result if v is not None]
@@ -524,6 +554,13 @@ async def scrape_with_escalation(
     return data, headers, status_code
+def ensure_url_scheme(url: str) -> str:
+    """Prepend https:// if the URL has no scheme (like curl/httpie do)."""
+    if url and not url.startswith(("http://", "https://", "ftp://")):
+        return "https://" + url
+    return url
 def prepare_batch_inputs(inputs: list[str], obj: dict) -> list[str]:
     """Apply --deduplicate and --sample to batch inputs."""
     from .batch import deduplicate_inputs, sample_inputs
@@ -558,11 +595,17 @@ def run_on_complete(
     import os
     import subprocess
+    from .audit import log_exec
+    from .exec_gate import require_exec
+    require_exec("--on-complete", cmd)
+    log_exec("on-complete", cmd, output_dir=output_dir)
+    click.echo(f"⚠ Executing: {cmd.split()[0] if cmd.split() else cmd} (whitelisted)", err=True)
     env = os.environ.copy()
     env["SCRAPINGBEE_OUTPUT_DIR"] = output_dir
     env["SCRAPINGBEE_SUCCEEDED"] = str(succeeded)
     env["SCRAPINGBEE_FAILED"] = str(failed)
-    click.echo(f"[on-complete] Running: {cmd}", err=True)
     result = subprocess.run(cmd, shell=True, env=env)  # noqa: S602
     if result.returncode != 0:
         click.echo(f"[on-complete] Exit code: {result.returncode}", err=True)
@@ -578,6 +621,7 @@ def write_output(
     extract_field: str | None = None,
     fields: str | None = None,
     command: str | None = None,
+    credit_cost: int | None = None,
 ) -> None:
     """Write response data to file or stdout; optionally print verbose headers.
@@ -604,11 +648,14 @@ def write_output(
                     click.echo(f"{label}: {val}", err=True)
                     if key == "spb-cost":
                         spb_cost_present = True
-        if not spb_cost_present and command:
-            from scrapingbee_cli.credits import ESTIMATED_CREDITS
-            if command in ESTIMATED_CREDITS:
-                click.echo(f"Credit Cost (estimated): {ESTIMATED_CREDITS[command]}", err=True)
+        if not spb_cost_present:
+            if credit_cost is not None:
+                click.echo(f"Credit Cost: {credit_cost}", err=True)
+            elif command:
+                from scrapingbee_cli.credits import ESTIMATED_CREDITS
+                if command in ESTIMATED_CREDITS:
+                    click.echo(f"Credit Cost (estimated): {ESTIMATED_CREDITS[command]}", err=True)
         click.echo("---", err=True)
     if extract_field:
         data = _extract_field_values(data, extract_field)

{scrapingbee_cli-1.2.2 → scrapingbee_cli-1.3.0}/src/scrapingbee_cli/client.py RENAMED Viewed

@@ -10,6 +10,7 @@ from typing import Any
 import aiohttp
 import certifi
+from . import user_agent
 from .config import BASE_URL
@@ -45,7 +46,7 @@ class Client:
         self._session = aiohttp.ClientSession(
             connector=connector,
             timeout=timeout,
-            headers={"User-Agent": "ScrapingBee/CLI"},
+            headers={"User-Agent": user_agent()},
         )
         return self
@@ -539,12 +540,22 @@ class Client:
     async def chatgpt(
         self,
         prompt: str,
+        search: bool | None = None,
+        add_html: bool | None = None,
+        country_code: str | None = None,
         retries: int = 3,
         backoff: float = 2.0,
     ) -> tuple[bytes, dict, int]:
+        params: dict[str, object] = {"prompt": prompt}
+        if search:
+            params["search"] = "true"
+        if add_html is not None:
+            params["add_html"] = str(add_html).lower()
+        if country_code is not None:
+            params["country_code"] = country_code
         return await self._get_with_retry(
             "/chatgpt",
-            {"prompt": prompt},
+            params,
             retries=retries,
             backoff=backoff,
         )

{scrapingbee_cli-1.2.2 → scrapingbee_cli-1.3.0}/src/scrapingbee_cli/commands/__init__.py RENAMED Viewed

@@ -35,3 +35,6 @@ def register_commands(cli: click.Group) -> None:
     chatgpt.register(cli)
     export.register(cli)
     schedule.register(cli)
+    from . import unsafe
+    unsafe.register(cli)

{scrapingbee_cli-1.2.2 → scrapingbee_cli-1.3.0}/src/scrapingbee_cli/commands/amazon.py RENAMED Viewed

@@ -167,6 +167,8 @@ def amazon_product_cmd(
                 backoff=obj.get("backoff", 2.0) or 2.0,
             )
         check_api_response(data, status_code)
+        from ..credits import amazon_credits
         write_output(
             data,
             headers,
@@ -176,6 +178,7 @@ def amazon_product_cmd(
             extract_field=obj.get("extract_field"),
             fields=obj.get("fields"),
             command="amazon-product",
+            credit_cost=amazon_credits(parse_bool(light_request)),
         )
     asyncio.run(_single())
@@ -338,6 +341,8 @@ def amazon_search_cmd(
                 backoff=obj.get("backoff", 2.0) or 2.0,
             )
         check_api_response(data, status_code)
+        from ..credits import amazon_credits
         write_output(
             data,
             headers,
@@ -347,6 +352,7 @@ def amazon_search_cmd(
             extract_field=obj.get("extract_field"),
             fields=obj.get("fields"),
             command="amazon-search",
+            credit_cost=amazon_credits(parse_bool(light_request)),
         )
     asyncio.run(_single())

scrapingbee-cli 1.2.2__tar.gz → 1.3.0__tar.gz

scrapingbee-cli 1.2.2tar.gz → 1.3.0tar.gz