PyPI - scrape-cli - Versions diffs - 1.2.0__tar.gz → 1.2.2__tar.gz - Mend

scrape-cli 1.2.0tar.gz → 1.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/PKG-INFO RENAMED Viewed

@@ -1,24 +1,20 @@
 Metadata-Version: 2.4
 Name: scrape_cli
-Version: 1.2.0
+Version: 1.2.2
 Summary: It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
-Home-page: https://github.com/aborruso/scrape-cli
-Author: Andrea Borruso
 Author-email: Andrea Borruso <aborruso@gmail.com>
 Project-URL: Homepage, https://github.com/aborruso/scrape-cli
 Classifier: Programming Language :: Python :: 3
 Classifier: Operating System :: OS Independent
-Requires-Python: >=3.6
+Requires-Python: >=3.8
 Description-Content-Type: text/markdown
 Requires-Dist: cssselect
 Requires-Dist: lxml
 Requires-Dist: requests
-Dynamic: author
-Dynamic: home-page
-Dynamic: requires-python
 [![PyPI version](https://img.shields.io/pypi/v/scrape-cli.svg?label=PyPI%20version)](https://pypi.org/project/scrape-cli/)
 [![Python Versions](https://img.shields.io/pypi/pyversions/scrape-cli.svg)](https://pypi.org/project/scrape-cli/)
+[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/aborruso/scrape-cli)
 # scrape cli
@@ -52,7 +48,7 @@ uv tool install scrape-cli
 uv pip install scrape-cli
 # Or run temporarily without installing
-uvx scrape-cli --help
+uvx --from scrape-cli scrape --help
 ```
 ### Using pip
@@ -80,7 +76,15 @@ pip install -e .
 ### Using the Test HTML File
-In the `resources` directory you'll find a `test.html` file that you can use to test various scraping scenarios. Here are some examples:
+In the `resources` directory you'll find a `test.html` file that you can use to test various scraping scenarios.
+**Note**: You can also test directly from the URL without cloning the repository:
+```bash
+scrape -e "h1" https://raw.githubusercontent.com/aborruso/scrape-cli/refs/heads/master/resources/test.html
+```
+Here are some examples:
 1. Extract all table data:
@@ -226,6 +230,60 @@ scrape -te 'h1, h2, h3' resources/test.html
 The `-t` option automatically excludes text from `<script>` and `<style>` tags and cleans up whitespace for better readability.
+### JSON Output Integration
+You can integrate scrape-cli with [xq](https://github.com/kislyuk/yq) (part of yq) to convert HTML output to structured JSON:
+```bash
+# Extract and convert to JSON (requires -b for complete HTML)
+scrape -be "a.external-link" resources/test.html | xq .
+```
+Output:
+```json
+{
+  "html": {
+    "body": {
+      "a": {
+        "@href": "https://example.com",
+        "@class": "external-link",
+        "#text": "Example Link"
+      }
+    }
+  }
+}
+```
+Table extraction example:
+```bash
+scrape -be "table.data-table td" resources/test.html | xq .
+```
+Output:
+```json
+{
+  "html": {
+    "body": {
+      "td": [
+        "1",
+        "John Doe",
+        "john@example.com",
+        "2",
+        "Jane Smith",
+        "jane@example.com"
+      ]
+    }
+  }
+}
+```
+**Note**: The `-b` flag is mandatory to produce valid HTML with `<html>`, `<head>` and `<body>` tags.
+Useful for JSON-based pipelines, APIs, databases, and processing with jq/DuckDB.
 Some notes on the commands:
 - `-e` to set the query

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/README.md RENAMED Viewed

@@ -1,5 +1,6 @@
 [![PyPI version](https://img.shields.io/pypi/v/scrape-cli.svg?label=PyPI%20version)](https://pypi.org/project/scrape-cli/)
 [![Python Versions](https://img.shields.io/pypi/pyversions/scrape-cli.svg)](https://pypi.org/project/scrape-cli/)
+[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/aborruso/scrape-cli)
 # scrape cli
@@ -33,7 +34,7 @@ uv tool install scrape-cli
 uv pip install scrape-cli
 # Or run temporarily without installing
-uvx scrape-cli --help
+uvx --from scrape-cli scrape --help
 ```
 ### Using pip
@@ -61,7 +62,15 @@ pip install -e .
 ### Using the Test HTML File
-In the `resources` directory you'll find a `test.html` file that you can use to test various scraping scenarios. Here are some examples:
+In the `resources` directory you'll find a `test.html` file that you can use to test various scraping scenarios.
+**Note**: You can also test directly from the URL without cloning the repository:
+```bash
+scrape -e "h1" https://raw.githubusercontent.com/aborruso/scrape-cli/refs/heads/master/resources/test.html
+```
+Here are some examples:
 1. Extract all table data:
@@ -207,6 +216,60 @@ scrape -te 'h1, h2, h3' resources/test.html
 The `-t` option automatically excludes text from `<script>` and `<style>` tags and cleans up whitespace for better readability.
+### JSON Output Integration
+You can integrate scrape-cli with [xq](https://github.com/kislyuk/yq) (part of yq) to convert HTML output to structured JSON:
+```bash
+# Extract and convert to JSON (requires -b for complete HTML)
+scrape -be "a.external-link" resources/test.html | xq .
+```
+Output:
+```json
+{
+  "html": {
+    "body": {
+      "a": {
+        "@href": "https://example.com",
+        "@class": "external-link",
+        "#text": "Example Link"
+      }
+    }
+  }
+}
+```
+Table extraction example:
+```bash
+scrape -be "table.data-table td" resources/test.html | xq .
+```
+Output:
+```json
+{
+  "html": {
+    "body": {
+      "td": [
+        "1",
+        "John Doe",
+        "john@example.com",
+        "2",
+        "Jane Smith",
+        "jane@example.com"
+      ]
+    }
+  }
+}
+```
+**Note**: The `-b` flag is mandatory to produce valid HTML with `<html>`, `<head>` and `<body>` tags.
+Useful for JSON-based pipelines, APIs, databases, and processing with jq/DuckDB.
 Some notes on the commands:
 - `-e` to set the query

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "scrape_cli"
-version = "1.2.0"
+version = "1.2.2"
 description = "It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector."
 readme = "README.md"
 authors = [
@@ -14,7 +14,7 @@ classifiers = [
     "Programming Language :: Python :: 3",
     "Operating System :: OS Independent",
 ]
-requires-python = ">=3.6"
+requires-python = ">=3.8"
 dependencies = [
     "cssselect",
     "lxml",

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/scrape_cli/__init__.py RENAMED Viewed

@@ -4,7 +4,7 @@ scrape-cli - A command-line tool to extract HTML elements using XPath or CSS3 se
 from scrape_cli.scrape import main
-__version__ = "1.2.0"
+__version__ = "1.2.2"
 __author__ = "Andrea Borruso"
 __author_email__ = "aborruso@gmail.com"

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/scrape_cli/scrape.py RENAMED Viewed

@@ -63,13 +63,13 @@ def is_xpath(expression):
     - Expressions wrapped in parentheses that contain XPath syntax
     """
     expr = expression.strip()
     # Direct XPath patterns
     if expr.startswith('/') or expr.startswith('//'):
         return True
     if '::' in expr:
         return True
     # Handle expressions wrapped in parentheses
     if expr.startswith('(') and expr.endswith(')'):
         # Remove outer parentheses and check inner content
@@ -78,7 +78,7 @@ def is_xpath(expression):
             return True
         if '::' in inner_expr:
             return True
     # Additional XPath indicators
     # Check for XPath-specific patterns that CSS doesn't have
     if '//' in expr or expr.startswith('/'):
@@ -91,7 +91,7 @@ def is_xpath(expression):
         return True
     if re.search(r'\b(ancestor|descendant|following|preceding|parent|child)::', expr):  # XPath axes
         return True
     return False
 def main():
@@ -128,6 +128,8 @@ def main():
     parser.add_argument('-r', '--rawinput', action='store_true', default=False,
                         help="Do not parse HTML before passing to etree (useful for CData)")
     parser.add_argument('--check-existence', dest='check_existence', action='store_true')
+    parser.add_argument('-u', '--user-agent', default=None,
+                        help="Custom User-Agent string for HTTP requests")
     args = parser.parse_args()
     # Check that at least one expression is provided by the user (unless using -t option)
@@ -142,7 +144,9 @@ def main():
         if args.html.startswith('http://') or args.html.startswith('https://'):
             # If the input is a URL, download the HTML content
             try:
-                response = requests.get(args.html)
+                ua = args.user_agent or "Mozilla/5.0 (compatible; scrape-cli/1.0)"
+                headers = {"User-Agent": ua}
+                response = requests.get(args.html, headers=headers, timeout=30)
                 response.raise_for_status()
                 inp = response.content
             except requests.RequestException as e:
@@ -189,7 +193,7 @@ def main():
             meta = re.search(r'<meta[^>]+charset=["\']?([\w-]+)', head)
             if meta:
                 return meta.group(1)
-        except:
+        except Exception:
             pass
         return None

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/scrape_cli.egg-info/PKG-INFO RENAMED Viewed

@@ -1,24 +1,20 @@
 Metadata-Version: 2.4
 Name: scrape_cli
-Version: 1.2.0
+Version: 1.2.2
 Summary: It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
-Home-page: https://github.com/aborruso/scrape-cli
-Author: Andrea Borruso
 Author-email: Andrea Borruso <aborruso@gmail.com>
 Project-URL: Homepage, https://github.com/aborruso/scrape-cli
 Classifier: Programming Language :: Python :: 3
 Classifier: Operating System :: OS Independent
-Requires-Python: >=3.6
+Requires-Python: >=3.8
 Description-Content-Type: text/markdown
 Requires-Dist: cssselect
 Requires-Dist: lxml
 Requires-Dist: requests
-Dynamic: author
-Dynamic: home-page
-Dynamic: requires-python
 [![PyPI version](https://img.shields.io/pypi/v/scrape-cli.svg?label=PyPI%20version)](https://pypi.org/project/scrape-cli/)
 [![Python Versions](https://img.shields.io/pypi/pyversions/scrape-cli.svg)](https://pypi.org/project/scrape-cli/)
+[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/aborruso/scrape-cli)
 # scrape cli
@@ -52,7 +48,7 @@ uv tool install scrape-cli
 uv pip install scrape-cli
 # Or run temporarily without installing
-uvx scrape-cli --help
+uvx --from scrape-cli scrape --help
 ```
 ### Using pip
@@ -80,7 +76,15 @@ pip install -e .
 ### Using the Test HTML File
-In the `resources` directory you'll find a `test.html` file that you can use to test various scraping scenarios. Here are some examples:
+In the `resources` directory you'll find a `test.html` file that you can use to test various scraping scenarios.
+**Note**: You can also test directly from the URL without cloning the repository:
+```bash
+scrape -e "h1" https://raw.githubusercontent.com/aborruso/scrape-cli/refs/heads/master/resources/test.html
+```
+Here are some examples:
 1. Extract all table data:
@@ -226,6 +230,60 @@ scrape -te 'h1, h2, h3' resources/test.html
 The `-t` option automatically excludes text from `<script>` and `<style>` tags and cleans up whitespace for better readability.
+### JSON Output Integration
+You can integrate scrape-cli with [xq](https://github.com/kislyuk/yq) (part of yq) to convert HTML output to structured JSON:
+```bash
+# Extract and convert to JSON (requires -b for complete HTML)
+scrape -be "a.external-link" resources/test.html | xq .
+```
+Output:
+```json
+{
+  "html": {
+    "body": {
+      "a": {
+        "@href": "https://example.com",
+        "@class": "external-link",
+        "#text": "Example Link"
+      }
+    }
+  }
+}
+```
+Table extraction example:
+```bash
+scrape -be "table.data-table td" resources/test.html | xq .
+```
+Output:
+```json
+{
+  "html": {
+    "body": {
+      "td": [
+        "1",
+        "John Doe",
+        "john@example.com",
+        "2",
+        "Jane Smith",
+        "jane@example.com"
+      ]
+    }
+  }
+}
+```
+**Note**: The `-b` flag is mandatory to produce valid HTML with `<html>`, `<head>` and `<body>` tags.
+Useful for JSON-based pipelines, APIs, databases, and processing with jq/DuckDB.
 Some notes on the commands:
 - `-e` to set the query

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/scrape_cli.egg-info/SOURCES.txt RENAMED Viewed

@@ -1,6 +1,5 @@
 README.md
 pyproject.toml
-setup.py
 scrape_cli/__init__.py
 scrape_cli/scrape.py
 scrape_cli.egg-info/PKG-INFO
@@ -8,4 +7,5 @@ scrape_cli.egg-info/SOURCES.txt
 scrape_cli.egg-info/dependency_links.txt
 scrape_cli.egg-info/entry_points.txt
 scrape_cli.egg-info/requires.txt
-scrape_cli.egg-info/top_level.txt
+scrape_cli.egg-info/top_level.txt
+tests/test_scrape.py

scrape_cli-1.2.2/tests/test_scrape.py ADDED Viewed

@@ -0,0 +1,205 @@
+import subprocess
+import sys
+import threading
+from pathlib import Path
+from http.server import BaseHTTPRequestHandler, HTTPServer
+ROOT = Path(__file__).resolve().parents[1]
+TEST_HTML = ROOT / "resources" / "test.html"
+sys.path.insert(0, str(ROOT))
+from scrape_cli.scrape import is_xpath
+def run_scrape(*args, input_data=None):
+    cmd = [sys.executable, "-m", "scrape_cli.scrape", *args]
+    return subprocess.run(
+        cmd,
+        capture_output=True,
+        text=True,
+        cwd=ROOT,
+        input=input_data,
+    )
+def run_test_server(html_bytes):
+    class Handler(BaseHTTPRequestHandler):
+        def do_GET(self):
+            self.send_response(200)
+            self.send_header("Content-Type", "text/html; charset=utf-8")
+            self.end_headers()
+            self.wfile.write(html_bytes)
+        def log_message(self, format, *args):
+            return
+    server = HTTPServer(("127.0.0.1", 0), Handler)
+    thread = threading.Thread(target=server.serve_forever, daemon=True)
+    thread.start()
+    return server, thread
+def test_is_xpath_true_patterns():
+    candidates = [
+        "//div",
+        "/html/body/div",
+        "(//div)[1]",
+        "//a/@href",
+        "//li[2]",
+        "ancestor::div",
+        "descendant::span",
+        "//p/text()",
+    ]
+    for expression in candidates:
+        assert is_xpath(expression) is True
+def test_is_xpath_false_css_patterns():
+    candidates = [
+        "div.content > a.link",
+        "a[href*='/about']",
+        "input[type='email']",
+        "ul.items-list li:first-child",
+        "div.class1.class2",
+    ]
+    for expression in candidates:
+        assert is_xpath(expression) is False
+def test_xpath_parentheses_extracts_first_match():
+    result = run_scrape(str(TEST_HTML), "-e", "(//ul[@class='items-list']/li)[1]", "-t")
+    assert result.returncode == 0
+    assert result.stdout.strip() == "First item"
+def test_css_attribute_selector_is_not_misclassified_as_xpath():
+    result = run_scrape(str(TEST_HTML), "-e", ".resource-links a[href*='github.com']", "-t")
+    assert result.returncode == 0
+    assert result.stdout.strip() == "GitHub Repository"
+def test_check_existence_true_and_false():
+    found = run_scrape(str(TEST_HTML), "-e", "//h1", "--check-existence")
+    missing = run_scrape(str(TEST_HTML), "-e", "//this-node-does-not-exist", "--check-existence")
+    assert found.returncode == 0
+    assert missing.returncode == 1
+def test_encoding_meta_charset_iso_8859_1(tmp_path):
+    html = """<!doctype html>
+<html>
+<head><meta charset=\"iso-8859-1\"></head>
+<body><p>Perch\xe9</p></body>
+</html>
+""".encode("iso-8859-1")
+    sample = tmp_path / "latin1.html"
+    sample.write_bytes(html)
+    result = run_scrape(str(sample), "-e", "//p/text()", "-t")
+    assert result.returncode == 0
+    assert result.stdout.strip() == "Perché"
+def test_argument_extracts_attribute_value():
+    result = run_scrape(str(TEST_HTML), "-e", "//a[@class='external-link']", "-a", "href")
+    assert result.returncode == 0
+    assert result.stdout.strip() == "https://example.com"
+def test_body_flag_wraps_output_in_html_body():
+    result = run_scrape(str(TEST_HTML), "-e", "//h1", "-b")
+    assert result.returncode == 0
+    assert result.stdout.startswith("<!DOCTYPE html>\n<html>\n<body>\n")
+    assert result.stdout.strip().endswith("</body>\n</html>")
+    assert "<h1 id=\"main-title\">Welcome to the Test Page</h1>" in result.stdout
+def test_text_flag_without_expression_extracts_body_and_skips_script():
+    result = run_scrape(str(TEST_HTML), "-t")
+    assert result.returncode == 0
+    assert "Welcome to the Test Page" in result.stdout
+    assert "document.getElementById('dynamic-content')" not in result.stdout
+def test_short_check_existence_flag_x():
+    found = run_scrape(str(TEST_HTML), "-e", "//table", "-x")
+    missing = run_scrape(str(TEST_HTML), "-e", "//definitely-not-here", "-x")
+    assert found.returncode == 0
+    assert missing.returncode == 1
+def test_rawinput_parses_xml_without_html_parser():
+    xml_data = "<root><item>one</item><item>two</item></root>"
+    result = run_scrape("-e", "//item[2]/text()", "-r", input_data=xml_data)
+    assert result.returncode == 0
+    assert result.stdout.strip() == "two"
+def test_stdin_input_works_when_no_html_argument():
+    html_data = "<html><body><p>stdin-ok</p></body></html>"
+    result = run_scrape("-e", "//p/text()", "-t", input_data=html_data)
+    assert result.returncode == 0
+    assert result.stdout.strip() == "stdin-ok"
+def test_empty_stdin_returns_error():
+    result = run_scrape("-e", "//p", input_data="")
+    assert result.returncode == 1
+    assert "Error: No input received from stdin" in result.stdout
+def test_missing_file_returns_error():
+    result = run_scrape("resources/this-file-does-not-exist.html", "-e", "//p")
+    assert result.returncode == 1
+    assert "was not found" in result.stdout
+def test_missing_expression_without_text_returns_error():
+    result = run_scrape(str(TEST_HTML))
+    assert result.returncode == 1
+    assert "you must provide at least one XPath query or CSS3 selector" in result.stderr
+def test_incorrect_eb_order_exits_with_specific_message():
+    result = run_scrape("-eb")
+    assert result.returncode == 1
+    assert "Please use -be instead of -eb." in result.stderr
+def test_invalid_css_selector_fails_conversion():
+    result = run_scrape(str(TEST_HTML), "-e", "div[")
+    assert result.returncode == 1
+    assert "Error converting CSS selector to XPath" in result.stdout
+def test_url_input_downloads_and_extracts_text():
+    html_bytes = TEST_HTML.read_bytes()
+    server, thread = run_test_server(html_bytes)
+    try:
+        url = f"http://127.0.0.1:{server.server_address[1]}"
+        result = run_scrape(url, "-e", "//h1/text()", "-t")
+    finally:
+        server.shutdown()
+        server.server_close()
+        thread.join(timeout=2)
+    assert result.returncode == 0
+    assert result.stdout.strip() == "Welcome to the Test Page"

scrape_cli-1.2.0/setup.py DELETED Viewed

@@ -1,37 +0,0 @@
-# setup.py
-from setuptools import setup
-from pathlib import Path
-# Leggi il README
-this_directory = Path(__file__).parent
-long_description = (this_directory / "README.md").read_text(encoding="utf-8")
-setup(
-    name="scrape_cli",
-    version="1.1.9",
-    description="It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.",
-    long_description=long_description,
-    long_description_content_type="text/markdown",
-    author="Andrea Borruso",
-    author_email="aborruso@gmail.com",
-    url="https://github.com/aborruso/scrape-cli",
-    license="MIT",
-    packages=["scrape_cli"],
-    package_dir={"scrape_cli": "scrape_cli"},
-    entry_points={
-        'console_scripts': [
-            'scrape=scrape_cli.scrape:main',
-        ],
-    },
-    install_requires=[
-        "cssselect",
-        "lxml",
-        "requests"
-    ],
-    classifiers=[
-        "Programming Language :: Python :: 3",
-        "License :: OSI Approved :: MIT License",
-        "Operating System :: OS Independent",
-    ],
-    python_requires='>=3.6',
-)

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/scrape_cli.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/scrape_cli.egg-info/entry_points.txt RENAMED Viewed

File without changes

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/scrape_cli.egg-info/requires.txt RENAMED Viewed

File without changes

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/scrape_cli.egg-info/top_level.txt RENAMED Viewed

File without changes

{scrape_cli-1.2.0 → scrape_cli-1.2.2}/setup.cfg RENAMED Viewed

File without changes

scrape-cli 1.2.0__tar.gz → 1.2.2__tar.gz

scrape-cli 1.2.0tar.gz → 1.2.2tar.gz