PyPI - scrapling - Versions diffs - 0.1__tar.gz → 0.1.2__tar.gz - Mend

scrapling 0.1tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

{scrapling-0.1 → scrapling-0.1.2}/MANIFEST.in RENAMED Viewed

@@ -1,4 +1,6 @@
 include LICENSE
+include *.db
+include scrapling/*.db
 include scrapling/py.typed
 recursive-exclude * __pycache__

{scrapling-0.1/scrapling.egg-info → scrapling-0.1.2}/PKG-INFO RENAMED Viewed

@@ -1,12 +1,12 @@
 Metadata-Version: 2.1
 Name: scrapling
-Version: 0.1
+Version: 0.1.2
 Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
 Author-email: karim.shoair@pm.me
 License: BSD
-Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/Docs
+Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
 Project-URL: Source, https://github.com/D4Vinci/Scrapling
 Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
 Classifier: Operating System :: OS Independent
@@ -20,7 +20,6 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3 :: Only
-Classifier: Programming Language :: Python :: 3.6
 Classifier: Programming Language :: Python :: 3.7
 Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
@@ -40,7 +39,7 @@ Requires-Dist: orjson>=3
 Requires-Dist: tldextract
 # 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
-[![PyPI version](https://badge.fury.io/py/scrapling.svg)](https://badge.fury.io/py/scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![License](https://img.shields.io/badge/License-BSD--3-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
+[![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
 Dealing with failing web scrapers due to website changes? Meet Scrapling.
@@ -78,7 +77,7 @@ products = page.css('.product', auto_match=True)  # Still finds them!
 ## Getting Started
-Let's walk through a basic example that demonstrates small group of Scrapling's core features:
+Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
 ```python
 import requests
@@ -456,6 +455,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
 ### Is Scrapling thread-safe?
 Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
+## Sponsors
+[![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
 ## Contributing
 Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!

{scrapling-0.1 → scrapling-0.1.2}/README.md RENAMED Viewed

@@ -1,5 +1,5 @@
 # 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
-[![PyPI version](https://badge.fury.io/py/scrapling.svg)](https://badge.fury.io/py/scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![License](https://img.shields.io/badge/License-BSD--3-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
+[![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
 Dealing with failing web scrapers due to website changes? Meet Scrapling.
@@ -37,7 +37,7 @@ products = page.css('.product', auto_match=True)  # Still finds them!
 ## Getting Started
-Let's walk through a basic example that demonstrates small group of Scrapling's core features:
+Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
 ```python
 import requests
@@ -415,6 +415,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
 ### Is Scrapling thread-safe?
 Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
+## Sponsors
+[![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
 ## Contributing
 Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
@@ -431,4 +434,4 @@ This project includes code adapted from:
 - In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
 - Currently, Scrapling is not compatible with async/await.
-<div align="center"><small>Made with ❤️ by Karim Shoair</small></div><br>
+<div align="center"><small>Made with ❤️ by Karim Shoair</small></div><br>

{scrapling-0.1 → scrapling-0.1.2}/scrapling/__init__.py RENAMED Viewed

@@ -3,7 +3,7 @@ from scrapling.parser import Adaptor, Adaptors
 from scrapling.custom_types import TextHandler, AttributesHandler
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.1"
+__version__ = "0.1.2"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"

{scrapling-0.1 → scrapling-0.1.2}/scrapling/parser.py RENAMED Viewed

@@ -78,7 +78,7 @@ class Adaptor(SelectorsGeneration):
             parser = html.HTMLParser(
                 # https://lxml.de/api/lxml.etree.HTMLParser-class.html
-                recover=True, remove_blank_text=True, remove_comments=(keep_comments is True), encoding=encoding,
+                recover=True, remove_blank_text=True, remove_comments=(keep_comments is False), encoding=encoding,
                 compact=True, huge_tree=huge_tree, default_doctype=True
             )
             self._root = etree.fromstring(body, parser=parser, base_url=url)
@@ -142,7 +142,8 @@ class Adaptor(SelectorsGeneration):
             if issubclass(type(element), html.HtmlMixin):
                 return self.__class__(
                     root=element, url=self.url, encoding=self.encoding, auto_match=self.__auto_match_enabled,
-                    keep_comments=self.__keep_comments, huge_tree=self.__huge_tree_enabled, debug=self.__debug
+                    keep_comments=True,  # if the comments are already removed in initialization, no need to try to delete them in sub-elements
+                    huge_tree=self.__huge_tree_enabled, debug=self.__debug
                 )
             return element
@@ -186,7 +187,23 @@ class Adaptor(SelectorsGeneration):
     def text(self) -> TextHandler:
         """Get text content of the element"""
         if not self.__text:
-            self.__text = TextHandler(self._root.text)
+            if self.__keep_comments:
+                if not self.children:
+                    # If use chose to keep comments, remove comments from text
+                    # Escape lxml default behaviour and remove comments like this `<span>CONDITION: <!-- -->Excellent</span>`
+                    # This issue is present in parsel/scrapy as well so no need to repeat it here so the user can run regex on the full text.
+                    code = self.html_content
+                    parser = html.HTMLParser(
+                        recover=True, remove_blank_text=True, remove_comments=True, encoding=self.encoding,
+                        compact=True, huge_tree=self.__huge_tree_enabled, default_doctype=True
+                    )
+                    fragment_root = html.fragment_fromstring(code, parser=parser)
+                    self.__text = TextHandler(fragment_root.text)
+                else:
+                    self.__text = TextHandler(self._root.text)
+            else:
+                # If user already chose to not keep comments then all is good
+                self.__text = TextHandler(self._root.text)
         return self.__text
     def get_all_text(self, separator: str = "\n", strip: bool = False, ignore_tags: Tuple = ('script', 'style',), valid_values: bool = True) -> TextHandler:

{scrapling-0.1 → scrapling-0.1.2/scrapling.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,12 +1,12 @@
 Metadata-Version: 2.1
 Name: scrapling
-Version: 0.1
+Version: 0.1.2
 Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
 Home-page: https://github.com/D4Vinci/Scrapling
 Author: Karim Shoair
 Author-email: karim.shoair@pm.me
 License: BSD
-Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/Docs
+Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
 Project-URL: Source, https://github.com/D4Vinci/Scrapling
 Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
 Classifier: Operating System :: OS Independent
@@ -20,7 +20,6 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3 :: Only
-Classifier: Programming Language :: Python :: 3.6
 Classifier: Programming Language :: Python :: 3.7
 Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
@@ -40,7 +39,7 @@ Requires-Dist: orjson>=3
 Requires-Dist: tldextract
 # 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
-[![PyPI version](https://badge.fury.io/py/scrapling.svg)](https://badge.fury.io/py/scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![License](https://img.shields.io/badge/License-BSD--3-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
+[![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
 Dealing with failing web scrapers due to website changes? Meet Scrapling.
@@ -78,7 +77,7 @@ products = page.css('.product', auto_match=True)  # Still finds them!
 ## Getting Started
-Let's walk through a basic example that demonstrates small group of Scrapling's core features:
+Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
 ```python
 import requests
@@ -456,6 +455,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
 ### Is Scrapling thread-safe?
 Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
+## Sponsors
+[![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
 ## Contributing
 Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!

{scrapling-0.1 → scrapling-0.1.2}/setup.cfg RENAMED Viewed

@@ -1,6 +1,6 @@
 [metadata]
 name = scrapling
-version = 0.1
+version = 0.1.2
 author = Karim Shoair
 author_email = karim.shoair@pm.me
 description = Scrapling is a powerful, flexible, adaptive, and high-performance web scraping library for Python.

{scrapling-0.1 → scrapling-0.1.2}/setup.py RENAMED Viewed

@@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf-8") as fh:
 setup(
     name="scrapling",
-    version="0.1",
+    version="0.1.2",
     description="""Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
     simplifies the process of extracting data from websites, even when they undergo structural changes, and offers
     impressive speed improvements over many popular scraping tools.""",
@@ -36,7 +36,6 @@ setup(
         "Topic :: Software Development :: Libraries :: Python Modules",
         "Programming Language :: Python :: 3",
         "Programming Language :: Python :: 3 :: Only",
-        "Programming Language :: Python :: 3.6",
         "Programming Language :: Python :: 3.7",
         "Programming Language :: Python :: 3.8",
         "Programming Language :: Python :: 3.9",
@@ -58,7 +57,7 @@ setup(
     python_requires=">=3.7",
     url="https://github.com/D4Vinci/Scrapling",
     project_urls={
-        "Documentation": "https://github.com/D4Vinci/Scrapling/Docs",  # For now
+        "Documentation": "https://github.com/D4Vinci/Scrapling/tree/main/docs",  # For now
         "Source": "https://github.com/D4Vinci/Scrapling",
         "Tracker": "https://github.com/D4Vinci/Scrapling/issues",
     }