scrapling 0.1__tar.gz → 0.1.2__tar.gz

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,4 +1,6 @@
1
1
  include LICENSE
2
+ include *.db
3
+ include scrapling/*.db
2
4
  include scrapling/py.typed
3
5
 
4
6
  recursive-exclude * __pycache__
@@ -1,12 +1,12 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: scrapling
3
- Version: 0.1
3
+ Version: 0.1.2
4
4
  Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
5
5
  Home-page: https://github.com/D4Vinci/Scrapling
6
6
  Author: Karim Shoair
7
7
  Author-email: karim.shoair@pm.me
8
8
  License: BSD
9
- Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/Docs
9
+ Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
10
10
  Project-URL: Source, https://github.com/D4Vinci/Scrapling
11
11
  Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
12
12
  Classifier: Operating System :: OS Independent
@@ -20,7 +20,6 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
20
20
  Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
21
  Classifier: Programming Language :: Python :: 3
22
22
  Classifier: Programming Language :: Python :: 3 :: Only
23
- Classifier: Programming Language :: Python :: 3.6
24
23
  Classifier: Programming Language :: Python :: 3.7
25
24
  Classifier: Programming Language :: Python :: 3.8
26
25
  Classifier: Programming Language :: Python :: 3.9
@@ -40,7 +39,7 @@ Requires-Dist: orjson>=3
40
39
  Requires-Dist: tldextract
41
40
 
42
41
  # 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
43
- [![PyPI version](https://badge.fury.io/py/scrapling.svg)](https://badge.fury.io/py/scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![License](https://img.shields.io/badge/License-BSD--3-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
42
+ [![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
44
43
 
45
44
  Dealing with failing web scrapers due to website changes? Meet Scrapling.
46
45
 
@@ -78,7 +77,7 @@ products = page.css('.product', auto_match=True) # Still finds them!
78
77
 
79
78
  ## Getting Started
80
79
 
81
- Let's walk through a basic example that demonstrates small group of Scrapling's core features:
80
+ Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
82
81
 
83
82
  ```python
84
83
  import requests
@@ -456,6 +455,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
456
455
  ### Is Scrapling thread-safe?
457
456
  Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
458
457
 
458
+ ## Sponsors
459
+ [![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
460
+
459
461
  ## Contributing
460
462
  Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
461
463
 
@@ -1,5 +1,5 @@
1
1
  # 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
2
- [![PyPI version](https://badge.fury.io/py/scrapling.svg)](https://badge.fury.io/py/scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![License](https://img.shields.io/badge/License-BSD--3-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
2
+ [![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
3
3
 
4
4
  Dealing with failing web scrapers due to website changes? Meet Scrapling.
5
5
 
@@ -37,7 +37,7 @@ products = page.css('.product', auto_match=True) # Still finds them!
37
37
 
38
38
  ## Getting Started
39
39
 
40
- Let's walk through a basic example that demonstrates small group of Scrapling's core features:
40
+ Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
41
41
 
42
42
  ```python
43
43
  import requests
@@ -415,6 +415,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
415
415
  ### Is Scrapling thread-safe?
416
416
  Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
417
417
 
418
+ ## Sponsors
419
+ [![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
420
+
418
421
  ## Contributing
419
422
  Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
420
423
 
@@ -431,4 +434,4 @@ This project includes code adapted from:
431
434
  - In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
432
435
  - Currently, Scrapling is not compatible with async/await.
433
436
 
434
- <div align="center"><small>Made with ❤️ by Karim Shoair</small></div><br>
437
+ <div align="center"><small>Made with ❤️ by Karim Shoair</small></div><br>
@@ -3,7 +3,7 @@ from scrapling.parser import Adaptor, Adaptors
3
3
  from scrapling.custom_types import TextHandler, AttributesHandler
4
4
 
5
5
  __author__ = "Karim Shoair (karim.shoair@pm.me)"
6
- __version__ = "0.1"
6
+ __version__ = "0.1.2"
7
7
  __copyright__ = "Copyright (c) 2024 Karim Shoair"
8
8
 
9
9
 
@@ -78,7 +78,7 @@ class Adaptor(SelectorsGeneration):
78
78
 
79
79
  parser = html.HTMLParser(
80
80
  # https://lxml.de/api/lxml.etree.HTMLParser-class.html
81
- recover=True, remove_blank_text=True, remove_comments=(keep_comments is True), encoding=encoding,
81
+ recover=True, remove_blank_text=True, remove_comments=(keep_comments is False), encoding=encoding,
82
82
  compact=True, huge_tree=huge_tree, default_doctype=True
83
83
  )
84
84
  self._root = etree.fromstring(body, parser=parser, base_url=url)
@@ -142,7 +142,8 @@ class Adaptor(SelectorsGeneration):
142
142
  if issubclass(type(element), html.HtmlMixin):
143
143
  return self.__class__(
144
144
  root=element, url=self.url, encoding=self.encoding, auto_match=self.__auto_match_enabled,
145
- keep_comments=self.__keep_comments, huge_tree=self.__huge_tree_enabled, debug=self.__debug
145
+ keep_comments=True, # if the comments are already removed in initialization, no need to try to delete them in sub-elements
146
+ huge_tree=self.__huge_tree_enabled, debug=self.__debug
146
147
  )
147
148
  return element
148
149
 
@@ -186,7 +187,23 @@ class Adaptor(SelectorsGeneration):
186
187
  def text(self) -> TextHandler:
187
188
  """Get text content of the element"""
188
189
  if not self.__text:
189
- self.__text = TextHandler(self._root.text)
190
+ if self.__keep_comments:
191
+ if not self.children:
192
+ # If use chose to keep comments, remove comments from text
193
+ # Escape lxml default behaviour and remove comments like this `<span>CONDITION: <!-- -->Excellent</span>`
194
+ # This issue is present in parsel/scrapy as well so no need to repeat it here so the user can run regex on the full text.
195
+ code = self.html_content
196
+ parser = html.HTMLParser(
197
+ recover=True, remove_blank_text=True, remove_comments=True, encoding=self.encoding,
198
+ compact=True, huge_tree=self.__huge_tree_enabled, default_doctype=True
199
+ )
200
+ fragment_root = html.fragment_fromstring(code, parser=parser)
201
+ self.__text = TextHandler(fragment_root.text)
202
+ else:
203
+ self.__text = TextHandler(self._root.text)
204
+ else:
205
+ # If user already chose to not keep comments then all is good
206
+ self.__text = TextHandler(self._root.text)
190
207
  return self.__text
191
208
 
192
209
  def get_all_text(self, separator: str = "\n", strip: bool = False, ignore_tags: Tuple = ('script', 'style',), valid_values: bool = True) -> TextHandler:
@@ -1,12 +1,12 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: scrapling
3
- Version: 0.1
3
+ Version: 0.1.2
4
4
  Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
5
5
  Home-page: https://github.com/D4Vinci/Scrapling
6
6
  Author: Karim Shoair
7
7
  Author-email: karim.shoair@pm.me
8
8
  License: BSD
9
- Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/Docs
9
+ Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
10
10
  Project-URL: Source, https://github.com/D4Vinci/Scrapling
11
11
  Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
12
12
  Classifier: Operating System :: OS Independent
@@ -20,7 +20,6 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
20
20
  Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
21
  Classifier: Programming Language :: Python :: 3
22
22
  Classifier: Programming Language :: Python :: 3 :: Only
23
- Classifier: Programming Language :: Python :: 3.6
24
23
  Classifier: Programming Language :: Python :: 3.7
25
24
  Classifier: Programming Language :: Python :: 3.8
26
25
  Classifier: Programming Language :: Python :: 3.9
@@ -40,7 +39,7 @@ Requires-Dist: orjson>=3
40
39
  Requires-Dist: tldextract
41
40
 
42
41
  # 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
43
- [![PyPI version](https://badge.fury.io/py/scrapling.svg)](https://badge.fury.io/py/scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![License](https://img.shields.io/badge/License-BSD--3-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
42
+ [![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)
44
43
 
45
44
  Dealing with failing web scrapers due to website changes? Meet Scrapling.
46
45
 
@@ -78,7 +77,7 @@ products = page.css('.product', auto_match=True) # Still finds them!
78
77
 
79
78
  ## Getting Started
80
79
 
81
- Let's walk through a basic example that demonstrates small group of Scrapling's core features:
80
+ Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
82
81
 
83
82
  ```python
84
83
  import requests
@@ -456,6 +455,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
456
455
  ### Is Scrapling thread-safe?
457
456
  Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
458
457
 
458
+ ## Sponsors
459
+ [![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
460
+
459
461
  ## Contributing
460
462
  Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
461
463
 
@@ -1,6 +1,6 @@
1
1
  [metadata]
2
2
  name = scrapling
3
- version = 0.1
3
+ version = 0.1.2
4
4
  author = Karim Shoair
5
5
  author_email = karim.shoair@pm.me
6
6
  description = Scrapling is a powerful, flexible, adaptive, and high-performance web scraping library for Python.
@@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf-8") as fh:
6
6
 
7
7
  setup(
8
8
  name="scrapling",
9
- version="0.1",
9
+ version="0.1.2",
10
10
  description="""Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
11
11
  simplifies the process of extracting data from websites, even when they undergo structural changes, and offers
12
12
  impressive speed improvements over many popular scraping tools.""",
@@ -36,7 +36,6 @@ setup(
36
36
  "Topic :: Software Development :: Libraries :: Python Modules",
37
37
  "Programming Language :: Python :: 3",
38
38
  "Programming Language :: Python :: 3 :: Only",
39
- "Programming Language :: Python :: 3.6",
40
39
  "Programming Language :: Python :: 3.7",
41
40
  "Programming Language :: Python :: 3.8",
42
41
  "Programming Language :: Python :: 3.9",
@@ -58,7 +57,7 @@ setup(
58
57
  python_requires=">=3.7",
59
58
  url="https://github.com/D4Vinci/Scrapling",
60
59
  project_urls={
61
- "Documentation": "https://github.com/D4Vinci/Scrapling/Docs", # For now
60
+ "Documentation": "https://github.com/D4Vinci/Scrapling/tree/main/docs", # For now
62
61
  "Source": "https://github.com/D4Vinci/Scrapling",
63
62
  "Tracker": "https://github.com/D4Vinci/Scrapling/issues",
64
63
  }
File without changes
File without changes
File without changes