scrapling 0.1__tar.gz → 0.1.2__tar.gz
Sign up to get free protection for your applications and to get access to all the features.
- {scrapling-0.1 → scrapling-0.1.2}/MANIFEST.in +2 -0
- {scrapling-0.1/scrapling.egg-info → scrapling-0.1.2}/PKG-INFO +7 -5
- {scrapling-0.1 → scrapling-0.1.2}/README.md +6 -3
- {scrapling-0.1 → scrapling-0.1.2}/scrapling/__init__.py +1 -1
- {scrapling-0.1 → scrapling-0.1.2}/scrapling/parser.py +20 -3
- {scrapling-0.1 → scrapling-0.1.2/scrapling.egg-info}/PKG-INFO +7 -5
- {scrapling-0.1 → scrapling-0.1.2}/setup.cfg +1 -1
- {scrapling-0.1 → scrapling-0.1.2}/setup.py +2 -3
- {scrapling-0.1 → scrapling-0.1.2}/LICENSE +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling/custom_types.py +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling/mixins.py +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling/storage_adaptors.py +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling/translator.py +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling/utils.py +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling.egg-info/SOURCES.txt +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling.egg-info/dependency_links.txt +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling.egg-info/not-zip-safe +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling.egg-info/requires.txt +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/scrapling.egg-info/top_level.txt +0 -0
- {scrapling-0.1 → scrapling-0.1.2}/tests/test_all_functions.py +0 -0
@@ -1,12 +1,12 @@
|
|
1
1
|
Metadata-Version: 2.1
|
2
2
|
Name: scrapling
|
3
|
-
Version: 0.1
|
3
|
+
Version: 0.1.2
|
4
4
|
Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
|
5
5
|
Home-page: https://github.com/D4Vinci/Scrapling
|
6
6
|
Author: Karim Shoair
|
7
7
|
Author-email: karim.shoair@pm.me
|
8
8
|
License: BSD
|
9
|
-
Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/
|
9
|
+
Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
|
10
10
|
Project-URL: Source, https://github.com/D4Vinci/Scrapling
|
11
11
|
Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
|
12
12
|
Classifier: Operating System :: OS Independent
|
@@ -20,7 +20,6 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
20
20
|
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
21
21
|
Classifier: Programming Language :: Python :: 3
|
22
22
|
Classifier: Programming Language :: Python :: 3 :: Only
|
23
|
-
Classifier: Programming Language :: Python :: 3.6
|
24
23
|
Classifier: Programming Language :: Python :: 3.7
|
25
24
|
Classifier: Programming Language :: Python :: 3.8
|
26
25
|
Classifier: Programming Language :: Python :: 3.9
|
@@ -40,7 +39,7 @@ Requires-Dist: orjson>=3
|
|
40
39
|
Requires-Dist: tldextract
|
41
40
|
|
42
41
|
# 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
|
43
|
-
[](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [](https://badge.fury.io/py/Scrapling) [](https://pypi.org/project/scrapling/) [](https://pepy.tech/project/scrapling)
|
44
43
|
|
45
44
|
Dealing with failing web scrapers due to website changes? Meet Scrapling.
|
46
45
|
|
@@ -78,7 +77,7 @@ products = page.css('.product', auto_match=True) # Still finds them!
|
|
78
77
|
|
79
78
|
## Getting Started
|
80
79
|
|
81
|
-
Let's walk through a basic example that demonstrates small group of Scrapling's core features:
|
80
|
+
Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
|
82
81
|
|
83
82
|
```python
|
84
83
|
import requests
|
@@ -456,6 +455,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
|
|
456
455
|
### Is Scrapling thread-safe?
|
457
456
|
Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
|
458
457
|
|
458
|
+
## Sponsors
|
459
|
+
[](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
|
460
|
+
|
459
461
|
## Contributing
|
460
462
|
Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
|
461
463
|
|
@@ -1,5 +1,5 @@
|
|
1
1
|
# 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
|
2
|
-
[](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [](https://badge.fury.io/py/Scrapling) [](https://pypi.org/project/scrapling/) [](https://pepy.tech/project/scrapling)
|
3
3
|
|
4
4
|
Dealing with failing web scrapers due to website changes? Meet Scrapling.
|
5
5
|
|
@@ -37,7 +37,7 @@ products = page.css('.product', auto_match=True) # Still finds them!
|
|
37
37
|
|
38
38
|
## Getting Started
|
39
39
|
|
40
|
-
Let's walk through a basic example that demonstrates small group of Scrapling's core features:
|
40
|
+
Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
|
41
41
|
|
42
42
|
```python
|
43
43
|
import requests
|
@@ -415,6 +415,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
|
|
415
415
|
### Is Scrapling thread-safe?
|
416
416
|
Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
|
417
417
|
|
418
|
+
## Sponsors
|
419
|
+
[](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
|
420
|
+
|
418
421
|
## Contributing
|
419
422
|
Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
|
420
423
|
|
@@ -431,4 +434,4 @@ This project includes code adapted from:
|
|
431
434
|
- In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
|
432
435
|
- Currently, Scrapling is not compatible with async/await.
|
433
436
|
|
434
|
-
<div align="center"><small>Made with ❤️ by Karim Shoair</small></div><br>
|
437
|
+
<div align="center"><small>Made with ❤️ by Karim Shoair</small></div><br>
|
@@ -3,7 +3,7 @@ from scrapling.parser import Adaptor, Adaptors
|
|
3
3
|
from scrapling.custom_types import TextHandler, AttributesHandler
|
4
4
|
|
5
5
|
__author__ = "Karim Shoair (karim.shoair@pm.me)"
|
6
|
-
__version__ = "0.1"
|
6
|
+
__version__ = "0.1.2"
|
7
7
|
__copyright__ = "Copyright (c) 2024 Karim Shoair"
|
8
8
|
|
9
9
|
|
@@ -78,7 +78,7 @@ class Adaptor(SelectorsGeneration):
|
|
78
78
|
|
79
79
|
parser = html.HTMLParser(
|
80
80
|
# https://lxml.de/api/lxml.etree.HTMLParser-class.html
|
81
|
-
recover=True, remove_blank_text=True, remove_comments=(keep_comments is
|
81
|
+
recover=True, remove_blank_text=True, remove_comments=(keep_comments is False), encoding=encoding,
|
82
82
|
compact=True, huge_tree=huge_tree, default_doctype=True
|
83
83
|
)
|
84
84
|
self._root = etree.fromstring(body, parser=parser, base_url=url)
|
@@ -142,7 +142,8 @@ class Adaptor(SelectorsGeneration):
|
|
142
142
|
if issubclass(type(element), html.HtmlMixin):
|
143
143
|
return self.__class__(
|
144
144
|
root=element, url=self.url, encoding=self.encoding, auto_match=self.__auto_match_enabled,
|
145
|
-
keep_comments=
|
145
|
+
keep_comments=True, # if the comments are already removed in initialization, no need to try to delete them in sub-elements
|
146
|
+
huge_tree=self.__huge_tree_enabled, debug=self.__debug
|
146
147
|
)
|
147
148
|
return element
|
148
149
|
|
@@ -186,7 +187,23 @@ class Adaptor(SelectorsGeneration):
|
|
186
187
|
def text(self) -> TextHandler:
|
187
188
|
"""Get text content of the element"""
|
188
189
|
if not self.__text:
|
189
|
-
|
190
|
+
if self.__keep_comments:
|
191
|
+
if not self.children:
|
192
|
+
# If use chose to keep comments, remove comments from text
|
193
|
+
# Escape lxml default behaviour and remove comments like this `<span>CONDITION: <!-- -->Excellent</span>`
|
194
|
+
# This issue is present in parsel/scrapy as well so no need to repeat it here so the user can run regex on the full text.
|
195
|
+
code = self.html_content
|
196
|
+
parser = html.HTMLParser(
|
197
|
+
recover=True, remove_blank_text=True, remove_comments=True, encoding=self.encoding,
|
198
|
+
compact=True, huge_tree=self.__huge_tree_enabled, default_doctype=True
|
199
|
+
)
|
200
|
+
fragment_root = html.fragment_fromstring(code, parser=parser)
|
201
|
+
self.__text = TextHandler(fragment_root.text)
|
202
|
+
else:
|
203
|
+
self.__text = TextHandler(self._root.text)
|
204
|
+
else:
|
205
|
+
# If user already chose to not keep comments then all is good
|
206
|
+
self.__text = TextHandler(self._root.text)
|
190
207
|
return self.__text
|
191
208
|
|
192
209
|
def get_all_text(self, separator: str = "\n", strip: bool = False, ignore_tags: Tuple = ('script', 'style',), valid_values: bool = True) -> TextHandler:
|
@@ -1,12 +1,12 @@
|
|
1
1
|
Metadata-Version: 2.1
|
2
2
|
Name: scrapling
|
3
|
-
Version: 0.1
|
3
|
+
Version: 0.1.2
|
4
4
|
Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
|
5
5
|
Home-page: https://github.com/D4Vinci/Scrapling
|
6
6
|
Author: Karim Shoair
|
7
7
|
Author-email: karim.shoair@pm.me
|
8
8
|
License: BSD
|
9
|
-
Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/
|
9
|
+
Project-URL: Documentation, https://github.com/D4Vinci/Scrapling/tree/main/docs
|
10
10
|
Project-URL: Source, https://github.com/D4Vinci/Scrapling
|
11
11
|
Project-URL: Tracker, https://github.com/D4Vinci/Scrapling/issues
|
12
12
|
Classifier: Operating System :: OS Independent
|
@@ -20,7 +20,6 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
20
20
|
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
21
21
|
Classifier: Programming Language :: Python :: 3
|
22
22
|
Classifier: Programming Language :: Python :: 3 :: Only
|
23
|
-
Classifier: Programming Language :: Python :: 3.6
|
24
23
|
Classifier: Programming Language :: Python :: 3.7
|
25
24
|
Classifier: Programming Language :: Python :: 3.8
|
26
25
|
Classifier: Programming Language :: Python :: 3.9
|
@@ -40,7 +39,7 @@ Requires-Dist: orjson>=3
|
|
40
39
|
Requires-Dist: tldextract
|
41
40
|
|
42
41
|
# 🕷️ Scrapling: Lightning-Fast, Adaptive Web Scraping for Python
|
43
|
-
[](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [](https://badge.fury.io/py/Scrapling) [](https://pypi.org/project/scrapling/) [](https://pepy.tech/project/scrapling)
|
44
43
|
|
45
44
|
Dealing with failing web scrapers due to website changes? Meet Scrapling.
|
46
45
|
|
@@ -78,7 +77,7 @@ products = page.css('.product', auto_match=True) # Still finds them!
|
|
78
77
|
|
79
78
|
## Getting Started
|
80
79
|
|
81
|
-
Let's walk through a basic example that demonstrates small group of Scrapling's core features:
|
80
|
+
Let's walk through a basic example that demonstrates a small group of Scrapling's core features:
|
82
81
|
|
83
82
|
```python
|
84
83
|
import requests
|
@@ -456,6 +455,9 @@ Of course, you can find elements by text/regex, find similar elements in a more
|
|
456
455
|
### Is Scrapling thread-safe?
|
457
456
|
Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
|
458
457
|
|
458
|
+
## Sponsors
|
459
|
+
[](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
|
460
|
+
|
459
461
|
## Contributing
|
460
462
|
Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
|
461
463
|
|
@@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf-8") as fh:
|
|
6
6
|
|
7
7
|
setup(
|
8
8
|
name="scrapling",
|
9
|
-
version="0.1",
|
9
|
+
version="0.1.2",
|
10
10
|
description="""Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
|
11
11
|
simplifies the process of extracting data from websites, even when they undergo structural changes, and offers
|
12
12
|
impressive speed improvements over many popular scraping tools.""",
|
@@ -36,7 +36,6 @@ setup(
|
|
36
36
|
"Topic :: Software Development :: Libraries :: Python Modules",
|
37
37
|
"Programming Language :: Python :: 3",
|
38
38
|
"Programming Language :: Python :: 3 :: Only",
|
39
|
-
"Programming Language :: Python :: 3.6",
|
40
39
|
"Programming Language :: Python :: 3.7",
|
41
40
|
"Programming Language :: Python :: 3.8",
|
42
41
|
"Programming Language :: Python :: 3.9",
|
@@ -58,7 +57,7 @@ setup(
|
|
58
57
|
python_requires=">=3.7",
|
59
58
|
url="https://github.com/D4Vinci/Scrapling",
|
60
59
|
project_urls={
|
61
|
-
"Documentation": "https://github.com/D4Vinci/Scrapling/
|
60
|
+
"Documentation": "https://github.com/D4Vinci/Scrapling/tree/main/docs", # For now
|
62
61
|
"Source": "https://github.com/D4Vinci/Scrapling",
|
63
62
|
"Tracker": "https://github.com/D4Vinci/Scrapling/issues",
|
64
63
|
}
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|