scrapling 0.2.5__tar.gz → 0.2.6__tar.gz

Sign up to get free protection for your applications and to get access to all the features.
Files changed (49) hide show
  1. {scrapling-0.2.5/scrapling.egg-info → scrapling-0.2.6}/PKG-INFO +6 -3
  2. {scrapling-0.2.5 → scrapling-0.2.6}/README.md +4 -1
  3. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/__init__.py +1 -1
  4. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/pw.py +15 -9
  5. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/fingerprints.py +1 -1
  6. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/fetchers.py +7 -5
  7. {scrapling-0.2.5 → scrapling-0.2.6/scrapling.egg-info}/PKG-INFO +6 -3
  8. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling.egg-info/requires.txt +1 -1
  9. {scrapling-0.2.5 → scrapling-0.2.6}/setup.cfg +1 -1
  10. {scrapling-0.2.5 → scrapling-0.2.6}/setup.py +2 -2
  11. {scrapling-0.2.5 → scrapling-0.2.6}/LICENSE +0 -0
  12. {scrapling-0.2.5 → scrapling-0.2.6}/MANIFEST.in +0 -0
  13. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/core/__init__.py +0 -0
  14. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/core/_types.py +0 -0
  15. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/core/custom_types.py +0 -0
  16. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/core/mixins.py +0 -0
  17. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/core/storage_adaptors.py +0 -0
  18. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/core/translator.py +0 -0
  19. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/core/utils.py +0 -0
  20. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/defaults.py +0 -0
  21. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/__init__.py +0 -0
  22. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/camo.py +0 -0
  23. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/constants.py +0 -0
  24. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/static.py +0 -0
  25. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/__init__.py +0 -0
  26. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/bypasses/navigator_plugins.js +0 -0
  27. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/bypasses/notification_permission.js +0 -0
  28. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/bypasses/pdf_viewer.js +0 -0
  29. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/bypasses/playwright_fingerprint.js +0 -0
  30. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/bypasses/screen_props.js +0 -0
  31. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/bypasses/webdriver_fully.js +0 -0
  32. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/bypasses/window_chrome.js +0 -0
  33. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/custom.py +0 -0
  34. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/engines/toolbelt/navigation.py +0 -0
  35. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/parser.py +0 -0
  36. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling/py.typed +0 -0
  37. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling.egg-info/SOURCES.txt +0 -0
  38. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling.egg-info/dependency_links.txt +0 -0
  39. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling.egg-info/not-zip-safe +0 -0
  40. {scrapling-0.2.5 → scrapling-0.2.6}/scrapling.egg-info/top_level.txt +0 -0
  41. {scrapling-0.2.5 → scrapling-0.2.6}/tests/__init__.py +0 -0
  42. {scrapling-0.2.5 → scrapling-0.2.6}/tests/fetchers/__init__.py +0 -0
  43. {scrapling-0.2.5 → scrapling-0.2.6}/tests/fetchers/test_camoufox.py +0 -0
  44. {scrapling-0.2.5 → scrapling-0.2.6}/tests/fetchers/test_httpx.py +0 -0
  45. {scrapling-0.2.5 → scrapling-0.2.6}/tests/fetchers/test_playwright.py +0 -0
  46. {scrapling-0.2.5 → scrapling-0.2.6}/tests/fetchers/test_utils.py +0 -0
  47. {scrapling-0.2.5 → scrapling-0.2.6}/tests/parser/__init__.py +0 -0
  48. {scrapling-0.2.5 → scrapling-0.2.6}/tests/parser/test_automatch.py +0 -0
  49. {scrapling-0.2.5 → scrapling-0.2.6}/tests/parser/test_general.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: scrapling
3
- Version: 0.2.5
3
+ Version: 0.2.6
4
4
  Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
5
5
  Home-page: https://github.com/D4Vinci/Scrapling
6
6
  Author: Karim Shoair
@@ -39,7 +39,7 @@ Requires-Dist: w3lib
39
39
  Requires-Dist: orjson>=3
40
40
  Requires-Dist: tldextract
41
41
  Requires-Dist: httpx[brotli,zstd]
42
- Requires-Dist: playwright
42
+ Requires-Dist: playwright==1.48
43
43
  Requires-Dist: rebrowser-playwright
44
44
  Requires-Dist: camoufox>=0.3.10
45
45
  Requires-Dist: browserforge
@@ -336,9 +336,11 @@ Using this Fetcher class, you can make requests with:
336
336
  * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
337
337
  * Using custom flags on launch to hide Playwright even more and make it faster.
338
338
  * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
339
- 3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
339
+ 3) Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
340
340
  4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
341
341
 
342
+ > Hence using the `real_chrome` argument requires that you have chrome browser installed on your device
343
+
342
344
  Add that to a lot of controlling/hiding options as you will see in the arguments list below.
343
345
 
344
346
  <details><summary><strong>Expand this for the complete list of arguments</strong></summary>
@@ -360,6 +362,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
360
362
  | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
361
363
  | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
362
364
  | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
365
+ | real_chrome | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. | ✔️ |
363
366
  | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
364
367
  | nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
365
368
  | nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | ✔️ |
@@ -290,9 +290,11 @@ Using this Fetcher class, you can make requests with:
290
290
  * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
291
291
  * Using custom flags on launch to hide Playwright even more and make it faster.
292
292
  * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
293
- 3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
293
+ 3) Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
294
294
  4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
295
295
 
296
+ > Hence using the `real_chrome` argument requires that you have chrome browser installed on your device
297
+
296
298
  Add that to a lot of controlling/hiding options as you will see in the arguments list below.
297
299
 
298
300
  <details><summary><strong>Expand this for the complete list of arguments</strong></summary>
@@ -314,6 +316,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
314
316
  | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
315
317
  | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
316
318
  | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
319
+ | real_chrome | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. | ✔️ |
317
320
  | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
318
321
  | nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
319
322
  | nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | ✔️ |
@@ -4,7 +4,7 @@ from scrapling.parser import Adaptor, Adaptors
4
4
  from scrapling.core.custom_types import TextHandler, AttributesHandler
5
5
 
6
6
  __author__ = "Karim Shoair (karim.shoair@pm.me)"
7
- __version__ = "0.2.5"
7
+ __version__ = "0.2.6"
8
8
  __copyright__ = "Copyright (c) 2024 Karim Shoair"
9
9
 
10
10
 
@@ -27,11 +27,12 @@ class PlaywrightEngine:
27
27
  page_action: Callable = do_nothing,
28
28
  wait_selector: Optional[str] = None,
29
29
  wait_selector_state: Optional[str] = 'attached',
30
- stealth: bool = False,
31
- hide_canvas: bool = True,
32
- disable_webgl: bool = False,
30
+ stealth: Optional[bool] = False,
31
+ real_chrome: Optional[bool] = False,
32
+ hide_canvas: Optional[bool] = False,
33
+ disable_webgl: Optional[bool] = False,
33
34
  cdp_url: Optional[str] = None,
34
- nstbrowser_mode: bool = False,
35
+ nstbrowser_mode: Optional[bool] = False,
35
36
  nstbrowser_config: Optional[Dict] = None,
36
37
  google_search: Optional[bool] = True,
37
38
  extra_headers: Optional[Dict[str, str]] = None,
@@ -51,6 +52,7 @@ class PlaywrightEngine:
51
52
  :param wait_selector: Wait for a specific css selector to be in a specific state.
52
53
  :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
53
54
  :param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
55
+ :param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.
54
56
  :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
55
57
  :param disable_webgl: Disables WebGL and WebGL 2.0 support entirely.
56
58
  :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.
@@ -67,6 +69,7 @@ class PlaywrightEngine:
67
69
  self.stealth = bool(stealth)
68
70
  self.hide_canvas = bool(hide_canvas)
69
71
  self.disable_webgl = bool(disable_webgl)
72
+ self.real_chrome = bool(real_chrome)
70
73
  self.google_search = bool(google_search)
71
74
  self.extra_headers = extra_headers or {}
72
75
  self.proxy = construct_proxy_dict(proxy)
@@ -119,7 +122,8 @@ class PlaywrightEngine:
119
122
  :param url: Target url.
120
123
  :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`
121
124
  """
122
- if not self.stealth:
125
+ if not self.stealth or self.real_chrome:
126
+ # Because rebrowser_playwright doesn't play well with real browsers
123
127
  from playwright.sync_api import sync_playwright
124
128
  else:
125
129
  from rebrowser_playwright.sync_api import sync_playwright
@@ -130,8 +134,8 @@ class PlaywrightEngine:
130
134
  extra_headers = {}
131
135
  useragent = self.useragent
132
136
  else:
133
- extra_headers = generate_headers(browser_mode=True)
134
- useragent = extra_headers.get('User-Agent')
137
+ extra_headers = {}
138
+ useragent = generate_headers(browser_mode=True).get('User-Agent')
135
139
 
136
140
  # Prepare the flags before diving
137
141
  flags = DEFAULT_STEALTH_FLAGS
@@ -146,9 +150,11 @@ class PlaywrightEngine:
146
150
  browser = p.chromium.connect_over_cdp(endpoint_url=cdp_url)
147
151
  else:
148
152
  if self.stealth:
149
- browser = p.chromium.launch(headless=self.headless, args=flags, ignore_default_args=['--enable-automation'], chromium_sandbox=True)
153
+ browser = p.chromium.launch(
154
+ headless=self.headless, args=flags, ignore_default_args=['--enable-automation'], chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
155
+ )
150
156
  else:
151
- browser = p.chromium.launch(headless=self.headless, ignore_default_args=['--enable-automation'])
157
+ browser = p.chromium.launch(headless=self.headless, ignore_default_args=['--enable-automation'], channel='chrome' if self.real_chrome else 'chromium')
152
158
 
153
159
  # Creating the context
154
160
  if self.stealth:
@@ -67,7 +67,7 @@ def generate_headers(browser_mode: bool = False) -> Dict:
67
67
  # So we don't raise any inconsistency red flags while websites fingerprinting us
68
68
  os_name = get_os_name()
69
69
  return HeaderGenerator(
70
- browser=[Browser(name='chrome', min_version=128)],
70
+ browser=[Browser(name='chrome', min_version=130)],
71
71
  os=os_name, # None is ignored
72
72
  device='desktop'
73
73
  ).generate()
@@ -138,7 +138,7 @@ class PlayWrightFetcher(BaseFetcher):
138
138
  2) Mimics some of the real browsers' properties by injecting several JS files and using custom options.
139
139
  3) Using custom flags on launch to hide Playwright even more and make it faster.
140
140
  4) Generates real browser's headers of the same type and same user OS then append it to the request.
141
- - Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
141
+ - Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
142
142
  - NSTBrowser's docker browserless option by passing the CDP URL and enabling `nstbrowser_mode` option.
143
143
 
144
144
  > Note that these are the main options with PlayWright but it can be mixed together.
@@ -146,12 +146,12 @@ class PlayWrightFetcher(BaseFetcher):
146
146
  def fetch(
147
147
  self, url: str, headless: Union[bool, str] = True, disable_resources: bool = None,
148
148
  useragent: Optional[str] = None, network_idle: Optional[bool] = False, timeout: Optional[float] = 30000,
149
- page_action: Callable = do_nothing, wait_selector: Optional[str] = None, wait_selector_state: Optional[str] = 'attached',
150
- hide_canvas: bool = True, disable_webgl: bool = False, extra_headers: Optional[Dict[str, str]] = None, google_search: Optional[bool] = True,
149
+ page_action: Optional[Callable] = do_nothing, wait_selector: Optional[str] = None, wait_selector_state: Optional[str] = 'attached',
150
+ hide_canvas: Optional[bool] = False, disable_webgl: Optional[bool] = False, extra_headers: Optional[Dict[str, str]] = None, google_search: Optional[bool] = True,
151
151
  proxy: Optional[Union[str, Dict[str, str]]] = None,
152
- stealth: bool = False,
152
+ stealth: Optional[bool] = False, real_chrome: Optional[bool] = False,
153
153
  cdp_url: Optional[str] = None,
154
- nstbrowser_mode: bool = False, nstbrowser_config: Optional[Dict] = None,
154
+ nstbrowser_mode: Optional[bool] = False, nstbrowser_config: Optional[Dict] = None,
155
155
  ) -> Response:
156
156
  """Opens up a browser and do your request based on your chosen options below.
157
157
 
@@ -167,6 +167,7 @@ class PlayWrightFetcher(BaseFetcher):
167
167
  :param wait_selector: Wait for a specific css selector to be in a specific state.
168
168
  :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
169
169
  :param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
170
+ :param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.
170
171
  :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
171
172
  :param disable_webgl: Disables WebGL and WebGL 2.0 support entirely.
172
173
  :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name.
@@ -184,6 +185,7 @@ class PlayWrightFetcher(BaseFetcher):
184
185
  cdp_url=cdp_url,
185
186
  headless=headless,
186
187
  useragent=useragent,
188
+ real_chrome=real_chrome,
187
189
  page_action=page_action,
188
190
  hide_canvas=hide_canvas,
189
191
  network_idle=network_idle,
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: scrapling
3
- Version: 0.2.5
3
+ Version: 0.2.6
4
4
  Summary: Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
5
5
  Home-page: https://github.com/D4Vinci/Scrapling
6
6
  Author: Karim Shoair
@@ -39,7 +39,7 @@ Requires-Dist: w3lib
39
39
  Requires-Dist: orjson>=3
40
40
  Requires-Dist: tldextract
41
41
  Requires-Dist: httpx[brotli,zstd]
42
- Requires-Dist: playwright
42
+ Requires-Dist: playwright==1.48
43
43
  Requires-Dist: rebrowser-playwright
44
44
  Requires-Dist: camoufox>=0.3.10
45
45
  Requires-Dist: browserforge
@@ -336,9 +336,11 @@ Using this Fetcher class, you can make requests with:
336
336
  * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
337
337
  * Using custom flags on launch to hide Playwright even more and make it faster.
338
338
  * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
339
- 3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
339
+ 3) Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
340
340
  4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
341
341
 
342
+ > Hence using the `real_chrome` argument requires that you have chrome browser installed on your device
343
+
342
344
  Add that to a lot of controlling/hiding options as you will see in the arguments list below.
343
345
 
344
346
  <details><summary><strong>Expand this for the complete list of arguments</strong></summary>
@@ -360,6 +362,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
360
362
  | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
361
363
  | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
362
364
  | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
365
+ | real_chrome | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. | ✔️ |
363
366
  | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
364
367
  | nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
365
368
  | nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | ✔️ |
@@ -5,7 +5,7 @@ w3lib
5
5
  orjson>=3
6
6
  tldextract
7
7
  httpx[brotli,zstd]
8
- playwright
8
+ playwright==1.48
9
9
  rebrowser-playwright
10
10
  camoufox>=0.3.10
11
11
  browserforge
@@ -1,6 +1,6 @@
1
1
  [metadata]
2
2
  name = scrapling
3
- version = 0.2.5
3
+ version = 0.2.6
4
4
  author = Karim Shoair
5
5
  author_email = karim.shoair@pm.me
6
6
  description = Scrapling is an undetectable, powerful, flexible, adaptive, and high-performance web scraping library for Python.
@@ -6,7 +6,7 @@ with open("README.md", "r", encoding="utf-8") as fh:
6
6
 
7
7
  setup(
8
8
  name="scrapling",
9
- version="0.2.5",
9
+ version="0.2.6",
10
10
  description="""Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
11
11
  simplifies the process of extracting data from websites, even when they undergo structural changes, and offers
12
12
  impressive speed improvements over many popular scraping tools.""",
@@ -55,7 +55,7 @@ setup(
55
55
  "orjson>=3",
56
56
  "tldextract",
57
57
  'httpx[brotli,zstd]',
58
- 'playwright',
58
+ 'playwright==1.48', # Temporary because currently All libraries that provide CDP patches doesn't support playwright 1.49 yet
59
59
  'rebrowser-playwright',
60
60
  'camoufox>=0.3.10',
61
61
  'browserforge',
File without changes
File without changes
File without changes
File without changes
File without changes