stealth-requests 0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- stealth_requests-0.1/PKG-INFO +110 -0
- stealth_requests-0.1/README.md +93 -0
- stealth_requests-0.1/pyproject.toml +28 -0
- stealth_requests-0.1/setup.cfg +4 -0
- stealth_requests-0.1/setup.py +27 -0
- stealth_requests-0.1/stealth_requests/__init__.py +16 -0
- stealth_requests-0.1/stealth_requests/response.py +108 -0
- stealth_requests-0.1/stealth_requests/session.py +154 -0
- stealth_requests-0.1/stealth_requests.egg-info/PKG-INFO +110 -0
- stealth_requests-0.1/stealth_requests.egg-info/SOURCES.txt +11 -0
- stealth_requests-0.1/stealth_requests.egg-info/dependency_links.txt +1 -0
- stealth_requests-0.1/stealth_requests.egg-info/requires.txt +6 -0
- stealth_requests-0.1/stealth_requests.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,110 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: stealth-requests
|
|
3
|
+
Version: 0.1
|
|
4
|
+
Summary: Make HTTP requests exactly like a browser.
|
|
5
|
+
Home-page: https://github.com/jpjacobpadilla/Stealth-Requests
|
|
6
|
+
Author: Jacob Padilla
|
|
7
|
+
Author-email: Jacob Padilla <jp@jacobpadilla.com>
|
|
8
|
+
License: MIT
|
|
9
|
+
Project-URL: Homepage, https://github.com/jpjacobpadilla/Stealth-Requests
|
|
10
|
+
Requires-Python: >=3.6
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
Requires-Dist: curl_cffi
|
|
13
|
+
Provides-Extra: parsers
|
|
14
|
+
Requires-Dist: lxml; extra == "parsers"
|
|
15
|
+
Requires-Dist: html2text; extra == "parsers"
|
|
16
|
+
Requires-Dist: beautifulsoup4; extra == "parsers"
|
|
17
|
+
|
|
18
|
+
<p align="center">
|
|
19
|
+
<img src="https://github.com/jpjacobpadilla/Stealth-Requests/blob/7f83b67a0d62a932663d8216bad7d25971c90aaf/logo.png">
|
|
20
|
+
</p>
|
|
21
|
+
|
|
22
|
+
<h1 align="center">Stay Undetected While Scraping the Web.</h1>
|
|
23
|
+
|
|
24
|
+
### The All-In-One Solution to Web Scraping:
|
|
25
|
+
- Mimic the headers sent by a browser when going to a website (GET requests)
|
|
26
|
+
- Automatically handle and update the Referer header & client hint headers
|
|
27
|
+
- Mask the TLS fingerprint of the request using the [curl_cffi](https://curl-cffi.readthedocs.io/en/latest/) package
|
|
28
|
+
- Automatically parse the metadata from HTML responses such as page title, description, thumbnail, author, etc...
|
|
29
|
+
- Easily get an [lxml](https://lxml.de/apidoc/lxml.html) tree or [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) object from the HTTP response.
|
|
30
|
+
|
|
31
|
+
### Sending Requests
|
|
32
|
+
|
|
33
|
+
This package mimics the API of the `requests` package, and thus can be used in basically the same way.
|
|
34
|
+
|
|
35
|
+
You can send one-off requests like such:
|
|
36
|
+
|
|
37
|
+
```python
|
|
38
|
+
import stealth_requests and requests
|
|
39
|
+
|
|
40
|
+
resp = requests.get(link)
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
Or you can use a `StealthSession` object which will keep track of certain headers for you between requests such as the `Referer` header.
|
|
44
|
+
|
|
45
|
+
```python
|
|
46
|
+
from stealth_requests import StealthSession
|
|
47
|
+
|
|
48
|
+
with StealthSession() as s:
|
|
49
|
+
resp = s.get(link)
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
When sending a one-off request, or creating a session, you can specify the type of browser that you want the request to mimic - either `safari` or `chrome` (which is the default).
|
|
53
|
+
|
|
54
|
+
### Sending Requests With Asyncio
|
|
55
|
+
|
|
56
|
+
This package supports Asyncio in the same way as the `requests` package.
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
from stealth_requests import AsyncStealthSession
|
|
60
|
+
|
|
61
|
+
async with AsyncStealthSession(impersonate='chrome') as s:
|
|
62
|
+
resp = await s.get(link)
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
or, for a one off request you can do something like this:
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
from curl_cffi import requests
|
|
69
|
+
|
|
70
|
+
resp = await requests.post(link, data=...)
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Getting Response Metadata
|
|
74
|
+
|
|
75
|
+
The response returned from this package is a `StealthResponse` which has all of the same methods and attributes as a standard `requests` response, with a few added features. One if automatic parsing of header metadata. The metadata can be accessed from the `meta` attribute, which gives you access to the following data (if it's avaible on the scraped website):
|
|
76
|
+
|
|
77
|
+
- title: str
|
|
78
|
+
- description: str
|
|
79
|
+
- thumbnail: str
|
|
80
|
+
- author: str
|
|
81
|
+
- keywords: tuple[str]
|
|
82
|
+
- twitter_handle: str
|
|
83
|
+
- robots: tuple[str]
|
|
84
|
+
- canonical: str
|
|
85
|
+
|
|
86
|
+
Heres an example of how to get the title of a page:
|
|
87
|
+
|
|
88
|
+
```python
|
|
89
|
+
import stealth_requests and requests
|
|
90
|
+
|
|
91
|
+
resp = requests.get(link)
|
|
92
|
+
print(resp.meta.title)
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Parsing Response
|
|
96
|
+
|
|
97
|
+
To make parsing HTML easier, I've also added two popular parsing packages to this project - `Lxml` and `BeautifulSoup4`. To install these add-ons you need to install the parsers extra: `pip install stealth_requests[parsers]`.
|
|
98
|
+
|
|
99
|
+
To easily get an Lxml tree, you can use `resp.tree()` and to get a BeautifulSoup object, use the `resp.soup()` method.
|
|
100
|
+
|
|
101
|
+
For simple parsing, I've also added the following convience methods right to the `StealthResponse` object:
|
|
102
|
+
|
|
103
|
+
- `iterlinks` Iterate through all links in an HTML response
|
|
104
|
+
- `itertext`: Iterate through all text in an HTML response
|
|
105
|
+
- `text_content`: Get all text content in an HTML response
|
|
106
|
+
- `xpath` Go right to using XPATH expressions instead of getting your own Lxml tree.
|
|
107
|
+
|
|
108
|
+
### Getting HTML response in MarkDown format
|
|
109
|
+
|
|
110
|
+
Sometimes it's easier to get a webpage in MarkDown format instead of HTML. To do this, use the `resp.markdown()` method, after sending a GET request to a website.
|
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
<p align="center">
|
|
2
|
+
<img src="https://github.com/jpjacobpadilla/Stealth-Requests/blob/7f83b67a0d62a932663d8216bad7d25971c90aaf/logo.png">
|
|
3
|
+
</p>
|
|
4
|
+
|
|
5
|
+
<h1 align="center">Stay Undetected While Scraping the Web.</h1>
|
|
6
|
+
|
|
7
|
+
### The All-In-One Solution to Web Scraping:
|
|
8
|
+
- Mimic the headers sent by a browser when going to a website (GET requests)
|
|
9
|
+
- Automatically handle and update the Referer header & client hint headers
|
|
10
|
+
- Mask the TLS fingerprint of the request using the [curl_cffi](https://curl-cffi.readthedocs.io/en/latest/) package
|
|
11
|
+
- Automatically parse the metadata from HTML responses such as page title, description, thumbnail, author, etc...
|
|
12
|
+
- Easily get an [lxml](https://lxml.de/apidoc/lxml.html) tree or [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) object from the HTTP response.
|
|
13
|
+
|
|
14
|
+
### Sending Requests
|
|
15
|
+
|
|
16
|
+
This package mimics the API of the `requests` package, and thus can be used in basically the same way.
|
|
17
|
+
|
|
18
|
+
You can send one-off requests like such:
|
|
19
|
+
|
|
20
|
+
```python
|
|
21
|
+
import stealth_requests and requests
|
|
22
|
+
|
|
23
|
+
resp = requests.get(link)
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
Or you can use a `StealthSession` object which will keep track of certain headers for you between requests such as the `Referer` header.
|
|
27
|
+
|
|
28
|
+
```python
|
|
29
|
+
from stealth_requests import StealthSession
|
|
30
|
+
|
|
31
|
+
with StealthSession() as s:
|
|
32
|
+
resp = s.get(link)
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
When sending a one-off request, or creating a session, you can specify the type of browser that you want the request to mimic - either `safari` or `chrome` (which is the default).
|
|
36
|
+
|
|
37
|
+
### Sending Requests With Asyncio
|
|
38
|
+
|
|
39
|
+
This package supports Asyncio in the same way as the `requests` package.
|
|
40
|
+
|
|
41
|
+
```python
|
|
42
|
+
from stealth_requests import AsyncStealthSession
|
|
43
|
+
|
|
44
|
+
async with AsyncStealthSession(impersonate='chrome') as s:
|
|
45
|
+
resp = await s.get(link)
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
or, for a one off request you can do something like this:
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
from curl_cffi import requests
|
|
52
|
+
|
|
53
|
+
resp = await requests.post(link, data=...)
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Getting Response Metadata
|
|
57
|
+
|
|
58
|
+
The response returned from this package is a `StealthResponse` which has all of the same methods and attributes as a standard `requests` response, with a few added features. One if automatic parsing of header metadata. The metadata can be accessed from the `meta` attribute, which gives you access to the following data (if it's avaible on the scraped website):
|
|
59
|
+
|
|
60
|
+
- title: str
|
|
61
|
+
- description: str
|
|
62
|
+
- thumbnail: str
|
|
63
|
+
- author: str
|
|
64
|
+
- keywords: tuple[str]
|
|
65
|
+
- twitter_handle: str
|
|
66
|
+
- robots: tuple[str]
|
|
67
|
+
- canonical: str
|
|
68
|
+
|
|
69
|
+
Heres an example of how to get the title of a page:
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
import stealth_requests and requests
|
|
73
|
+
|
|
74
|
+
resp = requests.get(link)
|
|
75
|
+
print(resp.meta.title)
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
### Parsing Response
|
|
79
|
+
|
|
80
|
+
To make parsing HTML easier, I've also added two popular parsing packages to this project - `Lxml` and `BeautifulSoup4`. To install these add-ons you need to install the parsers extra: `pip install stealth_requests[parsers]`.
|
|
81
|
+
|
|
82
|
+
To easily get an Lxml tree, you can use `resp.tree()` and to get a BeautifulSoup object, use the `resp.soup()` method.
|
|
83
|
+
|
|
84
|
+
For simple parsing, I've also added the following convience methods right to the `StealthResponse` object:
|
|
85
|
+
|
|
86
|
+
- `iterlinks` Iterate through all links in an HTML response
|
|
87
|
+
- `itertext`: Iterate through all text in an HTML response
|
|
88
|
+
- `text_content`: Get all text content in an HTML response
|
|
89
|
+
- `xpath` Go right to using XPATH expressions instead of getting your own Lxml tree.
|
|
90
|
+
|
|
91
|
+
### Getting HTML response in MarkDown format
|
|
92
|
+
|
|
93
|
+
Sometimes it's easier to get a webpage in MarkDown format instead of HTML. To do this, use the `resp.markdown()` method, after sending a GET request to a website.
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=42", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "stealth-requests"
|
|
7
|
+
version = "0.1"
|
|
8
|
+
description = "Make HTTP requests exactly like a browser."
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
requires-python = ">=3.6"
|
|
11
|
+
license = { text = "MIT" }
|
|
12
|
+
authors = [
|
|
13
|
+
{ name = "Jacob Padilla", email = "jp@jacobpadilla.com" }
|
|
14
|
+
]
|
|
15
|
+
keywords = [""]
|
|
16
|
+
urls = { "Homepage" = "https://github.com/jpjacobpadilla/Stealth-Requests" }
|
|
17
|
+
|
|
18
|
+
dependencies = ["curl_cffi"]
|
|
19
|
+
|
|
20
|
+
[project.optional-dependencies]
|
|
21
|
+
parsers = [
|
|
22
|
+
"lxml",
|
|
23
|
+
"html2text",
|
|
24
|
+
"beautifulsoup4"
|
|
25
|
+
]
|
|
26
|
+
|
|
27
|
+
[tool.setuptools.packages.find]
|
|
28
|
+
include = ["stealth_requests"]
|
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
from setuptools import setup
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
with open('README.md', 'r', encoding='utf-8') as f:
|
|
5
|
+
long_description = f.read()
|
|
6
|
+
|
|
7
|
+
setup(
|
|
8
|
+
name='stealth-requests',
|
|
9
|
+
description='Make HTTP requests exactly like a browser.',
|
|
10
|
+
version='0.1',
|
|
11
|
+
packages=['stealth_requests'],
|
|
12
|
+
install_requires=['curl_cffi'],
|
|
13
|
+
extras_require={
|
|
14
|
+
'parsers': [
|
|
15
|
+
'lxml',
|
|
16
|
+
'html2text',
|
|
17
|
+
'beautifulsoup4'
|
|
18
|
+
]
|
|
19
|
+
},
|
|
20
|
+
author = 'Jacob Padilla',
|
|
21
|
+
author_email = 'jp@jacobpadilla.com',
|
|
22
|
+
url='https://github.com/jpjacobpadilla/Stealth-Requests',
|
|
23
|
+
license='MIT',
|
|
24
|
+
long_description=long_description,
|
|
25
|
+
long_description_content_type='text/markdown',
|
|
26
|
+
keywords=''
|
|
27
|
+
)
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
from functools import partial
|
|
2
|
+
from .session import StealthSession, AsyncStealthSession
|
|
3
|
+
from curl_cffi.requests import *
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
def request(method: str, url: str, *args, **kwargs) -> Response:
|
|
7
|
+
with StealthSession() as s:
|
|
8
|
+
return s.request(method, url, *args, **kwargs)
|
|
9
|
+
|
|
10
|
+
head = partial(request, "HEAD")
|
|
11
|
+
get = partial(request, "GET")
|
|
12
|
+
post = partial(request, "POST")
|
|
13
|
+
put = partial(request, "PUT")
|
|
14
|
+
patch = partial(request, "PATCH")
|
|
15
|
+
delete = partial(request, "DELETE")
|
|
16
|
+
options = partial(request, "OPTIONS")
|
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
from dataclasses import dataclass
|
|
4
|
+
from typing import TYPE_CHECKING
|
|
5
|
+
|
|
6
|
+
if TYPE_CHECKING:
|
|
7
|
+
from lxml.html import HtmlElement
|
|
8
|
+
from bs4 import BeautifulSoup
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
@dataclass
|
|
12
|
+
class Metadata:
|
|
13
|
+
title: str | None
|
|
14
|
+
description: str | None
|
|
15
|
+
thumbnail: str | None
|
|
16
|
+
author: str | None
|
|
17
|
+
keywords: tuple[str] | None
|
|
18
|
+
twitter_handle: str | None
|
|
19
|
+
robots: tuple[str] | None
|
|
20
|
+
canonical: str | None
|
|
21
|
+
|
|
22
|
+
PARSER_IMPORT_SOLUTION = "Install it using 'pip install stealth-requests[parsers]'."
|
|
23
|
+
|
|
24
|
+
|
|
25
|
+
class StealthResponse():
|
|
26
|
+
def __init__(self, resp):
|
|
27
|
+
self._response = resp
|
|
28
|
+
|
|
29
|
+
self._tree = None
|
|
30
|
+
self._important_meta_tags = None
|
|
31
|
+
|
|
32
|
+
def __getattr__(self, name):
|
|
33
|
+
return getattr(self._response, name)
|
|
34
|
+
|
|
35
|
+
def _get_tree(self):
|
|
36
|
+
try:
|
|
37
|
+
from lxml import html
|
|
38
|
+
except ImportError:
|
|
39
|
+
raise ImportError(f'Lxml is not installed. {PARSER_IMPORT_SOLUTION}')
|
|
40
|
+
|
|
41
|
+
self._tree = html.fromstring(self.content)
|
|
42
|
+
return self._tree
|
|
43
|
+
|
|
44
|
+
def tree(self) -> HtmlElement:
|
|
45
|
+
return self._tree or self._get_tree()
|
|
46
|
+
|
|
47
|
+
def soup(self, parser: str = 'html.parser') -> BeautifulSoup:
|
|
48
|
+
try:
|
|
49
|
+
from bs4 import BeautifulSoup
|
|
50
|
+
except ImportError:
|
|
51
|
+
raise ImportError(f'BeautifulSoup is required for markdown extraction. {PARSER_IMPORT_SOLUTION}')
|
|
52
|
+
|
|
53
|
+
return BeautifulSoup(self.content, parser)
|
|
54
|
+
|
|
55
|
+
def markdown(self):
|
|
56
|
+
try:
|
|
57
|
+
import html2text
|
|
58
|
+
except ImportError:
|
|
59
|
+
raise ImportError(f'Html2text is required for markdown extraction. {PARSER_IMPORT_SOLUTION}')
|
|
60
|
+
|
|
61
|
+
text_maker = html2text.HTML2Text()
|
|
62
|
+
text_maker.ignore_links = True
|
|
63
|
+
return text_maker.handle(str(self.soup()))
|
|
64
|
+
|
|
65
|
+
def xpath(self, xp: str):
|
|
66
|
+
return self.tree().xpath(xp)
|
|
67
|
+
|
|
68
|
+
def iterlinks(self, *args, **kwargs):
|
|
69
|
+
return self.tree().iterlinks(*args, **kwargs)
|
|
70
|
+
|
|
71
|
+
def itertext(self, *args, **kwargs):
|
|
72
|
+
return self.tree().itertext(*args, **kwargs)
|
|
73
|
+
|
|
74
|
+
def text_content(self, *args, **kwargs):
|
|
75
|
+
return self.tree().text_content(*args, **kwargs)
|
|
76
|
+
|
|
77
|
+
@staticmethod
|
|
78
|
+
def _format_meta_list(content: str) -> tuple[str]:
|
|
79
|
+
items = content.split(',')
|
|
80
|
+
return tuple(item.strip() for item in items)
|
|
81
|
+
|
|
82
|
+
def _set_important_meta_tags(self) -> Metadata:
|
|
83
|
+
tree = self.tree()
|
|
84
|
+
|
|
85
|
+
title = tree.xpath('//head/title/text()')
|
|
86
|
+
description = tree.xpath('//head/meta[@name="description"]/@content')
|
|
87
|
+
thumbnail = tree.xpath('//head/meta[@property="og:image"]/@content')
|
|
88
|
+
author = tree.xpath('//head/meta[@name="author"]/@content')
|
|
89
|
+
keywords = tree.xpath('//head/meta[@name="keywords"]/@content')
|
|
90
|
+
twitter_handle = tree.xpath('//head/meta[@name="twitter:site"]/@content')
|
|
91
|
+
robots = tree.xpath('//head/meta[@name="robots"]/@content')
|
|
92
|
+
canonical = tree.xpath('//head/link[@rel="canonical"]/@content')
|
|
93
|
+
|
|
94
|
+
self._important_meta_tags = Metadata(
|
|
95
|
+
title = title[0] if title else None,
|
|
96
|
+
description = description[0] if description else None,
|
|
97
|
+
thumbnail = thumbnail[0] if thumbnail else None,
|
|
98
|
+
author = author[0] if author else None,
|
|
99
|
+
keywords = self._format_meta_list(keywords[0]) if keywords else None,
|
|
100
|
+
twitter_handle = twitter_handle[0] if twitter_handle else None,
|
|
101
|
+
robots = self._format_meta_list(robots[0]) if robots else None,
|
|
102
|
+
canonical = canonical[0] if canonical else None
|
|
103
|
+
)
|
|
104
|
+
return self._important_meta_tags
|
|
105
|
+
|
|
106
|
+
@property
|
|
107
|
+
def meta(self):
|
|
108
|
+
return self._important_meta_tags or self._set_important_meta_tags()
|
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
import random
|
|
2
|
+
import json
|
|
3
|
+
from dataclasses import dataclass
|
|
4
|
+
from urllib.parse import urlparse
|
|
5
|
+
from collections import defaultdict
|
|
6
|
+
from functools import partialmethod
|
|
7
|
+
|
|
8
|
+
from .response import StealthResponse
|
|
9
|
+
|
|
10
|
+
from curl_cffi.requests.session import Session, AsyncSession
|
|
11
|
+
from curl_cffi.requests.models import Response
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
@dataclass
|
|
15
|
+
class ClientProfile:
|
|
16
|
+
user_agent: str
|
|
17
|
+
sec_ch_ua: str
|
|
18
|
+
sec_ch_ua_mobile: str
|
|
19
|
+
sec_ch_ua_platform: str
|
|
20
|
+
|
|
21
|
+
|
|
22
|
+
class BaseStealthSession:
|
|
23
|
+
def __init__(
|
|
24
|
+
self,
|
|
25
|
+
client_profile: str = None,
|
|
26
|
+
impersonate: str = 'chrome124',
|
|
27
|
+
**kwargs
|
|
28
|
+
):
|
|
29
|
+
if impersonate.lower() in ('chrome', 'chrome124'):
|
|
30
|
+
impersonate = 'chrome124'
|
|
31
|
+
elif impersonate.lower() in ('safari', 'safari_17_0', 'safari17'):
|
|
32
|
+
impersonate = 'safari17_0'
|
|
33
|
+
|
|
34
|
+
self.profile = client_profile or BaseStealthSession.create_profile(impersonate)
|
|
35
|
+
self.last_request_url = defaultdict(lambda: 'https://www.google.com/')
|
|
36
|
+
|
|
37
|
+
super().__init__(
|
|
38
|
+
headers=self.initialize_chrome_headers()
|
|
39
|
+
if impersonate == 'chrome124'
|
|
40
|
+
else self.initialize_safari_headers(),
|
|
41
|
+
impersonate=impersonate,
|
|
42
|
+
**kwargs
|
|
43
|
+
)
|
|
44
|
+
|
|
45
|
+
def __enter__(self):
|
|
46
|
+
return self
|
|
47
|
+
|
|
48
|
+
def __exit__(self, *_):
|
|
49
|
+
self.close()
|
|
50
|
+
return False
|
|
51
|
+
|
|
52
|
+
async def __aenter__(self):
|
|
53
|
+
return self
|
|
54
|
+
|
|
55
|
+
async def __aexit__(self, *_):
|
|
56
|
+
self.close()
|
|
57
|
+
return False
|
|
58
|
+
|
|
59
|
+
@staticmethod
|
|
60
|
+
def create_profile(impersonate: str) -> ClientProfile:
|
|
61
|
+
import os
|
|
62
|
+
file_path = os.path.join(os.path.dirname(__file__), 'profiles.json')
|
|
63
|
+
|
|
64
|
+
with open(file_path, encoding='utf-8', mode='r') as file:
|
|
65
|
+
user_agents = json.load(file)
|
|
66
|
+
|
|
67
|
+
assert impersonate in user_agents.keys(), f'Please choose one of the supported profiles: {user_agents.keys()}'
|
|
68
|
+
|
|
69
|
+
return ClientProfile(
|
|
70
|
+
user_agent=random.choice(user_agents[impersonate]),
|
|
71
|
+
sec_ch_ua='"Not A;Brand";v="99", "Chromium";v="124", "Google Chrome";v="124"' if impersonate == 'chrome_124' else None,
|
|
72
|
+
sec_ch_ua_mobile='?0' if impersonate == 'chrome_124' else None,
|
|
73
|
+
sec_ch_ua_platform='"macOS"' if impersonate == 'chrome_124' else None
|
|
74
|
+
)
|
|
75
|
+
|
|
76
|
+
def initialize_chrome_headers(self) -> dict[str, str]:
|
|
77
|
+
return {
|
|
78
|
+
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
|
|
79
|
+
"Accept-Encoding": "gzip, deflate, br, zstd",
|
|
80
|
+
"Accept-Language": "en-US,en;q=0.9",
|
|
81
|
+
"Cache-Control": "no-cache",
|
|
82
|
+
"Connection": "keep-alive",
|
|
83
|
+
"Pragma": "no-cache",
|
|
84
|
+
"Upgrade-Insecure-Requests": "1",
|
|
85
|
+
"User-Agent": self.profile.user_agent,
|
|
86
|
+
"Sec-Fetch-Dest": "document",
|
|
87
|
+
"Sec-Fetch-Mode": "navigate",
|
|
88
|
+
"Sec-Fetch-Site": "same-origin",
|
|
89
|
+
"Sec-Fetch-User": "?1",
|
|
90
|
+
"sec-ch-ua": self.profile.sec_ch_ua,
|
|
91
|
+
"sec-ch-ua-mobile": self.profile.sec_ch_ua_mobile,
|
|
92
|
+
"sec-ch-ua-platform": self.profile.sec_ch_ua_platform,
|
|
93
|
+
}
|
|
94
|
+
|
|
95
|
+
def initialize_safari_headers(self) -> dict[str, str]:
|
|
96
|
+
return {
|
|
97
|
+
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
|
98
|
+
"Accept-Encoding": "gzip, deflate, br",
|
|
99
|
+
"Accept-Language": "en-US,en;q=0.9",
|
|
100
|
+
"Cache-Control": "no-cache",
|
|
101
|
+
"Connection": "keep-alive",
|
|
102
|
+
"Pragma": "no-cache",
|
|
103
|
+
"Sec-Fetch-Dest": "document",
|
|
104
|
+
"Sec-Fetch-Mode": "navigate",
|
|
105
|
+
"Sec-Fetch-Site": "same-origin",
|
|
106
|
+
"User-Agent": self.profile.user_agent
|
|
107
|
+
}
|
|
108
|
+
|
|
109
|
+
def get_dynamic_headers(self, url: str) -> dict[str, str]:
|
|
110
|
+
parsed_url = urlparse(url)
|
|
111
|
+
host = parsed_url.netloc
|
|
112
|
+
|
|
113
|
+
headers = {
|
|
114
|
+
"Host": host,
|
|
115
|
+
"Referer": self.last_request_url[host]
|
|
116
|
+
}
|
|
117
|
+
|
|
118
|
+
self.last_request_url[host] = url
|
|
119
|
+
return headers
|
|
120
|
+
|
|
121
|
+
|
|
122
|
+
class StealthSession(BaseStealthSession, Session):
|
|
123
|
+
def __init__(self, *args, **kwargs):
|
|
124
|
+
super().__init__(*args, **kwargs)
|
|
125
|
+
|
|
126
|
+
def request(self, method: str, url: str, *args, **kwargs) -> Response:
|
|
127
|
+
headers = self.get_dynamic_headers(url) | kwargs.pop('headers', {})
|
|
128
|
+
resp = Session.request(self, method, url, *args, headers=headers, **kwargs)
|
|
129
|
+
return StealthResponse(resp)
|
|
130
|
+
|
|
131
|
+
head = partialmethod(request, "HEAD")
|
|
132
|
+
get = partialmethod(request, "GET")
|
|
133
|
+
post = partialmethod(request, "POST")
|
|
134
|
+
put = partialmethod(request, "PUT")
|
|
135
|
+
patch = partialmethod(request, "PATCH")
|
|
136
|
+
delete = partialmethod(request, "DELETE")
|
|
137
|
+
options = partialmethod(request, "OPTIONS")
|
|
138
|
+
|
|
139
|
+
class AsyncStealthSession(BaseStealthSession, AsyncSession):
|
|
140
|
+
def __init__(self, *args, **kwargs):
|
|
141
|
+
super().__init__(*args, **kwargs)
|
|
142
|
+
|
|
143
|
+
async def request(self, method: str, url: str, *args, **kwargs) -> Response:
|
|
144
|
+
headers = self.get_dynamic_headers(url) | kwargs.pop('headers', {})
|
|
145
|
+
resp = await AsyncSession.request(self, method, url, *args, headers=headers, **kwargs)
|
|
146
|
+
return StealthResponse(resp)
|
|
147
|
+
|
|
148
|
+
head = partialmethod(request, "HEAD")
|
|
149
|
+
get = partialmethod(request, "GET")
|
|
150
|
+
post = partialmethod(request, "POST")
|
|
151
|
+
put = partialmethod(request, "PUT")
|
|
152
|
+
patch = partialmethod(request, "PATCH")
|
|
153
|
+
delete = partialmethod(request, "DELETE")
|
|
154
|
+
options = partialmethod(request, "OPTIONS")
|
|
@@ -0,0 +1,110 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: stealth-requests
|
|
3
|
+
Version: 0.1
|
|
4
|
+
Summary: Make HTTP requests exactly like a browser.
|
|
5
|
+
Home-page: https://github.com/jpjacobpadilla/Stealth-Requests
|
|
6
|
+
Author: Jacob Padilla
|
|
7
|
+
Author-email: Jacob Padilla <jp@jacobpadilla.com>
|
|
8
|
+
License: MIT
|
|
9
|
+
Project-URL: Homepage, https://github.com/jpjacobpadilla/Stealth-Requests
|
|
10
|
+
Requires-Python: >=3.6
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
Requires-Dist: curl_cffi
|
|
13
|
+
Provides-Extra: parsers
|
|
14
|
+
Requires-Dist: lxml; extra == "parsers"
|
|
15
|
+
Requires-Dist: html2text; extra == "parsers"
|
|
16
|
+
Requires-Dist: beautifulsoup4; extra == "parsers"
|
|
17
|
+
|
|
18
|
+
<p align="center">
|
|
19
|
+
<img src="https://github.com/jpjacobpadilla/Stealth-Requests/blob/7f83b67a0d62a932663d8216bad7d25971c90aaf/logo.png">
|
|
20
|
+
</p>
|
|
21
|
+
|
|
22
|
+
<h1 align="center">Stay Undetected While Scraping the Web.</h1>
|
|
23
|
+
|
|
24
|
+
### The All-In-One Solution to Web Scraping:
|
|
25
|
+
- Mimic the headers sent by a browser when going to a website (GET requests)
|
|
26
|
+
- Automatically handle and update the Referer header & client hint headers
|
|
27
|
+
- Mask the TLS fingerprint of the request using the [curl_cffi](https://curl-cffi.readthedocs.io/en/latest/) package
|
|
28
|
+
- Automatically parse the metadata from HTML responses such as page title, description, thumbnail, author, etc...
|
|
29
|
+
- Easily get an [lxml](https://lxml.de/apidoc/lxml.html) tree or [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) object from the HTTP response.
|
|
30
|
+
|
|
31
|
+
### Sending Requests
|
|
32
|
+
|
|
33
|
+
This package mimics the API of the `requests` package, and thus can be used in basically the same way.
|
|
34
|
+
|
|
35
|
+
You can send one-off requests like such:
|
|
36
|
+
|
|
37
|
+
```python
|
|
38
|
+
import stealth_requests and requests
|
|
39
|
+
|
|
40
|
+
resp = requests.get(link)
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
Or you can use a `StealthSession` object which will keep track of certain headers for you between requests such as the `Referer` header.
|
|
44
|
+
|
|
45
|
+
```python
|
|
46
|
+
from stealth_requests import StealthSession
|
|
47
|
+
|
|
48
|
+
with StealthSession() as s:
|
|
49
|
+
resp = s.get(link)
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
When sending a one-off request, or creating a session, you can specify the type of browser that you want the request to mimic - either `safari` or `chrome` (which is the default).
|
|
53
|
+
|
|
54
|
+
### Sending Requests With Asyncio
|
|
55
|
+
|
|
56
|
+
This package supports Asyncio in the same way as the `requests` package.
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
from stealth_requests import AsyncStealthSession
|
|
60
|
+
|
|
61
|
+
async with AsyncStealthSession(impersonate='chrome') as s:
|
|
62
|
+
resp = await s.get(link)
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
or, for a one off request you can do something like this:
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
from curl_cffi import requests
|
|
69
|
+
|
|
70
|
+
resp = await requests.post(link, data=...)
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Getting Response Metadata
|
|
74
|
+
|
|
75
|
+
The response returned from this package is a `StealthResponse` which has all of the same methods and attributes as a standard `requests` response, with a few added features. One if automatic parsing of header metadata. The metadata can be accessed from the `meta` attribute, which gives you access to the following data (if it's avaible on the scraped website):
|
|
76
|
+
|
|
77
|
+
- title: str
|
|
78
|
+
- description: str
|
|
79
|
+
- thumbnail: str
|
|
80
|
+
- author: str
|
|
81
|
+
- keywords: tuple[str]
|
|
82
|
+
- twitter_handle: str
|
|
83
|
+
- robots: tuple[str]
|
|
84
|
+
- canonical: str
|
|
85
|
+
|
|
86
|
+
Heres an example of how to get the title of a page:
|
|
87
|
+
|
|
88
|
+
```python
|
|
89
|
+
import stealth_requests and requests
|
|
90
|
+
|
|
91
|
+
resp = requests.get(link)
|
|
92
|
+
print(resp.meta.title)
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Parsing Response
|
|
96
|
+
|
|
97
|
+
To make parsing HTML easier, I've also added two popular parsing packages to this project - `Lxml` and `BeautifulSoup4`. To install these add-ons you need to install the parsers extra: `pip install stealth_requests[parsers]`.
|
|
98
|
+
|
|
99
|
+
To easily get an Lxml tree, you can use `resp.tree()` and to get a BeautifulSoup object, use the `resp.soup()` method.
|
|
100
|
+
|
|
101
|
+
For simple parsing, I've also added the following convience methods right to the `StealthResponse` object:
|
|
102
|
+
|
|
103
|
+
- `iterlinks` Iterate through all links in an HTML response
|
|
104
|
+
- `itertext`: Iterate through all text in an HTML response
|
|
105
|
+
- `text_content`: Get all text content in an HTML response
|
|
106
|
+
- `xpath` Go right to using XPATH expressions instead of getting your own Lxml tree.
|
|
107
|
+
|
|
108
|
+
### Getting HTML response in MarkDown format
|
|
109
|
+
|
|
110
|
+
Sometimes it's easier to get a webpage in MarkDown format instead of HTML. To do this, use the `resp.markdown()` method, after sending a GET request to a website.
|
|
@@ -0,0 +1,11 @@
|
|
|
1
|
+
README.md
|
|
2
|
+
pyproject.toml
|
|
3
|
+
setup.py
|
|
4
|
+
stealth_requests/__init__.py
|
|
5
|
+
stealth_requests/response.py
|
|
6
|
+
stealth_requests/session.py
|
|
7
|
+
stealth_requests.egg-info/PKG-INFO
|
|
8
|
+
stealth_requests.egg-info/SOURCES.txt
|
|
9
|
+
stealth_requests.egg-info/dependency_links.txt
|
|
10
|
+
stealth_requests.egg-info/requires.txt
|
|
11
|
+
stealth_requests.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
stealth_requests
|