phantomfetch 0.1.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- phantomfetch-0.1.2/PKG-INFO +404 -0
- phantomfetch-0.1.2/README.md +373 -0
- phantomfetch-0.1.2/pyproject.toml +94 -0
- phantomfetch-0.1.2/src/phantomfetch/__init__.py +82 -0
- phantomfetch-0.1.2/src/phantomfetch/cache.py +259 -0
- phantomfetch-0.1.2/src/phantomfetch/engines/__init__.py +8 -0
- phantomfetch-0.1.2/src/phantomfetch/engines/base.py +31 -0
- phantomfetch-0.1.2/src/phantomfetch/engines/browser/__init__.py +10 -0
- phantomfetch-0.1.2/src/phantomfetch/engines/browser/actions.py +183 -0
- phantomfetch-0.1.2/src/phantomfetch/engines/browser/baas.py +39 -0
- phantomfetch-0.1.2/src/phantomfetch/engines/browser/base.py +181 -0
- phantomfetch-0.1.2/src/phantomfetch/engines/browser/cdp.py +452 -0
- phantomfetch-0.1.2/src/phantomfetch/engines/curl.py +291 -0
- phantomfetch-0.1.2/src/phantomfetch/fetch.py +452 -0
- phantomfetch-0.1.2/src/phantomfetch/pool.py +65 -0
- phantomfetch-0.1.2/src/phantomfetch/telemetry.py +35 -0
- phantomfetch-0.1.2/src/phantomfetch/types.py +200 -0
|
@@ -0,0 +1,404 @@
|
|
|
1
|
+
Metadata-Version: 2.3
|
|
2
|
+
Name: phantomfetch
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: High-performance agentic web scraping library combining curl-cffi speed with Playwright browser capabilities
|
|
5
|
+
Keywords: web-scraping,playwright,curl-cffi,async,browser-automation,http-client,agentic,anti-detection
|
|
6
|
+
Author: CosmicBull
|
|
7
|
+
Author-email: CosmicBull <cosmicbull@example.com>
|
|
8
|
+
License: MIT
|
|
9
|
+
Classifier: Development Status :: 3 - Alpha
|
|
10
|
+
Classifier: Intended Audience :: Developers
|
|
11
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
14
|
+
Classifier: Topic :: Internet :: WWW/HTTP
|
|
15
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
16
|
+
Classifier: Framework :: AsyncIO
|
|
17
|
+
Requires-Dist: curl-cffi>=0.13.0
|
|
18
|
+
Requires-Dist: msgspec>=0.20.0
|
|
19
|
+
Requires-Dist: playwright>=1.56.0
|
|
20
|
+
Requires-Dist: rusticsoup>=0.3.0
|
|
21
|
+
Requires-Dist: httpx>=0.27.0
|
|
22
|
+
Requires-Dist: opentelemetry-api>=1.38.0
|
|
23
|
+
Requires-Dist: loguru>=0.7.3
|
|
24
|
+
Requires-Python: >=3.13
|
|
25
|
+
Project-URL: Changelog, https://github.com/iristech-systems/PhantomFetch/blob/main/CHANGELOG.md
|
|
26
|
+
Project-URL: Documentation, https://github.com/iristech-systems/PhantomFetch#readme
|
|
27
|
+
Project-URL: Homepage, https://github.com/iristech-systems/PhantomFetch
|
|
28
|
+
Project-URL: Issues, https://github.com/iristech-systems/PhantomFetch/issues
|
|
29
|
+
Project-URL: Repository, https://github.com/iristech-systems/PhantomFetch
|
|
30
|
+
Description-Content-Type: text/markdown
|
|
31
|
+
|
|
32
|
+
# PhantomFetch
|
|
33
|
+
|
|
34
|
+
[](https://www.python.org/downloads/)
|
|
35
|
+
[](https://opensource.org/licenses/MIT)
|
|
36
|
+
[](https://github.com/astral-sh/ruff)
|
|
37
|
+
|
|
38
|
+
**PhantomFetch** is a high-performance, agentic web scraping library for Python. It seamlessly combines the speed of `curl-cffi` with the capabilities of `Playwright`, offering a unified API for all your data extraction needs.
|
|
39
|
+
|
|
40
|
+
## Why PhantomFetch?
|
|
41
|
+
|
|
42
|
+
Most web scraping requires choosing between speed (httpx, requests) or browser capabilities (Playwright, Selenium). PhantomFetch gives you **both** with a unified interface:
|
|
43
|
+
|
|
44
|
+
| Feature | PhantomFetch | requests/httpx | Playwright/Selenium |
|
|
45
|
+
|---------|--------------|----------------|---------------------|
|
|
46
|
+
| **Speed** | โก Fast (curl-cffi) | โก Fast | ๐ Slow |
|
|
47
|
+
| **JavaScript Support** | โ
Yes (Playwright) | โ No | โ
Yes |
|
|
48
|
+
| **Anti-Detection** | โ
Built-in | โ No | โ ๏ธ Manual |
|
|
49
|
+
| **Smart Caching** | โ
Configurable | โ No | โ No |
|
|
50
|
+
| **Proxy Rotation** | โ
Automatic | โ ๏ธ Manual | โ ๏ธ Manual |
|
|
51
|
+
| **Async-First** | โ
Yes | โ ๏ธ Partial | โ
Yes |
|
|
52
|
+
| **Unified API** | โ
One interface | N/A | N/A |
|
|
53
|
+
| **OpenTelemetry** | โ
Built-in | โ No | โ No |
|
|
54
|
+
|
|
55
|
+
**Key Benefits:**
|
|
56
|
+
- ๐ฏ **Start Fast, Scale Smart**: Use curl for quick requests, switch to browser when needed
|
|
57
|
+
- ๐ง **Intelligent**: Automatic retry logic, exponential backoff, fingerprint rotation
|
|
58
|
+
- ๐ **Production-Ready**: Built-in observability, caching, and error handling
|
|
59
|
+
- ๐ ๏ธ **Developer-Friendly**: Intuitive API, comprehensive type hints, rich documentation
|
|
60
|
+
|
|
61
|
+
## Features
|
|
62
|
+
|
|
63
|
+
- ๐ **Unified API**: Switch between `curl` (fast, lightweight) and `browser` (JavaScript-capable) engines with a single parameter
|
|
64
|
+
- ๐ง **Smart Caching**: Configurable caching strategies (`all`, `resources`, `conservative`) to speed up development and save bandwidth
|
|
65
|
+
- ๐ค **Agentic Actions**: Define browser interactions (click, scroll, input, wait) declaratively
|
|
66
|
+
- ๐ก๏ธ **Anti-Detection**: Built-in support for proxy rotation and fingerprinting protection (via `curl-cffi`)
|
|
67
|
+
- โก **Async First**: Built on `asyncio` for high concurrency
|
|
68
|
+
- ๐ **Smart Retries**: Configurable retry logic with exponential backoff
|
|
69
|
+
- ๐ช **Cookie Management**: Automatic cookie handling across engines
|
|
70
|
+
- ๐ **Observability**: OpenTelemetry integration out of the box
|
|
71
|
+
|
|
72
|
+
## Installation
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
pip install phantomfetch
|
|
76
|
+
# or with uv (recommended)
|
|
77
|
+
uv pip install phantomfetch
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
After installation, install Playwright browsers:
|
|
81
|
+
```bash
|
|
82
|
+
playwright install chromium
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
## Quick Start
|
|
86
|
+
|
|
87
|
+
### Basic Fetch (Curl Engine)
|
|
88
|
+
|
|
89
|
+
```python
|
|
90
|
+
import asyncio
|
|
91
|
+
from phantomfetch import Fetcher
|
|
92
|
+
|
|
93
|
+
async def main():
|
|
94
|
+
async with Fetcher() as f:
|
|
95
|
+
response = await f.fetch("https://httpbin.org/get")
|
|
96
|
+
print(response.json())
|
|
97
|
+
|
|
98
|
+
if __name__ == "__main__":
|
|
99
|
+
asyncio.run(main())
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
### Browser Fetch with Caching
|
|
103
|
+
|
|
104
|
+
Use the `resources` strategy to cache static assets (images, CSS, scripts) while keeping the main page fresh.
|
|
105
|
+
|
|
106
|
+
```python
|
|
107
|
+
from phantomfetch import Fetcher, FileSystemCache
|
|
108
|
+
|
|
109
|
+
async def main():
|
|
110
|
+
# Cache sub-resources to speed up subsequent fetches
|
|
111
|
+
cache = FileSystemCache(strategy="resources")
|
|
112
|
+
|
|
113
|
+
async with Fetcher(browser_engine="cdp", cache=cache) as f:
|
|
114
|
+
# First run: downloads everything
|
|
115
|
+
resp = await f.fetch("https://example.com", engine="browser")
|
|
116
|
+
|
|
117
|
+
# Second run: uses cached resources, only fetches main HTML
|
|
118
|
+
resp = await f.fetch("https://example.com", engine="browser")
|
|
119
|
+
print(resp.text)
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### Browser Actions
|
|
123
|
+
|
|
124
|
+
Perform interactions like clicking, scrolling, and taking screenshots:
|
|
125
|
+
|
|
126
|
+
```python
|
|
127
|
+
from phantomfetch import Fetcher
|
|
128
|
+
|
|
129
|
+
actions = [
|
|
130
|
+
{"action": "wait", "selector": "#search-input"},
|
|
131
|
+
{"action": "input", "selector": "#search-input", "value": "phantomfetch"},
|
|
132
|
+
{"action": "click", "selector": "#search-button"},
|
|
133
|
+
{"action": "wait_for_load"},
|
|
134
|
+
{"action": "screenshot", "value": "search_results.png"}
|
|
135
|
+
]
|
|
136
|
+
|
|
137
|
+
async with Fetcher(browser_engine="cdp") as f:
|
|
138
|
+
resp = await f.fetch("https://example.com", actions=actions, engine="browser")
|
|
139
|
+
# Screenshot saved to search_results.png
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
### Advanced: Retry Configuration
|
|
143
|
+
|
|
144
|
+
Fine-tune retry behavior per request:
|
|
145
|
+
|
|
146
|
+
```python
|
|
147
|
+
from phantomfetch import Fetcher
|
|
148
|
+
|
|
149
|
+
async with Fetcher() as f:
|
|
150
|
+
# Custom retry logic for flaky endpoints
|
|
151
|
+
resp = await f.fetch(
|
|
152
|
+
"https://api.example.com/data",
|
|
153
|
+
max_retries=5, # Override default retries
|
|
154
|
+
timeout=60.0, # Longer timeout for slow APIs
|
|
155
|
+
)
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Cookie Handling
|
|
159
|
+
|
|
160
|
+
Pass cookies to any engine and retrieve them from the response:
|
|
161
|
+
|
|
162
|
+
```python
|
|
163
|
+
from phantomfetch import Fetcher, Cookie
|
|
164
|
+
|
|
165
|
+
async with Fetcher() as f:
|
|
166
|
+
# Set cookies
|
|
167
|
+
resp = await f.fetch(
|
|
168
|
+
"https://httpbin.org/cookies",
|
|
169
|
+
cookies={"session_id": "secret_token"}
|
|
170
|
+
)
|
|
171
|
+
print(resp.json())
|
|
172
|
+
|
|
173
|
+
# Get cookies (including from redirects)
|
|
174
|
+
resp = await f.fetch("https://httpbin.org/cookies/set/foo/bar")
|
|
175
|
+
for cookie in resp.cookies:
|
|
176
|
+
print(f"{cookie.name}: {cookie.value}")
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
## Configuration
|
|
180
|
+
|
|
181
|
+
### Caching Strategies
|
|
182
|
+
|
|
183
|
+
- **`all`**: Caches everything, including the main document. Good for offline development
|
|
184
|
+
- **`resources`** (Default): Caches sub-resources (images, styles, scripts) but fetches the main document fresh. Best for scraping dynamic sites
|
|
185
|
+
- **`conservative`**: Caches only heavy static assets like images and fonts
|
|
186
|
+
|
|
187
|
+
Example:
|
|
188
|
+
```python
|
|
189
|
+
from phantomfetch import FileSystemCache, Fetcher
|
|
190
|
+
|
|
191
|
+
cache = FileSystemCache(
|
|
192
|
+
cache_dir=".cache",
|
|
193
|
+
strategy="resources"
|
|
194
|
+
)
|
|
195
|
+
|
|
196
|
+
async with Fetcher(cache=cache) as f:
|
|
197
|
+
# Resources will be cached automatically
|
|
198
|
+
resp = await f.fetch("https://example.com", engine="browser")
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Proxy Rotation
|
|
202
|
+
|
|
203
|
+
Multiple proxy strategies available:
|
|
204
|
+
|
|
205
|
+
```python
|
|
206
|
+
from phantomfetch import Fetcher, Proxy
|
|
207
|
+
|
|
208
|
+
proxies = [
|
|
209
|
+
Proxy(url="http://user:pass@proxy1:8080", location="US", weight=2),
|
|
210
|
+
Proxy(url="http://user:pass@proxy2:8080", location="EU", weight=1),
|
|
211
|
+
]
|
|
212
|
+
|
|
213
|
+
async with Fetcher(
|
|
214
|
+
proxies=proxies,
|
|
215
|
+
proxy_strategy="round_robin" # or "random", "sticky", "geo_match"
|
|
216
|
+
) as f:
|
|
217
|
+
resp = await f.fetch("https://example.com")
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
### Observability (OpenTelemetry)
|
|
221
|
+
|
|
222
|
+
PhantomFetch is fully instrumented with OpenTelemetry:
|
|
223
|
+
|
|
224
|
+
```python
|
|
225
|
+
from phantomfetch.telemetry import configure_telemetry
|
|
226
|
+
from opentelemetry.sdk.trace.export import ConsoleSpanExporter
|
|
227
|
+
|
|
228
|
+
# Setup OTel with custom service name
|
|
229
|
+
configure_telemetry(service_name="my-scraper")
|
|
230
|
+
|
|
231
|
+
async with Fetcher() as f:
|
|
232
|
+
await f.fetch("https://example.com")
|
|
233
|
+
# Spans automatically created and exported
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
Or use standard OpenTelemetry environment variables:
|
|
237
|
+
```bash
|
|
238
|
+
export OTEL_SERVICE_NAME="my-scraper"
|
|
239
|
+
export OTEL_TRACES_EXPORTER="console"
|
|
240
|
+
python my_scraper.py
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
## Troubleshooting
|
|
244
|
+
|
|
245
|
+
### Playwright Installation Issues
|
|
246
|
+
|
|
247
|
+
If you encounter browser-related errors:
|
|
248
|
+
```bash
|
|
249
|
+
# Install all browsers
|
|
250
|
+
playwright install
|
|
251
|
+
|
|
252
|
+
# Or just chromium (recommended)
|
|
253
|
+
playwright install chromium
|
|
254
|
+
|
|
255
|
+
# Check installation
|
|
256
|
+
playwright install --help
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
### SSL Certificate Errors
|
|
260
|
+
|
|
261
|
+
For development/testing, you can disable SSL verification:
|
|
262
|
+
```python
|
|
263
|
+
# Note: Only use this in development!
|
|
264
|
+
async with Fetcher() as f:
|
|
265
|
+
# SSL verification is handled by curl-cffi and Playwright
|
|
266
|
+
# For curl engine, certificates are validated by default
|
|
267
|
+
resp = await f.fetch("https://self-signed.badssl.com/")
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
### Memory Issues with Caching
|
|
271
|
+
|
|
272
|
+
If cache grows too large:
|
|
273
|
+
```python
|
|
274
|
+
from phantomfetch import FileSystemCache
|
|
275
|
+
|
|
276
|
+
cache = FileSystemCache(cache_dir=".cache")
|
|
277
|
+
|
|
278
|
+
# Manually clear expired entries
|
|
279
|
+
cache.clear_expired()
|
|
280
|
+
|
|
281
|
+
# Or just delete the cache directory
|
|
282
|
+
import shutil
|
|
283
|
+
shutil.rmtree(".cache", ignore_errors=True)
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
### Browser Engine Not Working
|
|
287
|
+
|
|
288
|
+
Common issues:
|
|
289
|
+
1. **Playwright not installed**: Run `playwright install chromium`
|
|
290
|
+
2. **Marimo notebook issues**: Browser engines may not work in some notebook environments
|
|
291
|
+
3. **Port conflicts**: CDP uses random ports, but firewall rules might block them
|
|
292
|
+
|
|
293
|
+
Debug with:
|
|
294
|
+
```python
|
|
295
|
+
async with Fetcher(browser_engine="cdp") as f:
|
|
296
|
+
# Enable verbose logging
|
|
297
|
+
import logging
|
|
298
|
+
logging.basicConfig(level=logging.DEBUG)
|
|
299
|
+
|
|
300
|
+
resp = await f.fetch("https://example.com", engine="browser")
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
### Rate Limiting / 429 Errors
|
|
304
|
+
|
|
305
|
+
Use retry configuration and delays:
|
|
306
|
+
```python
|
|
307
|
+
import asyncio
|
|
308
|
+
|
|
309
|
+
async with Fetcher(max_retries=5) as f:
|
|
310
|
+
for url in urls:
|
|
311
|
+
resp = await f.fetch(url)
|
|
312
|
+
await asyncio.sleep(1) # Be nice to servers
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
### Scrapeless Session Recording
|
|
316
|
+
|
|
317
|
+
When using Scrapeless's CDP endpoint for session recording, PhantomFetch automatically reuses existing browser windows:
|
|
318
|
+
|
|
319
|
+
```python
|
|
320
|
+
async with Fetcher(
|
|
321
|
+
browser_engine="cdp",
|
|
322
|
+
browser_engine_config={
|
|
323
|
+
"cdp_endpoint": "wss://YOUR_SESSION.scrapeless.com/chrome/cdp"
|
|
324
|
+
# use_existing_page=True (default) ensures recording compatibility
|
|
325
|
+
}
|
|
326
|
+
) as f:
|
|
327
|
+
# Uses existing window - Scrapeless records this! โ
|
|
328
|
+
resp = await f.fetch("https://example.com", engine="browser")
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
**Why this matters**: Scrapeless can only record a single window. By default (`use_existing_page=True`), PhantomFetch detects and reuses the existing browser page in your Scrapeless session instead of creating new windows.
|
|
332
|
+
|
|
333
|
+
**To disable** (not recommended for recording): Set `use_existing_page=False` in `browser_engine_config`.
|
|
334
|
+
|
|
335
|
+
See [`examples/scrapeless_cdp_recording.py`](examples/scrapeless_cdp_recording.py) for a complete example.
|
|
336
|
+
|
|
337
|
+
|
|
338
|
+
## Next Steps
|
|
339
|
+
|
|
340
|
+
Ready to dive deeper? Here's what to explore:
|
|
341
|
+
|
|
342
|
+
1. **[Examples](examples/)** - See retry configuration and advanced patterns
|
|
343
|
+
2. **[CHANGELOG](CHANGELOG.md)** - See what's new
|
|
344
|
+
3. **[Contributing Guide](CONTRIBUTING.md)** - Help improve PhantomFetch
|
|
345
|
+
|
|
346
|
+
## Community & Support
|
|
347
|
+
|
|
348
|
+
- **๐ Found a bug?** [Open an issue](https://github.com/iristech-systems/PhantomFetch/issues/new?template=bug_report.md)
|
|
349
|
+
- **๐ก Have a feature idea?** [Request a feature](https://github.com/iristech-systems/PhantomFetch/issues/new?template=feature_request.md)
|
|
350
|
+
- **โ Questions?** [Start a discussion](https://github.com/iristech-systems/PhantomFetch/discussions)
|
|
351
|
+
- **๐ Documentation issues?** [Improve the docs](https://github.com/iristech-systems/PhantomFetch/edit/main/README.md)
|
|
352
|
+
|
|
353
|
+
## Contributing
|
|
354
|
+
|
|
355
|
+
We love contributions! PhantomFetch is built by developers, for developers. Whether you're:
|
|
356
|
+
- ๐ Fixing bugs
|
|
357
|
+
- โจ Adding features
|
|
358
|
+
- ๐ Improving documentation
|
|
359
|
+
- ๐งช Writing tests
|
|
360
|
+
|
|
361
|
+
Check out our [Contributing Guide](CONTRIBUTING.md) to get started!
|
|
362
|
+
|
|
363
|
+
### Quick Start for Contributors
|
|
364
|
+
|
|
365
|
+
```bash
|
|
366
|
+
# Clone and setup
|
|
367
|
+
git clone https://github.com/iristech-systems/PhantomFetch.git
|
|
368
|
+
cd phantomfetch
|
|
369
|
+
uv sync
|
|
370
|
+
uv run pre-commit install
|
|
371
|
+
|
|
372
|
+
# Run tests
|
|
373
|
+
uv run pytest
|
|
374
|
+
|
|
375
|
+
# Make changes and commit
|
|
376
|
+
git checkout -b feature/amazing-feature
|
|
377
|
+
# ... make changes ...
|
|
378
|
+
uv run pre-commit run --all-files
|
|
379
|
+
git commit -m "feat: add amazing feature"
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
## License
|
|
383
|
+
|
|
384
|
+
MIT License - see [LICENSE](LICENSE) for details.
|
|
385
|
+
|
|
386
|
+
## Acknowledgments
|
|
387
|
+
|
|
388
|
+
Built on the shoulders of giants:
|
|
389
|
+
- [curl-cffi](https://github.com/yifeikong/curl-cffi) - Amazing curl bindings with anti-detection
|
|
390
|
+
- [Playwright](https://playwright.dev) - Best-in-class browser automation
|
|
391
|
+
- [msgspec](https://jcristharif.com/msgspec/) - Fast serialization
|
|
392
|
+
- [OpenTelemetry](https://opentelemetry.io/) - Observability standard
|
|
393
|
+
|
|
394
|
+
Special thanks to all [contributors](https://github.com/iristech-systems/PhantomFetch/graphs/contributors) who help make PhantomFetch better!
|
|
395
|
+
|
|
396
|
+
---
|
|
397
|
+
|
|
398
|
+
<div align="center">
|
|
399
|
+
|
|
400
|
+
**Made with โค๏ธ for the web scraping community**
|
|
401
|
+
|
|
402
|
+
[โญ Star us on GitHub](https://github.com/iristech-systems/PhantomFetch) โข [๐ฆ Install from PyPI](https://pypi.org/project/phantomfetch/)
|
|
403
|
+
|
|
404
|
+
</div>
|