phantomfetch 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,404 @@
1
+ Metadata-Version: 2.3
2
+ Name: phantomfetch
3
+ Version: 0.1.2
4
+ Summary: High-performance agentic web scraping library combining curl-cffi speed with Playwright browser capabilities
5
+ Keywords: web-scraping,playwright,curl-cffi,async,browser-automation,http-client,agentic,anti-detection
6
+ Author: CosmicBull
7
+ Author-email: CosmicBull <cosmicbull@example.com>
8
+ License: MIT
9
+ Classifier: Development Status :: 3 - Alpha
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.13
14
+ Classifier: Topic :: Internet :: WWW/HTTP
15
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
16
+ Classifier: Framework :: AsyncIO
17
+ Requires-Dist: curl-cffi>=0.13.0
18
+ Requires-Dist: msgspec>=0.20.0
19
+ Requires-Dist: playwright>=1.56.0
20
+ Requires-Dist: rusticsoup>=0.3.0
21
+ Requires-Dist: httpx>=0.27.0
22
+ Requires-Dist: opentelemetry-api>=1.38.0
23
+ Requires-Dist: loguru>=0.7.3
24
+ Requires-Python: >=3.13
25
+ Project-URL: Changelog, https://github.com/iristech-systems/PhantomFetch/blob/main/CHANGELOG.md
26
+ Project-URL: Documentation, https://github.com/iristech-systems/PhantomFetch#readme
27
+ Project-URL: Homepage, https://github.com/iristech-systems/PhantomFetch
28
+ Project-URL: Issues, https://github.com/iristech-systems/PhantomFetch/issues
29
+ Project-URL: Repository, https://github.com/iristech-systems/PhantomFetch
30
+ Description-Content-Type: text/markdown
31
+
32
+ # PhantomFetch
33
+
34
+ [![Python 3.13+](https://img.shields.io/badge/python-3.13+-blue.svg)](https://www.python.org/downloads/)
35
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
36
+ [![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
37
+
38
+ **PhantomFetch** is a high-performance, agentic web scraping library for Python. It seamlessly combines the speed of `curl-cffi` with the capabilities of `Playwright`, offering a unified API for all your data extraction needs.
39
+
40
+ ## Why PhantomFetch?
41
+
42
+ Most web scraping requires choosing between speed (httpx, requests) or browser capabilities (Playwright, Selenium). PhantomFetch gives you **both** with a unified interface:
43
+
44
+ | Feature | PhantomFetch | requests/httpx | Playwright/Selenium |
45
+ |---------|--------------|----------------|---------------------|
46
+ | **Speed** | โšก Fast (curl-cffi) | โšก Fast | ๐ŸŒ Slow |
47
+ | **JavaScript Support** | โœ… Yes (Playwright) | โŒ No | โœ… Yes |
48
+ | **Anti-Detection** | โœ… Built-in | โŒ No | โš ๏ธ Manual |
49
+ | **Smart Caching** | โœ… Configurable | โŒ No | โŒ No |
50
+ | **Proxy Rotation** | โœ… Automatic | โš ๏ธ Manual | โš ๏ธ Manual |
51
+ | **Async-First** | โœ… Yes | โš ๏ธ Partial | โœ… Yes |
52
+ | **Unified API** | โœ… One interface | N/A | N/A |
53
+ | **OpenTelemetry** | โœ… Built-in | โŒ No | โŒ No |
54
+
55
+ **Key Benefits:**
56
+ - ๐ŸŽฏ **Start Fast, Scale Smart**: Use curl for quick requests, switch to browser when needed
57
+ - ๐Ÿง  **Intelligent**: Automatic retry logic, exponential backoff, fingerprint rotation
58
+ - ๐Ÿš€ **Production-Ready**: Built-in observability, caching, and error handling
59
+ - ๐Ÿ› ๏ธ **Developer-Friendly**: Intuitive API, comprehensive type hints, rich documentation
60
+
61
+ ## Features
62
+
63
+ - ๐Ÿš€ **Unified API**: Switch between `curl` (fast, lightweight) and `browser` (JavaScript-capable) engines with a single parameter
64
+ - ๐Ÿง  **Smart Caching**: Configurable caching strategies (`all`, `resources`, `conservative`) to speed up development and save bandwidth
65
+ - ๐Ÿค– **Agentic Actions**: Define browser interactions (click, scroll, input, wait) declaratively
66
+ - ๐Ÿ›ก๏ธ **Anti-Detection**: Built-in support for proxy rotation and fingerprinting protection (via `curl-cffi`)
67
+ - โšก **Async First**: Built on `asyncio` for high concurrency
68
+ - ๐Ÿ”„ **Smart Retries**: Configurable retry logic with exponential backoff
69
+ - ๐Ÿช **Cookie Management**: Automatic cookie handling across engines
70
+ - ๐Ÿ“Š **Observability**: OpenTelemetry integration out of the box
71
+
72
+ ## Installation
73
+
74
+ ```bash
75
+ pip install phantomfetch
76
+ # or with uv (recommended)
77
+ uv pip install phantomfetch
78
+ ```
79
+
80
+ After installation, install Playwright browsers:
81
+ ```bash
82
+ playwright install chromium
83
+ ```
84
+
85
+ ## Quick Start
86
+
87
+ ### Basic Fetch (Curl Engine)
88
+
89
+ ```python
90
+ import asyncio
91
+ from phantomfetch import Fetcher
92
+
93
+ async def main():
94
+ async with Fetcher() as f:
95
+ response = await f.fetch("https://httpbin.org/get")
96
+ print(response.json())
97
+
98
+ if __name__ == "__main__":
99
+ asyncio.run(main())
100
+ ```
101
+
102
+ ### Browser Fetch with Caching
103
+
104
+ Use the `resources` strategy to cache static assets (images, CSS, scripts) while keeping the main page fresh.
105
+
106
+ ```python
107
+ from phantomfetch import Fetcher, FileSystemCache
108
+
109
+ async def main():
110
+ # Cache sub-resources to speed up subsequent fetches
111
+ cache = FileSystemCache(strategy="resources")
112
+
113
+ async with Fetcher(browser_engine="cdp", cache=cache) as f:
114
+ # First run: downloads everything
115
+ resp = await f.fetch("https://example.com", engine="browser")
116
+
117
+ # Second run: uses cached resources, only fetches main HTML
118
+ resp = await f.fetch("https://example.com", engine="browser")
119
+ print(resp.text)
120
+ ```
121
+
122
+ ### Browser Actions
123
+
124
+ Perform interactions like clicking, scrolling, and taking screenshots:
125
+
126
+ ```python
127
+ from phantomfetch import Fetcher
128
+
129
+ actions = [
130
+ {"action": "wait", "selector": "#search-input"},
131
+ {"action": "input", "selector": "#search-input", "value": "phantomfetch"},
132
+ {"action": "click", "selector": "#search-button"},
133
+ {"action": "wait_for_load"},
134
+ {"action": "screenshot", "value": "search_results.png"}
135
+ ]
136
+
137
+ async with Fetcher(browser_engine="cdp") as f:
138
+ resp = await f.fetch("https://example.com", actions=actions, engine="browser")
139
+ # Screenshot saved to search_results.png
140
+ ```
141
+
142
+ ### Advanced: Retry Configuration
143
+
144
+ Fine-tune retry behavior per request:
145
+
146
+ ```python
147
+ from phantomfetch import Fetcher
148
+
149
+ async with Fetcher() as f:
150
+ # Custom retry logic for flaky endpoints
151
+ resp = await f.fetch(
152
+ "https://api.example.com/data",
153
+ max_retries=5, # Override default retries
154
+ timeout=60.0, # Longer timeout for slow APIs
155
+ )
156
+ ```
157
+
158
+ ### Cookie Handling
159
+
160
+ Pass cookies to any engine and retrieve them from the response:
161
+
162
+ ```python
163
+ from phantomfetch import Fetcher, Cookie
164
+
165
+ async with Fetcher() as f:
166
+ # Set cookies
167
+ resp = await f.fetch(
168
+ "https://httpbin.org/cookies",
169
+ cookies={"session_id": "secret_token"}
170
+ )
171
+ print(resp.json())
172
+
173
+ # Get cookies (including from redirects)
174
+ resp = await f.fetch("https://httpbin.org/cookies/set/foo/bar")
175
+ for cookie in resp.cookies:
176
+ print(f"{cookie.name}: {cookie.value}")
177
+ ```
178
+
179
+ ## Configuration
180
+
181
+ ### Caching Strategies
182
+
183
+ - **`all`**: Caches everything, including the main document. Good for offline development
184
+ - **`resources`** (Default): Caches sub-resources (images, styles, scripts) but fetches the main document fresh. Best for scraping dynamic sites
185
+ - **`conservative`**: Caches only heavy static assets like images and fonts
186
+
187
+ Example:
188
+ ```python
189
+ from phantomfetch import FileSystemCache, Fetcher
190
+
191
+ cache = FileSystemCache(
192
+ cache_dir=".cache",
193
+ strategy="resources"
194
+ )
195
+
196
+ async with Fetcher(cache=cache) as f:
197
+ # Resources will be cached automatically
198
+ resp = await f.fetch("https://example.com", engine="browser")
199
+ ```
200
+
201
+ ### Proxy Rotation
202
+
203
+ Multiple proxy strategies available:
204
+
205
+ ```python
206
+ from phantomfetch import Fetcher, Proxy
207
+
208
+ proxies = [
209
+ Proxy(url="http://user:pass@proxy1:8080", location="US", weight=2),
210
+ Proxy(url="http://user:pass@proxy2:8080", location="EU", weight=1),
211
+ ]
212
+
213
+ async with Fetcher(
214
+ proxies=proxies,
215
+ proxy_strategy="round_robin" # or "random", "sticky", "geo_match"
216
+ ) as f:
217
+ resp = await f.fetch("https://example.com")
218
+ ```
219
+
220
+ ### Observability (OpenTelemetry)
221
+
222
+ PhantomFetch is fully instrumented with OpenTelemetry:
223
+
224
+ ```python
225
+ from phantomfetch.telemetry import configure_telemetry
226
+ from opentelemetry.sdk.trace.export import ConsoleSpanExporter
227
+
228
+ # Setup OTel with custom service name
229
+ configure_telemetry(service_name="my-scraper")
230
+
231
+ async with Fetcher() as f:
232
+ await f.fetch("https://example.com")
233
+ # Spans automatically created and exported
234
+ ```
235
+
236
+ Or use standard OpenTelemetry environment variables:
237
+ ```bash
238
+ export OTEL_SERVICE_NAME="my-scraper"
239
+ export OTEL_TRACES_EXPORTER="console"
240
+ python my_scraper.py
241
+ ```
242
+
243
+ ## Troubleshooting
244
+
245
+ ### Playwright Installation Issues
246
+
247
+ If you encounter browser-related errors:
248
+ ```bash
249
+ # Install all browsers
250
+ playwright install
251
+
252
+ # Or just chromium (recommended)
253
+ playwright install chromium
254
+
255
+ # Check installation
256
+ playwright install --help
257
+ ```
258
+
259
+ ### SSL Certificate Errors
260
+
261
+ For development/testing, you can disable SSL verification:
262
+ ```python
263
+ # Note: Only use this in development!
264
+ async with Fetcher() as f:
265
+ # SSL verification is handled by curl-cffi and Playwright
266
+ # For curl engine, certificates are validated by default
267
+ resp = await f.fetch("https://self-signed.badssl.com/")
268
+ ```
269
+
270
+ ### Memory Issues with Caching
271
+
272
+ If cache grows too large:
273
+ ```python
274
+ from phantomfetch import FileSystemCache
275
+
276
+ cache = FileSystemCache(cache_dir=".cache")
277
+
278
+ # Manually clear expired entries
279
+ cache.clear_expired()
280
+
281
+ # Or just delete the cache directory
282
+ import shutil
283
+ shutil.rmtree(".cache", ignore_errors=True)
284
+ ```
285
+
286
+ ### Browser Engine Not Working
287
+
288
+ Common issues:
289
+ 1. **Playwright not installed**: Run `playwright install chromium`
290
+ 2. **Marimo notebook issues**: Browser engines may not work in some notebook environments
291
+ 3. **Port conflicts**: CDP uses random ports, but firewall rules might block them
292
+
293
+ Debug with:
294
+ ```python
295
+ async with Fetcher(browser_engine="cdp") as f:
296
+ # Enable verbose logging
297
+ import logging
298
+ logging.basicConfig(level=logging.DEBUG)
299
+
300
+ resp = await f.fetch("https://example.com", engine="browser")
301
+ ```
302
+
303
+ ### Rate Limiting / 429 Errors
304
+
305
+ Use retry configuration and delays:
306
+ ```python
307
+ import asyncio
308
+
309
+ async with Fetcher(max_retries=5) as f:
310
+ for url in urls:
311
+ resp = await f.fetch(url)
312
+ await asyncio.sleep(1) # Be nice to servers
313
+ ```
314
+
315
+ ### Scrapeless Session Recording
316
+
317
+ When using Scrapeless's CDP endpoint for session recording, PhantomFetch automatically reuses existing browser windows:
318
+
319
+ ```python
320
+ async with Fetcher(
321
+ browser_engine="cdp",
322
+ browser_engine_config={
323
+ "cdp_endpoint": "wss://YOUR_SESSION.scrapeless.com/chrome/cdp"
324
+ # use_existing_page=True (default) ensures recording compatibility
325
+ }
326
+ ) as f:
327
+ # Uses existing window - Scrapeless records this! โœ“
328
+ resp = await f.fetch("https://example.com", engine="browser")
329
+ ```
330
+
331
+ **Why this matters**: Scrapeless can only record a single window. By default (`use_existing_page=True`), PhantomFetch detects and reuses the existing browser page in your Scrapeless session instead of creating new windows.
332
+
333
+ **To disable** (not recommended for recording): Set `use_existing_page=False` in `browser_engine_config`.
334
+
335
+ See [`examples/scrapeless_cdp_recording.py`](examples/scrapeless_cdp_recording.py) for a complete example.
336
+
337
+
338
+ ## Next Steps
339
+
340
+ Ready to dive deeper? Here's what to explore:
341
+
342
+ 1. **[Examples](examples/)** - See retry configuration and advanced patterns
343
+ 2. **[CHANGELOG](CHANGELOG.md)** - See what's new
344
+ 3. **[Contributing Guide](CONTRIBUTING.md)** - Help improve PhantomFetch
345
+
346
+ ## Community & Support
347
+
348
+ - **๐Ÿ› Found a bug?** [Open an issue](https://github.com/iristech-systems/PhantomFetch/issues/new?template=bug_report.md)
349
+ - **๐Ÿ’ก Have a feature idea?** [Request a feature](https://github.com/iristech-systems/PhantomFetch/issues/new?template=feature_request.md)
350
+ - **โ“ Questions?** [Start a discussion](https://github.com/iristech-systems/PhantomFetch/discussions)
351
+ - **๐Ÿ“– Documentation issues?** [Improve the docs](https://github.com/iristech-systems/PhantomFetch/edit/main/README.md)
352
+
353
+ ## Contributing
354
+
355
+ We love contributions! PhantomFetch is built by developers, for developers. Whether you're:
356
+ - ๐Ÿ› Fixing bugs
357
+ - โœจ Adding features
358
+ - ๐Ÿ“ Improving documentation
359
+ - ๐Ÿงช Writing tests
360
+
361
+ Check out our [Contributing Guide](CONTRIBUTING.md) to get started!
362
+
363
+ ### Quick Start for Contributors
364
+
365
+ ```bash
366
+ # Clone and setup
367
+ git clone https://github.com/iristech-systems/PhantomFetch.git
368
+ cd phantomfetch
369
+ uv sync
370
+ uv run pre-commit install
371
+
372
+ # Run tests
373
+ uv run pytest
374
+
375
+ # Make changes and commit
376
+ git checkout -b feature/amazing-feature
377
+ # ... make changes ...
378
+ uv run pre-commit run --all-files
379
+ git commit -m "feat: add amazing feature"
380
+ ```
381
+
382
+ ## License
383
+
384
+ MIT License - see [LICENSE](LICENSE) for details.
385
+
386
+ ## Acknowledgments
387
+
388
+ Built on the shoulders of giants:
389
+ - [curl-cffi](https://github.com/yifeikong/curl-cffi) - Amazing curl bindings with anti-detection
390
+ - [Playwright](https://playwright.dev) - Best-in-class browser automation
391
+ - [msgspec](https://jcristharif.com/msgspec/) - Fast serialization
392
+ - [OpenTelemetry](https://opentelemetry.io/) - Observability standard
393
+
394
+ Special thanks to all [contributors](https://github.com/iristech-systems/PhantomFetch/graphs/contributors) who help make PhantomFetch better!
395
+
396
+ ---
397
+
398
+ <div align="center">
399
+
400
+ **Made with โค๏ธ for the web scraping community**
401
+
402
+ [โญ Star us on GitHub](https://github.com/iristech-systems/PhantomFetch) โ€ข [๐Ÿ“ฆ Install from PyPI](https://pypi.org/project/phantomfetch/)
403
+
404
+ </div>