ghostfetch 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- ghostfetch-1.0.0/LICENSE +21 -0
- ghostfetch-1.0.0/PKG-INFO +661 -0
- ghostfetch-1.0.0/README.md +623 -0
- ghostfetch-1.0.0/ghostfetch/__init__.py +17 -0
- ghostfetch-1.0.0/ghostfetch/cli.py +198 -0
- ghostfetch-1.0.0/ghostfetch/client.py +211 -0
- ghostfetch-1.0.0/ghostfetch/fetch.py +103 -0
- ghostfetch-1.0.0/ghostfetch/mcp_server.py +255 -0
- ghostfetch-1.0.0/ghostfetch.egg-info/PKG-INFO +661 -0
- ghostfetch-1.0.0/ghostfetch.egg-info/SOURCES.txt +23 -0
- ghostfetch-1.0.0/ghostfetch.egg-info/dependency_links.txt +1 -0
- ghostfetch-1.0.0/ghostfetch.egg-info/entry_points.txt +2 -0
- ghostfetch-1.0.0/ghostfetch.egg-info/requires.txt +12 -0
- ghostfetch-1.0.0/ghostfetch.egg-info/top_level.txt +2 -0
- ghostfetch-1.0.0/pyproject.toml +60 -0
- ghostfetch-1.0.0/setup.cfg +4 -0
- ghostfetch-1.0.0/src/__init__.py +0 -0
- ghostfetch-1.0.0/src/api/__init__.py +0 -0
- ghostfetch-1.0.0/src/core/__init__.py +0 -0
- ghostfetch-1.0.0/src/core/job_manager.py +237 -0
- ghostfetch-1.0.0/src/core/scraper.py +316 -0
- ghostfetch-1.0.0/src/core/stealth_utils.py +219 -0
- ghostfetch-1.0.0/src/sdk/client.py +84 -0
- ghostfetch-1.0.0/src/utils/__init__.py +0 -0
- ghostfetch-1.0.0/src/utils/config.py +38 -0
ghostfetch-1.0.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Arsalan
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,661 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: ghostfetch
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: A stealthy headless browser service for AI agents. Bypasses anti-bot protections to fetch content and convert to clean Markdown.
|
|
5
|
+
Author-email: Arsalan Shah <iarsalanshah@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/iArsalanshah/GhostFetch
|
|
8
|
+
Project-URL: Documentation, https://github.com/iArsalanshah/GhostFetch#readme
|
|
9
|
+
Project-URL: Repository, https://github.com/iArsalanshah/GhostFetch
|
|
10
|
+
Project-URL: Issues, https://github.com/iArsalanshah/GhostFetch/issues
|
|
11
|
+
Keywords: ai-agents,web-scraping,stealth,playwright,markdown,llm
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
15
|
+
Classifier: Operating System :: OS Independent
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
21
|
+
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
|
|
22
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
23
|
+
Requires-Python: >=3.9
|
|
24
|
+
Description-Content-Type: text/markdown
|
|
25
|
+
License-File: LICENSE
|
|
26
|
+
Requires-Dist: playwright>=1.40.0
|
|
27
|
+
Requires-Dist: fastapi>=0.100.0
|
|
28
|
+
Requires-Dist: uvicorn>=0.23.0
|
|
29
|
+
Requires-Dist: beautifulsoup4>=4.12.0
|
|
30
|
+
Requires-Dist: html2text>=2020.1.16
|
|
31
|
+
Requires-Dist: requests>=2.31.0
|
|
32
|
+
Requires-Dist: prometheus-client>=0.17.0
|
|
33
|
+
Provides-Extra: dev
|
|
34
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
35
|
+
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
|
|
36
|
+
Requires-Dist: httpx>=0.24.0; extra == "dev"
|
|
37
|
+
Dynamic: license-file
|
|
38
|
+
|
|
39
|
+
# GhostFetch
|
|
40
|
+
|
|
41
|
+
A stealthy headless browser service for AI agents. Bypasses anti-bot protections to fetch content from sites like X.com and converts it to clean Markdown.
|
|
42
|
+
|
|
43
|
+
## Features
|
|
44
|
+
- **Zero Setup**: Install with pip, browsers auto-install on first run
|
|
45
|
+
- **Synchronous API**: Single request returns content directly (no polling needed)
|
|
46
|
+
- **Ghost Protocol**: Advanced proxy rotation and cohesive browser fingerprinting
|
|
47
|
+
- **Stealth Browsing**: Uses Playwright with custom flags and canvas noise injection
|
|
48
|
+
- **Markdown Output**: Automatically converts HTML to Markdown for easy LLM consumption
|
|
49
|
+
- **Metadata Extraction**: Automatically extracts title, author, publish date, and images
|
|
50
|
+
- **X.com Support**: Logic to wait for dynamic content on Twitter/X
|
|
51
|
+
- **Async Job Queue**: Process multiple requests concurrently with intelligent retry
|
|
52
|
+
- **Persistent Sessions**: Cookie/localStorage persistence per domain
|
|
53
|
+
- **Webhook Callbacks**: Get notified via HTTP when jobs complete
|
|
54
|
+
- **GitHub Integration**: Post results directly to GitHub issues
|
|
55
|
+
- **Dual Mode**: CLI tool or REST API service
|
|
56
|
+
- **Docker Ready**: Pre-configured Docker setup with docker-compose
|
|
57
|
+
|
|
58
|
+
## Quick Start
|
|
59
|
+
|
|
60
|
+
### For AI Agents (Simplest)
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
# Install from source
|
|
64
|
+
pip install -e .
|
|
65
|
+
|
|
66
|
+
# Fetch any URL (auto-installs browsers on first run)
|
|
67
|
+
ghostfetch "https://x.com/user/status/123"
|
|
68
|
+
|
|
69
|
+
# Or use the Python SDK
|
|
70
|
+
python -c "from ghostfetch import fetch; print(fetch('https://example.com')['markdown'])"
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### For API Usage
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
# Start the server
|
|
77
|
+
ghostfetch serve
|
|
78
|
+
|
|
79
|
+
# Fetch synchronously (blocks until done)
|
|
80
|
+
curl "http://localhost:8000/fetch/sync?url=https://example.com"
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Installation
|
|
84
|
+
|
|
85
|
+
### Option 1: Docker Hub (Fastest)
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
# Pull and run
|
|
89
|
+
docker run -p 8000:8000 iarsalanshah/ghostfetch
|
|
90
|
+
|
|
91
|
+
# Or with docker-compose
|
|
92
|
+
docker-compose up
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Option 2: pip install
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
# From PyPI (when published)
|
|
99
|
+
pip install ghostfetch
|
|
100
|
+
|
|
101
|
+
# Or from source
|
|
102
|
+
git clone https://github.com/iArsalanshah/GhostFetch.git
|
|
103
|
+
cd GhostFetch
|
|
104
|
+
pip install -e .
|
|
105
|
+
|
|
106
|
+
# Browsers install automatically on first use, or run:
|
|
107
|
+
ghostfetch setup
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### Option 3: Manual Setup
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
cd GhostFetch
|
|
114
|
+
|
|
115
|
+
# Create virtual environment (optional)
|
|
116
|
+
python3 -m venv venv
|
|
117
|
+
source venv/bin/activate
|
|
118
|
+
|
|
119
|
+
# Install packages & browser
|
|
120
|
+
pip install -r requirements.txt
|
|
121
|
+
playwright install chromium
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
## Usage
|
|
125
|
+
|
|
126
|
+
### 1. CLI Mode (Zero Setup)
|
|
127
|
+
|
|
128
|
+
Using the `ghostfetch` CLI (after pip install):
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
# Basic fetch
|
|
132
|
+
ghostfetch "https://x.com/user/status/123"
|
|
133
|
+
|
|
134
|
+
# JSON output (for parsing)
|
|
135
|
+
ghostfetch "https://example.com" --json
|
|
136
|
+
|
|
137
|
+
# Metadata only
|
|
138
|
+
ghostfetch "https://example.com" --metadata-only
|
|
139
|
+
|
|
140
|
+
# Quiet mode (no progress messages)
|
|
141
|
+
ghostfetch "https://example.com" --quiet
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
Using the legacy module directly:
|
|
145
|
+
```bash
|
|
146
|
+
python -m src.core.scraper "https://x.com/user/status/123"
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Output:
|
|
150
|
+
```
|
|
151
|
+
--- Metadata ---
|
|
152
|
+
{
|
|
153
|
+
"title": "...",
|
|
154
|
+
"author": "...",
|
|
155
|
+
"publish_date": "...",
|
|
156
|
+
"images": [...]
|
|
157
|
+
}
|
|
158
|
+
|
|
159
|
+
--- Markdown ---
|
|
160
|
+
[converted markdown content]
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### 2. API Mode (Service for Agents)
|
|
164
|
+
|
|
165
|
+
Start the server:
|
|
166
|
+
```bash
|
|
167
|
+
# Using CLI
|
|
168
|
+
ghostfetch serve
|
|
169
|
+
|
|
170
|
+
# Or directly
|
|
171
|
+
python main.py
|
|
172
|
+
```
|
|
173
|
+
The server will start at `http://localhost:8000`.
|
|
174
|
+
|
|
175
|
+
## API Endpoints
|
|
176
|
+
|
|
177
|
+
### Synchronous Fetch (Recommended for AI Agents)
|
|
178
|
+
- **POST** `/fetch/sync` — blocks until content is ready
|
|
179
|
+
- **GET** `/fetch/sync?url=...` — same, but via query parameter
|
|
180
|
+
|
|
181
|
+
**Example (POST):**
|
|
182
|
+
```bash
|
|
183
|
+
curl -X POST "http://localhost:8000/fetch/sync" \
|
|
184
|
+
-H "Content-Type: application/json" \
|
|
185
|
+
-d '{"url": "https://example.com", "timeout": 60}'
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
**Example (GET):**
|
|
189
|
+
```bash
|
|
190
|
+
curl "http://localhost:8000/fetch/sync?url=https://example.com"
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
**Response:**
|
|
194
|
+
```json
|
|
195
|
+
{
|
|
196
|
+
"metadata": {
|
|
197
|
+
"title": "Example Domain",
|
|
198
|
+
"author": "",
|
|
199
|
+
"publish_date": "",
|
|
200
|
+
"images": []
|
|
201
|
+
},
|
|
202
|
+
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples..."
|
|
203
|
+
}
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
### Async Fetch (For Background Processing)
|
|
207
|
+
- **POST** `/fetch` (returns `202 Accepted`)
|
|
208
|
+
- **Body**:
|
|
209
|
+
```json
|
|
210
|
+
{
|
|
211
|
+
"url": "https://example.com",
|
|
212
|
+
"callback_url": "https://your-server.com/webhook",
|
|
213
|
+
"github_issue": 123
|
|
214
|
+
}
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
**Example:**
|
|
218
|
+
```bash
|
|
219
|
+
curl -X POST "http://localhost:8000/fetch" \
|
|
220
|
+
-H "Content-Type: application/json" \
|
|
221
|
+
-d '{"url": "https://x.com/user/status/123"}'
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
**Response:**
|
|
225
|
+
```json
|
|
226
|
+
{
|
|
227
|
+
"job_id": "a1b2c3d4-e5f6-7890",
|
|
228
|
+
"url": "https://x.com/user/status/123",
|
|
229
|
+
"status": "queued"
|
|
230
|
+
}
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
### Check Job Status
|
|
234
|
+
- **GET** `/job/{job_id}`
|
|
235
|
+
|
|
236
|
+
**Example:**
|
|
237
|
+
```bash
|
|
238
|
+
curl "http://localhost:8000/job/a1b2c3d4-e5f6-7890"
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
**Response (Completed):**
|
|
242
|
+
```json
|
|
243
|
+
{
|
|
244
|
+
"id": "a1b2c3d4-e5f6-7890",
|
|
245
|
+
"url": "https://x.com/mrnacknack/status/2016134416897360212",
|
|
246
|
+
"status": "completed",
|
|
247
|
+
"result": {
|
|
248
|
+
"metadata": {
|
|
249
|
+
"title": "...",
|
|
250
|
+
"author": "...",
|
|
251
|
+
"publish_date": "...",
|
|
252
|
+
"images": [...]
|
|
253
|
+
},
|
|
254
|
+
"markdown": "..."
|
|
255
|
+
},
|
|
256
|
+
"created_at": 1706000000,
|
|
257
|
+
"started_at": 1706000001,
|
|
258
|
+
"completed_at": 1706000010
|
|
259
|
+
}
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
### Health Check
|
|
263
|
+
- **GET** `/health`
|
|
264
|
+
|
|
265
|
+
**Response:**
|
|
266
|
+
```json
|
|
267
|
+
{
|
|
268
|
+
"status": "ok",
|
|
269
|
+
"browser_connected": true,
|
|
270
|
+
"active_jobs_queue": 2,
|
|
271
|
+
"active_browser_contexts": 1,
|
|
272
|
+
"concurrency_limit": 2
|
|
273
|
+
}
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
## Integration Examples
|
|
277
|
+
|
|
278
|
+
### Python Agent with Job Polling
|
|
279
|
+
```python
|
|
280
|
+
import requests
|
|
281
|
+
import time
|
|
282
|
+
|
|
283
|
+
def fetch_content_async(url):
|
|
284
|
+
# Submit job
|
|
285
|
+
response = requests.post(
|
|
286
|
+
"http://localhost:8000/fetch",
|
|
287
|
+
json={"url": url}
|
|
288
|
+
)
|
|
289
|
+
job_id = response.json()["job_id"]
|
|
290
|
+
|
|
291
|
+
# Poll until completed
|
|
292
|
+
while True:
|
|
293
|
+
job_response = requests.get(f"http://localhost:8000/job/{job_id}")
|
|
294
|
+
job = job_response.json()
|
|
295
|
+
|
|
296
|
+
if job["status"] == "completed":
|
|
297
|
+
return job["result"]["markdown"]
|
|
298
|
+
elif job["status"] == "failed":
|
|
299
|
+
raise Exception(f"Job failed: {job['error']}")
|
|
300
|
+
|
|
301
|
+
time.sleep(1) # Poll every second
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
### Using Webhook Callbacks
|
|
305
|
+
```python
|
|
306
|
+
import requests
|
|
307
|
+
|
|
308
|
+
# Your webhook endpoint receives:
|
|
309
|
+
# POST to callback_url with:
|
|
310
|
+
# {
|
|
311
|
+
# "job_id": "...",
|
|
312
|
+
# "url": "...",
|
|
313
|
+
# "status": "completed",
|
|
314
|
+
# "data": {"metadata": {...}, "markdown": "..."},
|
|
315
|
+
# "error": null,
|
|
316
|
+
# "error_details": null
|
|
317
|
+
# }
|
|
318
|
+
|
|
319
|
+
requests.post(
|
|
320
|
+
"http://localhost:8000/fetch",
|
|
321
|
+
json={
|
|
322
|
+
"url": "https://example.com",
|
|
323
|
+
"callback_url": "https://your-server.com/webhooks/ghostfetch"
|
|
324
|
+
}
|
|
325
|
+
)
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
### GitHub Integration
|
|
329
|
+
When you include a `github_issue` parameter, GhostFetch will post results as comments:
|
|
330
|
+
|
|
331
|
+
```python
|
|
332
|
+
requests.post(
|
|
333
|
+
"http://localhost:8000/fetch",
|
|
334
|
+
json={
|
|
335
|
+
"url": "https://example.com",
|
|
336
|
+
"github_issue": 42 # Post result as comment on issue #42
|
|
337
|
+
}
|
|
338
|
+
)
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
**Requires:**
|
|
342
|
+
- GitHub CLI (`gh` command) installed
|
|
343
|
+
- `GITHUB_TOKEN` environment variable set
|
|
344
|
+
- `GITHUB_REPO` configured
|
|
345
|
+
|
|
346
|
+
## Integration with AI Agents
|
|
347
|
+
Your agent can submit a fetch job and poll for results:
|
|
348
|
+
|
|
349
|
+
```python
|
|
350
|
+
import requests
|
|
351
|
+
import time
|
|
352
|
+
|
|
353
|
+
def fetch_blocked_content(url):
|
|
354
|
+
response = requests.post(
|
|
355
|
+
"http://localhost:8000/fetch",
|
|
356
|
+
json={"url": url}
|
|
357
|
+
)
|
|
358
|
+
job_id = response.json()["job_id"]
|
|
359
|
+
|
|
360
|
+
# Poll for completion
|
|
361
|
+
max_retries = 60
|
|
362
|
+
for _ in range(max_retries):
|
|
363
|
+
result = requests.get(f"http://localhost:8000/job/{job_id}").json()
|
|
364
|
+
if result["status"] == "completed":
|
|
365
|
+
return result["result"]["markdown"]
|
|
366
|
+
elif result["status"] == "failed":
|
|
367
|
+
return f"Error: {result['error']}"
|
|
368
|
+
time.sleep(1)
|
|
369
|
+
|
|
370
|
+
return "Timeout waiting for result"
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
## Configuration
|
|
374
|
+
|
|
375
|
+
GhostFetch is configured via environment variables (see `src/utils/config.py`) or the `proxies.txt` file.
|
|
376
|
+
|
|
377
|
+
- **Proxies**: Add one proxy per line to `proxies.txt` in the format `http://user:pass@host:port`.
|
|
378
|
+
- **Strategy**: Set `PROXY_STRATEGY` to `round_robin` or `random`.
|
|
379
|
+
|
|
380
|
+
### Environment Variables
|
|
381
|
+
|
|
382
|
+
```bash
|
|
383
|
+
# API Server
|
|
384
|
+
GHOSTFETCH_HOST=0.0.0.0
|
|
385
|
+
GHOSTFETCH_PORT=8000
|
|
386
|
+
|
|
387
|
+
# Scraper Settings
|
|
388
|
+
MAX_CONCURRENT_BROWSERS=2 # Number of concurrent browser contexts
|
|
389
|
+
MIN_DOMAIN_DELAY=10 # Minimum seconds between requests to same domain
|
|
390
|
+
MAX_REQUESTS_PER_BROWSER=50 # Restart browser after N requests
|
|
391
|
+
MAX_RETRIES=3 # Retry attempts for failed requests
|
|
392
|
+
|
|
393
|
+
# Sync Endpoint Settings
|
|
394
|
+
SYNC_TIMEOUT_DEFAULT=120 # Default timeout for /fetch/sync (seconds)
|
|
395
|
+
MAX_SYNC_TIMEOUT=300 # Maximum allowed timeout (5 minutes)
|
|
396
|
+
|
|
397
|
+
# GitHub Integration
|
|
398
|
+
GITHUB_REPO=iArsalanshah/GhostFetch # Owner/repo for issue comments
|
|
399
|
+
|
|
400
|
+
# Persistence
|
|
401
|
+
DATABASE_URL=sqlite:///./storage/jobs.db
|
|
402
|
+
STORAGE_DIR=storage
|
|
403
|
+
|
|
404
|
+
# Job Lifecycle
|
|
405
|
+
JOB_TTL_SECONDS=86400 # Delete completed jobs after 24 hours
|
|
406
|
+
```
|
|
407
|
+
|
|
408
|
+
### Docker Environment
|
|
409
|
+
Create a `.env` file for docker-compose:
|
|
410
|
+
|
|
411
|
+
```bash
|
|
412
|
+
MAX_CONCURRENT_BROWSERS=2
|
|
413
|
+
MIN_DOMAIN_DELAY=10
|
|
414
|
+
GITHUB_REPO=your-org/your-repo
|
|
415
|
+
JOB_TTL_SECONDS=86400
|
|
416
|
+
```
|
|
417
|
+
|
|
418
|
+
Then run:
|
|
419
|
+
```bash
|
|
420
|
+
docker-compose --env-file .env up
|
|
421
|
+
```
|
|
422
|
+
|
|
423
|
+
## Specific Handling
|
|
424
|
+
- **X/Twitter**: The scraper waits for `[data-testid="tweetText"]` to ensure the tweet content is loaded before capturing.
|
|
425
|
+
|
|
426
|
+
## ⚠️ Important: Rate Limiting & Ethics
|
|
427
|
+
|
|
428
|
+
This tool bypasses anti-bot protections. **Use responsibly:**
|
|
429
|
+
|
|
430
|
+
- **Respect robots.txt** - Check site policies before scraping
|
|
431
|
+
- **Implement delays** - Use `MIN_DOMAIN_DELAY` (default: 10 seconds) to avoid overloading servers
|
|
432
|
+
- **Throttle requests** - Reduce `MAX_CONCURRENT_BROWSERS` for high-volume scraping
|
|
433
|
+
- **Terms of Service** - Ensure your use complies with target site's ToS
|
|
434
|
+
- **Authentication** - When possible, use authorized access instead of bypassing protections
|
|
435
|
+
|
|
436
|
+
### Recommended Settings for Production
|
|
437
|
+
```bash
|
|
438
|
+
# Conservative (respectful scraping)
|
|
439
|
+
MIN_DOMAIN_DELAY=30
|
|
440
|
+
MAX_CONCURRENT_BROWSERS=1
|
|
441
|
+
|
|
442
|
+
# Moderate
|
|
443
|
+
MIN_DOMAIN_DELAY=15
|
|
444
|
+
MAX_CONCURRENT_BROWSERS=2
|
|
445
|
+
|
|
446
|
+
# Aggressive (only for your own content)
|
|
447
|
+
MIN_DOMAIN_DELAY=5
|
|
448
|
+
MAX_CONCURRENT_BROWSERS=4
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
## Production Deployment Guide
|
|
452
|
+
|
|
453
|
+
### 1. Proxy Support (Recommended for High-Volume)
|
|
454
|
+
|
|
455
|
+
For serious stealth, rotate through residential proxies:
|
|
456
|
+
|
|
457
|
+
```python
|
|
458
|
+
# Configure proxies.txt with your proxy list
|
|
459
|
+
# GhostFetch will automatically rotate and track health.
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
**Recommended proxy providers:**
|
|
463
|
+
- BrightData (datacenter/residential)
|
|
464
|
+
- ScrapingBee (cloud-based)
|
|
465
|
+
- Oxylabs (residential networks)
|
|
466
|
+
- Local proxy rotation with tools like `scrapy-proxy-pool`
|
|
467
|
+
|
|
468
|
+
### 2. Caching Layer (Reduce Redundant Requests)
|
|
469
|
+
|
|
470
|
+
For repeated fetches, implement Redis caching:
|
|
471
|
+
|
|
472
|
+
```python
|
|
473
|
+
import redis
|
|
474
|
+
|
|
475
|
+
cache = redis.Redis(host='localhost', port=6379)
|
|
476
|
+
|
|
477
|
+
async def fetch_with_cache(url, ttl=3600):
|
|
478
|
+
cached = cache.get(url)
|
|
479
|
+
if cached:
|
|
480
|
+
return json.loads(cached)
|
|
481
|
+
|
|
482
|
+
result = await scraper.fetch(url)
|
|
483
|
+
cache.setex(url, ttl, json.dumps(result))
|
|
484
|
+
return result
|
|
485
|
+
```
|
|
486
|
+
|
|
487
|
+
**Docker Compose with Redis:**
|
|
488
|
+
```yaml
|
|
489
|
+
services:
|
|
490
|
+
ghostfetch:
|
|
491
|
+
build: .
|
|
492
|
+
ports:
|
|
493
|
+
- "8000:8000"
|
|
494
|
+
redis:
|
|
495
|
+
image: redis:7-alpine
|
|
496
|
+
ports:
|
|
497
|
+
- "6379:6379"
|
|
498
|
+
```
|
|
499
|
+
|
|
500
|
+
### 3. Security & Authentication
|
|
501
|
+
|
|
502
|
+
Add API key authentication before exposing publicly:
|
|
503
|
+
|
|
504
|
+
```python
|
|
505
|
+
from fastapi import Header, HTTPException
|
|
506
|
+
|
|
507
|
+
VALID_API_KEYS = set(os.getenv("API_KEYS", "").split(","))
|
|
508
|
+
|
|
509
|
+
@app.post("/fetch")
|
|
510
|
+
async def fetch_endpoint(request: FetchRequest, x_api_key: str = Header(None)):
|
|
511
|
+
if not x_api_key or x_api_key not in VALID_API_KEYS:
|
|
512
|
+
raise HTTPException(status_code=401, detail="Invalid API key")
|
|
513
|
+
# ... rest of endpoint
|
|
514
|
+
```
|
|
515
|
+
|
|
516
|
+
Usage:
|
|
517
|
+
```bash
|
|
518
|
+
curl -X POST "http://localhost:8000/fetch" \
|
|
519
|
+
-H "x-api-key: your-secret-key" \
|
|
520
|
+
-H "Content-Type: application/json" \
|
|
521
|
+
-d '{"url": "https://example.com"}'
|
|
522
|
+
```
|
|
523
|
+
|
|
524
|
+
### 4. Monitoring & Observability
|
|
525
|
+
|
|
526
|
+
**Log rotation** (automatically configured):
|
|
527
|
+
- Logs stored in `storage/scraper.log`
|
|
528
|
+
- Max 5MB per file, keeps 5 backups
|
|
529
|
+
- Check for errors: `tail -f storage/scraper.log | grep ERROR`
|
|
530
|
+
|
|
531
|
+
**Database queries for analytics:**
|
|
532
|
+
```bash
|
|
533
|
+
sqlite3 storage/jobs.db "SELECT status, COUNT(*) FROM jobs GROUP BY status;"
|
|
534
|
+
```
|
|
535
|
+
|
|
536
|
+
**Health check monitoring:**
|
|
537
|
+
```bash
|
|
538
|
+
while true; do
|
|
539
|
+
curl http://localhost:8000/health | jq .
|
|
540
|
+
sleep 30
|
|
541
|
+
done
|
|
542
|
+
```
|
|
543
|
+
|
|
544
|
+
### 3. Model Context Protocol (MCP)
|
|
545
|
+
|
|
546
|
+
GhostFetch includes an MCP server for integration with Claude Desktop and other MCP-aware agents.
|
|
547
|
+
|
|
548
|
+
Configuration (`claude_desktop_config.json`):
|
|
549
|
+
|
|
550
|
+
```json
|
|
551
|
+
{
|
|
552
|
+
"mcpServers": {
|
|
553
|
+
"ghostfetch": {
|
|
554
|
+
"command": "python",
|
|
555
|
+
"args": ["-m", "ghostfetch.mcp_server"],
|
|
556
|
+
"env": {
|
|
557
|
+
"SYNC_TIMEOUT_DEFAULT": "120"
|
|
558
|
+
}
|
|
559
|
+
}
|
|
560
|
+
}
|
|
561
|
+
}
|
|
562
|
+
```
|
|
563
|
+
|
|
564
|
+
This exposes a `ghostfetch` tool to the agent:
|
|
565
|
+
- `url`: The URL to fetch
|
|
566
|
+
- `context_id`: Optional session ID
|
|
567
|
+
- `timeout`: Optional timeout (seconds)
|
|
568
|
+
|
|
569
|
+
## Performance & Monitoring
|
|
570
|
+
|
|
571
|
+
### Logging
|
|
572
|
+
Logs are written to `storage/scraper.log` with rotation (5MB max):
|
|
573
|
+
- Stream output to console (INFO level)
|
|
574
|
+
- File output with detailed format
|
|
575
|
+
|
|
576
|
+
### Load Testing
|
|
577
|
+
Run included load tests:
|
|
578
|
+
```bash
|
|
579
|
+
# Python async load test
|
|
580
|
+
python scripts/load_test.py
|
|
581
|
+
```
|
|
582
|
+
|
|
583
|
+
### Database
|
|
584
|
+
Job history is stored in `storage/jobs.db` (SQLite):
|
|
585
|
+
- Persistent across restarts
|
|
586
|
+
- Automatic cleanup of old jobs (configurable TTL)
|
|
587
|
+
- Query jobs directly for analytics/debugging
|
|
588
|
+
|
|
589
|
+
## Troubleshooting
|
|
590
|
+
|
|
591
|
+
**Playwright Error: Executable doesn't exist**
|
|
592
|
+
If you see an error about the browser executable not being found, run:
|
|
593
|
+
```bash
|
|
594
|
+
playwright install chromium
|
|
595
|
+
```
|
|
596
|
+
|
|
597
|
+
**Timeout Errors**
|
|
598
|
+
If fetching times out, it might be due to slow network or heavy anti-bot protections. You can try:
|
|
599
|
+
- Increasing timeout in `src/core/scraper.py` (default is 60000ms)
|
|
600
|
+
- Increasing `MIN_DOMAIN_DELAY` to avoid rate-limiting
|
|
601
|
+
|
|
602
|
+
**Job Stuck in "Processing"**
|
|
603
|
+
Check logs in `storage/scraper.log` for errors. If stuck, restart the service.
|
|
604
|
+
|
|
605
|
+
**GitHub Comments Not Posting**
|
|
606
|
+
Ensure:
|
|
607
|
+
- `gh` CLI is installed: `brew install gh` (macOS) or `apt install gh` (Linux)
|
|
608
|
+
- You're authenticated: `gh auth login`
|
|
609
|
+
- `GITHUB_REPO` is set correctly
|
|
610
|
+
- `GITHUB_TOKEN` is in your environment
|
|
611
|
+
|
|
612
|
+
**High Memory Usage**
|
|
613
|
+
Reduce `MAX_CONCURRENT_BROWSERS` or `MAX_REQUESTS_PER_BROWSER` in configuration.
|
|
614
|
+
|
|
615
|
+
## Publishing Setup
|
|
616
|
+
|
|
617
|
+
### Docker Hub
|
|
618
|
+
|
|
619
|
+
To enable automated Docker image publishing:
|
|
620
|
+
|
|
621
|
+
1. Create a Docker Hub account and repository (`your-username/ghostfetch`)
|
|
622
|
+
2. Generate an access token at https://hub.docker.com/settings/security
|
|
623
|
+
3. Add these secrets to your GitHub repository:
|
|
624
|
+
- `DOCKERHUB_USERNAME`: Your Docker Hub username
|
|
625
|
+
- `DOCKERHUB_TOKEN`: Your access token
|
|
626
|
+
|
|
627
|
+
Images will be published automatically on pushes to `main` and version tags.
|
|
628
|
+
|
|
629
|
+
### PyPI (Trusted Publishing)
|
|
630
|
+
|
|
631
|
+
To enable automated PyPI publishing:
|
|
632
|
+
|
|
633
|
+
1. Go to https://pypi.org/manage/account/publishing/
|
|
634
|
+
2. Add a new pending publisher:
|
|
635
|
+
- **PyPI Project Name**: `ghostfetch`
|
|
636
|
+
- **Owner**: `iArsalanshah`
|
|
637
|
+
- **Repository**: `GhostFetch`
|
|
638
|
+
- **Workflow name**: `pypi-publish.yml`
|
|
639
|
+
- **Environment**: `pypi`
|
|
640
|
+
3. Create a GitHub Release to trigger publishing
|
|
641
|
+
|
|
642
|
+
No API tokens needed - uses OIDC trusted publishing.
|
|
643
|
+
|
|
644
|
+
|
|
645
|
+
## Legal Disclaimer
|
|
646
|
+
|
|
647
|
+
GhostFetch is provided for educational and research purposes only. Users are solely responsible for ensuring their use complies with:
|
|
648
|
+
1. The Terms of Service of target websites
|
|
649
|
+
2. Applicable laws regarding data access and automation (including CFAA in the US)
|
|
650
|
+
3. The robots.txt and scraping policies of target domains
|
|
651
|
+
|
|
652
|
+
This tool should not be used to:
|
|
653
|
+
- Scrape private or authenticated content without authorization
|
|
654
|
+
- Circumvent security measures on sites where such circumvention violates applicable law
|
|
655
|
+
- Violate the Terms of Service of social media platforms (including X/Twitter)
|
|
656
|
+
|
|
657
|
+
The authors assume no liability for misuse of this software.
|
|
658
|
+
|
|
659
|
+
## License
|
|
660
|
+
|
|
661
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|