arcfetch 0.0.1 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +308 -85
- package/cli.ts +264 -15
- package/index.ts +258 -13
- package/package.json +12 -2
- package/src/config/defaults.ts +1 -1
- package/src/config/index.ts +3 -3
- package/src/config/loader.ts +2 -2
- package/src/core/cache.ts +116 -37
- package/src/core/extractor.ts +1 -1
- package/src/core/index.ts +4 -4
- package/src/core/pipeline.ts +4 -4
- package/src/core/playwright/index.ts +2 -2
- package/src/core/playwright/local.ts +2 -2
- package/src/core/playwright/manager.ts +3 -3
- package/src/utils/version.ts +3 -3
package/README.md
CHANGED
|
@@ -1,119 +1,185 @@
|
|
|
1
1
|
# arcfetch
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
[](https://www.npmjs.org/package/arcfetch)
|
|
4
|
+
[](https://opensource.org/licenses/MIT)
|
|
5
|
+
|
|
6
|
+
**Zero-config URL fetching** that converts web pages to clean markdown with automatic JavaScript rendering fallback.
|
|
7
|
+
|
|
8
|
+
Perfect for AI workflows, research, and documentation. Fetches URLs, extracts article content using Mozilla Readability, and caches as markdown with 90-95% token reduction.
|
|
9
|
+
|
|
10
|
+
## Why arcfetch?
|
|
11
|
+
|
|
12
|
+
| Problem | Solution |
|
|
13
|
+
|---------|----------|
|
|
14
|
+
| **JS-heavy sites return blank** | Auto-detects and retries with Playwright |
|
|
15
|
+
| **Too much HTML clutter** | Mozilla Readability extracts just the article |
|
|
16
|
+
| **High token costs for LLMs** | 90-95% token reduction vs raw HTML |
|
|
17
|
+
| **No good caching story** | Temp → Docs workflow for easy curation |
|
|
18
|
+
| **Hard to integrate** | Works as CLI or MCP server with zero setup |
|
|
4
19
|
|
|
5
20
|
## Features
|
|
6
21
|
|
|
7
22
|
- **Smart Fetching**: Simple HTTP first, automatic Playwright fallback for JS-heavy sites
|
|
8
|
-
- **Quality Gates**: Configurable quality thresholds with automatic retry
|
|
23
|
+
- **Quality Gates**: Configurable quality thresholds (0-100) with automatic retry
|
|
9
24
|
- **Clean Markdown**: Mozilla Readability + Turndown for 90-95% token reduction
|
|
10
25
|
- **Temp → Docs Workflow**: Cache to temp folder, promote to docs when ready
|
|
11
26
|
- **CLI & MCP**: Available as command-line tool and MCP server
|
|
27
|
+
- **Multiple Output Formats**: Plain text, JSON, filepath, or summary
|
|
28
|
+
- **Configurable Thresholds**: Set quality minimums and retry strategies
|
|
29
|
+
|
|
30
|
+
## Quick Start
|
|
31
|
+
|
|
32
|
+
### No Installation Required (npx/bunx)
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
# Fetch and display markdown
|
|
36
|
+
npx arcfetch fetch https://example.com/article
|
|
37
|
+
|
|
38
|
+
# Get just the filepath (for scripts)
|
|
39
|
+
npx arcfetch fetch https://example.com -o path
|
|
40
|
+
|
|
41
|
+
# With pretty output
|
|
42
|
+
bunx arcfetch fetch https://example.com --pretty
|
|
43
|
+
```
|
|
12
44
|
|
|
13
|
-
|
|
45
|
+
### Global Installation
|
|
14
46
|
|
|
15
47
|
```bash
|
|
16
|
-
# For users
|
|
17
48
|
npm install -g arcfetch
|
|
18
49
|
|
|
19
|
-
#
|
|
50
|
+
# Then use directly
|
|
51
|
+
arcfetch fetch https://example.com/article
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### Development
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
git clone https://github.com/yourusername/arcfetch.git
|
|
58
|
+
cd arcfetch
|
|
20
59
|
bun install
|
|
60
|
+
bun run cli.ts fetch https://example.com
|
|
21
61
|
```
|
|
22
62
|
|
|
23
|
-
##
|
|
63
|
+
## CLI Usage
|
|
24
64
|
|
|
25
|
-
###
|
|
65
|
+
### Basic Commands
|
|
26
66
|
|
|
27
67
|
```bash
|
|
28
|
-
# Fetch
|
|
68
|
+
# Fetch and display markdown (default output)
|
|
29
69
|
arcfetch fetch https://example.com/article
|
|
30
70
|
|
|
31
|
-
# List cached references
|
|
71
|
+
# List all cached references
|
|
32
72
|
arcfetch list
|
|
33
73
|
|
|
34
|
-
# Promote to docs
|
|
74
|
+
# Promote from temp to permanent docs
|
|
35
75
|
arcfetch promote REF-001
|
|
36
76
|
|
|
37
|
-
# Delete a reference
|
|
77
|
+
# Delete a cached reference
|
|
38
78
|
arcfetch delete REF-001
|
|
79
|
+
|
|
80
|
+
# Show current configuration
|
|
81
|
+
arcfetch config
|
|
39
82
|
```
|
|
40
83
|
|
|
41
|
-
###
|
|
84
|
+
### Output Formats
|
|
42
85
|
|
|
43
|
-
|
|
86
|
+
```bash
|
|
87
|
+
# Plain text (LLM-friendly, default)
|
|
88
|
+
arcfetch fetch https://example.com -o text
|
|
44
89
|
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
"mcpServers": {
|
|
48
|
-
"arcfetch": {
|
|
49
|
-
"command": "bun",
|
|
50
|
-
"args": ["run", "/path/to/arcfetch/index.ts"]
|
|
51
|
-
}
|
|
52
|
-
}
|
|
53
|
-
}
|
|
54
|
-
```
|
|
90
|
+
# Just the filepath (for scripts)
|
|
91
|
+
arcfetch fetch https://example.com -o path
|
|
55
92
|
|
|
56
|
-
|
|
93
|
+
# Summary: REF-ID|filepath
|
|
94
|
+
arcfetch fetch https://example.com -o summary
|
|
95
|
+
|
|
96
|
+
# Structured JSON
|
|
97
|
+
arcfetch fetch https://example.com -o json
|
|
57
98
|
|
|
58
|
-
|
|
99
|
+
# Human-friendly with emojis
|
|
100
|
+
arcfetch fetch https://example.com --pretty
|
|
101
|
+
```
|
|
59
102
|
|
|
60
|
-
|
|
103
|
+
### Advanced Options
|
|
61
104
|
|
|
62
105
|
```bash
|
|
63
|
-
|
|
106
|
+
# Add search query as metadata
|
|
107
|
+
arcfetch fetch https://example.com -q "machine learning"
|
|
64
108
|
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
-o, --output <format> Output: text, json, summary (default: text)
|
|
68
|
-
-v, --verbose Show detailed output
|
|
69
|
-
--pretty Human-friendly output with emojis
|
|
70
|
-
--min-quality <n> Minimum quality score 0-100 (default: 60)
|
|
71
|
-
--temp-dir <path> Temp folder (default: .tmp)
|
|
72
|
-
--docs-dir <path> Docs folder (default: docs/ai/references)
|
|
73
|
-
--wait-strategy <mode> Playwright wait strategy: networkidle, domcontentloaded, load
|
|
74
|
-
--force-playwright Skip simple fetch and use Playwright directly
|
|
75
|
-
```
|
|
109
|
+
# Set minimum quality threshold (default: 60)
|
|
110
|
+
arcfetch fetch https://example.com --min-quality 80
|
|
76
111
|
|
|
77
|
-
|
|
112
|
+
# Force Playwright (skip simple fetch)
|
|
113
|
+
arcfetch fetch https://example.com --force-playwright
|
|
78
114
|
|
|
79
|
-
|
|
115
|
+
# Use faster wait strategy for simple sites
|
|
116
|
+
arcfetch fetch https://example.com --wait-strategy load
|
|
80
117
|
|
|
81
|
-
|
|
82
|
-
arcfetch
|
|
118
|
+
# Custom directories
|
|
119
|
+
arcfetch fetch https://example.com --temp-dir .cache --docs-dir content
|
|
120
|
+
|
|
121
|
+
# Verbose output for debugging
|
|
122
|
+
arcfetch fetch https://example.com -v
|
|
83
123
|
```
|
|
84
124
|
|
|
85
|
-
|
|
125
|
+
## MCP Server
|
|
86
126
|
|
|
87
|
-
|
|
127
|
+
### Installation (Recommended: npx/bunx)
|
|
88
128
|
|
|
89
|
-
|
|
90
|
-
arcfetch promote <ref-id>
|
|
91
|
-
```
|
|
129
|
+
Add to your Claude Code MCP configuration (`~/.config/claude-code/mcp_config.json`):
|
|
92
130
|
|
|
93
|
-
|
|
131
|
+
```json
|
|
132
|
+
{
|
|
133
|
+
"mcpServers": {
|
|
134
|
+
"arcfetch": {
|
|
135
|
+
"command": "npx",
|
|
136
|
+
"args": ["arcfetch"]
|
|
137
|
+
}
|
|
138
|
+
}
|
|
139
|
+
}
|
|
140
|
+
```
|
|
94
141
|
|
|
95
|
-
|
|
142
|
+
Or using bunx (faster):
|
|
96
143
|
|
|
97
|
-
```
|
|
98
|
-
|
|
144
|
+
```json
|
|
145
|
+
{
|
|
146
|
+
"mcpServers": {
|
|
147
|
+
"arcfetch": {
|
|
148
|
+
"command": "bunx",
|
|
149
|
+
"args": ["arcfetch"]
|
|
150
|
+
}
|
|
151
|
+
}
|
|
152
|
+
}
|
|
99
153
|
```
|
|
100
154
|
|
|
101
|
-
###
|
|
155
|
+
### Local Development
|
|
102
156
|
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
arcfetch
|
|
157
|
+
```json
|
|
158
|
+
{
|
|
159
|
+
"mcpServers": {
|
|
160
|
+
"arcfetch": {
|
|
161
|
+
"command": "bun",
|
|
162
|
+
"args": ["run", "/path/to/arcfetch/index.ts"],
|
|
163
|
+
"cwd": "/path/to/arcfetch"
|
|
164
|
+
}
|
|
165
|
+
}
|
|
166
|
+
}
|
|
107
167
|
```
|
|
108
168
|
|
|
109
|
-
|
|
169
|
+
### MCP Tools
|
|
170
|
+
|
|
171
|
+
| Tool | Parameters | Description |
|
|
172
|
+
|------|------------|-------------|
|
|
173
|
+
| `fetch_url` | `url`, `query?`, `minQuality?`, `forcePlaywright?` | Fetch URL with auto JS fallback |
|
|
174
|
+
| `list_cached` | - | List all cached references |
|
|
175
|
+
| `promote_reference` | `refId` | Move from temp to docs folder |
|
|
176
|
+
| `delete_cached` | `refId` | Delete a cached reference |
|
|
110
177
|
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
| `delete_cached` | Delete a cached reference |
|
|
178
|
+
Example MCP usage:
|
|
179
|
+
```
|
|
180
|
+
User: Fetch https://example.com/article for me
|
|
181
|
+
Claude: [Calls fetch_url tool]
|
|
182
|
+
```
|
|
117
183
|
|
|
118
184
|
## Configuration
|
|
119
185
|
|
|
@@ -167,46 +233,203 @@ URL → Simple Fetch → Quality Check
|
|
|
167
233
|
|
|
168
234
|
## Playwright Wait Strategies
|
|
169
235
|
|
|
170
|
-
| Strategy |
|
|
171
|
-
|
|
172
|
-
| `networkidle` |
|
|
173
|
-
| `domcontentloaded` |
|
|
174
|
-
| `load` |
|
|
236
|
+
| Strategy | Speed | Reliability | Best For |
|
|
237
|
+
|----------|-------|-------------|----------|
|
|
238
|
+
| `networkidle` | Slowest | Highest | JS-heavy apps, dynamic content |
|
|
239
|
+
| `domcontentloaded` | Medium | Medium | Most SPAs, modern sites |
|
|
240
|
+
| `load` | Fastest | Basic | Static sites, simple pages |
|
|
175
241
|
|
|
176
|
-
##
|
|
242
|
+
## Quality Pipeline
|
|
177
243
|
|
|
178
244
|
```
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
245
|
+
URL → Simple Fetch → Quality Check (0-100)
|
|
246
|
+
│
|
|
247
|
+
┌───────────────┼───────────────┐
|
|
248
|
+
▼ ▼ ▼
|
|
249
|
+
Score ≥ 85 60-84 < 60
|
|
250
|
+
│ │ │
|
|
251
|
+
▼ ▼ ▼
|
|
252
|
+
Save Try Playwright Try Playwright
|
|
253
|
+
use best Score ≥ 60?
|
|
254
|
+
Yes → Save
|
|
255
|
+
No → Error
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
**Default thresholds:**
|
|
259
|
+
- `minScore`: 60 - Content below this is rejected
|
|
260
|
+
- `jsRetryThreshold`: 85 - Above this, skip Playwright entirely
|
|
261
|
+
|
|
262
|
+
## Real-World Examples
|
|
263
|
+
|
|
264
|
+
### AI Research Workflow
|
|
182
265
|
|
|
183
|
-
|
|
184
|
-
|
|
266
|
+
```bash
|
|
267
|
+
# Fetch multiple articles for research
|
|
268
|
+
npx arcfetch fetch https://arxiv.org/abs/2301.00001 -q "LLM research"
|
|
269
|
+
npx arcfetch fetch https://openai.com/research/gpt-4 -q "GPT-4"
|
|
270
|
+
|
|
271
|
+
# List all fetched
|
|
272
|
+
npx arcfetch list
|
|
273
|
+
|
|
274
|
+
# Promote the best ones to docs
|
|
275
|
+
npx arcfetch promote REF-001
|
|
276
|
+
npx arcfetch promote REF-002
|
|
185
277
|
```
|
|
186
278
|
|
|
187
|
-
|
|
279
|
+
### Script Integration
|
|
280
|
+
|
|
281
|
+
```bash
|
|
282
|
+
#!/bin/bash
|
|
283
|
+
# fetch-and-process.sh
|
|
284
|
+
|
|
285
|
+
# Fetch and get filepath
|
|
286
|
+
filepath=$(npx arcfetch fetch https://example.com -o path)
|
|
287
|
+
|
|
288
|
+
# Process with other tools
|
|
289
|
+
cat "$filepath" | other-tool
|
|
290
|
+
|
|
291
|
+
# Or get just the ref ID
|
|
292
|
+
summary=$(npx arcfetch fetch https://example.com -o summary)
|
|
293
|
+
ref_id=$(echo "$summary" | cut -d'|' -f1)
|
|
188
294
|
|
|
189
|
-
|
|
295
|
+
# Promote if it meets quality standards
|
|
296
|
+
if npx arcfetch promote "$ref_id"; then
|
|
297
|
+
echo "Successfully promoted $ref_id"
|
|
298
|
+
fi
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
### Handling JS-Heavy Sites
|
|
190
302
|
|
|
191
303
|
```bash
|
|
192
|
-
|
|
304
|
+
# Modern React/Vue/Angular apps
|
|
305
|
+
arcfetch fetch https://spa-example.com --force-playwright --wait-strategy networkidle
|
|
306
|
+
|
|
307
|
+
# Simple blogs (use faster strategy)
|
|
308
|
+
arcfetch fetch https://blog.example.com --wait-strategy load
|
|
309
|
+
|
|
310
|
+
# Unknown site (let arcfetch decide)
|
|
311
|
+
arcfetch fetch https://unknown-site.com
|
|
193
312
|
```
|
|
194
313
|
|
|
195
|
-
###
|
|
314
|
+
### Bulk Fetching with JSON Output
|
|
196
315
|
|
|
197
316
|
```bash
|
|
198
|
-
|
|
317
|
+
# Fetch multiple URLs and parse JSON
|
|
318
|
+
for url in "${urls[@]}"; do
|
|
319
|
+
arcfetch fetch "$url" -o json >> results.json
|
|
320
|
+
done
|
|
321
|
+
|
|
322
|
+
# Or use jq to extract specific fields
|
|
323
|
+
arcfetch fetch https://example.com -o json | jq '.filepath'
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
## Troubleshooting
|
|
327
|
+
|
|
328
|
+
### "Playwright not found" Error
|
|
329
|
+
|
|
330
|
+
**Problem:** Playwright fails to launch
|
|
331
|
+
|
|
332
|
+
**Solution:**
|
|
333
|
+
```bash
|
|
334
|
+
# If using npm globally
|
|
335
|
+
npm install -g playwright
|
|
336
|
+
|
|
337
|
+
# If using npx (auto-installed)
|
|
338
|
+
npx arcfetch fetch https://example.com --force-playwright
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
### Low Quality Score
|
|
342
|
+
|
|
343
|
+
**Problem:** Content is rejected due to low quality
|
|
344
|
+
|
|
345
|
+
**Solution:**
|
|
346
|
+
```bash
|
|
347
|
+
# Lower the threshold temporarily
|
|
348
|
+
arcfetch fetch https://example.com --min-quality 40
|
|
349
|
+
|
|
350
|
+
# Or force Playwright (often produces better results)
|
|
351
|
+
arcfetch fetch https://example.com --force-playwright
|
|
352
|
+
```
|
|
353
|
+
|
|
354
|
+
### Timeout on Slow Sites
|
|
355
|
+
|
|
356
|
+
**Problem:** Site takes too long to load
|
|
357
|
+
|
|
358
|
+
**Solution:**
|
|
359
|
+
```bash
|
|
360
|
+
# Use faster wait strategy
|
|
361
|
+
arcfetch fetch https://example.com --wait-strategy load
|
|
362
|
+
|
|
363
|
+
# Combine with force-playwright for JS sites
|
|
364
|
+
arcfetch fetch https://example.com --force-playwright --wait-strategy domcontentloaded
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
### MCP Server Not Connecting
|
|
368
|
+
|
|
369
|
+
**Problem:** Claude Code can't connect to MCP server
|
|
370
|
+
|
|
371
|
+
**Solution:**
|
|
372
|
+
```bash
|
|
373
|
+
# Test if the MCP server works manually
|
|
374
|
+
npx arcfetch fetch https://example.com
|
|
375
|
+
|
|
376
|
+
# Check your MCP config path
|
|
377
|
+
# macOS: ~/.config/claude-code/mcp_config.json
|
|
378
|
+
# Linux: ~/.config/claude-code/mcp_config.json
|
|
379
|
+
# Windows: %APPDATA%\claude-code\mcp_config.json
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
## Comparison
|
|
383
|
+
|
|
384
|
+
| Feature | arcfetch | html-to-markdown | url-to-markdown | playwright-extra |
|
|
385
|
+
|---------|----------|------------------|-----------------|------------------|
|
|
386
|
+
| Auto JS fallback | ✅ | ❌ | ❌ | Manual |
|
|
387
|
+
| Quality scoring | ✅ | ❌ | ❌ | ❌ |
|
|
388
|
+
| Temp → Docs workflow | ✅ | ❌ | ❌ | ❌ |
|
|
389
|
+
| MCP server | ✅ | ❌ | ❌ | ❌ |
|
|
390
|
+
| Multiple output formats | ✅ | ❌ | Some | ❌ |
|
|
391
|
+
| Zero-config | ✅ | ✅ | ✅ | ❌ |
|
|
392
|
+
| Playwright included | ✅ | ❌ | ❌ | Manual setup |
|
|
393
|
+
|
|
394
|
+
## Architecture
|
|
395
|
+
|
|
199
396
|
```
|
|
397
|
+
┌─────────────────────────────────────────────────────────┐
|
|
398
|
+
│ CLI / MCP Interface │
|
|
399
|
+
└─────────────────────────────────────────────────────────┘
|
|
400
|
+
│
|
|
401
|
+
▼
|
|
402
|
+
┌─────────────────────────────────────────────────────────┐
|
|
403
|
+
│ Core Pipeline │
|
|
404
|
+
│ 1. Simple HTTP Fetch │
|
|
405
|
+
│ 2. Extract with Readability + Turndown │
|
|
406
|
+
│ 3. Validate Quality Score │
|
|
407
|
+
│ 4. Conditional Playwright Retry │
|
|
408
|
+
│ 5. Cache with Frontmatter Metadata │
|
|
409
|
+
└─────────────────────────────────────────────────────────┘
|
|
410
|
+
│
|
|
411
|
+
┌───────────────┼───────────────┐
|
|
412
|
+
▼ ▼ ▼
|
|
413
|
+
┌───────────┐ ┌───────────┐ ┌───────────┐
|
|
414
|
+
│ Cache │ │Playwright │ │ Validator │
|
|
415
|
+
│ Manager │ │ Manager │ │ │
|
|
416
|
+
└───────────┘ └───────────┘ └───────────┘
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
## Contributing
|
|
200
420
|
|
|
201
|
-
|
|
421
|
+
Contributions welcome! Please read our contributing guidelines and submit pull requests to the main branch.
|
|
422
|
+
|
|
423
|
+
### Development Setup
|
|
202
424
|
|
|
203
425
|
```bash
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
426
|
+
git clone https://github.com/yourusername/arcfetch.git
|
|
427
|
+
cd arcfetch
|
|
428
|
+
bun install
|
|
429
|
+
bun test # Run tests
|
|
430
|
+
bun run typecheck # Type checking
|
|
208
431
|
```
|
|
209
432
|
|
|
210
433
|
## License
|
|
211
434
|
|
|
212
|
-
MIT
|
|
435
|
+
MIT License - see LICENSE file for details
|