arcfetch 0.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,119 +1,185 @@
1
1
  # arcfetch
2
2
 
3
- Fetch URLs, extract clean article content, and cache as markdown. Supports automatic JavaScript rendering fallback via Playwright.
3
+ [![npm version](https://badge.fury.io/js/arcfetch.svg)](https://www.npmjs.org/package/arcfetch)
4
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
+
6
+ **Zero-config URL fetching** that converts web pages to clean markdown with automatic JavaScript rendering fallback.
7
+
8
+ Perfect for AI workflows, research, and documentation. Fetches URLs, extracts article content using Mozilla Readability, and caches as markdown with 90-95% token reduction.
9
+
10
+ ## Why arcfetch?
11
+
12
+ | Problem | Solution |
13
+ |---------|----------|
14
+ | **JS-heavy sites return blank** | Auto-detects and retries with Playwright |
15
+ | **Too much HTML clutter** | Mozilla Readability extracts just the article |
16
+ | **High token costs for LLMs** | 90-95% token reduction vs raw HTML |
17
+ | **No good caching story** | Temp → Docs workflow for easy curation |
18
+ | **Hard to integrate** | Works as CLI or MCP server with zero setup |
4
19
 
5
20
  ## Features
6
21
 
7
22
  - **Smart Fetching**: Simple HTTP first, automatic Playwright fallback for JS-heavy sites
8
- - **Quality Gates**: Configurable quality thresholds with automatic retry
23
+ - **Quality Gates**: Configurable quality thresholds (0-100) with automatic retry
9
24
  - **Clean Markdown**: Mozilla Readability + Turndown for 90-95% token reduction
10
25
  - **Temp → Docs Workflow**: Cache to temp folder, promote to docs when ready
11
26
  - **CLI & MCP**: Available as command-line tool and MCP server
27
+ - **Multiple Output Formats**: Plain text, JSON, filepath, or summary
28
+ - **Configurable Thresholds**: Set quality minimums and retry strategies
29
+
30
+ ## Quick Start
31
+
32
+ ### No Installation Required (npx/bunx)
33
+
34
+ ```bash
35
+ # Fetch and display markdown
36
+ npx arcfetch fetch https://example.com/article
37
+
38
+ # Get just the filepath (for scripts)
39
+ npx arcfetch fetch https://example.com -o path
40
+
41
+ # With pretty output
42
+ bunx arcfetch fetch https://example.com --pretty
43
+ ```
12
44
 
13
- ## Installation
45
+ ### Global Installation
14
46
 
15
47
  ```bash
16
- # For users
17
48
  npm install -g arcfetch
18
49
 
19
- # For development
50
+ # Then use directly
51
+ arcfetch fetch https://example.com/article
52
+ ```
53
+
54
+ ### Development
55
+
56
+ ```bash
57
+ git clone https://github.com/yourusername/arcfetch.git
58
+ cd arcfetch
20
59
  bun install
60
+ bun run cli.ts fetch https://example.com
21
61
  ```
22
62
 
23
- ## Quick Start
63
+ ## CLI Usage
24
64
 
25
- ### CLI
65
+ ### Basic Commands
26
66
 
27
67
  ```bash
28
- # Fetch a URL
68
+ # Fetch and display markdown (default output)
29
69
  arcfetch fetch https://example.com/article
30
70
 
31
- # List cached references
71
+ # List all cached references
32
72
  arcfetch list
33
73
 
34
- # Promote to docs folder
74
+ # Promote from temp to permanent docs
35
75
  arcfetch promote REF-001
36
76
 
37
- # Delete a reference
77
+ # Delete a cached reference
38
78
  arcfetch delete REF-001
79
+
80
+ # Show current configuration
81
+ arcfetch config
39
82
  ```
40
83
 
41
- ### MCP Server
84
+ ### Output Formats
42
85
 
43
- Add to your Claude Code MCP configuration:
86
+ ```bash
87
+ # Plain text (LLM-friendly, default)
88
+ arcfetch fetch https://example.com -o text
44
89
 
45
- ```json
46
- {
47
- "mcpServers": {
48
- "arcfetch": {
49
- "command": "bun",
50
- "args": ["run", "/path/to/arcfetch/index.ts"]
51
- }
52
- }
53
- }
54
- ```
90
+ # Just the filepath (for scripts)
91
+ arcfetch fetch https://example.com -o path
55
92
 
56
- ## CLI Commands
93
+ # Summary: REF-ID|filepath
94
+ arcfetch fetch https://example.com -o summary
95
+
96
+ # Structured JSON
97
+ arcfetch fetch https://example.com -o json
57
98
 
58
- ### fetch
99
+ # Human-friendly with emojis
100
+ arcfetch fetch https://example.com --pretty
101
+ ```
59
102
 
60
- Fetch URL and save to temp folder.
103
+ ### Advanced Options
61
104
 
62
105
  ```bash
63
- arcfetch fetch <url> [options]
106
+ # Add search query as metadata
107
+ arcfetch fetch https://example.com -q "machine learning"
64
108
 
65
- Options:
66
- -q, --query <text> Search query (saved as metadata)
67
- -o, --output <format> Output: text, json, summary (default: text)
68
- -v, --verbose Show detailed output
69
- --pretty Human-friendly output with emojis
70
- --min-quality <n> Minimum quality score 0-100 (default: 60)
71
- --temp-dir <path> Temp folder (default: .tmp)
72
- --docs-dir <path> Docs folder (default: docs/ai/references)
73
- --wait-strategy <mode> Playwright wait strategy: networkidle, domcontentloaded, load
74
- --force-playwright Skip simple fetch and use Playwright directly
75
- ```
109
+ # Set minimum quality threshold (default: 60)
110
+ arcfetch fetch https://example.com --min-quality 80
76
111
 
77
- ### list
112
+ # Force Playwright (skip simple fetch)
113
+ arcfetch fetch https://example.com --force-playwright
78
114
 
79
- List all cached references.
115
+ # Use faster wait strategy for simple sites
116
+ arcfetch fetch https://example.com --wait-strategy load
80
117
 
81
- ```bash
82
- arcfetch list [-o json]
118
+ # Custom directories
119
+ arcfetch fetch https://example.com --temp-dir .cache --docs-dir content
120
+
121
+ # Verbose output for debugging
122
+ arcfetch fetch https://example.com -v
83
123
  ```
84
124
 
85
- ### promote
125
+ ## MCP Server
86
126
 
87
- Move reference from temp to docs folder.
127
+ ### Installation (Recommended: npx/bunx)
88
128
 
89
- ```bash
90
- arcfetch promote <ref-id>
91
- ```
129
+ Add to your Claude Code MCP configuration (`~/.config/claude-code/mcp_config.json`):
92
130
 
93
- ### delete
131
+ ```json
132
+ {
133
+ "mcpServers": {
134
+ "arcfetch": {
135
+ "command": "npx",
136
+ "args": ["arcfetch"]
137
+ }
138
+ }
139
+ }
140
+ ```
94
141
 
95
- Delete a cached reference.
142
+ Or using bunx (faster):
96
143
 
97
- ```bash
98
- arcfetch delete <ref-id>
144
+ ```json
145
+ {
146
+ "mcpServers": {
147
+ "arcfetch": {
148
+ "command": "bunx",
149
+ "args": ["arcfetch"]
150
+ }
151
+ }
152
+ }
99
153
  ```
100
154
 
101
- ### config
155
+ ### Local Development
102
156
 
103
- Show current configuration.
104
-
105
- ```bash
106
- arcfetch config
157
+ ```json
158
+ {
159
+ "mcpServers": {
160
+ "arcfetch": {
161
+ "command": "bun",
162
+ "args": ["run", "/path/to/arcfetch/index.ts"],
163
+ "cwd": "/path/to/arcfetch"
164
+ }
165
+ }
166
+ }
107
167
  ```
108
168
 
109
- ## MCP Tools
169
+ ### MCP Tools
170
+
171
+ | Tool | Parameters | Description |
172
+ |------|------------|-------------|
173
+ | `fetch_url` | `url`, `query?`, `minQuality?`, `forcePlaywright?` | Fetch URL with auto JS fallback |
174
+ | `list_cached` | - | List all cached references |
175
+ | `promote_reference` | `refId` | Move from temp to docs folder |
176
+ | `delete_cached` | `refId` | Delete a cached reference |
110
177
 
111
- | Tool | Description |
112
- |------|-------------|
113
- | `fetch_url` | Fetch URL with auto JS fallback, save to temp |
114
- | `list_cached` | List all cached references |
115
- | `promote_reference` | Move from temp to docs folder |
116
- | `delete_cached` | Delete a cached reference |
178
+ Example MCP usage:
179
+ ```
180
+ User: Fetch https://example.com/article for me
181
+ Claude: [Calls fetch_url tool]
182
+ ```
117
183
 
118
184
  ## Configuration
119
185
 
@@ -167,46 +233,203 @@ URL → Simple Fetch → Quality Check
167
233
 
168
234
  ## Playwright Wait Strategies
169
235
 
170
- | Strategy | Description |
171
- |----------|-------------|
172
- | `networkidle` | Wait until network is idle (slowest, most reliable) |
173
- | `domcontentloaded` | Wait until DOM is loaded (faster) |
174
- | `load` | Wait until page load event completes (fastest) |
236
+ | Strategy | Speed | Reliability | Best For |
237
+ |----------|-------|-------------|----------|
238
+ | `networkidle` | Slowest | Highest | JS-heavy apps, dynamic content |
239
+ | `domcontentloaded` | Medium | Medium | Most SPAs, modern sites |
240
+ | `load` | Fastest | Basic | Static sites, simple pages |
175
241
 
176
- ## File Structure
242
+ ## Quality Pipeline
177
243
 
178
244
  ```
179
- .tmp/ # Temporary cache (default)
180
- REF-001-article-title.md
181
- REF-002-another-article.md
245
+ URL Simple Fetch → Quality Check (0-100)
246
+
247
+ ┌───────────────┼───────────────┐
248
+ ▼ ▼ ▼
249
+ Score ≥ 85 60-84 < 60
250
+ │ │ │
251
+ ▼ ▼ ▼
252
+ Save Try Playwright Try Playwright
253
+ use best Score ≥ 60?
254
+ Yes → Save
255
+ No → Error
256
+ ```
257
+
258
+ **Default thresholds:**
259
+ - `minScore`: 60 - Content below this is rejected
260
+ - `jsRetryThreshold`: 85 - Above this, skip Playwright entirely
261
+
262
+ ## Real-World Examples
263
+
264
+ ### AI Research Workflow
182
265
 
183
- docs/ai/references/ # Permanent docs (after promote)
184
- REF-001-article-title.md
266
+ ```bash
267
+ # Fetch multiple articles for research
268
+ npx arcfetch fetch https://arxiv.org/abs/2301.00001 -q "LLM research"
269
+ npx arcfetch fetch https://openai.com/research/gpt-4 -q "GPT-4"
270
+
271
+ # List all fetched
272
+ npx arcfetch list
273
+
274
+ # Promote the best ones to docs
275
+ npx arcfetch promote REF-001
276
+ npx arcfetch promote REF-002
185
277
  ```
186
278
 
187
- ## Examples
279
+ ### Script Integration
280
+
281
+ ```bash
282
+ #!/bin/bash
283
+ # fetch-and-process.sh
284
+
285
+ # Fetch and get filepath
286
+ filepath=$(npx arcfetch fetch https://example.com -o path)
287
+
288
+ # Process with other tools
289
+ cat "$filepath" | other-tool
290
+
291
+ # Or get just the ref ID
292
+ summary=$(npx arcfetch fetch https://example.com -o summary)
293
+ ref_id=$(echo "$summary" | cut -d'|' -f1)
188
294
 
189
- ### Force Playwright for JS-heavy sites
295
+ # Promote if it meets quality standards
296
+ if npx arcfetch promote "$ref_id"; then
297
+ echo "Successfully promoted $ref_id"
298
+ fi
299
+ ```
300
+
301
+ ### Handling JS-Heavy Sites
190
302
 
191
303
  ```bash
192
- arcfetch fetch https://spa-heavy-site.com --force-playwright --wait-strategy domcontentloaded
304
+ # Modern React/Vue/Angular apps
305
+ arcfetch fetch https://spa-example.com --force-playwright --wait-strategy networkidle
306
+
307
+ # Simple blogs (use faster strategy)
308
+ arcfetch fetch https://blog.example.com --wait-strategy load
309
+
310
+ # Unknown site (let arcfetch decide)
311
+ arcfetch fetch https://unknown-site.com
193
312
  ```
194
313
 
195
- ### Fetch and get JSON output
314
+ ### Bulk Fetching with JSON Output
196
315
 
197
316
  ```bash
198
- arcfetch fetch https://example.com -o json
317
+ # Fetch multiple URLs and parse JSON
318
+ for url in "${urls[@]}"; do
319
+ arcfetch fetch "$url" -o json >> results.json
320
+ done
321
+
322
+ # Or use jq to extract specific fields
323
+ arcfetch fetch https://example.com -o json | jq '.filepath'
324
+ ```
325
+
326
+ ## Troubleshooting
327
+
328
+ ### "Playwright not found" Error
329
+
330
+ **Problem:** Playwright fails to launch
331
+
332
+ **Solution:**
333
+ ```bash
334
+ # If using npm globally
335
+ npm install -g playwright
336
+
337
+ # If using npx (auto-installed)
338
+ npx arcfetch fetch https://example.com --force-playwright
339
+ ```
340
+
341
+ ### Low Quality Score
342
+
343
+ **Problem:** Content is rejected due to low quality
344
+
345
+ **Solution:**
346
+ ```bash
347
+ # Lower the threshold temporarily
348
+ arcfetch fetch https://example.com --min-quality 40
349
+
350
+ # Or force Playwright (often produces better results)
351
+ arcfetch fetch https://example.com --force-playwright
352
+ ```
353
+
354
+ ### Timeout on Slow Sites
355
+
356
+ **Problem:** Site takes too long to load
357
+
358
+ **Solution:**
359
+ ```bash
360
+ # Use faster wait strategy
361
+ arcfetch fetch https://example.com --wait-strategy load
362
+
363
+ # Combine with force-playwright for JS sites
364
+ arcfetch fetch https://example.com --force-playwright --wait-strategy domcontentloaded
365
+ ```
366
+
367
+ ### MCP Server Not Connecting
368
+
369
+ **Problem:** Claude Code can't connect to MCP server
370
+
371
+ **Solution:**
372
+ ```bash
373
+ # Test if the MCP server works manually
374
+ npx arcfetch fetch https://example.com
375
+
376
+ # Check your MCP config path
377
+ # macOS: ~/.config/claude-code/mcp_config.json
378
+ # Linux: ~/.config/claude-code/mcp_config.json
379
+ # Windows: %APPDATA%\claude-code\mcp_config.json
380
+ ```
381
+
382
+ ## Comparison
383
+
384
+ | Feature | arcfetch | html-to-markdown | url-to-markdown | playwright-extra |
385
+ |---------|----------|------------------|-----------------|------------------|
386
+ | Auto JS fallback | ✅ | ❌ | ❌ | Manual |
387
+ | Quality scoring | ✅ | ❌ | ❌ | ❌ |
388
+ | Temp → Docs workflow | ✅ | ❌ | ❌ | ❌ |
389
+ | MCP server | ✅ | ❌ | ❌ | ❌ |
390
+ | Multiple output formats | ✅ | ❌ | Some | ❌ |
391
+ | Zero-config | ✅ | ✅ | ✅ | ❌ |
392
+ | Playwright included | ✅ | ❌ | ❌ | Manual setup |
393
+
394
+ ## Architecture
395
+
199
396
  ```
397
+ ┌─────────────────────────────────────────────────────────┐
398
+ │ CLI / MCP Interface │
399
+ └─────────────────────────────────────────────────────────┘
400
+
401
+
402
+ ┌─────────────────────────────────────────────────────────┐
403
+ │ Core Pipeline │
404
+ │ 1. Simple HTTP Fetch │
405
+ │ 2. Extract with Readability + Turndown │
406
+ │ 3. Validate Quality Score │
407
+ │ 4. Conditional Playwright Retry │
408
+ │ 5. Cache with Frontmatter Metadata │
409
+ └─────────────────────────────────────────────────────────┘
410
+
411
+ ┌───────────────┼───────────────┐
412
+ ▼ ▼ ▼
413
+ ┌───────────┐ ┌───────────┐ ┌───────────┐
414
+ │ Cache │ │Playwright │ │ Validator │
415
+ │ Manager │ │ Manager │ │ │
416
+ └───────────┘ └───────────┘ └───────────┘
417
+ ```
418
+
419
+ ## Contributing
200
420
 
201
- ### Use in scripts
421
+ Contributions welcome! Please read our contributing guidelines and submit pull requests to the main branch.
422
+
423
+ ### Development Setup
202
424
 
203
425
  ```bash
204
- # Get just the ref ID and path
205
- result=$(arcfetch fetch https://example.com -o summary)
206
- ref_id=$(echo $result | cut -d'|' -f1)
207
- filepath=$(echo $result | cut -d'|' -f2)
426
+ git clone https://github.com/yourusername/arcfetch.git
427
+ cd arcfetch
428
+ bun install
429
+ bun test # Run tests
430
+ bun run typecheck # Type checking
208
431
  ```
209
432
 
210
433
  ## License
211
434
 
212
- MIT
435
+ MIT License - see LICENSE file for details