npm - arcfetch - Versions diffs - 0.0.1 → 1.1.0 - Mend

arcfetch 0.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/README.md +308 -85
package/cli.ts +264 -15
package/index.ts +258 -13
package/package.json +12 -2
package/src/config/defaults.ts +1 -1
package/src/config/index.ts +3 -3
package/src/config/loader.ts +2 -2
package/src/core/cache.ts +116 -37
package/src/core/extractor.ts +1 -1
package/src/core/index.ts +4 -4
package/src/core/pipeline.ts +4 -4
package/src/core/playwright/index.ts +2 -2
package/src/core/playwright/local.ts +2 -2
package/src/core/playwright/manager.ts +3 -3
package/src/utils/version.ts +3 -3

package/README.md CHANGED Viewed

@@ -1,119 +1,185 @@
 # arcfetch
-Fetch URLs, extract clean article content, and cache as markdown. Supports automatic JavaScript rendering fallback via Playwright.
+[![npm version](https://badge.fury.io/js/arcfetch.svg)](https://www.npmjs.org/package/arcfetch)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+**Zero-config URL fetching** that converts web pages to clean markdown with automatic JavaScript rendering fallback.
+Perfect for AI workflows, research, and documentation. Fetches URLs, extracts article content using Mozilla Readability, and caches as markdown with 90-95% token reduction.
+## Why arcfetch?
+| Problem | Solution |
+|---------|----------|
+| **JS-heavy sites return blank** | Auto-detects and retries with Playwright |
+| **Too much HTML clutter** | Mozilla Readability extracts just the article |
+| **High token costs for LLMs** | 90-95% token reduction vs raw HTML |
+| **No good caching story** | Temp → Docs workflow for easy curation |
+| **Hard to integrate** | Works as CLI or MCP server with zero setup |
 ## Features
 - **Smart Fetching**: Simple HTTP first, automatic Playwright fallback for JS-heavy sites
-- **Quality Gates**: Configurable quality thresholds with automatic retry
+- **Quality Gates**: Configurable quality thresholds (0-100) with automatic retry
 - **Clean Markdown**: Mozilla Readability + Turndown for 90-95% token reduction
 - **Temp → Docs Workflow**: Cache to temp folder, promote to docs when ready
 - **CLI & MCP**: Available as command-line tool and MCP server
+- **Multiple Output Formats**: Plain text, JSON, filepath, or summary
+- **Configurable Thresholds**: Set quality minimums and retry strategies
+## Quick Start
+### No Installation Required (npx/bunx)
+```bash
+# Fetch and display markdown
+npx arcfetch fetch https://example.com/article
+# Get just the filepath (for scripts)
+npx arcfetch fetch https://example.com -o path
+# With pretty output
+bunx arcfetch fetch https://example.com --pretty
+```
-## Installation
+### Global Installation
 ```bash
-# For users
 npm install -g arcfetch
-# For development
+# Then use directly
+arcfetch fetch https://example.com/article
+```
+### Development
+```bash
+git clone https://github.com/yourusername/arcfetch.git
+cd arcfetch
 bun install
+bun run cli.ts fetch https://example.com
 ```
-## Quick Start
+## CLI Usage
-### CLI
+### Basic Commands
 ```bash
-# Fetch a URL
+# Fetch and display markdown (default output)
 arcfetch fetch https://example.com/article
-# List cached references
+# List all cached references
 arcfetch list
-# Promote to docs folder
+# Promote from temp to permanent docs
 arcfetch promote REF-001
-# Delete a reference
+# Delete a cached reference
 arcfetch delete REF-001
+# Show current configuration
+arcfetch config
 ```
-### MCP Server
+### Output Formats
-Add to your Claude Code MCP configuration:
+```bash
+# Plain text (LLM-friendly, default)
+arcfetch fetch https://example.com -o text
-```json
-{
-  "mcpServers": {
-    "arcfetch": {
-      "command": "bun",
-      "args": ["run", "/path/to/arcfetch/index.ts"]
-    }
-  }
-}
-```
+# Just the filepath (for scripts)
+arcfetch fetch https://example.com -o path
-## CLI Commands
+# Summary: REF-ID|filepath
+arcfetch fetch https://example.com -o summary
+# Structured JSON
+arcfetch fetch https://example.com -o json
-### fetch
+# Human-friendly with emojis
+arcfetch fetch https://example.com --pretty
+```
-Fetch URL and save to temp folder.
+### Advanced Options
 ```bash
-arcfetch fetch <url> [options]
+# Add search query as metadata
+arcfetch fetch https://example.com -q "machine learning"
-Options:
-  -q, --query <text>        Search query (saved as metadata)
-  -o, --output <format>     Output: text, json, summary (default: text)
-  -v, --verbose             Show detailed output
-  --pretty                  Human-friendly output with emojis
-  --min-quality <n>         Minimum quality score 0-100 (default: 60)
-  --temp-dir <path>         Temp folder (default: .tmp)
-  --docs-dir <path>         Docs folder (default: docs/ai/references)
-  --wait-strategy <mode>    Playwright wait strategy: networkidle, domcontentloaded, load
-  --force-playwright        Skip simple fetch and use Playwright directly
-```
+# Set minimum quality threshold (default: 60)
+arcfetch fetch https://example.com --min-quality 80
-### list
+# Force Playwright (skip simple fetch)
+arcfetch fetch https://example.com --force-playwright
-List all cached references.
+# Use faster wait strategy for simple sites
+arcfetch fetch https://example.com --wait-strategy load
-```bash
-arcfetch list [-o json]
+# Custom directories
+arcfetch fetch https://example.com --temp-dir .cache --docs-dir content
+# Verbose output for debugging
+arcfetch fetch https://example.com -v
 ```
-### promote
+## MCP Server
-Move reference from temp to docs folder.
+### Installation (Recommended: npx/bunx)
-```bash
-arcfetch promote <ref-id>
-```
+Add to your Claude Code MCP configuration (`~/.config/claude-code/mcp_config.json`):
-### delete
+```json
+{
+  "mcpServers": {
+    "arcfetch": {
+      "command": "npx",
+      "args": ["arcfetch"]
+    }
+  }
+}
+```
-Delete a cached reference.
+Or using bunx (faster):
-```bash
-arcfetch delete <ref-id>
+```json
+{
+  "mcpServers": {
+    "arcfetch": {
+      "command": "bunx",
+      "args": ["arcfetch"]
+    }
+  }
+}
 ```
-### config
+### Local Development
-Show current configuration.
-```bash
-arcfetch config
+```json
+{
+  "mcpServers": {
+    "arcfetch": {
+      "command": "bun",
+      "args": ["run", "/path/to/arcfetch/index.ts"],
+      "cwd": "/path/to/arcfetch"
+    }
+  }
+}
 ```
-## MCP Tools
+### MCP Tools
+| Tool | Parameters | Description |
+|------|------------|-------------|
+| `fetch_url` | `url`, `query?`, `minQuality?`, `forcePlaywright?` | Fetch URL with auto JS fallback |
+| `list_cached` | - | List all cached references |
+| `promote_reference` | `refId` | Move from temp to docs folder |
+| `delete_cached` | `refId` | Delete a cached reference |
-| Tool | Description |
-|------|-------------|
-| `fetch_url` | Fetch URL with auto JS fallback, save to temp |
-| `list_cached` | List all cached references |
-| `promote_reference` | Move from temp to docs folder |
-| `delete_cached` | Delete a cached reference |
+Example MCP usage:
+```
+User: Fetch https://example.com/article for me
+Claude: [Calls fetch_url tool]
+```
 ## Configuration
@@ -167,46 +233,203 @@ URL → Simple Fetch → Quality Check
 ## Playwright Wait Strategies
-| Strategy | Description |
-|----------|-------------|
-| `networkidle` | Wait until network is idle (slowest, most reliable) |
-| `domcontentloaded` | Wait until DOM is loaded (faster) |
-| `load` | Wait until page load event completes (fastest) |
+| Strategy | Speed | Reliability | Best For |
+|----------|-------|-------------|----------|
+| `networkidle` | Slowest | Highest | JS-heavy apps, dynamic content |
+| `domcontentloaded` | Medium | Medium | Most SPAs, modern sites |
+| `load` | Fastest | Basic | Static sites, simple pages |
-## File Structure
+## Quality Pipeline
 ```
-.tmp/                          # Temporary cache (default)
-  REF-001-article-title.md
-  REF-002-another-article.md
+URL → Simple Fetch → Quality Check (0-100)
+                         │
+         ┌───────────────┼───────────────┐
+         ▼               ▼               ▼
+     Score ≥ 85      60-84           < 60
+         │               │               │
+         ▼               ▼               ▼
+       Save        Try Playwright   Try Playwright
+                   use best         Score ≥ 60?
+                                    Yes → Save
+                                    No → Error
+```
+**Default thresholds:**
+- `minScore`: 60 - Content below this is rejected
+- `jsRetryThreshold`: 85 - Above this, skip Playwright entirely
+## Real-World Examples
+### AI Research Workflow
-docs/ai/references/            # Permanent docs (after promote)
-  REF-001-article-title.md
+```bash
+# Fetch multiple articles for research
+npx arcfetch fetch https://arxiv.org/abs/2301.00001 -q "LLM research"
+npx arcfetch fetch https://openai.com/research/gpt-4 -q "GPT-4"
+# List all fetched
+npx arcfetch list
+# Promote the best ones to docs
+npx arcfetch promote REF-001
+npx arcfetch promote REF-002
 ```
-## Examples
+### Script Integration
+```bash
+#!/bin/bash
+# fetch-and-process.sh
+# Fetch and get filepath
+filepath=$(npx arcfetch fetch https://example.com -o path)
+# Process with other tools
+cat "$filepath" | other-tool
+# Or get just the ref ID
+summary=$(npx arcfetch fetch https://example.com -o summary)
+ref_id=$(echo "$summary" | cut -d'|' -f1)
-### Force Playwright for JS-heavy sites
+# Promote if it meets quality standards
+if npx arcfetch promote "$ref_id"; then
+  echo "Successfully promoted $ref_id"
+fi
+```
+### Handling JS-Heavy Sites
 ```bash
-arcfetch fetch https://spa-heavy-site.com --force-playwright --wait-strategy domcontentloaded
+# Modern React/Vue/Angular apps
+arcfetch fetch https://spa-example.com --force-playwright --wait-strategy networkidle
+# Simple blogs (use faster strategy)
+arcfetch fetch https://blog.example.com --wait-strategy load
+# Unknown site (let arcfetch decide)
+arcfetch fetch https://unknown-site.com
 ```
-### Fetch and get JSON output
+### Bulk Fetching with JSON Output
 ```bash
-arcfetch fetch https://example.com -o json
+# Fetch multiple URLs and parse JSON
+for url in "${urls[@]}"; do
+  arcfetch fetch "$url" -o json >> results.json
+done
+# Or use jq to extract specific fields
+arcfetch fetch https://example.com -o json | jq '.filepath'
+```
+## Troubleshooting
+### "Playwright not found" Error
+**Problem:** Playwright fails to launch
+**Solution:**
+```bash
+# If using npm globally
+npm install -g playwright
+# If using npx (auto-installed)
+npx arcfetch fetch https://example.com --force-playwright
+```
+### Low Quality Score
+**Problem:** Content is rejected due to low quality
+**Solution:**
+```bash
+# Lower the threshold temporarily
+arcfetch fetch https://example.com --min-quality 40
+# Or force Playwright (often produces better results)
+arcfetch fetch https://example.com --force-playwright
+```
+### Timeout on Slow Sites
+**Problem:** Site takes too long to load
+**Solution:**
+```bash
+# Use faster wait strategy
+arcfetch fetch https://example.com --wait-strategy load
+# Combine with force-playwright for JS sites
+arcfetch fetch https://example.com --force-playwright --wait-strategy domcontentloaded
+```
+### MCP Server Not Connecting
+**Problem:** Claude Code can't connect to MCP server
+**Solution:**
+```bash
+# Test if the MCP server works manually
+npx arcfetch fetch https://example.com
+# Check your MCP config path
+# macOS: ~/.config/claude-code/mcp_config.json
+# Linux: ~/.config/claude-code/mcp_config.json
+# Windows: %APPDATA%\claude-code\mcp_config.json
+```
+## Comparison
+| Feature | arcfetch | html-to-markdown | url-to-markdown | playwright-extra |
+|---------|----------|------------------|-----------------|------------------|
+| Auto JS fallback | ✅ | ❌ | ❌ | Manual |
+| Quality scoring | ✅ | ❌ | ❌ | ❌ |
+| Temp → Docs workflow | ✅ | ❌ | ❌ | ❌ |
+| MCP server | ✅ | ❌ | ❌ | ❌ |
+| Multiple output formats | ✅ | ❌ | Some | ❌ |
+| Zero-config | ✅ | ✅ | ✅ | ❌ |
+| Playwright included | ✅ | ❌ | ❌ | Manual setup |
+## Architecture
 ```
+┌─────────────────────────────────────────────────────────┐
+│                     CLI / MCP Interface                  │
+└─────────────────────────────────────────────────────────┘
+                            │
+                            ▼
+┌─────────────────────────────────────────────────────────┐
+│                    Core Pipeline                         │
+│  1. Simple HTTP Fetch                                    │
+│  2. Extract with Readability + Turndown                  │
+│  3. Validate Quality Score                               │
+│  4. Conditional Playwright Retry                         │
+│  5. Cache with Frontmatter Metadata                      │
+└─────────────────────────────────────────────────────────┘
+                            │
+            ┌───────────────┼───────────────┐
+            ▼               ▼               ▼
+      ┌───────────┐  ┌───────────┐  ┌───────────┐
+      │   Cache   │  │Playwright │  │ Validator │
+      │   Manager │  │  Manager  │  │           │
+      └───────────┘  └───────────┘  └───────────┘
+```
+## Contributing
-### Use in scripts
+Contributions welcome! Please read our contributing guidelines and submit pull requests to the main branch.
+### Development Setup
 ```bash
-# Get just the ref ID and path
-result=$(arcfetch fetch https://example.com -o summary)
-ref_id=$(echo $result | cut -d'|' -f1)
-filepath=$(echo $result | cut -d'|' -f2)
+git clone https://github.com/yourusername/arcfetch.git
+cd arcfetch
+bun install
+bun test          # Run tests
+bun run typecheck # Type checking
 ```
 ## License
-MIT
+MIT License - see LICENSE file for details