npm - doc-fetch-cli - Versions diffs - 1.0.2 - Mend

doc-fetch-cli 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

package/README.md +193 -0
package/SECURITY.md +84 -0
package/bin/doc-fetch.js +37 -0
package/bin/install.js +171 -0
package/cmd/docfetch/main.go +54 -0
package/dist/doc_fetch-1.0.1-py3-none-any.whl +0 -0
package/dist/doc_fetch-1.0.1.tar.gz +0 -0
package/doc-fetch +0 -0
package/doc-fetch_darwin_amd64 +0 -0
package/doc-fetch_linux_amd64 +0 -0
package/doc-fetch_windows_amd64.exe +0 -0
package/doc_fetch/__init__.py +6 -0
package/doc_fetch/__main__.py +7 -0
package/doc_fetch/cli.py +113 -0
package/doc_fetch.egg-info/PKG-INFO +224 -0
package/doc_fetch.egg-info/SOURCES.txt +19 -0
package/doc_fetch.egg-info/dependency_links.txt +1 -0
package/doc_fetch.egg-info/entry_points.txt +2 -0
package/doc_fetch.egg-info/not-zip-safe +1 -0
package/doc_fetch.egg-info/top_level.txt +1 -0
package/docs/usage.md +67 -0
package/examples/golang-example.sh +12 -0
package/go.mod +11 -0
package/go.sum +38 -0
package/package.json +18 -0
package/pkg/fetcher/classifier.go +50 -0
package/pkg/fetcher/describer.go +61 -0
package/pkg/fetcher/fetcher.go +332 -0
package/pkg/fetcher/html2md.go +71 -0
package/pkg/fetcher/llmtxt.go +36 -0
package/pkg/fetcher/validator.go +109 -0
package/pkg/fetcher/writer.go +32 -0
package/pyproject.toml +37 -0
package/setup.py +158 -0

package/README.md ADDED Viewed

@@ -0,0 +1,193 @@
+# DocFetch - Dynamic Documentation Fetcher 📚
+**Transform entire documentation sites into AI-ready, single-file markdown with intelligent LLM.txt indexing**
+Most AIs can't navigate documentation like humans do. They can't scroll through sections, click sidebar links, or explore related pages. **DocFetch solves this fundamental problem** by converting entire documentation sites into comprehensive, clean markdown files that contain every section and piece of information in a format that LLMs love.
+## 🚀 Why DocFetch is Essential for AI Development
+### 🤖 **AI/LLM Optimization**
+- **Single-file consumption**: No more fragmented context across multiple pages
+- **Clean, structured markdown**: Perfect token efficiency for LLM context windows
+- **Intelligent LLM.txt generation**: AI-friendly index with semantic categorization
+- **Noise removal**: Automatically strips navigation, headers, footers, ads, and buttons
+### ⚡ **Developer Productivity**
+- **One command automation**: Replace hours of manual copy-pasting with a single CLI command
+- **Complete documentation access**: Give your AI agents full access to official documentation
+- **Consistent formatting**: Uniform structure across different documentation sites
+- **Version control friendly**: Markdown files work perfectly with Git
+### 🎯 **Smart Content Intelligence**
+- **Automatic page classification**: Identifies APIs, guides, references, and examples
+- **Semantic descriptions**: Generates concise, relevant descriptions for each section
+- **URL preservation**: Maintains original source links for verification
+- **Adaptive content extraction**: Works with diverse documentation site structures
+### 🔧 **Production Ready**
+- **Concurrent fetching**: Fast downloads with configurable concurrency
+- **Respectful crawling**: Honors robots.txt and includes rate limiting
+- **Cross-platform**: Works on Windows, macOS, and Linux
+- **Multiple installation options**: NPM, Go install, or direct binary download
+## 📦 Installation
+### PyPI (Recommended for Python developers) ✨ NEW
+```bash
+pip install doc-fetch
+```
+### NPM (Recommended for JavaScript/Node.js developers)
+```bash
+npm install -g doc-fetch
+```
+### Go (For Go developers)
+```bash
+go install github.com/AlphaTechini/doc-fetch/cmd/docfetch@latest
+```
+### Direct Binary Download
+Visit [Releases](https://github.com/AlphaTechini/doc-fetch/releases) and download your platform's binary.
+## 🎯 Usage
+### Basic Usage
+```bash
+# Fetch entire documentation site to single markdown file
+doc-fetch --url https://golang.org/doc/ --output ./docs/golang-full.md
+# With LLM.txt generation for AI optimization
+doc-fetch --url https://react.dev/learn --output docs.md --llm-txt
+```
+### Advanced Usage
+```bash
+# Comprehensive documentation fetch with all features
+doc-fetch \
+  --url https://docs.example.com \
+  --output ./internal/docs.md \
+  --depth 4 \
+  --concurrent 10 \
+  --llm-txt \
+  --user-agent "MyBot/1.0"
+```
+### Command Options
+| Flag | Short | Description | Default |
+|------|-------|-------------|---------|
+| `--url` | `-u` | Base URL to fetch documentation from | **Required** |
+| `--output` | `-o` | Output file path | `docs.md` |
+| `--depth` | `-d` | Maximum crawl depth | `2` |
+| `--concurrent` | `-c` | Number of concurrent fetchers | `3` |
+| `--llm-txt` | | Generate AI-friendly llm.txt index | `false` |
+| `--user-agent` | | Custom user agent string | `DocFetch/1.0` |
+## 📁 Output Files
+When using `--llm-txt`, DocFetch generates two files:
+### `docs.md` - Complete Documentation
+```markdown
+# Documentation
+This file contains documentation fetched by DocFetch.
+---
+## Getting Started
+This guide covers installation, setup, and first program...
+---
+## Language Specification
+Complete Go language specification and syntax...
+```
+### `docs.llm.txt` - AI-Friendly Index
+```txt
+# llm.txt - AI-friendly documentation index
+[GUIDE] Getting Started
+https://golang.org/doc/install
+Covers installation, setup, and first program.
+[REFERENCE] Language Specification
+https://golang.org/ref/spec
+Complete Go language specification and syntax.
+[API] net/http
+https://pkg.go.dev/net/http
+HTTP client/server implementation.
+```
+## 🌟 Real-World Examples
+### Fetch Go Documentation
+```bash
+doc-fetch --url https://golang.org/doc/ --output ./docs/go-documentation.md --depth 4 --llm-txt
+```
+### Fetch React Documentation
+```bash
+doc-fetch --url https://react.dev/learn --output ./docs/react-learn.md --concurrent 10 --llm-txt
+```
+### Fetch Your Own Project Docs
+```bash
+doc-fetch --url https://your-project.com/docs/ --output ./internal/docs.md --llm-txt
+```
+## 🤖 How LLM.txt Supercharges Your AI
+The generated `llm.txt` file acts as a **semantic roadmap** for your AI agents:
+1. **Precise Navigation**: Agents can query specific sections without scanning entire documents
+2. **Context Awareness**: Know whether they're looking at an API reference vs. a tutorial
+3. **Efficient Retrieval**: Jump directly to relevant content based on query intent
+4. **Source Verification**: Always maintain links back to original documentation
+**Example AI Prompt Enhancement:**
+```
+Instead of: "What does the net/http package do?"
+Your AI can now: "Check the [API] net/http section in llm.txt for HTTP client/server implementation details"
+```
+## 🏗️ How It Works
+1. **Link Discovery**: Parses the base URL to find all internal documentation links
+2. **Content Fetching**: Downloads all pages concurrently with respect for robots.txt
+3. **HTML Cleaning**: Removes non-content elements (navigation, headers, footers, etc.)
+4. **Markdown Conversion**: Converts cleaned HTML to structured markdown
+5. **Intelligent Classification**: Categorizes pages as API, GUIDE, REFERENCE, or EXAMPLE
+6. **Description Generation**: Creates concise, relevant descriptions for each section
+7. **Single File Output**: Combines all documentation into one comprehensive file
+8. **LLM.txt Generation**: Creates AI-friendly index with semantic categorization
+## 🚀 Future Features
+- **Incremental updates**: Only fetch changed pages on subsequent runs
+- **Custom selectors**: Allow users to specify content areas for different sites
+- **Multiple formats**: Support PDF, JSON, and other output formats
+- **Token counting**: Estimate token usage for LLM context planning
+- **Advanced classification**: Machine learning-based page type detection
+## 💡 Why This Exists
+Traditional documentation sites are designed for **human navigation**, not **AI consumption**. When working with LLMs, you often need to manually copy-paste multiple sections or provide incomplete context. DocFetch automates this process, giving your AI agents complete access to documentation without the manual overhead.
+**Stop wasting time copying documentation. Start building AI agents with complete knowledge.**
+## 🤝 Contributing
+Contributions are welcome! Please open an issue or pull request on GitHub.
+## 📄 License
+MIT License
+---
+**Built with ❤️ for AI developers who deserve better documentation access**

package/SECURITY.md ADDED Viewed

@@ -0,0 +1,84 @@
+# Security Policy
+## Security Features
+DocFetch includes several built-in security protections:
+### ✅ Path Traversal Protection
+- Output files can only be written within the current working directory
+- Relative paths (`../`) are blocked
+- Absolute paths outside the current directory are rejected
+### ✅ SSRF (Server-Side Request Forgery) Protection
+- Only HTTP/HTTPS URLs are allowed
+- Private IP addresses (192.168.x.x, 10.x.x.x, etc.) are blocked
+- Localhost and loopback addresses are blocked
+- Internal network access is prevented
+### ✅ Rate Limiting
+- Maximum 10 requests per second to avoid overwhelming servers
+- Respectful crawling behavior
+### ✅ Input Validation
+- URL validation and sanitization
+- Output path validation
+- Parameter bounds checking (max depth: 10, max workers: 20)
+### ✅ Content Safety
+- HTML content is cleaned of scripts and dangerous elements
+- XSS patterns are filtered out
+- Only safe markdown is generated
+## Safe Usage Guidelines
+### Command Line Usage
+```bash
+# ✅ SAFE - relative path in current directory
+doc-fetch --url https://example.com --output docs.md
+# ✅ SAFE - subdirectory in current directory
+doc-fetch --url https://example.com --output ./docs/site.md
+# ❌ BLOCKED - path traversal attempt
+doc-fetch --url https://example.com --output ../../etc/passwd
+# ❌ BLOCKED - absolute path outside current directory
+doc-fetch --url https://example.com --output /tmp/malicious.md
+```
+### URL Restrictions
+```bash
+# ✅ SAFE - public HTTPS site
+doc-fetch --url https://golang.org/doc/ --output docs.md
+# ❌ BLOCKED - private IP address
+doc-fetch --url http://192.168.1.1/admin --output docs.md
+# ❌ BLOCKED - localhost
+doc-fetch --url http://localhost:8080/api --output docs.md
+# ❌ BLOCKED - non-HTTP protocol
+doc-fetch --url file:///etc/passwd --output docs.md
+```
+## Reporting Security Issues
+If you discover a security vulnerability in DocFetch, please:
+1. **Do not disclose publicly** until it's been addressed
+2. Contact the maintainer directly at [your email]
+3. Provide detailed reproduction steps
+4. Allow reasonable time for patch development
+## Security Updates
+Security patches will be released as soon as possible after vulnerability confirmation. Users are encouraged to keep DocFetch updated to the latest version.
+## Dependencies Security
+DocFetch uses the following dependencies with known security track records:
+- `github.com/PuerkitoBio/goquery` - HTML parsing
+- `github.com/yuin/goldmark` - Markdown processing
+- Standard Go libraries (`net/http`, `sync`, etc.)
+All dependencies are regularly audited and kept up-to-date.

package/bin/doc-fetch.js ADDED Viewed

@@ -0,0 +1,37 @@
+#!/usr/bin/env node
+const { spawn } = require('child_process');
+const path = require('path');
+const os = require('os');
+// Determine binary path based on platform
+const binDir = path.join(__dirname, '..');
+const binaryName = os.platform() === 'win32' ? 'doc-fetch.exe' : 'doc-fetch';
+const binaryPath = path.join(binDir, binaryName);
+// Check if binary exists
+if (!require('fs').existsSync(binaryPath)) {
+  console.error('❌ doc-fetch binary not found!');
+  console.error('💡 Please run: npm install doc-fetch');
+  process.exit(1);
+}
+const args = process.argv.slice(2);
+// Spawn the Go binary
+const child = spawn(binaryPath, args, {
+  stdio: 'inherit'
+});
+child.on('error', (err) => {
+  if (err.code === 'ENOENT') {
+    console.error('❌ doc-fetch binary not found!');
+    console.error('💡 Please run: npm install doc-fetch');
+  } else {
+    console.error('❌ Failed to start doc-fetch:', err.message);
+  }
+  process.exit(1);
+});
+child.on('exit', (code) => {
+  process.exit(code || 0);
+});

package/bin/install.js ADDED Viewed

@@ -0,0 +1,171 @@
+#!/usr/bin/env node
+const os = require('os');
+const path = require('path');
+const { execSync } = require('child_process');
+const fs = require('fs');
+const https = require('https');
+const { pipeline } = require('stream');
+const { promisify } = require('util');
+const finished = promisify(pipeline);
+// Security: Validate and sanitize output paths
+function validateOutputPath(filePath) {
+    // Resolve to absolute path
+    const absPath = path.resolve(filePath);
+    // Get current working directory
+    const cwd = process.cwd();
+    // Ensure the path is within the current directory or a subdirectory
+    if (!absPath.startsWith(cwd)) {
+        throw new Error('Output path must be within current directory or subdirectories');
+    }
+    // Block dangerous patterns
+    if (filePath.includes('..') || filePath.includes('~')) {
+        throw new Error('Relative paths with ".." or "~" are not allowed');
+    }
+    return absPath;
+}
+// Security: Validate URL
+function validateURL(urlStr) {
+    try {
+        const url = new URL(urlStr);
+        // Only allow HTTP/HTTPS
+        if (url.protocol !== 'http:' && url.protocol !== 'https:') {
+            throw new Error('Only HTTP and HTTPS URLs are allowed');
+        }
+        // Check for private IP ranges (basic SSRF protection)
+        const hostname = url.hostname.toLowerCase();
+        const privatePatterns = [
+            '127.',      // localhost
+            '192.168.',  // private network
+            '10.',       // private network
+            '172.16.', '172.17.', '172.18.', '172.19.',
+            '172.20.', '172.21.', '172.22.', '172.23.',
+            '172.24.', '172.25.', '172.26.', '172.27.',
+            '172.28.', '172.29.', '172.30.', '172.31.',
+            'localhost',
+            '::1'        // IPv6 localhost
+        ];
+        for (const pattern of privatePatterns) {
+            if (hostname.startsWith(pattern)) {
+                throw new Error('Private/internal URLs are not allowed');
+            }
+        }
+        return true;
+    } catch (error) {
+        throw new Error(`Invalid URL: ${error.message}`);
+    }
+}
+async function getBinaryUrl() {
+    const platform = os.platform();
+    const arch = os.arch();
+    // Map to Go build targets
+    let goos, goarch;
+    switch(platform) {
+        case 'win32': goos = 'windows'; break;
+        case 'darwin': goos = 'darwin'; break;
+        default: goos = 'linux';
+    }
+    switch(arch) {
+        case 'x64': goarch = 'amd64'; break;
+        case 'arm64': goarch = 'arm64'; break;
+        default: goarch = 'amd64';
+    }
+    return {
+        url: `https://github.com/AlphaTechini/doc-fetch/releases/download/v1.0.0/doc-fetch_${goos}_${goarch}`,
+        filename: platform === 'win32' ? 'doc-fetch.exe' : 'doc-fetch'
+    };
+}
+async function downloadBinary() {
+    const binDir = path.join(__dirname, '..');
+    if (!fs.existsSync(binDir)) {
+        fs.mkdirSync(binDir, { recursive: true });
+    }
+    const { url, filename } = await getBinaryUrl();
+    // Validate the download URL
+    validateURL(url);
+    // Validate and sanitize the binary path
+    const binaryPath = validateOutputPath(path.join(binDir, filename));
+    console.log('📥 Downloading doc-fetch binary...');
+    console.log(`   Platform: ${os.platform()} ${os.arch()}`);
+    console.log(`   URL: ${url}`);
+    try {
+        const response = await new Promise((resolve, reject) => {
+            const req = https.get(url, (res) => {
+                if (res.statusCode === 404) {
+                    reject(new Error('Binary not found for your platform'));
+                } else if (res.statusCode !== 200) {
+                    reject(new Error(`HTTP ${res.statusCode}: ${res.statusMessage}`));
+                } else {
+                    resolve(res);
+                }
+            }).on('error', reject);
+            // Set timeout for security
+            req.setTimeout(30000, () => {
+                req.destroy(new Error('Download timeout'));
+            });
+        });
+        // Create write stream with secure permissions
+        const writeStream = fs.createWriteStream(binaryPath, { mode: 0o700 });
+        await finished(response, writeStream);
+        console.log('✅ Binary downloaded successfully!');
+        return true;
+    } catch (error) {
+        console.error('❌ Failed to download binary:', error.message);
+        console.log('🔄 Falling back to Go build from source...');
+        // Fallback: build from source if Go is available
+        try {
+            execSync('go version', { stdio: 'pipe' });
+            console.log('🏗️  Building from source...');
+            // Validate build path
+            const buildPath = validateOutputPath(path.join(binDir, 'doc-fetch'));
+            execSync(`go build -o ${buildPath} ./cmd/docfetch`, {
+                stdio: 'inherit',
+                cwd: path.join(__dirname, '..')
+            });
+            console.log('✅ Built successfully from source!');
+            return true;
+        } catch (buildError) {
+            console.error('❌ Go not found or build failed.');
+            console.error('💡 Please install Go (https://golang.org/dl/) or download the binary manually.');
+            return false;
+        }
+    }
+}
+// Run the installation
+(async () => {
+    try {
+        const success = await downloadBinary();
+        if (!success) {
+            process.exit(1);
+        }
+    } catch (error) {
+        console.error('Installation failed:', error.message);
+        process.exit(1);
+    }
+})();

package/cmd/docfetch/main.go ADDED Viewed

@@ -0,0 +1,54 @@
+package main
+import (
+	"flag"
+	"log"
+	"strings"
+	"github.com/AlphaTechini/doc-fetch/pkg/fetcher"
+)
+func main() {
+	url := flag.String("url", "", "Base URL to fetch documentation from")
+	output := flag.String("output", "docs.md", "Output file path")
+	depth := flag.Int("depth", 2, "Maximum crawl depth")
+	concurrent := flag.Int("concurrent", 3, "Concurrent fetchers")
+	userAgent := flag.String("user-agent", "DocFetch/1.0", "Custom user agent")
+	llmTxt := flag.Bool("llm-txt", false, "Generate llm.txt index file")
+	flag.Parse()
+	if *url == "" {
+		log.Fatal("Error: URL is required\nUsage: doc-fetch --url <base-url> --output <file-path>")
+	}
+	// Validate configuration for security
+	config := fetcher.Config{
+		BaseURL:         *url,
+		OutputPath:      *output,
+		MaxDepth:        *depth,
+		Workers:         *concurrent,
+		UserAgent:       *userAgent,
+		GenerateLLMTxt:  *llmTxt,
+	}
+	if err := fetcher.ValidateConfig(&config); err != nil {
+		log.Fatalf("Configuration error: %v", err)
+	}
+	err := fetcher.Run(config)
+	if err != nil {
+		log.Fatalf("Failed to fetch documentation: %v", err)
+	}
+	log.Printf("Documentation successfully saved to %s", *output)
+	if *llmTxt {
+		llmTxtPath := *output
+		if strings.HasSuffix(*output, ".md") {
+			llmTxtPath = strings.TrimSuffix(*output, ".md") + ".llm.txt"
+		} else {
+			llmTxtPath = *output + ".llm.txt"
+		}
+		log.Printf("LLM.txt index generated: %s", llmTxtPath)
+	}
+}

package/dist/doc_fetch-1.0.1-py3-none-any.whl ADDED Viewed

Binary file

package/dist/doc_fetch-1.0.1.tar.gz ADDED Viewed

Binary file

package/doc-fetch ADDED Viewed

Binary file

package/doc-fetch_darwin_amd64 ADDED Viewed

Binary file

package/doc-fetch_linux_amd64 ADDED Viewed

Binary file

package/doc-fetch_windows_amd64.exe ADDED Viewed

Binary file

package/doc_fetch/__init__.py ADDED Viewed

@@ -0,0 +1,6 @@
+"""
+DocFetch - Dynamic documentation fetching CLI for AI/LLM consumption.
+This package provides a Python wrapper around the Go-based DocFetch binary,
+enabling easy installation and usage via pip.
+"""

package/doc_fetch/__main__.py ADDED Viewed

@@ -0,0 +1,7 @@
+"""
+Module entry point for doc-fetch.
+"""
+from .cli import main
+if __name__ == "__main__":
+    main()

package/doc_fetch/cli.py ADDED Viewed

@@ -0,0 +1,113 @@
+#!/usr/bin/env python3
+"""
+DocFetch CLI wrapper for Python.
+This module provides a Python interface to the Go-based DocFetch binary.
+It handles downloading the appropriate binary for your platform and
+executing it with the provided arguments.
+"""
+import os
+import sys
+import subprocess
+import platform
+from pathlib import Path
+# Get the directory where this script is located
+SCRIPT_DIR = Path(__file__).parent
+BIN_DIR = SCRIPT_DIR / "bin"
+BINARY_NAME = None
+def get_binary_name():
+    """Get the appropriate binary name for the current platform."""
+    system = platform.system().lower()
+    machine = platform.machine().lower()
+    # Map machine architectures
+    arch_map = {
+        'x86_64': 'amd64',
+        'amd64': 'amd64',
+        'arm64': 'arm64',
+        'aarch64': 'arm64'
+    }
+    arch = arch_map.get(machine, 'amd64')
+    if system == 'windows':
+        return f'doc-fetch_windows_{arch}.exe'
+    elif system == 'darwin':
+        return f'doc-fetch_darwin_{arch}'
+    else:  # linux and others
+        return f'doc-fetch_linux_{arch}'
+def download_binary():
+    """Download the appropriate binary from GitHub releases."""
+    import urllib.request
+    import ssl
+    binary_name = get_binary_name()
+    binary_path = BIN_DIR / binary_name
+    # Create bin directory if it doesn't exist
+    BIN_DIR.mkdir(exist_ok=True)
+    # URL for the binary
+    url = f"https://github.com/AlphaTechini/doc-fetch/releases/download/v1.0.0/{binary_name}"
+    print(f"📥 Downloading doc-fetch binary for {platform.system()} {platform.machine()}...")
+    print(f"   URL: {url}")
+    try:
+        # Create SSL context to handle certificates
+        ssl_context = ssl.create_default_context()
+        # Download the binary
+        with urllib.request.urlopen(url, context=ssl_context) as response:
+            with open(binary_path, 'wb') as f:
+                f.write(response.read())
+        # Make executable on Unix-like systems
+        if platform.system() != 'Windows':
+            os.chmod(binary_path, 0o755)
+        print("✅ Binary downloaded successfully!")
+        return binary_path
+    except Exception as e:
+        print(f"❌ Failed to download binary: {e}")
+        print("💡 Please ensure you have internet access and can reach GitHub.")
+        sys.exit(1)
+def main():
+    """Main entry point for the doc-fetch CLI."""
+    global BINARY_NAME
+    # Get binary path
+    binary_name = get_binary_name()
+    binary_path = BIN_DIR / binary_name
+    # Download binary if it doesn't exist
+    if not binary_path.exists():
+        binary_path = download_binary()
+    # Execute the binary with all arguments
+    try:
+        result = subprocess.run([str(binary_path)] + sys.argv[1:], check=False)
+        sys.exit(result.returncode)
+    except FileNotFoundError:
+        print("❌ doc-fetch binary not found!")
+        print("💡 This shouldn't happen. Please reinstall the package.")
+        sys.exit(1)
+    except KeyboardInterrupt:
+        print("\n⚠️  Interrupted by user")
+        sys.exit(130)
+    except Exception as e:
+        print(f"❌ Failed to execute doc-fetch: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()