npm - @techwavedev/agi-agent-kit - Versions diffs - 1.1.3 - Mend

@techwavedev/agi-agent-kit 1.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (196) hide show

package/templates/skills/core/webcrawler/SKILL.md ADDED Viewed

@@ -0,0 +1,292 @@
+---
+name: webcrawler
+description: "Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output."
+---
+# Webcrawler Skill
+Intelligent documentation harvesting agent that recursively crawls documentation websites and extracts structured content about specific subjects.
+> **Last Updated:** 2026-01-23
+---
+## Quick Start
+```bash
+# Crawl Python documentation about async/await
+python skills/webcrawler/scripts/crawl_docs.py \
+  --url "https://docs.python.org/3/library/asyncio.html" \
+  --subject "asyncio" \
+  --depth 2 \
+  --output .tmp/docs/python-asyncio/
+# Crawl React documentation
+python skills/webcrawler/scripts/crawl_docs.py \
+  --url "https://react.dev/" \
+  --subject "React" \
+  --depth 3 \
+  --output .tmp/docs/react/
+# Extract only API reference pages
+python skills/webcrawler/scripts/crawl_docs.py \
+  --url "https://expressjs.com/en/4x/api.html" \
+  --subject "Express API" \
+  --filter "api" \
+  --output .tmp/docs/express-api/
+```
+---
+## Core Workflow
+1. **Initialize Crawl** — Provide base URL and subject focus
+2. **Discover Pages** — Recursively find all linked documentation pages
+3. **Filter Content** — Keep only pages matching the subject criteria
+4. **Extract Content** — Convert HTML to clean markdown
+5. **Organize Output** — Structure files in a navigable hierarchy
+6. **Generate Index** — Create a master index with all harvested pages
+---
+## Scripts
+### `crawl_docs.py` — Main Documentation Crawler
+The primary crawling script that handles recursive page discovery and content extraction.
+```bash
+python skills/webcrawler/scripts/crawl_docs.py \
+  --url <base-url>           # Starting URL (required)
+  --subject <topic>          # Subject focus for filtering (required)
+  --output <directory>       # Output directory (default: .tmp/crawled/)
+  --depth <n>                # Max crawl depth (default: 2)
+  --filter <pattern>         # URL path filter pattern (optional)
+  --delay <seconds>          # Delay between requests (default: 0.5)
+  --max-pages <n>            # Maximum pages to crawl (default: 100)
+  --same-domain              # Stay within same domain (default: true)
+  --include-code             # Preserve code blocks (default: true)
+  --format <md|json|both>    # Output format (default: both)
+```
+**Outputs:**
+- `index.md` — Master index with links to all pages
+- `pages/*.md` — Individual markdown files per page
+- `metadata.json` — Crawl metadata and page inventory
+- `content.json` — Structured JSON with all extracted content
+### `extract_page.py` — Single Page Extractor
+Extract content from a single documentation page.
+```bash
+python skills/webcrawler/scripts/extract_page.py \
+  --url <page-url>           # Page to extract (required)
+  --output <file>            # Output file (default: stdout)
+  --format <md|json>         # Output format (default: md)
+  --include-links            # Include internal links (default: true)
+```
+### `filter_docs.py` — Post-Crawl Filtering
+Filter already-crawled documentation by subject or pattern.
+```bash
+python skills/webcrawler/scripts/filter_docs.py \
+  --input <crawl-dir>        # Crawled docs directory (required)
+  --subject <topic>          # Subject to filter for (required)
+  --output <directory>       # Filtered output directory (required)
+  --threshold <0.0-1.0>      # Relevance threshold (default: 0.3)
+```
+---
+## Configuration
+### Rate Limiting & Politeness
+The crawler respects `robots.txt` and implements polite crawling:
+- **Default delay**: 0.5s between requests
+- **User-Agent**: Identifies as documentation harvester
+- **robots.txt**: Honored by default (disable with `--ignore-robots`)
+### Domain Handling
+| Mode                 | Behavior                                     |
+| -------------------- | -------------------------------------------- |
+| `--same-domain`      | Only crawl pages on the starting domain      |
+| `--same-path`        | Only crawl pages under the starting URL path |
+| `--allow-subdomains` | Include subdomains (e.g., api.example.com)   |
+### Content Extraction
+The crawler uses intelligent content extraction:
+1. **Main content detection** — Finds `<main>`, `<article>`, or content containers
+2. **Navigation removal** — Strips headers, footers, sidebars
+3. **Code preservation** — Maintains code blocks with language hints
+4. **Link normalization** — Converts relative links to absolute
+5. **Image handling** — Optionally downloads and references images
+---
+## Output Structure
+```
+.tmp/docs/<subject>/
+├── index.md              # Master index with TOC
+├── metadata.json         # Crawl metadata
+├── content.json          # Structured JSON export
+└── pages/
+    ├── getting-started.md
+    ├── installation.md
+    ├── api-reference.md
+    ├── configuration/
+    │   ├── basic.md
+    │   └── advanced.md
+    └── troubleshooting.md
+```
+### Index Format
+```markdown
+# <Subject> Documentation
+> Crawled from: <base-url>
+> Pages: <count>
+> Date: <timestamp>
+## Table of Contents
+- [Getting Started](pages/getting-started.md)
+- [Installation](pages/installation.md)
+- [API Reference](pages/api-reference.md)
+- Configuration
+  - [Basic](pages/configuration/basic.md)
+  - [Advanced](pages/configuration/advanced.md)
+- [Troubleshooting](pages/troubleshooting.md)
+```
+---
+## Common Workflows
+### 1. Harvest API Documentation
+```bash
+# Crawl API docs with deep recursion
+python skills/webcrawler/scripts/crawl_docs.py \
+  --url "https://api.example.com/docs" \
+  --subject "Example API" \
+  --depth 4 \
+  --filter "/api/" \
+  --output .tmp/docs/example-api/
+```
+### 2. Build RAG Knowledge Base
+```bash
+# Crawl and export as JSON for embedding
+python skills/webcrawler/scripts/crawl_docs.py \
+  --url "https://docs.example.com" \
+  --subject "Example Docs" \
+  --depth 3 \
+  --format json \
+  --output .tmp/rag/example/
+# The content.json can be fed directly to embedding pipelines
+```
+### 3. Offline Documentation Mirror
+```bash
+# Full documentation harvest
+python skills/webcrawler/scripts/crawl_docs.py \
+  --url "https://docs.kubernetes.io/docs/concepts/" \
+  --subject "Kubernetes Concepts" \
+  --depth 5 \
+  --max-pages 500 \
+  --include-images \
+  --output .tmp/docs/k8s-concepts/
+```
+### 4. Focused Topic Extraction
+```bash
+# Crawl, then filter to specific topic
+python skills/webcrawler/scripts/crawl_docs.py \
+  --url "https://developer.hashicorp.com/terraform/docs" \
+  --subject "Terraform" \
+  --depth 3 \
+  --output .tmp/docs/terraform-full/
+# Filter to AWS provider only
+python skills/webcrawler/scripts/filter_docs.py \
+  --input .tmp/docs/terraform-full/ \
+  --subject "AWS Provider" \
+  --output .tmp/docs/terraform-aws/
+```
+---
+## Best Practices
+### Crawling
+1. **Start shallow** — Begin with `--depth 1` to test, then increase
+2. **Use filters** — Narrow scope with `--filter` patterns
+3. **Set page limits** — Use `--max-pages` to prevent runaway crawls
+4. **Respect rate limits** — Increase `--delay` for slower servers
+### Content Quality
+1. **Subject focus** — Be specific with `--subject` for better filtering
+2. **Review index** — Check `index.md` to verify crawl coverage
+3. **Post-filter** — Use `filter_docs.py` to refine results
+### Storage
+1. **Use `.tmp/`** — Store crawled docs in the temp directory
+2. **Organize by subject** — Create subdirectories per topic
+3. **Version with dates** — Add timestamps for recurring crawls
+---
+## Troubleshooting
+| Issue                   | Cause                       | Solution                                |
+| ----------------------- | --------------------------- | --------------------------------------- |
+| **403 Forbidden**       | Blocked by server           | Increase delay, check robots.txt        |
+| **Empty pages**         | JavaScript-rendered content | Use `--render-js` (requires Playwright) |
+| **Too many pages**      | Unbounded crawl             | Lower depth, use filters                |
+| **Duplicate content**   | Same page via multiple URLs | Enabled by default (URL normalization)  |
+| **Missing code blocks** | Extraction issue            | Check `--include-code` is enabled       |
+---
+## Dependencies
+Required Python packages:
+```bash
+pip install requests beautifulsoup4 html2text lxml
+# Optional for JavaScript rendering:
+pip install playwright && playwright install
+```
+---
+## Related Skills
+- **[qdrant-memory](../qdrant-memory/SKILL.md)** — Store crawled docs in vector database for RAG
+- **[pdf-reader](../pdf-reader/SKILL.md)** — Extract text from PDF documentation
+---
+## External Resources
+- [Scrapy Documentation](https://docs.scrapy.org/) — For complex crawling needs
+- [html2text](https://github.com/Alir3z4/html2text) — HTML to Markdown conversion
+- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) — HTML parsing

package/templates/skills/core/webcrawler/references/advanced_crawling.md ADDED Viewed

@@ -0,0 +1,181 @@
+# Advanced Crawling Reference
+## JavaScript-Rendered Pages
+Some documentation sites render content with JavaScript. For these, use Playwright:
+```bash
+# Install Playwright
+pip install playwright
+playwright install chromium
+# Use in crawl_docs.py with --render-js flag (future feature)
+```
+### Manual Extraction with Playwright
+```python
+from playwright.sync_api import sync_playwright
+def extract_js_rendered(url: str) -> str:
+    with sync_playwright() as p:
+        browser = p.chromium.launch()
+        page = browser.new_page()
+        page.goto(url, wait_until='networkidle')
+        content = page.content()
+        browser.close()
+        return content
+```
+---
+## Rate Limiting Strategies
+### Exponential Backoff
+```python
+import time
+import random
+def fetch_with_backoff(url, max_retries=3):
+    for attempt in range(max_retries):
+        try:
+            response = requests.get(url)
+            if response.status_code == 429:  # Too Many Requests
+                wait = (2 ** attempt) + random.uniform(0, 1)
+                time.sleep(wait)
+                continue
+            return response
+        except requests.exceptions.RequestException:
+            if attempt == max_retries - 1:
+                raise
+            time.sleep(2 ** attempt)
+```
+### Respecting Crawl-Delay
+```python
+from urllib.robotparser import RobotFileParser
+rp = RobotFileParser()
+rp.set_url("https://example.com/robots.txt")
+rp.read()
+crawl_delay = rp.crawl_delay("*")
+if crawl_delay:
+    time.sleep(crawl_delay)
+```
+---
+## Content Extraction Patterns
+### Documentation Site Patterns
+| Site Type       | Content Selector       | Notes                     |
+| --------------- | ---------------------- | ------------------------- |
+| **ReadTheDocs** | `.document`, `.body`   | Standard Sphinx output    |
+| **GitBook**     | `.page-inner`          | Modern docs platform      |
+| **Docusaurus**  | `.markdown`, `article` | React-based docs          |
+| **MkDocs**      | `.md-content`          | Python-based docs         |
+| **Notion**      | `.notion-page-content` | Requires special handling |
+| **Confluence**  | `#main-content`        | Enterprise wiki           |
+### Handling Dynamic Navigation
+Some sites use JavaScript for navigation. Strategy:
+1. Extract sitemap from `sitemap.xml` if available
+2. Parse navigation elements for all page links
+3. Follow `next`/`prev` pagination links
+```python
+def get_sitemap_urls(base_url: str) -> list:
+    sitemap_url = f"{base_url}/sitemap.xml"
+    response = requests.get(sitemap_url)
+    soup = BeautifulSoup(response.content, 'lxml-xml')
+    return [loc.text for loc in soup.find_all('loc')]
+```
+---
+## Large Documentation Sets
+For documentation with 500+ pages:
+1. **Use depth limits** — Start with `--depth 1` to get main sections
+2. **Section by section** — Crawl each major section separately
+3. **Resume capability** — Check `metadata.json` for already-crawled pages
+4. **Parallel crawling** — Use async requests (not implemented in base script)
+### Memory-Efficient Streaming
+```python
+# For very large crawls, write pages immediately instead of buffering
+def crawl_streaming(url, output_dir):
+    for page in discover_pages(url):
+        content = extract_page(page)
+        save_immediately(content, output_dir)
+        # Page content is not kept in memory
+```
+---
+## Integration with RAG Pipelines
+### Chunking Strategy
+After crawling, chunk documents for embedding:
+```python
+def chunk_document(content: str, chunk_size: int = 500) -> list:
+    """Split document into overlapping chunks."""
+    words = content.split()
+    chunks = []
+    overlap = chunk_size // 4
+    for i in range(0, len(words), chunk_size - overlap):
+        chunk = ' '.join(words[i:i + chunk_size])
+        if chunk:
+            chunks.append(chunk)
+    return chunks
+```
+### Metadata Preservation
+Keep source URLs with chunks for citation:
+```python
+{
+    "text": "chunk content...",
+    "metadata": {
+        "source_url": "https://docs.example.com/page",
+        "title": "Page Title",
+        "section": "Getting Started"
+    }
+}
+```
+---
+## Troubleshooting
+### Common Issues
+| Problem                 | Solution                                           |
+| ----------------------- | -------------------------------------------------- |
+| **403 Forbidden**       | Add realistic User-Agent, increase delay           |
+| **Cloudflare blocking** | Use Playwright with stealth plugin                 |
+| **CAPTCHA**             | Cannot bypass; manual intervention required        |
+| **Session-based auth**  | Export cookies, use `--cookies` option             |
+| **Infinite scroll**     | Use Playwright to scroll and wait for content      |
+| **Rate limiting (429)** | Implement exponential backoff, respect Retry-After |
+### Debugging
+Enable verbose mode to trace crawl behavior:
+```bash
+python crawl_docs.py --url "..." --subject "..." -v 2>&1 | tee crawl.log
+```