npm - @pi-unipi/web-api - Versions diffs - 0.1.14 → 0.1.16 - Mend

@pi-unipi/web-api 0.1.14 → 0.1.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

package/README.md +81 -114
package/package.json +9 -2
package/skills/web/SKILL.md +54 -11
package/src/engine/constants.ts +36 -0
package/src/engine/dependencies.ts +145 -0
package/src/engine/dom.ts +266 -0
package/src/engine/extract.ts +642 -0
package/src/engine/format.ts +306 -0
package/src/engine/profiles.ts +102 -0
package/src/engine/types.ts +169 -0
package/src/index.ts +9 -2
package/src/providers/base.ts +9 -1
package/src/settings.ts +70 -4
package/src/tools.ts +281 -24
package/src/tui/progress.ts +168 -0
package/src/tui/result.ts +173 -0
package/src/tui/settings-dialog.ts +168 -0

package/README.md CHANGED Viewed

@@ -1,46 +1,85 @@
 # @pi-unipi/web-api
-Web search, read, and summarize tools with provider-based backend selection for Pi coding agent.
+Web search, page reading, and content summarization for the agent. The read path uses a local smart-fetch engine by default — free, no API key, browser-grade TLS fingerprinting that bypasses Cloudflare.
-## Overview
+Paid providers (SerpAPI, Tavily, Firecrawl, Perplexity) are available as fallbacks. DuckDuckGo and Jina work out of the box for search.
-`@pi-unipi/web-api` provides agent tools for web access:
+## Commands
-- **web_search** — Search the web using various providers
-- **web_read** — Extract content from URLs
-- **web_llm_summarize** — Summarize web content using LLM
+| Command | Description |
+|---------|-------------|
+| `/unipi:web-settings` | Configure providers, API keys, and smart-fetch defaults |
+| `/unipi:web-cache-clear` | Clear all cached web content |
-Providers are ranked by capability and cost, allowing smart auto-selection.
+## Special Triggers
-## Features
+Workflow skills detect web-api and inject web tools for research-type commands:
-- **Provider-based architecture** — Multiple search/read providers with unified interface
-- **Smart selection** — Auto-select cheapest available provider
-- **API key management** — Interactive TUI for key configuration
-- **Caching** — Web content cached with configurable TTL
-- **Subagent integration** — Tools automatically available to spawned subagents
+| Skill | What Changes |
+|-------|--------------|
+| `research` | Full web search, read, summarize |
+| `gather-context` | External documentation lookup |
+| `consultant` | Industry best practices research |
+| `subagents` (explore) | Web research in parallel |
-## Installation
+The footer and info-screen don't display web-api data — it's a tool package, not a state package.
-```bash
-npm install @pi-unipi/web-api
+## Agent Tools
+| Tool | Description |
+|------|-------------|
+| `web_search` | Search the web via provider |
+| `multi_web_content_read` | Extract content from URLs (smart-fetch or provider) |
+| `web_llm_summarize` | Summarize web content via LLM |
+### web_search
+```
+# Auto-select cheapest provider
+web_search(query: "TypeScript generics")
+# Use specific provider
+web_search(query: "latest AI research", source: 4)  # Tavily
+```
+### multi_web_content_read
+```
+# Single URL (smart-fetch engine by default)
+multi_web_content_read(url: "https://example.com/article")
+# Batch URLs
+multi_web_content_read(url: ["https://example.com/a", "https://example.com/b"])
+# Provider fallback (Jina Reader)
+multi_web_content_read(url: "https://example.com/article", source: 1)
+# Custom options
+multi_web_content_read(url: "https://example.com/article", format: "json", maxChars: 10000)
 ```
-Add to your pi configuration:
+### web_llm_summarize
-```json
-{
-  "pi": {
-    "extensions": [
-      "node_modules/@pi-unipi/web-api/src/index.ts"
-    ]
-  }
-}
 ```
+web_llm_summarize(url: "https://example.com/long-article")
+web_llm_summarize(url: "https://example.com/research", prompt: "Extract key findings")
+```
+## Smart-Fetch Engine
+Local content extraction pipeline — no API key required:
+| Component | Purpose |
+|-----------|---------|
+| **wreq-js** | Browser-grade TLS fingerprinting (bypasses Cloudflare) |
+| **defuddle** | Intelligent content extraction from HTML |
+| **linkedom** | Server-side DOM parsing |
+Outputs clean markdown with metadata (title, author, site, word count). Supports batch concurrent fetching with progress.
 ## Providers
-### Search Providers
+### Search
 | Provider | Rank | Cost | API Key |
 |----------|------|------|---------|
@@ -50,32 +89,27 @@ Add to your pi configuration:
 | Tavily | 4 | Paid | Required |
 | Perplexity | 5 | Paid | Required |
-### Read Providers
+### Read
 | Provider | Rank | Cost | API Key |
 |----------|------|------|---------|
+| Smart-Fetch Engine | 0 | Free | No |
 | Jina AI Reader | 1 | Freemium | Optional |
 | Firecrawl | 2 | Paid | Required |
 | Perplexity | 3 | Paid | Required |
-### Summarize Providers
+### Summarize
 | Provider | Rank | Cost | API Key |
 |----------|------|------|---------|
 | Perplexity | 1 | Paid | Required |
 | LLM Summarize | 2 | LLM tokens | No |
-## Configuration
+## Configurables
 ### API Keys
-Configure API keys via the interactive settings command:
-```
-/unipi:web-settings
-```
-Or set environment variables:
+Configure via `/unipi:web-settings` (interactive TUI) or environment variables:
 ```bash
 export SERPAPI_KEY="your-key"
@@ -85,97 +119,30 @@ export PERPLEXITY_API_KEY="your-key"
 export JINA_API_KEY="your-key"
 ```
-### Settings Files
-- **Auth:** `~/.unipi/config/web-api/auth.json` (API keys, gitignored)
-- **Config:** `~/.unipi/config/web-api/config.json` (provider settings)
+Providers auto-enable when you add a valid API key.
-## Usage
+### Smart-Fetch Defaults
-### Web Search
-```
-# Auto-select cheapest provider
-web_search(query: "TypeScript generics")
+Configure browser profile, OS, max chars, timeout via `/unipi:web-settings → "Smart Fetch Defaults"`.
-# Use specific provider
-web_search(query: "latest AI research", source: 4)  # Tavily
-```
-### Web Read
-```
-# Auto-select provider
-web_read(url: "https://example.com/article")
-# Use specific provider
-web_read(url: "https://example.com/spa", source: 2)  # Firecrawl
-```
-### Web Summarize
-```
-# Auto-summarize
-web_llm_summarize(url: "https://example.com/long-article")
-# Custom prompt
-web_llm_summarize(url: "https://example.com/research", prompt: "Extract key findings")
-```
-## Commands
-### /unipi:web-settings
-Interactive settings dialog for managing providers and API keys.
-- **Auto-enable on key input** — provider is automatically enabled when you add a valid API key (no extra toggle step)
-- **Cursor memory** — last configured provider moves to the top of the list when you return to the menu
-### /unipi:web-cache-clear
+### Settings Files
-Clear all cached web content.
+- **Auth:** `~/.unipi/config/web-api/auth.json` (API keys, gitignored)
+- **Config:** `~/.unipi/config/web-api/config.json` (provider settings, smart-fetch defaults)
-## Cache
+### Cache
 - Default TTL: 1 hour
 - Cache location: `~/.unipi/config/web-api/cache/`
-- Automatic for web_read operations
+- Automatic for all read operations
 ## Troubleshooting
-### No provider available
-If you see "No search provider available":
-1. Run `/unipi:web-settings`
-2. Add API keys for paid providers (they auto-enable on key input)
-3. Or manually enable a free provider
-### API key invalid
+**No provider available:** Run `/unipi:web-settings` and add API keys or enable a free provider.
-If API key validation fails:
+**Smart-fetch fails:** Try a different browser profile (`browser: "chrome_133"`) or a provider fallback (`source: 1`).
-1. Check the key is correct
-2. Verify the key has sufficient permissions
-3. Check provider status at their website
-### Rate limiting
-If you hit rate limits:
-1. Add an API key for higher limits
-2. Use a different provider
-3. Wait and retry
-## Development
-```bash
-# Type check
-npm run typecheck
-# Build
-npm run build
-```
+**Rate limiting:** Add an API key for higher limits, use smart-fetch (no limits), or try a different provider.
 ## License

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@pi-unipi/web-api",
-  "version": "0.1.14",
+  "version": "0.1.16",
   "description": "Web search, read, and summarize tools with provider-based backend selection for Pi coding agent",
   "type": "module",
   "license": "MIT",
@@ -38,13 +38,20 @@
     "README.md"
   ],
   "dependencies": {
-    "@pi-unipi/core": "*"
+    "@pi-unipi/core": "*",
+    "defuddle": "^0.18.1",
+    "linkedom": "^0.18.12",
+    "lodash": "^4.17.21",
+    "mime-types": "^2.1.35",
+    "wreq-js": "^2.3.0"
   },
   "peerDependencies": {
     "@mariozechner/pi-coding-agent": "*",
+    "@mariozechner/pi-tui": "*",
     "@sinclair/typebox": "*"
   },
   "devDependencies": {
+    "@types/lodash": "^4.17.24",
     "@types/node": "^25.6.0"
   }
 }

package/skills/web/SKILL.md CHANGED Viewed

@@ -5,7 +5,7 @@ description: "Web search, read, and summarize tools with provider-based backend"
 # Web Tools
-Use these tools to access web content. Providers are ranked by capability and cost.
+Use these tools to access web content. The read path uses a local smart-fetch engine by default — free, fast, and no API key required.
 ## web_search
@@ -24,21 +24,43 @@ web_search(query: "TypeScript generics tutorial")
 web_search(query: "latest AI research", source: 4)  # Use Tavily
 ```
-## web_read
+## multi_web_content_read
-Read URL content. Lower `source` = simpler providers.
+Read and extract content from URLs. Uses the **smart-fetch engine** by default (source=0 or omitted) — free, local, no API key required. Supports single URL or batch URLs.
-- **Basic extraction:** source 1 (Jina Reader)
-- **Advanced crawling:** source 2 (Firecrawl)
+**Default behavior (source=0):**
+- Browser-grade TLS fingerprinting via wreq-js
+- Intelligent content extraction via defuddle
+- Returns clean markdown with metadata (title, author, site, word count)
+- No API key required
 **Parameters:**
-- `url` (required): URL to read
-- `source` (optional): Provider selection (1-3)
+- `url` (required): Single URL string or array of URLs for batch
+- `source` (optional): Provider selection (0=smart-fetch, 1=Jina Reader, 2=Firecrawl, 3=Perplexity)
+- `browser` (optional): TLS fingerprint profile (default: chrome_145)
+- `os` (optional): OS fingerprint (default: windows)
+- `format` (optional): Output format — markdown, html, text, json (default: markdown)
+- `maxChars` (optional): Maximum content characters (default: 50000)
+- `timeoutMs` (optional): Request timeout in ms (default: 15000)
+- `removeImages` (optional): Strip image references (default: false)
+- `includeReplies` (optional): Include comments/replies (default: extractors)
+- `proxy` (optional): Proxy URL
+- `batchConcurrency` (optional): Concurrent requests for batch (default: 8)
+- `verbose` (optional): Include metadata header (default: true)
 **Examples:**
 ```
-web_read(url: "https://example.com/article")
-web_read(url: "https://example.com/spa", source: 2)  # Use Firecrawl
+# Single URL (uses smart-fetch engine by default)
+multi_web_content_read(url: "https://example.com/article")
+# Batch URLs
+multi_web_content_read(url: ["https://example.com/a", "https://example.com/b"])
+# Use provider fallback (Jina Reader)
+multi_web_content_read(url: "https://example.com/article", source: 1)
+# Custom options
+multi_web_content_read(url: "https://example.com/article", format: "json", maxChars: 10000)
 ```
 ## web_llm_summarize
@@ -61,7 +83,7 @@ web_llm_summarize(url: "https://example.com/research", prompt: "Extract key find
 ## Provider Selection
-- Omit `source` for auto-selection (cheapest available)
+- Omit `source` for auto-selection (smart-fetch engine for read, cheapest for search)
 - Specify `source` number for specific provider
 - If provider unavailable, tool throws descriptive error
@@ -75,6 +97,7 @@ web_llm_summarize(url: "https://example.com/research", prompt: "Extract key find
 5. Perplexity (paid)
 **Read providers:**
+0. **Smart-Fetch Engine** (free, local) — default
 1. Jina AI Reader (freemium)
 2. Firecrawl (paid)
 3. Perplexity (paid)
@@ -83,8 +106,27 @@ web_llm_summarize(url: "https://example.com/research", prompt: "Extract key find
 1. Perplexity (paid)
 2. LLM Summarize (uses pi's LLM)
+## Smart-Fetch Engine
+The smart-fetch engine is a local content extraction pipeline:
+- **wreq-js**: Browser-grade TLS fingerprinting (bypasses Cloudflare, etc.)
+- **defuddle**: Intelligent content extraction from HTML
+- **linkedom**: Server-side DOM parsing
+**Features:**
+- No API key required
+- Browser-level anti-bot bypass
+- Clean markdown output with metadata
+- Batch concurrent fetching with progress
+- Client-side meta redirect following
+- Multiple output formats
+**Configure defaults** via `/unipi:web-settings` → "Smart Fetch Defaults"
 ## Cost Awareness
+- **Smart-Fetch Engine:** Free (read only, no API key)
 - **DuckDuckGo:** Free (search only)
 - **Jina:** Freemium (search + read)
 - **SerpAPI/Tavily:** Paid (search)
@@ -98,6 +140,7 @@ Configure providers via `/unipi:web-settings` command.
 - Add/remove API keys
 - Enable/disable providers
+- Configure smart-fetch defaults
 - View provider status
 ## Cache
@@ -105,4 +148,4 @@ Configure providers via `/unipi:web-settings` command.
 Web content is cached for 1 hour by default.
 - Clear cache: `/unipi:web-cache-clear`
-- Cache is automatic for web_read operations
+- Cache includes smart-fetch results (keyed by URL + browser + format + maxChars)

package/src/engine/constants.ts ADDED Viewed

@@ -0,0 +1,36 @@
+/**
+ * @unipi/web-api — Engine Constants
+ *
+ * Default values for the smart-fetch engine.
+ */
+/** Default browser TLS fingerprint profile */
+export const DEFAULT_BROWSER = "chrome_145";
+/** Default OS fingerprint */
+export const DEFAULT_OS = "windows";
+/** Default maximum content length in characters */
+export const DEFAULT_MAX_CHARS = 50000;
+/** Default request timeout in milliseconds */
+export const DEFAULT_TIMEOUT_MS = 15000;
+/** Default batch concurrency */
+export const DEFAULT_BATCH_CONCURRENCY = 8;
+/** Default removeImages setting */
+export const DEFAULT_REMOVE_IMAGES = false;
+/** Default includeReplies setting */
+export const DEFAULT_INCLUDE_REPLIES: boolean | "extractors" = "extractors";
+/** Default output format */
+export const DEFAULT_FORMAT = "markdown" as const;
+/** Default HTTP headers */
+export const DEFAULT_HEADERS: Record<string, string> = {
+  Accept:
+    "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
+  "Accept-Language": "en-US,en;q=0.9",
+};

package/src/engine/dependencies.ts ADDED Viewed

@@ -0,0 +1,145 @@
+/**
+ * @unipi/web-api — Runtime Dependencies
+ *
+ * Lazy-loaded dependencies for the smart-fetch engine.
+ * Uses dynamic imports to handle optional native binding failures gracefully.
+ */
+let wreqModule: any = null;
+let defuddleModule: any = null;
+let lodashModule: any = null;
+let mimeTypesModule: any = null;
+/**
+ * Get the wreq-js module.
+ * Throws a helpful error if the module is not available.
+ *
+ * @returns wreq-js module
+ */
+export async function getWreq(): Promise<any> {
+  if (wreqModule) {
+    return wreqModule;
+  }
+  try {
+    // Use dynamic import for ESM compatibility
+    wreqModule = await import("wreq-js");
+    return wreqModule;
+  } catch (error) {
+    throw new Error(
+      `wreq-js is not available. ` +
+      `This is required for browser-grade TLS fingerprinting. ` +
+      `Run: npm install wreq-js\n` +
+      `Error: ${error instanceof Error ? error.message : String(error)}`
+    );
+  }
+}
+/**
+ * Get the defuddle module.
+ * Throws a helpful error if the module is not available.
+ *
+ * @returns defuddle module
+ */
+export async function getDefuddle(): Promise<any> {
+  if (defuddleModule) {
+    return defuddleModule;
+  }
+  try {
+    defuddleModule = await import("defuddle");
+    return defuddleModule;
+  } catch (error) {
+    throw new Error(
+      `defuddle is not available. ` +
+      `This is required for intelligent content extraction. ` +
+      `Run: npm install defuddle\n` +
+      `Error: ${error instanceof Error ? error.message : String(error)}`
+    );
+  }
+}
+/**
+ * Get the lodash module.
+ *
+ * @returns lodash module
+ */
+export async function getLodash(): Promise<any> {
+  if (lodashModule) {
+    return lodashModule;
+  }
+  try {
+    lodashModule = await import("lodash");
+    return lodashModule;
+  } catch (error) {
+    throw new Error(
+      `lodash is not available. ` +
+      `Run: npm install lodash\n` +
+      `Error: ${error instanceof Error ? error.message : String(error)}`
+    );
+  }
+}
+/**
+ * Get the mime-types module.
+ *
+ * @returns mime-types module
+ */
+export async function getMimeTypes(): Promise<any> {
+  if (mimeTypesModule) {
+    return mimeTypesModule;
+  }
+  try {
+    mimeTypesModule = await import("mime-types");
+    return mimeTypesModule;
+  } catch (error) {
+    throw new Error(
+      `mime-types is not available. ` +
+      `Run: npm install mime-types\n` +
+      `Error: ${error instanceof Error ? error.message : String(error)}`
+    );
+  }
+}
+/**
+ * Check if all required dependencies are available.
+ *
+ * @returns true if all deps are available
+ */
+export async function checkDependencies(): Promise<{
+  available: boolean;
+  missing: string[];
+}> {
+  const missing: string[] = [];
+  try {
+    await getWreq();
+  } catch {
+    missing.push("wreq-js");
+  }
+  try {
+    await getDefuddle();
+  } catch {
+    missing.push("defuddle");
+  }
+  try {
+    await getLodash();
+  } catch {
+    missing.push("lodash");
+  }
+  try {
+    await getMimeTypes();
+  } catch {
+    missing.push("mime-types");
+  }
+  return {
+    available: missing.length === 0,
+    missing,
+  };
+}