npm - @alexion42/pi-web-search - Versions diffs - 0.1.0 - Mend

@alexion42/pi-web-search 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

package/.pi/tasks/tasks-019e595f-0b95-7b09-9237-a0c6fbbda360.json +4 -0
package/CHANGELOG.md +18 -0
package/LICENSE +21 -0
package/README.md +88 -0
package/TOOLS.md +103 -0
package/activity.ts +101 -0
package/banner.png +0 -0
package/code-search.ts +107 -0
package/exa.ts +520 -0
package/extract.ts +342 -0
package/github-api.ts +196 -0
package/github-extract.ts +634 -0
package/index.ts +885 -0
package/package.json +46 -0
package/pdf-extract.ts +192 -0
package/pi-web-fetch-demo.mp4 +0 -0
package/rsc-extract.ts +338 -0
package/search.ts +49 -0
package/storage.ts +71 -0
package/test/pdf-extract.test.mjs +95 -0
package/types.ts +20 -0
package/utils.ts +44 -0

package/.pi/tasks/tasks-019e595f-0b95-7b09-9237-a0c6fbbda360.json ADDED Viewed

@@ -0,0 +1,4 @@
+{
+  "nextId": 4,
+  "tasks": []
+}

package/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,18 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+## [0.1.0] - 2026-05-24
+Initial release as `@alexion42/pi-web-search`.
+Lean fork of [nicobailon/pi-web-access](https://github.com/nicobailon/pi-web-access). Stripped out: Perplexity, Gemini API, Gemini Web, YouTube/video analysis, browser-cookie auth, curator UI, summary review, and librarian skill. What remains: Exa-only search, GitHub cloning, PDF extraction, and URL fetching via Readability → RSC → Jina.
+### Included
+- `web_search` — Exa search with synthesized answers (direct API or zero-config MCP)
+- `code_search` — Code/docs search via Exa MCP
+- `fetch_content` — URL content extraction with GitHub cloning and PDF support
+- `get_search_content` — Retrieve stored search/fetch results
+- `/search` — Interactive command to browse stored results
+- Activity monitor (`Ctrl+Shift+W`)

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Nico Bailon
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,88 @@
+# Pi Web Search
+Lean web search for Pi, powered by Exa. A lean fork of [nicobailon/pi-web-access](https://github.com/nicobailon/pi-web-access).
+[![npm version](https://img.shields.io/npm/v/@alexion42/pi-web-search?style=for-the-badge)](https://www.npmjs.com/package/@alexion42/pi-web-search)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
+## Install
+```bash
+pi install npm:@alexion42/pi-web-search
+```
+Works immediately with no API keys — Exa MCP provides zero-config search. For direct API access, add your key to `~/.pi/web-search.json`:
+```json
+{
+  "exaApiKey": "exa-..."
+}
+```
+Requires Pi v0.37.3+.
+## What's Available
+| Tool | Description |
+|------|-------------|
+| `web_search` | Search the web via Exa with synthesized answers and source citations |
+| `code_search` | Search for code examples, docs, and API references via Exa MCP |
+| `fetch_content` | Extract readable content from URLs, GitHub repos (cloned locally), and PDFs |
+| `get_search_content` | Retrieve stored content from previous searches or fetches |
+| `/search` | Interactive command to browse stored search results |
+| Activity monitor | `Ctrl+Shift+W` to view live request/response activity |
+Content extraction uses a robust fallback chain: Readability → RSC parser → Jina Reader. Full parameter reference and examples are in [TOOLS.md](./TOOLS.md).
+## Configuration
+All config lives in `~/.pi/web-search.json`. Every field is optional.
+```json
+{
+  "exaApiKey": "exa-...",
+  "githubClone": {
+    "enabled": true,
+    "maxRepoSizeMB": 350,
+    "cloneTimeoutSeconds": 30,
+    "clonePath": "/tmp/pi-github-repos"
+  },
+  "shortcuts": {
+    "activity": "ctrl+shift+w"
+  }
+}
+```
+`EXA_API_KEY` env var takes precedence over config file values. Config changes require a Pi restart.
+## How It Works
+```
+web_search(query)
+  → Exa (direct API with key, MCP without)
+fetch_content(url)
+  → GitHub URL?  Clone repo, return file contents + local path
+  → HTTP fetch → PDF? Extract text, save to ~/Downloads/
+               → HTML? Readability → RSC parser → Jina Reader
+               → Text/JSON/Markdown? Return directly
+```
+## Limitations
+- PDFs are text-extracted only (no OCR for scanned documents).
+- GitHub branch names with slashes may misresolve file paths; the clone still works and the agent can navigate manually.
+- Non-code GitHub URLs (issues, PRs, wiki) fall through to normal web extraction.
+## Comparison with Other Pi Search Extensions
+| Extension | Backends | Differentiator |
+|-----------|----------|----------------|
+| [nicobailon/pi-web-access](https://github.com/nicobailon/pi-web-access) | Perplexity, Gemini, Exa | Upstream — curator, YouTube, video |
+| [ronnieops/pi-search-hub](https://github.com/ronnieops/pi-search-hub) | 12 backends | RRF combine mode, auto-fallback |
+| [code-yeongyu/pi-websearch](https://github.com/code-yeongyu/pi-websearch) | 11 backends + native OpenAI/Anthropic | Provider routing, keyless DDG |
+| [ayagmar/pi-codex-web-search](https://github.com/ayagmar/pi-codex-web-search) | OpenAI Codex CLI | Wraps local `codex` CLI |
+| [iaptsiauri/pi-surf](https://github.com/iaptsiauri/pi-surf) | Brave + custom providers | Scout subagent, pluggable providers |
+| [NicoAvanzDev/pi-web-extension](https://github.com/NicoAvanzDev/pi-web-extension) | Brave, DDG (keyless) | Prompt steering, token-aware |
+This project's niche: **GitHub cloning + Exa MCP zero-config in a lean package**. No multi-provider routing, no browser UI, no video.

package/TOOLS.md ADDED Viewed

@@ -0,0 +1,103 @@
+# Tools Reference
+Detailed parameter reference and usage examples for all tools provided by Pi Web Search.
+## web_search
+Search the web via Exa. Returns an AI-synthesized answer with source citations.
+```typescript
+web_search({ query: "TypeScript best practices 2025" })
+web_search({ queries: ["query 1", "query 2"] })
+web_search({ query: "latest news", numResults: 10, recencyFilter: "week" })
+web_search({ query: "...", domainFilter: ["github.com"] })
+web_search({ query: "...", includeContent: true })
+```
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `query` | `string` | Single search query. For research tasks, prefer `queries` with multiple varied angles instead. |
+| `queries` | `string[]` | Multiple queries searched in sequence, each returning its own synthesized answer. |
+| `numResults` | `number` | Results per query (default: 5, max: 20) |
+| `recencyFilter` | `string` | Filter by recency: `day`, `week`, `month`, or `year` |
+| `domainFilter` | `string[]` | Limit to specific domains (prefix with `-` to exclude, e.g. `["-twitter.com"]`) |
+| `includeContent` | `boolean` | Fetch full page content from sources in background |
+**Tips:** For comprehensive research, use 2-4 varied queries instead of one broad query. Each query gets its own synthesized answer.
+## code_search
+Search for code examples, documentation, and API references via Exa MCP. No API key required.
+```typescript
+code_search({ query: "React useEffect cleanup pattern" })
+code_search({ query: "Express middleware error handling", maxTokens: 10000 })
+```
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `query` | `string` | Programming question, API, library, or debugging topic to search for |
+| `maxTokens` | `number` | Maximum tokens of code/documentation context to return (default: 5000, max: 50000) |
+## fetch_content
+Fetch URL(s) and extract readable content as markdown. Automatically detects and handles GitHub repos, PDFs, and regular web pages.
+```typescript
+fetch_content({ url: "https://example.com/article" })
+fetch_content({ urls: ["url1", "url2", "url3"] })
+fetch_content({ url: "https://github.com/owner/repo" })
+fetch_content({ url: "https://example.com/doc.pdf" })
+```
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `url` | `string` | Single URL to fetch |
+| `urls` | `string[]` | Multiple URLs (fetched in parallel) |
+| `forceClone` | `boolean` | Force cloning large GitHub repositories that exceed the size threshold |
+**GitHub repos:** GitHub URLs are cloned locally instead of scraped. The agent gets real file contents and a local path to explore with `read` and `bash`. Root URLs return the repo tree + README, `/tree/` paths return directory listings, `/blob/` paths return file contents. Repos over 350MB get a lightweight API-based view (override with `forceClone: true`).
+**PDFs:** PDF URLs are extracted as text and saved to `~/Downloads/` as markdown. Text-based extraction only — no OCR.
+**Fallback chain:** Readability → RSC parser (Next.js) → Jina Reader (JS-rendered pages). Handles SPAs, JS-heavy pages, and anti-bot protections transparently.
+## get_search_content
+Retrieve stored content from previous `web_search` or `fetch_content` calls. Content over 30,000 chars is truncated in tool responses but stored in full for retrieval here.
+```typescript
+get_search_content({ responseId: "abc123", urlIndex: 0 })
+get_search_content({ responseId: "abc123", url: "https://..." })
+get_search_content({ responseId: "abc123", query: "original query" })
+```
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `responseId` | `string` | The response ID from a previous `web_search` or `fetch_content` call |
+| `query` | `string` | Get content for this specific query (from `web_search` results) |
+| `queryIndex` | `number` | Get content for query at this index (0-based) |
+| `url` | `string` | Get content for this specific URL (from `fetch_content` results) |
+| `urlIndex` | `number` | Get content for URL at this index (0-based) |
+## /search
+Interactive command to browse stored search results from the current session. Lists all results with their response IDs for easy retrieval. Supports viewing details and deleting results.
+```
+/search
+```
+## Activity Monitor
+Toggle with `Ctrl+Shift+W` (configurable via `shortcuts.activity` in config) to see live request/response activity:
+```
+─── Web Search Activity ────────────────────────────────────
+  API  "typescript best practices"     200    2.1s ✓
+  GET  docs.example.com/article        200    0.8s ✓
+  GET  blog.example.com/post           404    0.3s ✗
+────────────────────────────────────────────────────────────
+```
+Shows the last 10 API calls and URL fetches with status codes, timing, and rate limit usage. Auto-clears on session switch.

package/activity.ts ADDED Viewed

@@ -0,0 +1,101 @@
+// Types
+export interface ActivityEntry {
+	id: string;
+	type: "api" | "fetch";
+	startTime: number;
+	endTime?: number;
+	// For API calls
+	query?: string;
+	// For URL fetches
+	url?: string;
+	// Result - status is number (HTTP code) or null (pending/network error)
+	status: number | null;
+	error?: string;
+}
+export interface RateLimitInfo {
+	used: number;
+	max: number;
+	oldestTimestamp: number | null;
+	windowMs: number;
+}
+export class ActivityMonitor {
+	private entries: ActivityEntry[] = [];
+	private readonly maxEntries = 10;
+	private listeners = new Set<() => void>();
+	private rateLimitInfo: RateLimitInfo = { used: 0, max: 10, oldestTimestamp: null, windowMs: 60000 };
+	private nextId = 1;
+	logStart(partial: Omit<ActivityEntry, "id" | "startTime" | "status">): string {
+		const id = `act-${this.nextId++}`;
+		const entry: ActivityEntry = {
+			...partial,
+			id,
+			startTime: Date.now(),
+			status: null,
+		};
+		this.entries.push(entry);
+		if (this.entries.length > this.maxEntries) {
+			this.entries.shift();
+		}
+		this.notify();
+		return id;
+	}
+	logComplete(id: string, status: number): void {
+		const entry = this.entries.find((e) => e.id === id);
+		if (entry) {
+			entry.endTime = Date.now();
+			entry.status = status;
+			this.notify();
+		}
+	}
+	logError(id: string, error: string): void {
+		const entry = this.entries.find((e) => e.id === id);
+		if (entry) {
+			entry.endTime = Date.now();
+			entry.error = error;
+			this.notify();
+		}
+	}
+	getEntries(): readonly ActivityEntry[] {
+		return this.entries;
+	}
+	getRateLimitInfo(): RateLimitInfo {
+		return this.rateLimitInfo;
+	}
+	updateRateLimit(info: RateLimitInfo): void {
+		this.rateLimitInfo = info;
+		this.notify();
+	}
+	onUpdate(callback: () => void): () => void {
+		this.listeners.add(callback);
+		return () => this.listeners.delete(callback);
+	}
+	clear(): void {
+		this.entries = [];
+		this.rateLimitInfo = { used: 0, max: 10, oldestTimestamp: null, windowMs: 60000 };
+		this.notify();
+	}
+	private notify(): void {
+		for (const cb of this.listeners) {
+			try {
+				cb();
+			} catch {
+			}
+		}
+	}
+}
+export const activityMonitor = new ActivityMonitor();

package/banner.png ADDED Viewed

Binary file

package/code-search.ts ADDED Viewed

@@ -0,0 +1,107 @@
+import { activityMonitor } from "./activity.js";
+import { callExaMcp } from "./exa.js";
+const CODE_CONTEXT_TOOL = "get_code_context_exa";
+const WEB_SEARCH_TOOL = "web_search_exa";
+const DEFAULT_MAX_TOKENS = 5000;
+let codeContextToolMissing = false;
+function isMissingMcpToolError(message: string): boolean {
+	const normalized = message.toLowerCase();
+	return normalized.includes("tool") && normalized.includes("not found");
+}
+function buildFallbackQuery(query: string): string {
+	const normalized = query.toLowerCase();
+	const hasCodeTerms = /\b(api|code|docs?|documentation|example|github|implementation|library|source|stackoverflow|stack overflow)\b/.test(normalized);
+	return hasCodeTerms ? query : `${query} code examples documentation GitHub Stack Overflow official docs`;
+}
+function maxTokensToResultCount(maxTokens: number): number {
+	return Math.min(20, Math.max(5, Math.ceil(maxTokens / 1000)));
+}
+function trimApproxTokens(text: string, maxTokens: number): string {
+	const maxCharacters = Math.max(1000, maxTokens * 4);
+	if (text.length <= maxCharacters) return text;
+	return `${text.slice(0, maxCharacters).trimEnd()}\n\n[Truncated by code_search to approximately ${maxTokens} tokens.]`;
+}
+async function executeFallbackSearch(query: string, maxTokens: number, signal?: AbortSignal): Promise<string> {
+	const text = await callExaMcp(
+		WEB_SEARCH_TOOL,
+		{
+			query: buildFallbackQuery(query),
+			numResults: maxTokensToResultCount(maxTokens),
+			livecrawl: "fallback",
+			type: "auto",
+			contextMaxCharacters: Math.min(50000, Math.max(1000, maxTokens * 4)),
+		},
+		signal,
+	);
+	return trimApproxTokens(text, maxTokens);
+}
+export async function executeCodeSearch(
+	_toolCallId: string,
+	params: { query: string; maxTokens?: number },
+	signal?: AbortSignal,
+): Promise<{
+	content: Array<{ type: "text"; text: string }>;
+	details: { query: string; maxTokens: number; error?: string; mode?: "code-context" | "web-search-fallback" };
+}> {
+	const query = params.query.trim();
+	if (!query) {
+		return {
+			content: [{ type: "text", text: "Error: No query provided." }],
+			details: { query: "", maxTokens: params.maxTokens ?? DEFAULT_MAX_TOKENS, error: "No query provided" },
+		};
+	}
+	const maxTokens = params.maxTokens ?? DEFAULT_MAX_TOKENS;
+	const activityId = activityMonitor.logStart({ type: "api", query });
+	try {
+		let mode: "code-context" | "web-search-fallback" = "web-search-fallback";
+		let text: string;
+		if (codeContextToolMissing) {
+			text = await executeFallbackSearch(query, maxTokens, signal);
+		} else {
+			try {
+				text = await callExaMcp(
+					CODE_CONTEXT_TOOL,
+					{
+						query,
+						tokensNum: maxTokens,
+					},
+					signal,
+				);
+				mode = "code-context";
+			} catch (err) {
+				const message = err instanceof Error ? err.message : String(err);
+				if (!isMissingMcpToolError(message)) throw err;
+				codeContextToolMissing = true;
+				text = await executeFallbackSearch(query, maxTokens, signal);
+			}
+		}
+		activityMonitor.logComplete(activityId, 200);
+		return {
+			content: [{ type: "text", text }],
+			details: { query, maxTokens, mode },
+		};
+	} catch (err) {
+		const message = err instanceof Error ? err.message : String(err);
+		if (message.toLowerCase().includes("abort")) {
+			activityMonitor.logComplete(activityId, 0);
+			throw err;
+		}
+		activityMonitor.logError(activityId, message);
+		return {
+			content: [{ type: "text", text: `Error: ${message}` }],
+			details: { query, maxTokens, error: message },
+		};
+	}
+}