@apmantza/greedysearch-pi 1.6.5 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,19 @@
1
1
  # Changelog
2
2
 
3
+ ## v1.6.5 (2026-04-04)
4
+
5
+ ### Security
6
+ - **Private URL blocking** — Added validation to block requests to localhost, RFC1918 private addresses (10.x, 192.168.x), and .local/.internal domains. Prevents accidental exposure of internal services.
7
+
8
+ ### Features
9
+ - **GitHub URL rewriting** — GitHub blob URLs (`github.com/owner/repo/blob/...`) are automatically rewritten to `raw.githubusercontent.com` for faster, cleaner raw file access.
10
+ - **GitHub repo cloning** — Root and tree URLs now trigger `git clone --depth 1` for complete repo access. Agent can explore files locally instead of parsing rendered HTML. Includes README preview and directory tree listing.
11
+ - **Head+tail content trimming** — Large documents now use smart truncation: keeps 75% from the beginning (introduction) + 25% from the end (conclusions/examples) with `[...content trimmed...]` marker, instead of simple truncation.
12
+ - **Anubis bot detection** — Added detection for the new Anubis proof-of-work anti-bot system (`protected by anubis`, `anubis uses a proof-of-work`).
13
+
14
+ ### Fixes
15
+ - **Perplexity clipboard retry** — Added single retry with 2s delay when clipboard extraction fails, improving reliability.
16
+
3
17
  ## v1.6.4 (2026-04-02)
4
18
 
5
19
  ### Fixes
package/README.md CHANGED
@@ -1,14 +1,12 @@
1
1
  # GreedySearch for Pi
2
2
 
3
- Pi extension that adds `greedy_search`, `deep_research`, and `coding_task` tools multi-engine AI search via browser automation. **NO API KEYS needed.**
3
+ Pi extension that adds `greedy_search`, `deep_research`, and `coding_task` tools -- multi-engine AI search via browser automation. **NO API KEYS needed.**
4
4
 
5
- Fans out queries to Perplexity, Bing Copilot, and Google AI simultaneously. Returns AI-synthesized answers with deduped sources. Streams progress as each engine completes.
5
+ Fans out queries to Perplexity, Bing Copilot, and Google AI simultaneously. Returns AI-synthesized answers with fetched source content. Streams progress as each engine completes.
6
6
 
7
- Forked from [GreedySearch-claude](https://github.com/apmantza/GreedySearch-claude).
8
-
9
- ## Quick Note
7
+ **New in v2.0:** HTTP-first source fetching with Mozilla Readability extraction (~3x faster), smart query-aware source ranking.
10
8
 
11
- **No API keys required** — this tool uses Chrome DevTools Protocol (CDP) to interact with search engines directly through a browser. It launches its own isolated Chrome instance, so it won't interfere with your main browser session.
9
+ Forked from [GreedySearch-claude](https://github.com/apmantza/GreedySearch-claude).
12
10
 
13
11
  ## Install
14
12
 
@@ -24,10 +22,14 @@ pi install git:github.com/apmantza/GreedySearch-pi
24
22
 
25
23
  ## Quick Start
26
24
 
27
- Once installed, Pi gains a `greedy_search` tool with three depth levels.
25
+ Once installed, Pi gains a `greedy_search` tool with two modes.
28
26
 
29
- ```
30
- greedy_search({ query: "What's new in React 19?", depth: "standard" })
27
+ ```javascript
28
+ // Default: multi-engine + source fetch + synthesis
29
+ Greedy_search({ query: "What's new in React 19?" })
30
+
31
+ // Fast: single engine, no synthesis
32
+ greedy_search({ query: "What's new in React 19?", depth: "fast", engine: "perplexity" })
31
33
  ```
32
34
 
33
35
  ## Parameters
@@ -36,22 +38,25 @@ greedy_search({ query: "What's new in React 19?", depth: "standard" })
36
38
  |-----------|------|---------|-------------|
37
39
  | `query` | string | required | The search question |
38
40
  | `engine` | string | `"all"` | `all`, `perplexity`, `bing`, `google`, `gemini` |
39
- | `depth` | string | `"standard"` | `fast` (1 engine), `standard` (3 engines + synthesis), `deep` (3 + fetch + synthesis + confidence) |
41
+ | `depth` | string | `"standard"` | `fast` (1 engine, no fetch), `standard` (3 engines + fetch + synthesis) |
40
42
  | `fullAnswer` | boolean | `false` | Return complete answer (~3000+ chars) vs truncated preview (~300 chars) |
41
43
 
42
44
  ## Depth Levels
43
45
 
44
46
  | Depth | Engines | Synthesis | Source Fetch | Time | Best For |
45
47
  |-------|---------|-----------|--------------|------|----------|
46
- | `fast` | 1 | | | 15-30s | Quick lookup, single perspective |
47
- | `standard` | 3 | | | 30-90s | Default balanced speed/quality |
48
- | `deep` | 3 | ✅ | ✅ (top 5) | 60-180s | Research that matters — architecture decisions |
48
+ | `fast` | 1 | no | no | 10-30s | Quick lookup, single perspective |
49
+ | `standard` | 3 | yes | yes (top 5) | 15-30s | **Default** -- balanced, grounded answers |
49
50
 
50
- ## Engines (for fast mode)
51
+ **Standard mode** (default for `engine: "all"`): Queries 3 engines, fetches content from top 5 sources via HTTP (with Readability extraction), synthesizes grounded answer with citations.
52
+
53
+ **Fast mode**: Single engine, no source fetching or synthesis. Good for quick checks.
54
+
55
+ ## Engines
51
56
 
52
57
  | Engine | Alias | Best for |
53
58
  |--------|-------|----------|
54
- | `all` | | All 3 engines but for fast single-engine, pick one below |
59
+ | `all` | - | **Default** -- all 3 engines with synthesis + source fetch |
55
60
  | `perplexity` | `p` | Technical Q&A, code explanations, documentation |
56
61
  | `bing` | `b` | Recent news, Microsoft ecosystem |
57
62
  | `google` | `g` | Broad coverage, multiple perspectives |
@@ -62,66 +67,87 @@ greedy_search({ query: "What's new in React 19?", depth: "standard" })
62
67
  When using `engine: "all"`, the tool streams progress as each engine completes:
63
68
 
64
69
  ```
65
- **Searching...** perplexity · ⏳ bing · ⏳ google
66
- **Searching...** perplexity done · ⏳ bing · ⏳ google
67
- **Searching...** perplexity done · ✅ bing done · ⏳ google
68
- **Searching...** perplexity done · ✅ bing done · ✅ google done
70
+ **Searching...** pending: perplexity, bing, google
71
+ **Searching...** done: perplexity, pending: bing, google
72
+ **Searching...** done: perplexity, done: bing, pending: google
73
+ **Searching...** done: perplexity, done: bing, done: google
74
+ **Synthesizing...** with Gemini
69
75
  ```
70
76
 
71
- ## Deep Research Mode
77
+ ## Source Fetching (HTTP-First)
72
78
 
73
- For research that matters architecture decisions, library comparisons use `depth: "deep"`:
79
+ GreedySearch now uses **HTTP-first source fetching** with Mozilla Readability for content extraction:
74
80
 
75
- ```
76
- greedy_search({ query: "best auth patterns for SaaS in 2026", depth: "deep" })
77
- ```
81
+ - **HTTP**: Fast (~200-800ms), parallel, structured markdown output
82
+ - **Browser fallback**: Only when HTTP fails (bot protection, JS-heavy sites)
83
+ - **Typical success rate**: 90%+ of documentation sites work via HTTP
84
+ - **Speed improvement**: ~3x faster than browser-only fetching (15-30s vs 60-180s)
78
85
 
79
- Deep mode: 3 engines + source fetching (top 5) + synthesis + confidence scores. ~60-180s but returns grounded synthesis with fetched evidence.
86
+ The old regex-based HTML stripping has been replaced with professional-grade content extraction that preserves document structure, code blocks, and headings.
80
87
 
81
- **Standard vs Deep:**
82
- - `standard` (default): 3 engines + synthesis. Good for most research.
83
- - `deep`: Same + fetches source content for grounded answers. Use when the answer really matters.
88
+ ## Smart Source Ranking
84
89
 
85
- **Legacy:** `deep_research` tool still works aliases to `greedy_search` with `depth: "deep"`.
90
+ Sources are now ranked using query-aware domain boosting:
86
91
 
87
- ## Full vs Short Answers
92
+ - **Query keywords** boost official docs (e.g., "react" → react.dev +10 points)
93
+ - **Consensus**: Sources found by multiple engines rank higher
94
+ - **Source type**: Official docs > repos > blogs > community
95
+ - **URL patterns**: `/docs/`, `/api/`, `/reference/` get extra boost
88
96
 
89
- Default mode returns ~300 char summaries to save tokens. Use `fullAnswer: true` for complete responses:
97
+ 40+ tech stacks have preferred domain mappings including React, Node.js, Python, Rust, Go, Prisma, Supabase, and more.
90
98
 
91
- ```
92
- greedy_search({ query: "explain the React compiler", engine: "perplexity", fullAnswer: true })
93
- ```
99
+ ## GitHub Content Extraction
100
+
101
+ GreedySearch handles GitHub URLs intelligently:
102
+
103
+ - **Blob URLs** (`/blob/`) — Automatically rewritten to `raw.githubusercontent.com` for instant raw file access
104
+ - **Tree/Root URLs** — Clones repo locally with `git clone --depth 1`, returns README preview + file tree for agent exploration
105
+ - **Benefits**: Real file contents (not rendered HTML), accurate line numbers, works with private repos via `gh` CLI auth
106
+
107
+ ## Security
108
+
109
+ - **Private URL blocking** — Requests to localhost, RFC1918 addresses (10.x, 192.168.x), and .local/.internal domains are automatically blocked
110
+ - **Cross-host redirect detection** — Detects redirects to authentication/login pages and falls back to browser extraction
111
+ - **File protocol blocking** — `file://` URLs are rejected
94
112
 
95
113
  ## Examples
96
114
 
97
- **Quick lookup (fast):**
115
+ **Default research (multi-engine + sources + synthesis):**
98
116
 
99
- ```
100
- greedy_search({ query: "How to use async await in Python", depth: "fast", engine: "perplexity" })
117
+ ```javascript
118
+ greedy_search({ query: "Best practices for monorepo structure" })
101
119
  ```
102
120
 
103
- **Compare tools (standard):**
121
+ **Quick lookup (fast):**
104
122
 
105
- ```
106
- greedy_search({ query: "Prisma vs Drizzle in 2026", depth: "standard" })
123
+ ```javascript
124
+ greedy_search({ query: "How to use async await in Python", depth: "fast", engine: "perplexity" })
107
125
  ```
108
126
 
109
- **Deep research (architecture decision):**
127
+ **Compare tools:**
110
128
 
111
- ```
112
- greedy_search({ query: "Best practices for monorepo structure", depth: "deep" })
129
+ ```javascript
130
+ greedy_search({ query: "Prisma vs Drizzle in 2026" })
113
131
  ```
114
132
 
115
133
  **Debug an error:**
116
134
 
135
+ ```javascript
136
+ greedy_search({ query: "Error: Cannot find module 'react-dom/client' Next.js 15" })
117
137
  ```
118
- greedy_search({ query: "Error: Cannot find module 'react-dom/client' Next.js 15", depth: "standard" })
138
+
139
+ ## Full vs Short Answers
140
+
141
+ Default mode returns ~300 char summaries to save tokens. Use `fullAnswer: true` for complete responses:
142
+
143
+ ```javascript
144
+ greedy_search({ query: "explain the React compiler", engine: "perplexity", fullAnswer: true })
119
145
  ```
120
146
 
121
147
  ## Requirements
122
148
 
123
- - **Chrome** must be installed. The extension auto-launches a dedicated Chrome instance on port 9222 with its own isolated profile and DevTools port file, separate from your main browser session.
124
- - **Node.js 22+** for built-in `fetch` and WebSocket support.
149
+ - **Chrome** -- must be installed. The extension auto-launches a dedicated Chrome instance on port 9222 with its own isolated profile and DevTools port file, separate from your main browser session.
150
+ - **Node.js 22+** -- for built-in `fetch` and WebSocket support.
125
151
 
126
152
  ## Setup (first time)
127
153
 
@@ -143,22 +169,6 @@ Check status:
143
169
  node ~/.pi/agent/git/GreedySearch-pi/launch.mjs --status
144
170
  ```
145
171
 
146
- ## Testing
147
-
148
- Run the test suite to verify everything works:
149
-
150
- ```bash
151
- ./test.sh # full suite (~3-4 min)
152
- ./test.sh quick # skip parallel tests (~1 min)
153
- ./test.sh parallel # parallel race condition tests only
154
- ```
155
-
156
- Tests verify:
157
- - Single engine mode (perplexity, bing, google)
158
- - Sequential "all" mode searches
159
- - Parallel "all" mode (5 concurrent searches) — detects tab race conditions
160
- - Synthesis mode with Gemini
161
-
162
172
  ## Troubleshooting
163
173
 
164
174
  ### "Chrome not found"
@@ -180,7 +190,7 @@ node ~/.pi/agent/git/GreedySearch-pi/launch.mjs
180
190
 
181
191
  ### Google / Bing "verify you're human"
182
192
 
183
- The extension auto-clicks verification buttons and Cloudflare Turnstile challenges using broad keyword matching resilient to variations like "Verify you are human" or localised button text. For hard CAPTCHAs (image puzzles), solve manually in the Chrome window that opens.
193
+ The extension auto-clicks verification buttons and Cloudflare Turnstile challenges using broad keyword matching -- resilient to variations like "Verify you are human" or localised button text. For hard CAPTCHAs (image puzzles), solve manually in the Chrome window that opens.
184
194
 
185
195
  ### Parallel searches failing
186
196
 
@@ -192,60 +202,60 @@ Chrome may be unresponsive. Restart it with `launch.mjs --kill` then `launch.mjs
192
202
 
193
203
  ### Sources are empty or junk links
194
204
 
195
- Sources are now extracted by regex-parsing Markdown links (`[title](url)`) from the clipboard text captured after each engine responds not from DOM selectors that break when the engine's UI updates. If sources are empty, the engine's clipboard copy didn't include formatted links (Bing Copilot currently falls into this category).
205
+ Sources are now extracted by regex-parsing Markdown links (`[title](url)`) from the clipboard text captured after each engine responds -- not from DOM selectors that break when the engine's UI updates. If sources are empty, the engine's clipboard copy didn't include formatted links (Bing Copilot currently falls into this category).
196
206
 
197
207
  ## How It Works
198
208
 
199
- - `index.ts` Pi extension, registers `greedy_search` tool with streaming progress
200
- - `search.mjs` CLI runner, spawns extractors in parallel, emits `PROGRESS:` events to stderr
201
- - `launch.mjs` launches dedicated Chrome on port 9222 with isolated profile
202
- - `extractors/` per-engine CDP scrapers (Perplexity, Bing Copilot, Google AI, Gemini)
203
- - `cdp.mjs` Chrome DevTools Protocol CLI for browser automation
204
- - `skills/greedy-search/SKILL.md` skill file that guides the model on when/how to use greedy_search
209
+ - `index.ts` -- Pi extension, registers `greedy_search` tool with streaming progress
210
+ - `search.mjs` -- CLI runner, spawns extractors in parallel, emits `PROGRESS:` events to stderr
211
+ - `launch.mjs` -- launches dedicated Chrome on port 9222 with isolated profile
212
+ - `extractors/` -- per-engine CDP scrapers (Perplexity, Bing Copilot, Google AI, Gemini)
213
+ - `cdp.mjs` -- Chrome DevTools Protocol CLI for browser automation
214
+ - `skills/greedy-search/SKILL.md` -- skill file that guides the model on when/how to use greedy_search
205
215
 
206
216
  ## Changelog
207
217
 
208
218
  ### v1.6.1 (2026-03-31)
209
- - **Single-engine full answers by default** `engine: "google"` (or any single engine) now returns complete answers instead of truncated previews. Multi-engine (`all`) still truncates to save tokens during synthesis.
210
- - **Codebase refactored** extracted 438 lines from `index.ts` into modular formatters (`src/formatters/`) reducing cognitive complexity from 360 to ~60 and maintainability index from 11.2 to ~40+
211
- - **Removed codebase search confusion** clarified that `greedy_search` is WEB SEARCH ONLY (not for searching local code)
219
+ - **Single-engine full answers by default** -- `engine: "google"` (or any single engine) now returns complete answers instead of truncated previews. Multi-engine (`all`) still truncates to save tokens during synthesis.
220
+ - **Codebase refactored** -- extracted 438 lines from `index.ts` into modular formatters (`src/formatters/`) reducing cognitive complexity from 360 to ~60 and maintainability index from 11.2 to ~40+
221
+ - **Removed codebase search confusion** -- clarified that `greedy_search` is WEB SEARCH ONLY (not for searching local code)
212
222
 
213
223
  ### v1.6.0 (2026-03-29)
214
- - **Merged deep_research into greedy_search** new `depth` parameter: `fast` (1 engine), `standard` (3 engines + synthesis), `deep` (3 engines + fetch + synthesis + confidence)
215
- - **Simpler API** one tool with clear speed/quality tradeoffs instead of separate tools with overlapping flags
216
- - **Backward compatible** `deep_research` still works as alias, `--synthesize` and `--deep-research` flags still function
217
- - **Updated documentation** README and skill docs now use `depth` parameter throughout
224
+ - **Merged deep_research into greedy_search** -- new `depth` parameter: `fast` (1 engine), `standard` (3 engines + synthesis), `deep` (3 engines + fetch + synthesis + confidence)
225
+ - **Simpler API** -- one tool with clear speed/quality tradeoffs instead of separate tools with overlapping flags
226
+ - **Backward compatible** -- `deep_research` still works as alias, `--synthesize` and `--deep-research` flags still function
227
+ - **Updated documentation** -- README and skill docs now use `depth` parameter throughout
218
228
 
219
229
  ### v1.5.1 (2026-03-29)
220
- - Fixed npm package added `.pi-lens/` and test files to `.npmignore`
230
+ - Fixed npm package -- added `.pi-lens/` and test files to `.npmignore`
221
231
 
222
232
  ### v1.5.0 (2026-03-29)
223
233
 
224
- - **Code extraction fixed** `coding_task` now uses clipboard interception to preserve markdown code blocks (was losing them via DOM scraping)
225
- - **Chrome targeting hardened** all tools now consistently target the dedicated GreedySearch Chrome via `CDP_PROFILE_DIR`, preventing fallback to user's main Chrome session
226
- - **Shared utilities** extracted ~220 lines of duplicate code from extractors into `common.mjs` (cdp wrapper, tab management, clipboard interception)
227
- - **Documentation leaner** skill documentation reduced 61% (180 70 lines) while preserving all decision-making info
228
- - **NO API KEYS** updated messaging to emphasize this works via browser automation, no API keys needed
234
+ - **Code extraction fixed** -- `coding_task` now uses clipboard interception to preserve markdown code blocks (was losing them via DOM scraping)
235
+ - **Chrome targeting hardened** -- all tools now consistently target the dedicated GreedySearch Chrome via `CDP_PROFILE_DIR`, preventing fallback to user's main Chrome session
236
+ - **Shared utilities** -- extracted ~220 lines of duplicate code from extractors into `common.mjs` (cdp wrapper, tab management, clipboard interception)
237
+ - **Documentation leaner** -- skill documentation reduced 61% (180 -> 70 lines) while preserving all decision-making info
238
+ - **NO API KEYS** -- updated messaging to emphasize this works via browser automation, no API keys needed
229
239
 
230
240
  ### v1.4.2 (2026-03-25)
231
241
 
232
- - **Fresh isolated tabs** each search now always creates a new `about:blank` tab via `Target.createTarget` and refreshes the CDP page cache immediately after, preventing SPA navigation failures and stale DOM state from prior queries
233
- - **Regex-based citation extraction** all extractors (Perplexity, Bing, Gemini) now parse sources from clipboard Markdown links (`[title](url)`) instead of DOM selectors that break on UI updates
234
- - **Relaxed verification detection** `consent.mjs` now uses broad keyword matching (`includes('verify')`, `includes('human')`) instead of anchored regexes, correctly catching button text variants like "Verify you are human" across Cloudflare, Microsoft, and generic modals
242
+ - **Fresh isolated tabs** -- each search now always creates a new `about:blank` tab via `Target.createTarget` and refreshes the CDP page cache immediately after, preventing SPA navigation failures and stale DOM state from prior queries
243
+ - **Regex-based citation extraction** -- all extractors (Perplexity, Bing, Gemini) now parse sources from clipboard Markdown links (`[title](url)`) instead of DOM selectors that break on UI updates
244
+ - **Relaxed verification detection** -- `consent.mjs` now uses broad keyword matching (`includes('verify')`, `includes('human')`) instead of anchored regexes, correctly catching button text variants like "Verify you are human" across Cloudflare, Microsoft, and generic modals
235
245
 
236
246
  ---
237
247
 
238
248
  ### v1.4.1
239
249
 
240
- - **Fixed parallel synthesis** multiple `greedy_search` calls with `synthesize: true` now run safely in parallel. Each search creates a fresh Gemini tab that gets cleaned up after synthesis, preventing tab conflicts and "Uncaught" errors.
250
+ - **Fixed parallel synthesis** -- multiple `greedy_search` calls with `synthesize: true` now run safely in parallel. Each search creates a fresh Gemini tab that gets cleaned up after synthesis, preventing tab conflicts and "Uncaught" errors.
241
251
 
242
252
  ### v1.4.0
243
253
 
244
- - **Grounded synthesis** Gemini now receives a normalized source registry with stable source IDs, agreement summaries, caveats, and cited claims
245
- - **Real deep research** top sources are fetched before synthesis so deep research answers are grounded in fetched evidence, not just engine summaries
246
- - **Richer source metadata** source output now includes canonical URLs, domains, source types, per-engine attribution, and confidence metadata
247
- - **Cleaner tab lifecycle** temporary Perplexity, Bing, and Google tabs are closed after each fan-out search, and synthesis finishes on the Gemini tab
248
- - **Isolated Chrome targeting** GreedySearch now refuses to fall back to your normal Chrome session, preventing stray remote-debugging prompts
254
+ - **Grounded synthesis** -- Gemini now receives a normalized source registry with stable source IDs, agreement summaries, caveats, and cited claims
255
+ - **Real deep research** -- top sources are fetched before synthesis so deep research answers are grounded in fetched evidence, not just engine summaries
256
+ - **Richer source metadata** -- source output now includes canonical URLs, domains, source types, per-engine attribution, and confidence metadata
257
+ - **Cleaner tab lifecycle** -- temporary Perplexity, Bing, and Google tabs are closed after each fan-out search, and synthesis finishes on the Gemini tab
258
+ - **Isolated Chrome targeting** -- GreedySearch now refuses to fall back to your normal Chrome session, preventing stray remote-debugging prompts
249
259
 
250
260
  ## License
251
261