@apmantza/greedysearch-pi 1.6.5 → 1.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +14 -0
- package/README.md +102 -92
- package/cdp.mjs +1004 -0
- package/coding-task.mjs +392 -0
- package/extractors/bing-copilot.mjs +167 -0
- package/extractors/common.mjs +237 -0
- package/extractors/consent.mjs +273 -0
- package/extractors/gemini.mjs +160 -0
- package/extractors/google-ai.mjs +156 -0
- package/extractors/perplexity.mjs +141 -0
- package/extractors/selectors.mjs +52 -0
- package/launch.mjs +288 -0
- package/package.json +17 -2
- package/search.mjs +1504 -0
- package/src/fetcher.mjs +589 -0
- package/src/formatters/synthesis.ts +0 -9
- package/src/github.mjs +323 -0
- package/src/utils/content.mjs +56 -0
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,19 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## v1.6.5 (2026-04-04)
|
|
4
|
+
|
|
5
|
+
### Security
|
|
6
|
+
- **Private URL blocking** — Added validation to block requests to localhost, RFC1918 private addresses (10.x, 192.168.x), and .local/.internal domains. Prevents accidental exposure of internal services.
|
|
7
|
+
|
|
8
|
+
### Features
|
|
9
|
+
- **GitHub URL rewriting** — GitHub blob URLs (`github.com/owner/repo/blob/...`) are automatically rewritten to `raw.githubusercontent.com` for faster, cleaner raw file access.
|
|
10
|
+
- **GitHub repo cloning** — Root and tree URLs now trigger `git clone --depth 1` for complete repo access. Agent can explore files locally instead of parsing rendered HTML. Includes README preview and directory tree listing.
|
|
11
|
+
- **Head+tail content trimming** — Large documents now use smart truncation: keeps 75% from the beginning (introduction) + 25% from the end (conclusions/examples) with `[...content trimmed...]` marker, instead of simple truncation.
|
|
12
|
+
- **Anubis bot detection** — Added detection for the new Anubis proof-of-work anti-bot system (`protected by anubis`, `anubis uses a proof-of-work`).
|
|
13
|
+
|
|
14
|
+
### Fixes
|
|
15
|
+
- **Perplexity clipboard retry** — Added single retry with 2s delay when clipboard extraction fails, improving reliability.
|
|
16
|
+
|
|
3
17
|
## v1.6.4 (2026-04-02)
|
|
4
18
|
|
|
5
19
|
### Fixes
|
package/README.md
CHANGED
|
@@ -1,14 +1,12 @@
|
|
|
1
1
|
# GreedySearch for Pi
|
|
2
2
|
|
|
3
|
-
Pi extension that adds `greedy_search`, `deep_research`, and `coding_task` tools
|
|
3
|
+
Pi extension that adds `greedy_search`, `deep_research`, and `coding_task` tools -- multi-engine AI search via browser automation. **NO API KEYS needed.**
|
|
4
4
|
|
|
5
|
-
Fans out queries to Perplexity, Bing Copilot, and Google AI simultaneously. Returns AI-synthesized answers with
|
|
5
|
+
Fans out queries to Perplexity, Bing Copilot, and Google AI simultaneously. Returns AI-synthesized answers with fetched source content. Streams progress as each engine completes.
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
## Quick Note
|
|
7
|
+
**New in v2.0:** HTTP-first source fetching with Mozilla Readability extraction (~3x faster), smart query-aware source ranking.
|
|
10
8
|
|
|
11
|
-
|
|
9
|
+
Forked from [GreedySearch-claude](https://github.com/apmantza/GreedySearch-claude).
|
|
12
10
|
|
|
13
11
|
## Install
|
|
14
12
|
|
|
@@ -24,10 +22,14 @@ pi install git:github.com/apmantza/GreedySearch-pi
|
|
|
24
22
|
|
|
25
23
|
## Quick Start
|
|
26
24
|
|
|
27
|
-
Once installed, Pi gains a `greedy_search` tool with
|
|
25
|
+
Once installed, Pi gains a `greedy_search` tool with two modes.
|
|
28
26
|
|
|
29
|
-
```
|
|
30
|
-
|
|
27
|
+
```javascript
|
|
28
|
+
// Default: multi-engine + source fetch + synthesis
|
|
29
|
+
Greedy_search({ query: "What's new in React 19?" })
|
|
30
|
+
|
|
31
|
+
// Fast: single engine, no synthesis
|
|
32
|
+
greedy_search({ query: "What's new in React 19?", depth: "fast", engine: "perplexity" })
|
|
31
33
|
```
|
|
32
34
|
|
|
33
35
|
## Parameters
|
|
@@ -36,22 +38,25 @@ greedy_search({ query: "What's new in React 19?", depth: "standard" })
|
|
|
36
38
|
|-----------|------|---------|-------------|
|
|
37
39
|
| `query` | string | required | The search question |
|
|
38
40
|
| `engine` | string | `"all"` | `all`, `perplexity`, `bing`, `google`, `gemini` |
|
|
39
|
-
| `depth` | string | `"standard"` | `fast` (1 engine), `standard` (3 engines +
|
|
41
|
+
| `depth` | string | `"standard"` | `fast` (1 engine, no fetch), `standard` (3 engines + fetch + synthesis) |
|
|
40
42
|
| `fullAnswer` | boolean | `false` | Return complete answer (~3000+ chars) vs truncated preview (~300 chars) |
|
|
41
43
|
|
|
42
44
|
## Depth Levels
|
|
43
45
|
|
|
44
46
|
| Depth | Engines | Synthesis | Source Fetch | Time | Best For |
|
|
45
47
|
|-------|---------|-----------|--------------|------|----------|
|
|
46
|
-
| `fast` | 1 |
|
|
47
|
-
| `standard` | 3 |
|
|
48
|
-
| `deep` | 3 | ✅ | ✅ (top 5) | 60-180s | Research that matters — architecture decisions |
|
|
48
|
+
| `fast` | 1 | no | no | 10-30s | Quick lookup, single perspective |
|
|
49
|
+
| `standard` | 3 | yes | yes (top 5) | 15-30s | **Default** -- balanced, grounded answers |
|
|
49
50
|
|
|
50
|
-
|
|
51
|
+
**Standard mode** (default for `engine: "all"`): Queries 3 engines, fetches content from top 5 sources via HTTP (with Readability extraction), synthesizes grounded answer with citations.
|
|
52
|
+
|
|
53
|
+
**Fast mode**: Single engine, no source fetching or synthesis. Good for quick checks.
|
|
54
|
+
|
|
55
|
+
## Engines
|
|
51
56
|
|
|
52
57
|
| Engine | Alias | Best for |
|
|
53
58
|
|--------|-------|----------|
|
|
54
|
-
| `all` |
|
|
59
|
+
| `all` | - | **Default** -- all 3 engines with synthesis + source fetch |
|
|
55
60
|
| `perplexity` | `p` | Technical Q&A, code explanations, documentation |
|
|
56
61
|
| `bing` | `b` | Recent news, Microsoft ecosystem |
|
|
57
62
|
| `google` | `g` | Broad coverage, multiple perspectives |
|
|
@@ -62,66 +67,87 @@ greedy_search({ query: "What's new in React 19?", depth: "standard" })
|
|
|
62
67
|
When using `engine: "all"`, the tool streams progress as each engine completes:
|
|
63
68
|
|
|
64
69
|
```
|
|
65
|
-
**Searching...**
|
|
66
|
-
**Searching...**
|
|
67
|
-
**Searching...**
|
|
68
|
-
**Searching...**
|
|
70
|
+
**Searching...** pending: perplexity, bing, google
|
|
71
|
+
**Searching...** done: perplexity, pending: bing, google
|
|
72
|
+
**Searching...** done: perplexity, done: bing, pending: google
|
|
73
|
+
**Searching...** done: perplexity, done: bing, done: google
|
|
74
|
+
**Synthesizing...** with Gemini
|
|
69
75
|
```
|
|
70
76
|
|
|
71
|
-
##
|
|
77
|
+
## Source Fetching (HTTP-First)
|
|
72
78
|
|
|
73
|
-
|
|
79
|
+
GreedySearch now uses **HTTP-first source fetching** with Mozilla Readability for content extraction:
|
|
74
80
|
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
81
|
+
- **HTTP**: Fast (~200-800ms), parallel, structured markdown output
|
|
82
|
+
- **Browser fallback**: Only when HTTP fails (bot protection, JS-heavy sites)
|
|
83
|
+
- **Typical success rate**: 90%+ of documentation sites work via HTTP
|
|
84
|
+
- **Speed improvement**: ~3x faster than browser-only fetching (15-30s vs 60-180s)
|
|
78
85
|
|
|
79
|
-
|
|
86
|
+
The old regex-based HTML stripping has been replaced with professional-grade content extraction that preserves document structure, code blocks, and headings.
|
|
80
87
|
|
|
81
|
-
|
|
82
|
-
- `standard` (default): 3 engines + synthesis. Good for most research.
|
|
83
|
-
- `deep`: Same + fetches source content for grounded answers. Use when the answer really matters.
|
|
88
|
+
## Smart Source Ranking
|
|
84
89
|
|
|
85
|
-
|
|
90
|
+
Sources are now ranked using query-aware domain boosting:
|
|
86
91
|
|
|
87
|
-
|
|
92
|
+
- **Query keywords** boost official docs (e.g., "react" → react.dev +10 points)
|
|
93
|
+
- **Consensus**: Sources found by multiple engines rank higher
|
|
94
|
+
- **Source type**: Official docs > repos > blogs > community
|
|
95
|
+
- **URL patterns**: `/docs/`, `/api/`, `/reference/` get extra boost
|
|
88
96
|
|
|
89
|
-
|
|
97
|
+
40+ tech stacks have preferred domain mappings including React, Node.js, Python, Rust, Go, Prisma, Supabase, and more.
|
|
90
98
|
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
99
|
+
## GitHub Content Extraction
|
|
100
|
+
|
|
101
|
+
GreedySearch handles GitHub URLs intelligently:
|
|
102
|
+
|
|
103
|
+
- **Blob URLs** (`/blob/`) — Automatically rewritten to `raw.githubusercontent.com` for instant raw file access
|
|
104
|
+
- **Tree/Root URLs** — Clones repo locally with `git clone --depth 1`, returns README preview + file tree for agent exploration
|
|
105
|
+
- **Benefits**: Real file contents (not rendered HTML), accurate line numbers, works with private repos via `gh` CLI auth
|
|
106
|
+
|
|
107
|
+
## Security
|
|
108
|
+
|
|
109
|
+
- **Private URL blocking** — Requests to localhost, RFC1918 addresses (10.x, 192.168.x), and .local/.internal domains are automatically blocked
|
|
110
|
+
- **Cross-host redirect detection** — Detects redirects to authentication/login pages and falls back to browser extraction
|
|
111
|
+
- **File protocol blocking** — `file://` URLs are rejected
|
|
94
112
|
|
|
95
113
|
## Examples
|
|
96
114
|
|
|
97
|
-
**
|
|
115
|
+
**Default research (multi-engine + sources + synthesis):**
|
|
98
116
|
|
|
99
|
-
```
|
|
100
|
-
greedy_search({ query: "
|
|
117
|
+
```javascript
|
|
118
|
+
greedy_search({ query: "Best practices for monorepo structure" })
|
|
101
119
|
```
|
|
102
120
|
|
|
103
|
-
**
|
|
121
|
+
**Quick lookup (fast):**
|
|
104
122
|
|
|
105
|
-
```
|
|
106
|
-
greedy_search({ query: "
|
|
123
|
+
```javascript
|
|
124
|
+
greedy_search({ query: "How to use async await in Python", depth: "fast", engine: "perplexity" })
|
|
107
125
|
```
|
|
108
126
|
|
|
109
|
-
**
|
|
127
|
+
**Compare tools:**
|
|
110
128
|
|
|
111
|
-
```
|
|
112
|
-
greedy_search({ query: "
|
|
129
|
+
```javascript
|
|
130
|
+
greedy_search({ query: "Prisma vs Drizzle in 2026" })
|
|
113
131
|
```
|
|
114
132
|
|
|
115
133
|
**Debug an error:**
|
|
116
134
|
|
|
135
|
+
```javascript
|
|
136
|
+
greedy_search({ query: "Error: Cannot find module 'react-dom/client' Next.js 15" })
|
|
117
137
|
```
|
|
118
|
-
|
|
138
|
+
|
|
139
|
+
## Full vs Short Answers
|
|
140
|
+
|
|
141
|
+
Default mode returns ~300 char summaries to save tokens. Use `fullAnswer: true` for complete responses:
|
|
142
|
+
|
|
143
|
+
```javascript
|
|
144
|
+
greedy_search({ query: "explain the React compiler", engine: "perplexity", fullAnswer: true })
|
|
119
145
|
```
|
|
120
146
|
|
|
121
147
|
## Requirements
|
|
122
148
|
|
|
123
|
-
- **Chrome**
|
|
124
|
-
- **Node.js 22+**
|
|
149
|
+
- **Chrome** -- must be installed. The extension auto-launches a dedicated Chrome instance on port 9222 with its own isolated profile and DevTools port file, separate from your main browser session.
|
|
150
|
+
- **Node.js 22+** -- for built-in `fetch` and WebSocket support.
|
|
125
151
|
|
|
126
152
|
## Setup (first time)
|
|
127
153
|
|
|
@@ -143,22 +169,6 @@ Check status:
|
|
|
143
169
|
node ~/.pi/agent/git/GreedySearch-pi/launch.mjs --status
|
|
144
170
|
```
|
|
145
171
|
|
|
146
|
-
## Testing
|
|
147
|
-
|
|
148
|
-
Run the test suite to verify everything works:
|
|
149
|
-
|
|
150
|
-
```bash
|
|
151
|
-
./test.sh # full suite (~3-4 min)
|
|
152
|
-
./test.sh quick # skip parallel tests (~1 min)
|
|
153
|
-
./test.sh parallel # parallel race condition tests only
|
|
154
|
-
```
|
|
155
|
-
|
|
156
|
-
Tests verify:
|
|
157
|
-
- Single engine mode (perplexity, bing, google)
|
|
158
|
-
- Sequential "all" mode searches
|
|
159
|
-
- Parallel "all" mode (5 concurrent searches) — detects tab race conditions
|
|
160
|
-
- Synthesis mode with Gemini
|
|
161
|
-
|
|
162
172
|
## Troubleshooting
|
|
163
173
|
|
|
164
174
|
### "Chrome not found"
|
|
@@ -180,7 +190,7 @@ node ~/.pi/agent/git/GreedySearch-pi/launch.mjs
|
|
|
180
190
|
|
|
181
191
|
### Google / Bing "verify you're human"
|
|
182
192
|
|
|
183
|
-
The extension auto-clicks verification buttons and Cloudflare Turnstile challenges using broad keyword matching
|
|
193
|
+
The extension auto-clicks verification buttons and Cloudflare Turnstile challenges using broad keyword matching -- resilient to variations like "Verify you are human" or localised button text. For hard CAPTCHAs (image puzzles), solve manually in the Chrome window that opens.
|
|
184
194
|
|
|
185
195
|
### Parallel searches failing
|
|
186
196
|
|
|
@@ -192,60 +202,60 @@ Chrome may be unresponsive. Restart it with `launch.mjs --kill` then `launch.mjs
|
|
|
192
202
|
|
|
193
203
|
### Sources are empty or junk links
|
|
194
204
|
|
|
195
|
-
Sources are now extracted by regex-parsing Markdown links (`[title](url)`) from the clipboard text captured after each engine responds
|
|
205
|
+
Sources are now extracted by regex-parsing Markdown links (`[title](url)`) from the clipboard text captured after each engine responds -- not from DOM selectors that break when the engine's UI updates. If sources are empty, the engine's clipboard copy didn't include formatted links (Bing Copilot currently falls into this category).
|
|
196
206
|
|
|
197
207
|
## How It Works
|
|
198
208
|
|
|
199
|
-
- `index.ts`
|
|
200
|
-
- `search.mjs`
|
|
201
|
-
- `launch.mjs`
|
|
202
|
-
- `extractors/`
|
|
203
|
-
- `cdp.mjs`
|
|
204
|
-
- `skills/greedy-search/SKILL.md`
|
|
209
|
+
- `index.ts` -- Pi extension, registers `greedy_search` tool with streaming progress
|
|
210
|
+
- `search.mjs` -- CLI runner, spawns extractors in parallel, emits `PROGRESS:` events to stderr
|
|
211
|
+
- `launch.mjs` -- launches dedicated Chrome on port 9222 with isolated profile
|
|
212
|
+
- `extractors/` -- per-engine CDP scrapers (Perplexity, Bing Copilot, Google AI, Gemini)
|
|
213
|
+
- `cdp.mjs` -- Chrome DevTools Protocol CLI for browser automation
|
|
214
|
+
- `skills/greedy-search/SKILL.md` -- skill file that guides the model on when/how to use greedy_search
|
|
205
215
|
|
|
206
216
|
## Changelog
|
|
207
217
|
|
|
208
218
|
### v1.6.1 (2026-03-31)
|
|
209
|
-
- **Single-engine full answers by default**
|
|
210
|
-
- **Codebase refactored**
|
|
211
|
-
- **Removed codebase search confusion**
|
|
219
|
+
- **Single-engine full answers by default** -- `engine: "google"` (or any single engine) now returns complete answers instead of truncated previews. Multi-engine (`all`) still truncates to save tokens during synthesis.
|
|
220
|
+
- **Codebase refactored** -- extracted 438 lines from `index.ts` into modular formatters (`src/formatters/`) reducing cognitive complexity from 360 to ~60 and maintainability index from 11.2 to ~40+
|
|
221
|
+
- **Removed codebase search confusion** -- clarified that `greedy_search` is WEB SEARCH ONLY (not for searching local code)
|
|
212
222
|
|
|
213
223
|
### v1.6.0 (2026-03-29)
|
|
214
|
-
- **Merged deep_research into greedy_search**
|
|
215
|
-
- **Simpler API**
|
|
216
|
-
- **Backward compatible**
|
|
217
|
-
- **Updated documentation**
|
|
224
|
+
- **Merged deep_research into greedy_search** -- new `depth` parameter: `fast` (1 engine), `standard` (3 engines + synthesis), `deep` (3 engines + fetch + synthesis + confidence)
|
|
225
|
+
- **Simpler API** -- one tool with clear speed/quality tradeoffs instead of separate tools with overlapping flags
|
|
226
|
+
- **Backward compatible** -- `deep_research` still works as alias, `--synthesize` and `--deep-research` flags still function
|
|
227
|
+
- **Updated documentation** -- README and skill docs now use `depth` parameter throughout
|
|
218
228
|
|
|
219
229
|
### v1.5.1 (2026-03-29)
|
|
220
|
-
- Fixed npm package
|
|
230
|
+
- Fixed npm package -- added `.pi-lens/` and test files to `.npmignore`
|
|
221
231
|
|
|
222
232
|
### v1.5.0 (2026-03-29)
|
|
223
233
|
|
|
224
|
-
- **Code extraction fixed**
|
|
225
|
-
- **Chrome targeting hardened**
|
|
226
|
-
- **Shared utilities**
|
|
227
|
-
- **Documentation leaner**
|
|
228
|
-
- **NO API KEYS**
|
|
234
|
+
- **Code extraction fixed** -- `coding_task` now uses clipboard interception to preserve markdown code blocks (was losing them via DOM scraping)
|
|
235
|
+
- **Chrome targeting hardened** -- all tools now consistently target the dedicated GreedySearch Chrome via `CDP_PROFILE_DIR`, preventing fallback to user's main Chrome session
|
|
236
|
+
- **Shared utilities** -- extracted ~220 lines of duplicate code from extractors into `common.mjs` (cdp wrapper, tab management, clipboard interception)
|
|
237
|
+
- **Documentation leaner** -- skill documentation reduced 61% (180 -> 70 lines) while preserving all decision-making info
|
|
238
|
+
- **NO API KEYS** -- updated messaging to emphasize this works via browser automation, no API keys needed
|
|
229
239
|
|
|
230
240
|
### v1.4.2 (2026-03-25)
|
|
231
241
|
|
|
232
|
-
- **Fresh isolated tabs**
|
|
233
|
-
- **Regex-based citation extraction**
|
|
234
|
-
- **Relaxed verification detection**
|
|
242
|
+
- **Fresh isolated tabs** -- each search now always creates a new `about:blank` tab via `Target.createTarget` and refreshes the CDP page cache immediately after, preventing SPA navigation failures and stale DOM state from prior queries
|
|
243
|
+
- **Regex-based citation extraction** -- all extractors (Perplexity, Bing, Gemini) now parse sources from clipboard Markdown links (`[title](url)`) instead of DOM selectors that break on UI updates
|
|
244
|
+
- **Relaxed verification detection** -- `consent.mjs` now uses broad keyword matching (`includes('verify')`, `includes('human')`) instead of anchored regexes, correctly catching button text variants like "Verify you are human" across Cloudflare, Microsoft, and generic modals
|
|
235
245
|
|
|
236
246
|
---
|
|
237
247
|
|
|
238
248
|
### v1.4.1
|
|
239
249
|
|
|
240
|
-
- **Fixed parallel synthesis**
|
|
250
|
+
- **Fixed parallel synthesis** -- multiple `greedy_search` calls with `synthesize: true` now run safely in parallel. Each search creates a fresh Gemini tab that gets cleaned up after synthesis, preventing tab conflicts and "Uncaught" errors.
|
|
241
251
|
|
|
242
252
|
### v1.4.0
|
|
243
253
|
|
|
244
|
-
- **Grounded synthesis**
|
|
245
|
-
- **Real deep research**
|
|
246
|
-
- **Richer source metadata**
|
|
247
|
-
- **Cleaner tab lifecycle**
|
|
248
|
-
- **Isolated Chrome targeting**
|
|
254
|
+
- **Grounded synthesis** -- Gemini now receives a normalized source registry with stable source IDs, agreement summaries, caveats, and cited claims
|
|
255
|
+
- **Real deep research** -- top sources are fetched before synthesis so deep research answers are grounded in fetched evidence, not just engine summaries
|
|
256
|
+
- **Richer source metadata** -- source output now includes canonical URLs, domains, source types, per-engine attribution, and confidence metadata
|
|
257
|
+
- **Cleaner tab lifecycle** -- temporary Perplexity, Bing, and Google tabs are closed after each fan-out search, and synthesis finishes on the Gemini tab
|
|
258
|
+
- **Isolated Chrome targeting** -- GreedySearch now refuses to fall back to your normal Chrome session, preventing stray remote-debugging prompts
|
|
249
259
|
|
|
250
260
|
## License
|
|
251
261
|
|