npm - @apmantza/greedysearch-pi - Versions diffs - 1.9.1 → 2.0.0 - Mend

@apmantza/greedysearch-pi 1.9.1 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

package/CHANGELOG.md +110 -14
package/README.md +86 -41
package/bin/cdp.mjs +1153 -1108
package/bin/launch.mjs +11 -0
package/bin/search.mjs +886 -674
package/extractors/bing-copilot.mjs +528 -374
package/extractors/chatgpt.mjs +436 -0
package/extractors/common.mjs +837 -645
package/extractors/consensus.mjs +655 -0
package/extractors/consent.mjs +421 -388
package/extractors/gemini.mjs +335 -217
package/extractors/logically.mjs +567 -0
package/extractors/selectors.mjs +3 -2
package/extractors/semantic-scholar.mjs +219 -0
package/index.ts +2 -1
package/package.json +14 -6
package/skills/greedy-search/skill.md +9 -12
package/src/fetcher.mjs +8 -1
package/src/formatters/results.ts +163 -128
package/src/search/browser-lifecycle.mjs +27 -5
package/src/search/chrome.mjs +653 -590
package/src/search/constants.mjs +150 -39
package/src/search/engines.mjs +114 -76
package/src/search/fetch-source.mjs +566 -451
package/src/search/pdf.mjs +68 -0
package/src/search/recovery.mjs +51 -45
package/src/search/research.mjs +2579 -0
package/src/search/sources.mjs +77 -25
package/src/search/synthesis-runner.mjs +142 -57
package/src/search/synthesis.mjs +286 -246
package/src/tools/greedy-search-handler.ts +189 -45
package/src/tools/shared.ts +187 -186
package/src/types.ts +110 -104
package/test.mjs +1342 -534

package/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,115 @@
 # Changelog
-## [Unreleased]
+## [2.0.0] — 2026-06-07
+Major release consolidating ~6 weeks of work since 1.9.2: two new research engines (Semantic Scholar, Logically), deep-research structured output, configurable `all`-mode engines, ChatGPT and Gemini extractor rewrites that cut solo times from 71s → 8s, and full release/CI automation.
+### Added
+- **Release workflow** (`.github/workflows/release.yml`) — automatic version → tag → GitHub release → npm publish, triggered by every push to master. Mirrors the pi-lens release pattern. Three jobs: `prepare` (detects new version via tag absence, verifies CHANGELOG entry, runs `npm publish --dry-run`), `release` (creates `vX.Y.Z` tag, pushes it, creates a GitHub release with auto-generated notes), `publish-npm` (runs `npm publish`, gated on `NPM_TOKEN` secret). The tag check is the sole release guard — no need to also require `package.json` in the diff, which breaks when the version bump and changelog land in separate commits.
+- **Dependabot** (`.github/dependabot.yml`) — weekly npm dependency updates, grouped patch/minor, capped at 5 open PRs, labeled `dependencies` + `automated`.
+- **Lint & lockfile CI gate** (`.github/workflows/ci.yml`, `scripts/lint.mjs`, `scripts/check-lockfile.mjs`) — new `lint-and-lockfile` job that runs `npm run check:lockfile` (package.json ↔ package-lock.json sync) and `npm run lint` (cross-platform `node --check` on all 46 .mjs files) before any install. Both scripts are Node-only (no `find`/`grep`) so they work identically on Windows/macOS/Linux runners.
+- **Tarball entry-point verification** (`.github/workflows/ci.yml`) — CI now verifies every entry in `pi.extensions`, `pi.skills`, and `files` exists in the published tarball (not just `package.json`). Catches typos in the whitelist that would cause "module not found" at runtime.
+- **Extension-load check** (`.github/workflows/ci.yml`) — `npx jiti ./index.ts` smoke test on the globally-installed tarball that catches missing dependencies. The `pi-coding-agent` peer-dep absence is expected and ignored.
+- **CONTRIBUTING.md** — new document with the extractor authoring guide (clipboard interception, single-eval stream wait, language-agnostic selectors, registration in two places, headless fast-fail, recovery engine list, docs to update), and recovery-policy notes. Links to AGENTS.md for architecture details.
+### Added
+- **Semantic Scholar extractor and PDF source fetching** (`extractors/semantic-scholar.mjs`, `src/search/pdf.mjs`, `src/search/fetch-source.mjs`, `src/search/sources.mjs`) — New no-API academic-paper discovery engine registered as `semantic-scholar` / `semanticscholar` / `s2`. It searches `semanticscholar.org`, extracts ranked paper cards, TLDRs, authors, venues, citation counts, Semantic Scholar paper URLs, and direct PDF/external links when available. GreedySearch source fetching now parses direct PDFs with lazy-loaded `pdf-parse` so deep research can feed actual paper text to Gemini instead of relying on the synthesizer to browse links itself. Academic sources are classified and counted as primary research evidence.
+- **Logically extractor** (`extractors/logically.mjs`, `src/search/constants.mjs`, `bin/search.mjs`, `README.md`, `skills/greedy-search/skill.md`, `src/tools/greedy-search-handler.ts`) — New research engine for `logically.app/research-assistant` that submits a question through the ProseMirror editor, waits for the answer to stabilize, captures the rendered answer HTML, and opens the full `Citations (N)` popover to parse all Academic/Web citations. Academic cards provide citation counts, title, authors, venue, dates, fields, and snippets; Web URL blocks provide every cited URL occurrence. Inline answer citation popovers are still captured separately as `inlineSources`. Login/sign-up quota walls set `blockedBy: "signin"` / `verificationResult: "needs-human"`, participate in visible recovery, and visible tabs are activated before CDP typing so Logically accepts input reliably. Registered as `logically` / `log` in `ENGINES` and `logically` in `ENGINE_DOMAINS`. **Not in default `ALL_ENGINES`** — opt in via `~/.pi/greedyconfig` `engines` array.
+- **Experimental Consensus extractor, unregistered/deprecated** (`extractors/consensus.mjs`, `extractors/common.mjs`, `bin/cdp.mjs`) — Prototype support for `consensus.app` remains in-tree for future fuller CSV/source-download work, but it is no longer registered as a supported engine because the free quota is too easy to exhaust. The prototype submits a research query, waits for prose, expands references, and intercepts the `Export → .CSV` menu's `/api/papers/details/` POST response for rich source metadata when available. All selectors are language-agnostic.
+- **Browser-level CDP command (`browse`/`browserraw`)** (`bin/cdp.mjs`) — New CLI/daemon command for browser-target CDP methods (no `sessionId`). Adds `browserRawStr()` and forwards to the existing daemon socket. Needed for `Browser.setDownloadBehavior` and any future browser-level CDP method (permissions, geolocation, etc.). The evalraw path still works; browse is just the clean way to call browser-only methods.
+- **Visible recovery diagnostics log** (`bin/search.mjs`, `src/search/constants.mjs`) — Headless → visible fallback attempts now append structured JSONL events to the temp-file log `greedysearch-visible-recovery.jsonl`. Entries include scope (`single`/`all`), phase (`start`/`success`/`needs-human`), affected engines, extractor envelopes, visible result mode, duration, and last stage, without logging the raw query text. ChatGPT now participates in the same visible-recovery policy on headless timeout/verification signals, and its wrapper budget is 80s so the extractor's 30s stream wait plus 35s fallback can finish cleanly.
+### Fixed
+- **ChatGPT extractor skipped the static homepage greeting card** (`extractors/chatgpt.mjs`) — chatgpt.com renders a pre-baked `[data-message-author-role="assistant"]` greeting ("Hello! How can I help you today?") with `data-turn-start-message="true"` on the homepage before any conversation happens. The old `waitForResponse` / `pollForResponseNodeSide` / `extractAnswerFromDom` all read `Array.from(...).at(-1)`, so when a typed query never produced a real response, the length check fired on the 32-char placeholder and the DOM fallback returned "Hello! How can I help you today?" as a successful answer. All three now find the assistant message that comes AFTER the last user message in DOM order, and `extractAnswer` throws a clear `blockedBy: "no-response"` error if no real response exists. Visible & headless smoke now return real ChatGPT answers (e.g. "Hello! 👋").
+- **ChatGPT copy button clicked the user's message** (`extractors/chatgpt.mjs`) — `extractAnswer` used `document.querySelectorAll('${COPY_SELECTOR}')[buttons.length - 1]`, which picked the absolute last copy button on the page. When the assistant response was still empty (0 chars, no copy button of its own), that was the USER message's copy button — so the clipboard interceptor captured the user's query text and the extractor returned it as a "successful" answer (e.g. "What is the capital of France?" instead of "Paris"). Now the click targets the copy button on the assistant message after the last user message, with the old behaviour as a last-resort fallback.
+- **ChatGPT stream wait was 65s for short answers** (`extractors/chatgpt.mjs`) — the two custom `waitForResponse` / `pollForResponseNodeSide` polls had a hardcoded 50-char `_minLen` (a safety margin for the static greeting card) and ran sequentially for 30s + 35s. A short answer like "Hello! 👋" (8 chars) never reached the threshold and burned the full 65s budget. Replaced with a single `waitForStreamComplete` call (the same shared helper Perplexity and Gemini use) with `minLength: 1` and a 20s timeout, plus a 15s node-side fallback for throttled `all`-mode tabs. Solo ChatGPT runs now complete in ~8s (down from 71s); warm `all`-mode runs in ~8s (down from 50s).
+- **Research engine configuration and Consensus deprecation** (`src/search/constants.mjs`, `src/search/research.mjs`, `README.md`, `skills/greedy-search/skill.md`, `src/tools/greedy-search-handler.ts`) — Deep research child searches now explicitly follow the normal GreedySearch configuration pattern by reusing `~/.pi/greedyconfig.engines`, while Gemini remains the research planner/final-report synthesizer. Consensus is no longer registered in the engine map because it hits the free quota too easily; the extractor file remains in-tree for future fuller CSV/source-download work, but it is not exposed as a supported engine.
+- **Logically full citations in headless mode** (`extractors/logically.mjs`) — The answer and inline citation popovers worked headless, but the full `Citations (N)` popover sometimes rendered without the visible-mode `All / Academic (N) / Web (N)` header and old inline citation popovers remained mounted. Full citation extraction now removes stale inline citation popovers before opening the full citations control, detects headerless card-list popovers, and falls back to the button's citation count when tab headers are absent. Headless smoke now captures the full citation list instead of falling back to inline sources only.
+- **Consensus stale Clerk session recovery retained in prototype** (`extractors/consensus.mjs`) — The now-unregistered Consensus prototype still detects the broken Clerk handshake state (`clerk.consensus.app`, `session-token-expired`, `refresh_request_origin_azp_mismatch`, or Chrome's `HTTP ERROR 405` error page), clears only `https://consensus.app` and `https://clerk.consensus.app` storage, then retries navigation once. If re-enabled later, this avoids wiping the entire GreedySearch profile while recovering from corrupted Consensus auth cookies.
+- **`getOrOpenTab` stealth race condition** (`extractors/common.mjs`) — The CDP daemon processes commands concurrently, not sequentially. `getOrOpenTab` was firing `injectHeadlessStealth` and immediately returning, racing the next `Page.navigate` from the extractor. The new document was created before the stealth `addScriptToEvaluateOnNewDocument` registration landed, so headless fingerprinting (webdriver, UA client hints, hardware concurrency) fired on the new page and consensus.app bounced the request to a sign-up wall. Now `getOrOpenTab` awaits the stealth injection before returning the tab id. Comment in the code documents the race so the next person doesn't undo it.
+- **`~/.pi/greedyconfig` unknown engines/synthesizers now warn** (`src/search/constants.mjs`) — When the user's config file referenced engine names that don't exist in `ENGINES` (typo, deprecated engine, or engine not yet shipped), the loader silently filtered them out. A user with `["perplexity", "bogus", "google"]` would get a 2-engine fan-out with no indication anything was wrong. Same for `synthesizer`: a typo would silently fall back to the default. Now both paths emit a stderr warning listing the unknown names, the available alternatives, and the resolved fallback when the config has no valid entries. The valid subset of a partially-bad config is still used as before; only the unknown names are dropped with a notice.
+### Fixed
+- **ChatGPT extraction under parallel load** (`extractors/chatgpt.mjs`, `bin/launch.mjs`, `src/search/engines.mjs`) — chatgpt-search was timing out at the orchestrator's 70s budget even when the response was actually rendered in the DOM. Three root causes, three fixes. (1) `waitForResponse` was waiting on the copy-button count as an indirect proxy — ChatGPT's React re-renders make that count fluctuate, so the "stable for 3 rounds" condition rarely fired. Rewritten to use the latest assistant message's `innerText.length` as the direct signal (text ≥50 chars stable for 3 rounds ≈ 2.4s of polling). (2) Chrome was clamping `setTimeout` to 1Hz in background tabs, so 4 parallel engines starved each other. Added `--disable-background-timer-throttling`, `--disable-renderer-backgrounding`, and `--disable-backgrounding-occluded-windows` to the GreedySearch Chrome launch flags. (3) The `stream-wait` budget was 60s — too long under any throttle and wasted the wrapper's timeout. Lowered to 30s with a 35s node-side fallback that releases the WebSocket between polls. Synth-mode tests now pass 83/83 with chatgpt-search consistently completing in ~12s.
+- **Engine timeout diagnostic shadowing bug** (`src/search/engines.mjs`) — the `setTimeout` callback in `runExtractor` was shadowing the outer `err` string with `const err = new Error(...)`, then calling `tailLines(err)` on the next line with the Error object. That crashed with `TypeError: s.split is not a function` and lost the partial stderr capture. Renamed the new error to `errObj` and added `String(s ?? "")` to `tailLines` so it never throws on non-strings.
+### Added
+- **Research mode promoted to structured dataroom-style output** (`src/search/research.mjs`, `bin/search.mjs`, `src/tools/greedy-search-handler.ts`) — `depth: "research"` now writes a bundle by default under `.pi/greedysearch-research/<timestamp>_<query>/` with `STATUS.md`, `OUTLINE.md`, `reports/SUMMARY.md`, `reports/CLAIMS.md`, `reports/GAPS.md`, fetched `sources/`, and machine-readable `data/manifest.json` / `rounds.json` / `sources.json`. Added `--research-out-dir`, `--no-research-bundle`, and matching tool parameters `researchOutDir` / `writeResearchBundle`.
+- **Research completion floor, question ledger, source evidence extraction, and citation audit metadata** (`src/search/research.mjs`, `src/formatters/results.ts`) — Research runs now maintain a STATUS-style open/closed question ledger, run goal-based evidence extraction over fetched sources, ask Gemini to mark answered questions and propose new ones, and compute deterministic floor checks around required/root question closure, fetched source count, primary-source coverage, quality score, structured claims, citations, and unfetched citations. Newly discovered follow-up questions remain visible handoff gaps instead of making every short run partial. The formatted tool result surfaces floor status, stop reason, evidence counts, question progress, and bundle path.
+### Fixed
+- **Bing Copilot login wall detection** (`extractors/bing-copilot.mjs`, `src/search/recovery.mjs`) — Copilot now gates the chat behind Microsoft/Apple/Google sign-in on fresh sessions. The extractor previously timed out waiting for `#userInput` and returned a confusing "input not found" error. Added `detectSignInWall()`: language-agnostic detection that checks whether the chat input is missing **and** the page contains links to known OAuth endpoints (`login.microsoftonline.com`, `appleid.apple.com`, `accounts.google.com`). Called before the 15s input wait and again on timeout fallback, so the error now reads "Copilot requires sign-in — please sign in with Microsoft, Apple, or Google in the visible browser window. Once signed in, cookies persist for future runs." Recovery patterns now include `sign.in|login required` so the runner surfaces `_needsHumanVerification` and keeps visible Chrome open for the user.
+- **Gemini Material icon names changed** (`extractors/selectors.mjs`) — Google updated the icon data attributes in Gemini's UI: `content_copy` → `copy` and `send` → `arrow_upward`. The `.send-button` class was also removed. Updated selectors to match the new attributes so copy-button detection and message submission work again in both headless and visible modes.
+- **Bing Copilot deprecated from default `all` fan-out** (`src/search/constants.mjs`, `src/tools/shared.ts`, `bin/search.mjs`, `src/tools/greedy-search-handler.ts`) — Copilot now requires Microsoft/Apple/Google sign-in on fresh sessions, causing frequent failures in multi-engine searches. Removed `"bing"` from `ALL_ENGINES`. The `engine: "all"` path now fans out to Perplexity and Google only. Bing Copilot remains available as an explicit single-engine option (`engine: "bing"`) for users who have signed in. Updated tool description, README, and skill docs to reflect the change.
+- **ChatGPT extractor added** (`extractors/chatgpt.mjs`, `extractors/common.mjs`, `bin/search.mjs`) — New engine that navigates `chatgpt.com`, types into the ProseMirror editor, submits via the send button, and extracts answers + inline source citations via clipboard interception. ChatGPT copies markdown with reference-style links (`[text][1]` + `[1]: url "title"`), so `parseSourcesFromMarkdownRefStyle()` was added to `common.mjs`. The extractor uses a single `Runtime.evaluate` with in-browser polling for stream completion (matching Perplexity/Bing), avoiding CDP contention when running in parallel with other engines. Clipboard interception now captures text before attempting native clipboard writes, swallows native clipboard failures that caused misleading in-page "failed to copy" toasts, and falls back to language-agnostic assistant DOM extraction when clipboard capture is genuinely empty. Works in both headless and visible modes without requiring login. Registered as `chatgpt` / `gpt` in `ENGINES`.
+- **Configurable `ALL_ENGINES` and synthesizer via `~/.pi/greedyconfig`** (`src/search/constants.mjs`, `src/search/synthesis-runner.mjs`) — Users can now customize which engines participate in the `"all"` fan-out by creating `~/.pi/greedyconfig` with `{ "engines": ["perplexity", "google", "chatgpt"], "synthesizer": "gemini" }`. The file is auto-created with the default set on first run. Optional all-search synthesis is now routed through an engine-agnostic synthesis runner with `gemini` as the default and `chatgpt` supported as an opt-in synthesizer. `src/tools/shared.ts` re-exports from constants.mjs instead of hardcoding, and the progress regex matches all known engines.
+- **Research Gemini prompts no longer hit Windows `ENAMETOOLONG`** (`bin/cdp.mjs`, `extractors/common.mjs`, `extractors/gemini.mjs`) — Long research planning/learning prompts are now passed from the Gemini extractor to `cdp.mjs type` through stdin instead of command-line arguments.
+- **Research starts Chrome in the intended mode before opening tabs** (`bin/search.mjs`) — `search.mjs` now establishes the headless/visible environment before `ensureChrome()`, preventing stale visible recovery browsers from making Gemini planning/synthesis appear visible on subsequent default-headless research runs.
+- **Research direct-URL fetch actions work in ESM** (`src/search/research.mjs`) — Replaced a CommonJS `require("./sources.mjs")` in the direct `fetchUrl` path with normal ESM imports, avoiding runtime failures when Gemini plans a direct primary-source fetch.
+- **Gemini synthesis no longer kills Chrome after opening its tab** (`src/search/browser-lifecycle.mjs`, `src/search/chrome.mjs`, `src/search/synthesis-runner.mjs`, `bin/search.mjs`) — Fixed two Chrome lifecycle regressions that produced `No target matching prefix` during multi-engine synthesis. Child-process stale-session cleanup now verifies GreedySearch Chrome command lines with path-separator-insensitive profile matching, so Windows backslash/forward-slash differences do not make a live Chrome look like a ghost process. Mode detection now prefers the live Chrome command line over the stale mode marker file, preventing visible-mode synthesis from killing visible Chrome immediately after opening the Gemini tab. Added unit coverage for both cases.
+- **Gemini copy-button targeted `model-response` specifically** (`extractors/gemini.mjs`) — the absolute last copy button on the page is not always the assistant's response copy button (the page has many Material icon copy buttons: copy link, copy code, etc.). When the `model-response` custom element was empty, `extractAnswer` was clicking the wrong copy button and the clipboard interceptor captured the user's query — the extractor then returned the query as the answer. Now the click targets the copy button inside the `model-response` element, with a DOM fallback that reads `model-response` innerText (stripping the locale-specific "Gemini said" / "Το Gemini είπε" label) when the clipboard contains the echoed query.
+- **SonarCloud dead `=== "true"` check dropped** (`extractors/gemini.mjs:152`) — the eval expression `(() => { ... })()` only ever returns a boolean (`false` when no `model-response` element, or `t.length > 20`), so `ready === "true"` was always false. Simplified to `if (ready === true)`. Resolves the SonarCloud MAJOR BUG flagged in the leak-period view.
+### Documentation
+- **AGENTS.md** updated to document the new release workflow (the `lint-and-lockfile` → `install-test` → `release` pipeline, NPM_TOKEN gating, manual `workflow_dispatch` override, versioning guidelines), the new engine notes for ChatGPT and Gemini (greeting-card skip, copy-button targeting, single-eval stream wait), the research-mode hardening details (ledger cap, academic fetch injection, social-source guardrail), and the updated extractor timeout budgets (ChatGPT 20s in-browser + 15s node-side).
+- **README.md** cleaned up (split parameters section, removed anti-detection/known-quirks sections), and the **CHANGELOG** reformatted to use the new `## [X.Y.Z] — YYYY-MM-DD` heading convention that the release workflow's `grep -q "^## \[$VERSION\]"` check expects.
+### Changed
+- **Search modes simplified around grounded sources** (`bin/search.mjs`, `src/tools/greedy-search-handler.ts`, `README.md`, `skills/greedy-search/skill.md`) — `engine: "all"` is now the main grounded search mode: it fans out to configured engines, builds a source registry, fetches top source content, and returns confidence metadata by default. Synthesis is now an explicit opt-in via `synthesize: true` / `--synthesize` and uses the configured `synthesizer` (`gemini` default, `chatgpt` supported). Deep research remains a separate workflow via `depth: "research"`. Legacy `depth: "fast" | "standard" | "deep"` values remain accepted for compatibility (`fast` skips source fetching; `standard`/`deep` request synthesis) but are no longer the primary API.
+- **Research-mode hardening** (`src/search/research.mjs`, `src/search/sources.mjs`) — three changes to improve deep-research output quality. (1) The open-question ledger is capped at `MAX_OPEN_FOLLOWUPS = 5` per round — overflow "Discovered gap/follow-up" questions auto-resolve with evidence rather than carrying forward, keeping the floor check (`computeResearchFloor.requiredQuestions`) meaningful. (2) `pickAcademicFetchTargets()` injects a `fetchUrl` action for the top 1-2 academic sources (`arxiv.org`, `semanticscholar.org`, `doi.org`) when `combinedSources` contains one and no round action is already a `fetchUrl` — forces PDF/academic text to actually be fetched and synthesized. (3) Social-source penalty increased from −12 to −20, plus a post-sort hard guardrail: sources are split into `nonSocial` and `socialSources`, each sorted independently, then concatenated `[...nonSocial, ...socialSources]` before the top-12 slice. A social source can no longer land as S1 even with high composite scores.
+- **ChatGPT joined visible recovery** (`src/search/recovery.mjs`, `bin/search.mjs`) — the typed query can succeed in headless while the assistant response never streams in. Added ChatGPT to `HEADLESS_RECOVERY_ENGINES` so the same headless → visible fallback that helps Bing/Perplexity also fires on ChatGPT timeout/verification signals. The wrapper budget for ChatGPT was bumped to 80s so the extractor's 20s in-browser wait plus 15s node-side fallback can finish cleanly under the visible retry.
+### Removed
+## [1.9.2] — 2026-05-25
+### Added
+- **Iterative research mode** (`bin/search.mjs`, `src/search/research.mjs`) — Added `--research` / `--depth research` and `greedy_search({ depth: "research" })`. The new mode plans focused follow-up queries, runs fast multi-engine searches, fetches and deduplicates sources, extracts compact learnings/gaps with Gemini, and writes a final cited report. Optional knobs: `breadth` (1-5), `iterations` (1-3), and `maxSources` (3-12). Research mode now fills under-planned breadth with deterministic fallback query angles so `breadth: 3` actually fans out even when Gemini is conservative.
+### Fixed
+- **Pi update dependency install is leaner** (`package.json`, `package-lock.json`) — Moved the direct `@sinclair/typebox` import into runtime dependencies and marked the Pi host peer as optional so npm does not auto-install a full nested `@earendil-works/pi-coding-agent` tree during git-package updates. This keeps `pi update` focused on GreedySearch runtime deps (`jsdom`, `@mozilla/readability`, `turndown`) and avoids partial installs that leave `jsdom/package.json` missing.
+- **Pi TUI peer import no longer required at load time** (`src/tools/greedy-search-handler.ts`) — Replaced the direct `@earendil-works/pi-tui` runtime import with a tiny local `Text` component implementation so Pi/jiti extension import works even when optional TUI peer packages are not installed locally.
+- **Research unit tests no longer require fetcher dependencies at import time** (`src/search/research.mjs`) — Research mode now lazy-loads source fetching/file-output helpers only during live research execution, keeping pure planning/normalization unit tests runnable in CI's tarball install simulation without local `node_modules`.
+- **Research query sanitizer avoids ReDoS hotspot** (`src/search/research.mjs`) — Replaced markdown-link cleanup regexes with bounded string scanning and manual whitespace collapse, resolving the SonarCloud super-linear regex hotspot while preserving `site:[label](url)` query cleanup.
+- **Research source quality cleanup** (`src/search/sources.mjs`, `src/search/research.mjs`) — Social/login-wall domains (`facebook.com`, `linkedin.com`, `x.com`, etc.) now receive a strong ranking penalty unless the query explicitly targets that platform. Research source dedupe now uses the same composite score as normal source ranking, per-round learning extraction errors are recorded in `_research.rounds[].learningError`, child-search stderr forwarding is filtered so noisy page CSS/HTML cannot flood research logs, and markdown links in Gemini-generated follow-up queries are sanitized before search.
+- **Bing headless stealth hardening** (`extractors/common.mjs`, `bin/launch.mjs`) — Adopted low-risk ideas from Obscura's stealth model: `navigator.webdriver` now resolves to `undefined` instead of `false`, navigator plugins/mimeTypes/mediaDevices/connection/pdfViewer/platform/vendor are made more Chrome-like, patched functions stringify as `[native code]`, canvas noise is stable per page instead of random on each call, and Chrome launches with `--lang=en-US` plus `--force-color-profile=srgb`. Live Bing headless smoke passed after the change without visible recovery.
+- **Research/Bing false recovery fixed** (`bin/search.mjs`, `extractors/bing-copilot.mjs`, `extractors/consent.mjs`) — Research child searches no longer mark Bing/Perplexity failed before visible recovery has a final status, Bing fast-mode keeps a bounded 40s parent budget, and Bing's short-mode stream wait caps at 25s so research can extract rendered partial answers before timing out. Bing verification detection now reuses the DOM-based `handleVerification` detector instead of scanning accessibility text for generic words like “Cloudflare” or “challenge”, preventing false visible-recovery trips when the user query/answer is about anti-bot systems. Added locale-agnostic DOM/accessibility fallback extraction that picks the assistant article without relying solely on English “Copilot said” labels.
 ## [1.9.1] — 2026-05-23
@@ -18,19 +127,6 @@
 - **Gemini tab no longer steals focus during synthesis** (`bin/search.mjs`) — Removed the `activateTab` call on the pre-navigated Gemini tab. `Target.activateTarget` was restoring the minimized Chrome window mid-search; CDP synthesis operates on the target ID directly and has no need for the tab to be Chrome's active tab.
-### Changed
-- **Result file auto-purge** (`src/search/output.mjs`) — On each search run, files older than 7 days are deleted from the results directory. The 10 most recent files are always kept regardless of age. Runs inside `resultsDir()` so it's transparent and zero-overhead.
-- **`greedy_search` tool: collapsed rendering** (`src/tools/greedy-search-handler.ts`) — Added `renderCall` and `renderResult` hooks. The call line shows the query (truncated to 60 chars) and engine. The result collapses to a one-line summary: synthesis path shows source count + consensus label; single-engine path shows source count; human-verification path shows a warning. Full output is available via expand (Ctrl+O). Also migrated peer deps from `@mariozechner/pi-coding-agent` to `@earendil-works/pi-coding-agent` and added `@earendil-works/pi-tui` for the `Text` primitive.
-- **Headless stealth hardening** (`bin/launch.mjs`, `extractors/common.mjs`) — Four fingerprinting gaps closed:
-  - **UA version auto-detected** — `getChromeVersion` reads the versioned sub-directory inside the Chrome Application folder (e.g. `148.0.7778.168/`) to extract the real major version, then injects it into the `--user-agent` flag. Eliminates the TLS/UA mismatch that was caused by the hardcoded `Chrome/131` string (actual binary was `Chrome/148`).
-  - **`navigator.userAgentData`** — Spoofed to match the detected UA version and remove any `HeadlessChrome` brand entry. `getHighEntropyValues()` returns consistent architecture, platform, and full version list.
-  - **`window.outerWidth/Height`** — Patched from `0` (headless default) to mirror `innerWidth/Height`. A zero outer dimension is a well-known one-signal bot detector.
-  - **`screen.colorDepth/pixelDepth`** — Ensured to report `24` when unset.
-  - **GPU rendering re-enabled in headless** — Removed `--disable-gpu` and `--disable-software-rasterizer`. With `--headless=new`, Chrome uses hardware GPU acceleration (ANGLE/Direct3D on Windows), producing canvas and WebGL output identical to visible mode. Cloudflare Turnstile passes automatically on Perplexity without triggering visible-mode retry.
 ## [1.9.0] — 2026-05-22
 ### Added

package/README.md CHANGED Viewed

@@ -5,8 +5,9 @@
 Multi-engine AI web search for Pi via browser automation.
 - No API keys
-- Real browser results (Perplexity, Bing Copilot, Google AI)
-- Optional Gemini synthesis with source grounding
+- Real browser results from configurable engines: Perplexity, Google AI, ChatGPT, Gemini, plus opt-in Semantic Scholar and Logically research engines
+- Research mode as the centerpiece: iterative planning, source fetching, citation audit, and structured bundles
+- Optional configurable synthesis with source grounding (Gemini by default)
 - Chrome runs headless by default — no window, purely background
 ## Install
@@ -21,35 +22,55 @@ Or from git:
 pi install git:github.com/apmantza/GreedySearch-pi
 ```
-## Tools
+## Tool
-- `greedy_search` — multi-engine AI web search
-- `websearch` — lightweight DuckDuckGo/Brave search (via pi-webaio)
-- `webfetch` / `webpull` — page fetching and site crawling (via pi-webaio)
+- `greedy_search` — multi-engine AI web search, source-grounded synthesis, and deep research
 ## Quick usage
 ```js
-greedy_search({ query: "React 19 changes" });
-greedy_search({ query: "Prisma vs Drizzle", engine: "all", depth: "fast" });
+greedy_search({ query: "React 19 changes" }); // all engines + fetched sources
+greedy_search({ query: "React 19 changes", synthesize: true }); // add configured synthesis
+greedy_search({ query: "Prisma vs Drizzle", engine: "perplexity" }); // individual engine
 greedy_search({
-  query: "Best auth architecture 2026",
-  engine: "all",
-  depth: "deep",
+  query: "Evaluate browser automation options for AI agents",
+  depth: "research",
+  breadth: 3,
+  iterations: 2,
+  maxSources: 8,
 });
+// Research mode writes a dataroom-style bundle under .pi/greedysearch-research/ by default.
 // Headless is the default — no window. To force visible Chrome:
-greedy_search({ query: "Bing captcha setup", engine: "bing", visible: true });
+greedy_search({ query: "Visible browser setup", engine: "perplexity", visible: true });
 ```
 ## Parameters (`greedy_search`)
+### Common
 - `query` (required)
-- `engine`: `all` (default), `perplexity`, `bing`, `google`, `gemini`
-- `depth`: `standard` (default), `fast`, `deep`
 - `fullAnswer`: return full single-engine output instead of preview
 - `headless`: set to `false` to show Chrome window (default: `true`)
 - `visible` / `alwaysVisible`: set to `true` to always use visible Chrome for this search
+### Normal search
+- `engine`: `all` (default web/search fan-out), `perplexity`, `google`, `chatgpt`, `gemini`; opt-in research engines: `semantic-scholar`, `logically`; `bing` still works for signed-in users
+- `synthesize`: for `engine: "all"`, synthesize fetched sources with the configured synthesizer (default false)
+- `synthesizer`: override the configured synthesis engine for this call (`gemini` default, `chatgpt` supported)
+- `depth`: legacy `fast`/`standard`/`deep` aliases are still accepted for compatibility; prefer `synthesize` for normal search
+### Deep research
+- `depth`: set to `"research"` to run the iterative research workflow
+- `breadth`: number of research actions per round, 1-5 (default 3)
+- `iterations`: research rounds, 1-3 (default 2)
+- `maxSources`: fetched source cap for the final report, 3-12
+- `researchOutDir`: optional directory for the research bundle
+- `writeResearchBundle`: write the research bundle to disk (default true for research mode)
+Deep research uses the configured `~/.pi/greedyconfig.engines` list for child searches and Gemini for planning/final synthesis.
 ## Environment variables
 | Variable                             | Default       | Description                                                   |
@@ -60,11 +81,52 @@ greedy_search({ query: "Bing captcha setup", engine: "bing", visible: true });
 | `GREEDY_SEARCH_LOCALE`               | `en`          | Default result language (en, de, fr, es, ja, etc.)            |
 | `CHROME_PATH`                        | auto-detected | Path to Chrome/Chromium executable                            |
-## Depth modes
+## Search modes
+- **Individual engine search/research** — `engine: "perplexity" | "google" | "chatgpt" | "gemini" | "semantic-scholar" | "logically" | "bing"`; returns that engine's answer and sources.
+- **Grounded multi-engine search** — default `engine: "all"`; fans out to configured engines, ranks sources, fetches top source content, and reports confidence metadata.
+- **All + synthesis** — add `synthesize: true` (or CLI `--synthesize`) to ask the configured synthesizer to combine engine answers and fetched source evidence.
+- **Deep research** — `depth: "research"`; iterative action planning, direct URL fetches, fast multi-engine searches, source fetching, learning extraction, deterministic floor checks, citation audit, a final cited report, and a structured on-disk bundle.
-- `fast` - quickest, no synthesis/source fetching
-- `standard` - balanced default for `engine: "all"` (synthesis + fetched sources)
-- `deep` - strongest grounding and confidence metadata
+Legacy `depth: "fast" | "standard" | "deep"` values remain accepted for compatibility: `fast` skips source fetching; `standard`/`deep` request synthesis.
+Configure all-engine fan-out and synthesis in `~/.pi/greedyconfig`:
+```json
+{
+  "engines": ["perplexity", "google", "chatgpt", "gemini", "semantic-scholar", "logically"],
+  "synthesizer": "gemini"
+}
+```
+Gemini is a normal search engine and can participate in `engine: "all"`. Semantic Scholar and Logically are opt-in research engines; include them in `~/.pi/greedyconfig` only when you want the all-engine fan-out to include academic paper discovery or research-assistant workflows. Deep research child searches reuse the same configured `engines` list and keep query text on stdin; Gemini remains the research planner/final-report synthesizer. If `synthesize: true` and `"synthesizer": "gemini"`, Gemini runs once as a search engine and again as the synthesizer; set `"synthesizer": "chatgpt"` to separate those roles for normal all-search synthesis.
+Research bundles are written by default to `.pi/greedysearch-research/<timestamp>_<query>/` and include:
+```text
+STATUS.md              # floor status, open/closed question ledger, and gaps
+OUTLINE.md             # bundle table of contents
+reports/SUMMARY.md     # final cited report
+reports/CLAIMS.md      # extracted claims mapped to source IDs
+reports/EVIDENCE.md    # goal-based evidence extracted from fetched sources
+reports/GAPS.md        # caveats and remaining uncertainties
+sources/               # fetched source markdown files
+data/manifest.json     # run metadata, stop reason, floor checks, citation audit
+data/rounds.json       # per-round actions/learnings/gaps
+data/sources.json      # ranked source registry
+data/questions.json    # STATUS-style question ledger with evidence/source IDs
+data/evidence.json     # structured rational/evidence/summary per useful source
+```
+CLI controls:
+```bash
+node bin/search.mjs all --inline --stdin --depth research --breadth 3 --iterations 2 --max-sources 8 <<'EOF'
+Evaluate browser automation options for AI agents
+EOF
+node bin/search.mjs all "topic" --depth research --research-out-dir ./research-topic
+node bin/search.mjs all "topic" --depth research --no-research-bundle
+```
 ## Runtime commands
@@ -112,32 +174,15 @@ Chrome is auto-cleaned after 5 min idle. Override with `GREEDY_SEARCH_IDLE_TIMEO
 - Chrome
 - Node.js 20.11.0+
-## Known engine quirks
-### Bing Copilot
-Bing Copilot detects headless Chrome and sandboxes all AI responses inside nested iframes (`copilot.microsoft.com` → `copilot.fun` → `blob:`). In this mode the copy button is hidden and the Cloudflare Turnstile challenge blocks content delivery. The clipboard-based extraction cannot work.
-**Auto-recovery:** When Bing or Perplexity fails with a headless-only extraction error (clipboard, verification, timeout, Cloudflare), GreedySearch automatically switches to **visible Chrome** and retries, even in `fast` mode. If manual verification is required, the visible browser is left open and the tool returns instructions to solve the challenge and rerun the same search.
-If you prefer to skip the auto-recovery delay, launch visible Chrome ahead of time with `/greedy-visible`, set `GREEDY_SEARCH_ALWAYS_VISIBLE=1`, or pass `visible: true` to `greedy_search`.
-## Anti-detection
-Headless Chrome auto-injects stealth patches before any page JavaScript runs:
-- `navigator.webdriver` hidden, plugins/languages faked, `window.chrome` shimmed
-- WebGL vendor spoofed (Intel Iris), realistic hardware concurrency / memory
-- CDP automation markers deleted, `requestAnimationFrame` kept alive
-- Human-like click simulation with coordinate jitter and variable delays
-This bypasses casual bot detection (basic `navigator.webdriver` checks) but does not defeat commercial anti-bot services (DataDome, PerimeterX, Kasada). **Bing Copilot specifically detects headless and sandboxes responses behind Cloudflare Turnstile** — see [Known engine quirks](#known-engine-quirks) for the auto-recovery mechanism.
+## Source fetching
-When using `depth: "standard"` or `depth: "deep"`, source content is fetched and synthesized:
+When using `engine: "all"`, top source content is fetched by default. Add `synthesize: true` to synthesize with the configured synthesizer:
-- **Reddit** — Uses Reddit's public `.json` API for posts and comments (no scraping)
+- **PDFs** — Direct PDF links are parsed to markdown text for source-grounded synthesis
+- **Semantic Scholar** — Discovers academic papers and prefers direct PDF/external paper links when available
+- **Reddit** — Uses Reddit's public `.json` API for posts and comments
 - **GitHub** — Uses GitHub REST API for repos, READMEs, and file trees
-- **General web** — Mozilla Readability extraction with browser fallback for bot-blocked pages
+- **General web** — Mozilla Readability extraction with browser fallback when needed
 - **Metadata** — title, author/byline, site name, publish date, language, excerpt
 ## Project layout