@demigodmode/pi-web-agent 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +176 -0
- package/dist/cache/ttl-cache.d.ts +8 -0
- package/dist/cache/ttl-cache.js +21 -0
- package/dist/extension.d.ts +2 -0
- package/dist/extension.js +114 -0
- package/dist/extract/readability.d.ts +8 -0
- package/dist/extract/readability.js +93 -0
- package/dist/fetch/browser-resolution.d.ts +15 -0
- package/dist/fetch/browser-resolution.js +55 -0
- package/dist/fetch/headless-fetch.d.ts +18 -0
- package/dist/fetch/headless-fetch.js +87 -0
- package/dist/fetch/http-fetch.d.ts +4 -0
- package/dist/fetch/http-fetch.js +50 -0
- package/dist/orchestration/index.d.ts +41 -0
- package/dist/orchestration/index.js +9 -0
- package/dist/orchestration/research-orchestrator.d.ts +43 -0
- package/dist/orchestration/research-orchestrator.js +87 -0
- package/dist/orchestration/research-types.d.ts +41 -0
- package/dist/orchestration/research-types.js +1 -0
- package/dist/orchestration/research-worker.d.ts +16 -0
- package/dist/orchestration/research-worker.js +131 -0
- package/dist/search/duckduckgo.d.ts +4 -0
- package/dist/search/duckduckgo.js +42 -0
- package/dist/tools/web-explore.d.ts +44 -0
- package/dist/tools/web-explore.js +50 -0
- package/dist/tools/web-fetch-headless.d.ts +6 -0
- package/dist/tools/web-fetch-headless.js +14 -0
- package/dist/tools/web-fetch.d.ts +6 -0
- package/dist/tools/web-fetch.js +14 -0
- package/dist/tools/web-search.d.ts +10 -0
- package/dist/tools/web-search.js +44 -0
- package/dist/types.d.ts +48 -0
- package/dist/types.js +7 -0
- package/package.json +68 -0
package/README.md
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
1
|
+
# pi-web-agent
|
|
2
|
+
|
|
3
|
+
`@demigodmode/pi-web-agent` is a Pi package for reliable web access.
|
|
4
|
+
|
|
5
|
+
It is built around a simple rule: searching for a page is not the same thing as reading it. This package keeps those steps separate, prefers plain HTTP by default, and is designed to say "I couldn't read this reliably" instead of making something up.
|
|
6
|
+
|
|
7
|
+
## What it does
|
|
8
|
+
|
|
9
|
+
The package is built around three tools:
|
|
10
|
+
|
|
11
|
+
- `web_search` finds relevant pages and returns titles, URLs, and snippets
|
|
12
|
+
- `web_fetch` fetches a specific page over plain HTTP and tries to extract readable content
|
|
13
|
+
- `web_fetch_headless` is the explicit browser-based path for pages that need rendering
|
|
14
|
+
|
|
15
|
+
The boundary between those tools is intentional.
|
|
16
|
+
|
|
17
|
+
`web_search` is for discovery. It should not imply that a page was fetched.
|
|
18
|
+
|
|
19
|
+
`web_fetch` is for reading a page over HTTP. If the result looks weak, incomplete, blocked, or too script-heavy, it should return `needs_headless` instead of bluffing.
|
|
20
|
+
|
|
21
|
+
`web_fetch_headless` exists for the cases where a browser really is required. It is opt-in only.
|
|
22
|
+
|
|
23
|
+
## Why this exists
|
|
24
|
+
|
|
25
|
+
A lot of web tooling in coding agents gets fuzzy in exactly the wrong places. Search results get treated like page reads. Browser fallback happens behind the scenes. Failures get softened into fake confidence.
|
|
26
|
+
|
|
27
|
+
This package is trying to do the opposite.
|
|
28
|
+
|
|
29
|
+
The rules are straightforward:
|
|
30
|
+
|
|
31
|
+
- no hidden browser launch
|
|
32
|
+
- no automatic HTTP-to-headless fallback
|
|
33
|
+
- no claiming a page was read when only snippets were available
|
|
34
|
+
- explicit structured failure when the result is incomplete or blocked
|
|
35
|
+
|
|
36
|
+
## What makes it different
|
|
37
|
+
|
|
38
|
+
The main thing is the contract.
|
|
39
|
+
|
|
40
|
+
`web_search` discovers sources.
|
|
41
|
+
|
|
42
|
+
`web_fetch` reads over HTTP only.
|
|
43
|
+
|
|
44
|
+
`web_fetch_headless` is the explicit browser path.
|
|
45
|
+
|
|
46
|
+
That separation is the whole point. It makes failures easier to reason about and avoids the weird behavior where a tool quietly changes execution mode behind your back.
|
|
47
|
+
|
|
48
|
+
## Install
|
|
49
|
+
|
|
50
|
+
Install it through Pi:
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
pi install npm:@demigodmode/pi-web-agent
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
Update installed packages later with:
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
pi update
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
If you just want to inspect the package from npm directly, the package name is:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
npm view @demigodmode/pi-web-agent
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Current status
|
|
69
|
+
|
|
70
|
+
This repo is in early MVP shape, but it is no longer just a design doc.
|
|
71
|
+
|
|
72
|
+
Right now it has:
|
|
73
|
+
|
|
74
|
+
- a TypeScript project scaffold
|
|
75
|
+
- shared result and status contracts
|
|
76
|
+
- a DuckDuckGo HTML parser for `web_search`
|
|
77
|
+
- an HTTP fetch path with readability-based extraction and conservative escalation to `needs_headless`
|
|
78
|
+
- a real browser-backed `web_fetch_headless` implementation with local browser resolution
|
|
79
|
+
- repo-local Pi extension wiring for development
|
|
80
|
+
- a test suite around parser behavior, contracts, extraction, caching, and tool adapters
|
|
81
|
+
- optional smoke coverage for local installed browsers
|
|
82
|
+
|
|
83
|
+
So the project is real and usable, but still early.
|
|
84
|
+
|
|
85
|
+
## Example behavior
|
|
86
|
+
|
|
87
|
+
These are conceptual examples of the contract the package is aiming to expose.
|
|
88
|
+
|
|
89
|
+
### Search
|
|
90
|
+
|
|
91
|
+
`web_search("pi coding agent")`
|
|
92
|
+
|
|
93
|
+
Returns discovery results like:
|
|
94
|
+
|
|
95
|
+
- title
|
|
96
|
+
- URL
|
|
97
|
+
- snippet
|
|
98
|
+
|
|
99
|
+
It does not imply the page was fetched.
|
|
100
|
+
|
|
101
|
+
### HTTP fetch
|
|
102
|
+
|
|
103
|
+
`web_fetch("https://example.com/article")`
|
|
104
|
+
|
|
105
|
+
If the page is readable over plain HTTP, it should return extracted content.
|
|
106
|
+
|
|
107
|
+
If the page looks too script-heavy, too thin, blocked, or otherwise unreliable, it should return `needs_headless` instead of pretending the extraction is good enough.
|
|
108
|
+
|
|
109
|
+
### Explicit headless fetch
|
|
110
|
+
|
|
111
|
+
`web_fetch_headless("https://example.com/app")`
|
|
112
|
+
|
|
113
|
+
This is the browser-based path for pages that really need rendering.
|
|
114
|
+
|
|
115
|
+
This path now launches a local browser explicitly, waits for the rendered page to settle, and then extracts readable content from the rendered HTML.
|
|
116
|
+
|
|
117
|
+
## Local development
|
|
118
|
+
|
|
119
|
+
Install dependencies:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
npm install
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Run tests with coverage:
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
npm test
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Run the typecheck used as lint:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
npm run lint
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Build the project:
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
npm run build
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
To run the optional real-browser smoke test for headless fetch, set `PI_HEADLESS_SMOKE=1` before running Vitest. It stays skipped by default so local browser install differences do not make the normal test suite flaky.
|
|
144
|
+
|
|
145
|
+
Coverage is now part of the normal `npm test` flow. Vitest prints a text summary in the terminal and writes the full HTML report to `coverage/`.
|
|
146
|
+
|
|
147
|
+
### Trying it in Pi locally
|
|
148
|
+
|
|
149
|
+
This repo includes a project-local Pi extension entrypoint at `.pi/extensions/pi-web-agent.ts` for development and hot reload.
|
|
150
|
+
|
|
151
|
+
For the published npm package, Pi loads the compiled runtime from `dist/extension.js` via the `pi.extensions` entry in `package.json`.
|
|
152
|
+
|
|
153
|
+
After starting Pi in this project, use `/reload` if you change the extension code and want Pi to pick up the latest version.
|
|
154
|
+
|
|
155
|
+
## Project layout
|
|
156
|
+
|
|
157
|
+
The code is split into small modules on purpose.
|
|
158
|
+
|
|
159
|
+
- `src/extension.ts` - package entry surface
|
|
160
|
+
- `src/tools/` - thin tool adapters
|
|
161
|
+
- `src/search/` - search backend logic
|
|
162
|
+
- `src/fetch/` - HTTP and headless fetch logic
|
|
163
|
+
- `src/extract/` - readable-content extraction
|
|
164
|
+
- `src/cache/` - small cache utilities
|
|
165
|
+
- `src/types.ts` - shared contracts
|
|
166
|
+
- `tests/` - parser, contract, extraction, fetch, and adapter tests
|
|
167
|
+
|
|
168
|
+
## Near-term next steps
|
|
169
|
+
|
|
170
|
+
The next chunk of work is pretty clear:
|
|
171
|
+
|
|
172
|
+
- keep tightening weak-content escalation on tricky HTTP targets
|
|
173
|
+
- improve cleanup of noisy rendered-page extraction on busy sites
|
|
174
|
+
- expand fixtures and end-to-end coverage
|
|
175
|
+
- add alternate search backends behind a first-class provider abstraction
|
|
176
|
+
|
|
@@ -0,0 +1,8 @@
|
|
|
1
|
+
export declare function createCacheKey(parts: Array<string | number | boolean>): string;
|
|
2
|
+
export declare function createTtlCache<T>({ ttlMs, now }: {
|
|
3
|
+
ttlMs: number;
|
|
4
|
+
now?: () => number;
|
|
5
|
+
}): {
|
|
6
|
+
get(key: string): T | undefined;
|
|
7
|
+
set(key: string, value: T): void;
|
|
8
|
+
};
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
export function createCacheKey(parts) {
|
|
2
|
+
return JSON.stringify(parts);
|
|
3
|
+
}
|
|
4
|
+
export function createTtlCache({ ttlMs, now = () => Date.now() }) {
|
|
5
|
+
const entries = new Map();
|
|
6
|
+
return {
|
|
7
|
+
get(key) {
|
|
8
|
+
const entry = entries.get(key);
|
|
9
|
+
if (!entry)
|
|
10
|
+
return undefined;
|
|
11
|
+
if (entry.expiresAt <= now()) {
|
|
12
|
+
entries.delete(key);
|
|
13
|
+
return undefined;
|
|
14
|
+
}
|
|
15
|
+
return entry.value;
|
|
16
|
+
},
|
|
17
|
+
set(key, value) {
|
|
18
|
+
entries.set(key, { value, expiresAt: now() + ttlMs });
|
|
19
|
+
}
|
|
20
|
+
};
|
|
21
|
+
}
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
import { Type } from '@sinclair/typebox';
|
|
2
|
+
import { createWebExploreTool } from './tools/web-explore.js';
|
|
3
|
+
import { createWebFetchTool } from './tools/web-fetch.js';
|
|
4
|
+
import { createWebFetchHeadlessTool } from './tools/web-fetch-headless.js';
|
|
5
|
+
import { createWebSearchTool } from './tools/web-search.js';
|
|
6
|
+
const WEB_EXPLORE_REMINDER_TYPE = 'pi-web-agent-web-explore-reminder';
|
|
7
|
+
const WEB_EXPLORE_REMINDER = 'web_explore has already been used for this research task. Only call low-level web tools if there is a specific unresolved gap. Do not keep searching or fetching just for extra confirmation.';
|
|
8
|
+
function hasSuccessfulWebExplore(messages) {
|
|
9
|
+
return messages.some((message) => message.role === 'toolResult' && message.toolName === 'web_explore' && !message.isError);
|
|
10
|
+
}
|
|
11
|
+
function hasWebExploreReminder(messages) {
|
|
12
|
+
return messages.some((message) => message.role === 'custom' && message.customType === WEB_EXPLORE_REMINDER_TYPE);
|
|
13
|
+
}
|
|
14
|
+
export default function extension(pi) {
|
|
15
|
+
const webSearch = createWebSearchTool();
|
|
16
|
+
const webFetch = createWebFetchTool();
|
|
17
|
+
const webFetchHeadless = createWebFetchHeadlessTool();
|
|
18
|
+
const webExplore = createWebExploreTool();
|
|
19
|
+
pi.on('before_agent_start', async (event) => {
|
|
20
|
+
return {
|
|
21
|
+
systemPrompt: `${event.systemPrompt}\n\n` +
|
|
22
|
+
'For web research questions that require finding and comparing multiple sources, prefer web_explore. ' +
|
|
23
|
+
'Use web_search, web_fetch, and web_fetch_headless for direct/manual operations like explicit search calls, specific URL reads, or debugging. ' +
|
|
24
|
+
'After using web_explore, only call low-level web tools if there is a specific unresolved gap. ' +
|
|
25
|
+
'Do not keep searching or fetching just for extra confirmation.'
|
|
26
|
+
};
|
|
27
|
+
});
|
|
28
|
+
pi.on('context', async (event) => {
|
|
29
|
+
if (!hasSuccessfulWebExplore(event.messages) || hasWebExploreReminder(event.messages)) {
|
|
30
|
+
return { messages: event.messages };
|
|
31
|
+
}
|
|
32
|
+
return {
|
|
33
|
+
messages: [
|
|
34
|
+
...event.messages,
|
|
35
|
+
{
|
|
36
|
+
role: 'custom',
|
|
37
|
+
customType: WEB_EXPLORE_REMINDER_TYPE,
|
|
38
|
+
content: WEB_EXPLORE_REMINDER,
|
|
39
|
+
display: false,
|
|
40
|
+
timestamp: Date.now()
|
|
41
|
+
}
|
|
42
|
+
]
|
|
43
|
+
};
|
|
44
|
+
});
|
|
45
|
+
pi.registerTool({
|
|
46
|
+
name: 'web_search',
|
|
47
|
+
label: 'Web Search',
|
|
48
|
+
description: 'Direct search tool for manual discovery of links and snippets. Use for explicit search requests or when the user wants raw search results. Prefer web_explore for broader research questions.',
|
|
49
|
+
parameters: Type.Object({
|
|
50
|
+
query: Type.String({ description: 'Search query.' })
|
|
51
|
+
}),
|
|
52
|
+
async execute(_toolCallId, params) {
|
|
53
|
+
const result = await webSearch({ query: params.query });
|
|
54
|
+
return {
|
|
55
|
+
content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
|
|
56
|
+
details: result,
|
|
57
|
+
isError: result.status === 'error'
|
|
58
|
+
};
|
|
59
|
+
}
|
|
60
|
+
});
|
|
61
|
+
pi.registerTool({
|
|
62
|
+
name: 'web_fetch',
|
|
63
|
+
label: 'Web Fetch',
|
|
64
|
+
description: 'Direct HTTP page fetch for a specific URL. Use when the user wants one page read directly. Prefer web_explore for broader research across multiple sources.',
|
|
65
|
+
parameters: Type.Object({
|
|
66
|
+
url: Type.String({ description: 'HTTP or HTTPS URL to fetch.' })
|
|
67
|
+
}),
|
|
68
|
+
async execute(_toolCallId, params) {
|
|
69
|
+
const result = await webFetch({ url: params.url });
|
|
70
|
+
return {
|
|
71
|
+
content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
|
|
72
|
+
details: result,
|
|
73
|
+
isError: result.status === 'error'
|
|
74
|
+
};
|
|
75
|
+
}
|
|
76
|
+
});
|
|
77
|
+
pi.registerTool({
|
|
78
|
+
name: 'web_fetch_headless',
|
|
79
|
+
label: 'Web Fetch Headless',
|
|
80
|
+
description: 'Direct headless page fetch for a specific URL when browser rendering is explicitly needed. Prefer web_explore for research tasks; it decides headless escalation internally.',
|
|
81
|
+
parameters: Type.Object({
|
|
82
|
+
url: Type.String({ description: 'HTTP or HTTPS URL to fetch in headless mode.' })
|
|
83
|
+
}),
|
|
84
|
+
async execute(_toolCallId, params) {
|
|
85
|
+
const result = await webFetchHeadless({ url: params.url });
|
|
86
|
+
return {
|
|
87
|
+
content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
|
|
88
|
+
details: result,
|
|
89
|
+
isError: result.status === 'error'
|
|
90
|
+
};
|
|
91
|
+
}
|
|
92
|
+
});
|
|
93
|
+
pi.registerTool({
|
|
94
|
+
name: 'web_explore',
|
|
95
|
+
label: 'Web Explore',
|
|
96
|
+
description: 'Research a web question using bounded search/fetch passes, source ranking, and targeted headless escalation. Prefer this for multi-source web research, current docs/discussion lookups, and recommendation summaries. Use this instead of chaining low-level web tools for the same research task.',
|
|
97
|
+
parameters: Type.Object({
|
|
98
|
+
query: Type.String({ description: 'Web research question to explore.' })
|
|
99
|
+
}),
|
|
100
|
+
async execute(_toolCallId, params) {
|
|
101
|
+
const result = await webExplore({ query: params.query });
|
|
102
|
+
return {
|
|
103
|
+
content: [
|
|
104
|
+
{
|
|
105
|
+
type: 'text',
|
|
106
|
+
text: result.status === 'ok' ? result.text : JSON.stringify(result, null, 2)
|
|
107
|
+
}
|
|
108
|
+
],
|
|
109
|
+
details: result,
|
|
110
|
+
isError: result.status === 'error'
|
|
111
|
+
};
|
|
112
|
+
}
|
|
113
|
+
});
|
|
114
|
+
}
|
|
@@ -0,0 +1,8 @@
|
|
|
1
|
+
import type { ExtractedContent } from '../types.js';
|
|
2
|
+
export type ReadableExtractionMode = 'readability' | 'fallback';
|
|
3
|
+
export type SafeReadableExtraction = {
|
|
4
|
+
mode: ReadableExtractionMode;
|
|
5
|
+
content: ExtractedContent;
|
|
6
|
+
};
|
|
7
|
+
export declare function extractReadableContent(html: string, maxLength?: number): ExtractedContent;
|
|
8
|
+
export declare function extractReadableContentSafely(html: string, maxLength?: number): SafeReadableExtraction;
|
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
import { Readability } from '@mozilla/readability';
|
|
2
|
+
import { JSDOM, VirtualConsole } from 'jsdom';
|
|
3
|
+
export function extractReadableContent(html, maxLength = 4000) {
|
|
4
|
+
let stylesheetError;
|
|
5
|
+
const virtualConsole = new VirtualConsole();
|
|
6
|
+
virtualConsole.on('jsdomError', (error) => {
|
|
7
|
+
if (!stylesheetError && error.message.includes('Could not parse CSS stylesheet')) {
|
|
8
|
+
stylesheetError = error;
|
|
9
|
+
}
|
|
10
|
+
});
|
|
11
|
+
const dom = new JSDOM(html, {
|
|
12
|
+
url: 'https://example.com',
|
|
13
|
+
virtualConsole
|
|
14
|
+
});
|
|
15
|
+
if (stylesheetError) {
|
|
16
|
+
throw stylesheetError;
|
|
17
|
+
}
|
|
18
|
+
const article = new Readability(dom.window.document).parse();
|
|
19
|
+
const rawText = (article?.textContent ?? dom.window.document.body.textContent ?? '').trim();
|
|
20
|
+
const text = rawText.slice(0, maxLength);
|
|
21
|
+
const fallbackTitle = dom.window.document.title || undefined;
|
|
22
|
+
return {
|
|
23
|
+
title: article?.title ?? fallbackTitle,
|
|
24
|
+
byline: article?.byline || undefined,
|
|
25
|
+
text
|
|
26
|
+
};
|
|
27
|
+
}
|
|
28
|
+
function decodeHtmlEntities(text) {
|
|
29
|
+
return text
|
|
30
|
+
.replace(/ /gi, ' ')
|
|
31
|
+
.replace(/&/gi, '&')
|
|
32
|
+
.replace(/</gi, '<')
|
|
33
|
+
.replace(/>/gi, '>')
|
|
34
|
+
.replace(/"/gi, '"')
|
|
35
|
+
.replace(/'/gi, "'")
|
|
36
|
+
.replace(/'/gi, "'")
|
|
37
|
+
.replace(///gi, '/')
|
|
38
|
+
.replace(/&#(\d+);/g, (_, code) => String.fromCharCode(Number(code)))
|
|
39
|
+
.replace(/&#x([\da-f]+);/gi, (_, code) => String.fromCharCode(parseInt(code, 16)));
|
|
40
|
+
}
|
|
41
|
+
function extractTitle(html) {
|
|
42
|
+
const match = html.match(/<title[^>]*>([\s\S]*?)<\/title>/i);
|
|
43
|
+
if (!match)
|
|
44
|
+
return undefined;
|
|
45
|
+
return decodeHtmlEntities(match[1].replace(/<[^>]+>/g, ' ').replace(/\s+/g, ' ').trim()) || undefined;
|
|
46
|
+
}
|
|
47
|
+
function stripTagContent(html, tagName) {
|
|
48
|
+
return html.replace(new RegExp(`<${tagName}\\b[^>]*>[\\s\\S]*?<\\/${tagName}>`, 'gi'), ' ');
|
|
49
|
+
}
|
|
50
|
+
function extractPreferredSection(html) {
|
|
51
|
+
const mainMatch = html.match(/<main\b[^>]*>([\s\S]*?)<\/main>/i);
|
|
52
|
+
if (mainMatch)
|
|
53
|
+
return mainMatch[1];
|
|
54
|
+
const articleMatch = html.match(/<article\b[^>]*>([\s\S]*?)<\/article>/i);
|
|
55
|
+
if (articleMatch)
|
|
56
|
+
return articleMatch[1];
|
|
57
|
+
const bodyMatch = html.match(/<body\b[^>]*>([\s\S]*?)<\/body>/i);
|
|
58
|
+
if (bodyMatch)
|
|
59
|
+
return bodyMatch[1];
|
|
60
|
+
return html;
|
|
61
|
+
}
|
|
62
|
+
function extractFallbackText(html, maxLength) {
|
|
63
|
+
const title = extractTitle(html);
|
|
64
|
+
let section = extractPreferredSection(html);
|
|
65
|
+
section = stripTagContent(section, 'script');
|
|
66
|
+
section = stripTagContent(section, 'style');
|
|
67
|
+
section = stripTagContent(section, 'noscript');
|
|
68
|
+
section = stripTagContent(section, 'svg');
|
|
69
|
+
section = stripTagContent(section, 'template');
|
|
70
|
+
const text = decodeHtmlEntities(section)
|
|
71
|
+
.replace(/<[^>]+>/g, ' ')
|
|
72
|
+
.replace(/\s+/g, ' ')
|
|
73
|
+
.trim()
|
|
74
|
+
.slice(0, maxLength);
|
|
75
|
+
return {
|
|
76
|
+
title,
|
|
77
|
+
text
|
|
78
|
+
};
|
|
79
|
+
}
|
|
80
|
+
export function extractReadableContentSafely(html, maxLength = 4000) {
|
|
81
|
+
try {
|
|
82
|
+
return {
|
|
83
|
+
mode: 'readability',
|
|
84
|
+
content: extractReadableContent(html, maxLength)
|
|
85
|
+
};
|
|
86
|
+
}
|
|
87
|
+
catch {
|
|
88
|
+
return {
|
|
89
|
+
mode: 'fallback',
|
|
90
|
+
content: extractFallbackText(html, maxLength)
|
|
91
|
+
};
|
|
92
|
+
}
|
|
93
|
+
}
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
export type BrowserResolutionResult = {
|
|
2
|
+
ok: true;
|
|
3
|
+
executablePath: string;
|
|
4
|
+
browser: 'configured' | 'chrome' | 'edge';
|
|
5
|
+
} | {
|
|
6
|
+
ok: false;
|
|
7
|
+
error: {
|
|
8
|
+
code: 'BROWSER_NOT_FOUND' | 'CONFIGURED_BROWSER_NOT_FOUND';
|
|
9
|
+
message: string;
|
|
10
|
+
};
|
|
11
|
+
};
|
|
12
|
+
export declare function resolveBrowserExecutable({ configuredPath, fileExists }: {
|
|
13
|
+
configuredPath?: string;
|
|
14
|
+
fileExists?: (path: string) => Promise<boolean>;
|
|
15
|
+
}): Promise<BrowserResolutionResult>;
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
const WINDOWS_CANDIDATES = {
|
|
2
|
+
chrome: [
|
|
3
|
+
'C:/Program Files/Google/Chrome/Application/chrome.exe',
|
|
4
|
+
'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe'
|
|
5
|
+
],
|
|
6
|
+
edge: [
|
|
7
|
+
'C:/Program Files/Microsoft/Edge/Application/msedge.exe',
|
|
8
|
+
'C:/Program Files (x86)/Microsoft/Edge/Application/msedge.exe'
|
|
9
|
+
]
|
|
10
|
+
};
|
|
11
|
+
export async function resolveBrowserExecutable({ configuredPath, fileExists = defaultFileExists }) {
|
|
12
|
+
if (configuredPath) {
|
|
13
|
+
if (await fileExists(configuredPath)) {
|
|
14
|
+
return {
|
|
15
|
+
ok: true,
|
|
16
|
+
executablePath: configuredPath,
|
|
17
|
+
browser: 'configured'
|
|
18
|
+
};
|
|
19
|
+
}
|
|
20
|
+
return {
|
|
21
|
+
ok: false,
|
|
22
|
+
error: {
|
|
23
|
+
code: 'CONFIGURED_BROWSER_NOT_FOUND',
|
|
24
|
+
message: `Configured browser path was not found: ${configuredPath}`
|
|
25
|
+
}
|
|
26
|
+
};
|
|
27
|
+
}
|
|
28
|
+
for (const path of WINDOWS_CANDIDATES.chrome) {
|
|
29
|
+
if (await fileExists(path)) {
|
|
30
|
+
return { ok: true, executablePath: path, browser: 'chrome' };
|
|
31
|
+
}
|
|
32
|
+
}
|
|
33
|
+
for (const path of WINDOWS_CANDIDATES.edge) {
|
|
34
|
+
if (await fileExists(path)) {
|
|
35
|
+
return { ok: true, executablePath: path, browser: 'edge' };
|
|
36
|
+
}
|
|
37
|
+
}
|
|
38
|
+
return {
|
|
39
|
+
ok: false,
|
|
40
|
+
error: {
|
|
41
|
+
code: 'BROWSER_NOT_FOUND',
|
|
42
|
+
message: 'No compatible local browser was found for headless fetch.'
|
|
43
|
+
}
|
|
44
|
+
};
|
|
45
|
+
}
|
|
46
|
+
async function defaultFileExists(path) {
|
|
47
|
+
try {
|
|
48
|
+
const { access } = await import('node:fs/promises');
|
|
49
|
+
await access(path);
|
|
50
|
+
return true;
|
|
51
|
+
}
|
|
52
|
+
catch {
|
|
53
|
+
return false;
|
|
54
|
+
}
|
|
55
|
+
}
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
import { type BrowserResolutionResult } from './browser-resolution.js';
|
|
2
|
+
import type { WebFetchHeadlessResponse } from '../types.js';
|
|
3
|
+
export declare function headlessFetch(url: string, { configuredPath, resolveBrowser, launchBrowser, now }?: {
|
|
4
|
+
configuredPath?: string;
|
|
5
|
+
resolveBrowser?: (options?: {
|
|
6
|
+
configuredPath?: string;
|
|
7
|
+
}) => Promise<BrowserResolutionResult>;
|
|
8
|
+
launchBrowser?: (options: {
|
|
9
|
+
executablePath: string;
|
|
10
|
+
}) => Promise<{
|
|
11
|
+
newContext: () => Promise<{
|
|
12
|
+
newPage: () => Promise<any>;
|
|
13
|
+
close: () => Promise<void>;
|
|
14
|
+
}>;
|
|
15
|
+
close: () => Promise<void>;
|
|
16
|
+
}>;
|
|
17
|
+
now?: () => number;
|
|
18
|
+
}): Promise<WebFetchHeadlessResponse>;
|
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
import { chromium } from 'playwright-core';
|
|
2
|
+
import { extractReadableContentSafely } from '../extract/readability.js';
|
|
3
|
+
import { resolveBrowserExecutable } from './browser-resolution.js';
|
|
4
|
+
function cleanupRenderedText(text) {
|
|
5
|
+
let cleaned = text.replace(/(Show more)(\s+\1){1,}/gi, '$1');
|
|
6
|
+
cleaned = cleaned.replace(/(Privacy Terms)(\s+\1){1,}/gi, '$1');
|
|
7
|
+
cleaned = cleaned.replace(/\s+/g, ' ').trim();
|
|
8
|
+
return cleaned;
|
|
9
|
+
}
|
|
10
|
+
export async function headlessFetch(url, { configuredPath, resolveBrowser = (options) => resolveBrowserExecutable({ configuredPath: options?.configuredPath }), launchBrowser = ({ executablePath }) => chromium.launch({ executablePath, headless: true }), now = () => Date.now() } = {}) {
|
|
11
|
+
const resolved = await resolveBrowser({ configuredPath });
|
|
12
|
+
if (!resolved.ok) {
|
|
13
|
+
return {
|
|
14
|
+
status: 'error',
|
|
15
|
+
url,
|
|
16
|
+
metadata: { method: 'headless', cacheHit: false },
|
|
17
|
+
error: resolved.error
|
|
18
|
+
};
|
|
19
|
+
}
|
|
20
|
+
let browser;
|
|
21
|
+
let context;
|
|
22
|
+
let page;
|
|
23
|
+
try {
|
|
24
|
+
browser = await launchBrowser({ executablePath: resolved.executablePath });
|
|
25
|
+
context = await browser.newContext();
|
|
26
|
+
page = await context.newPage();
|
|
27
|
+
const startedAt = now();
|
|
28
|
+
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 20000 });
|
|
29
|
+
await page.waitForLoadState('load', { timeout: 10000 });
|
|
30
|
+
await page.waitForLoadState('networkidle', { timeout: 5000 }).catch(() => undefined);
|
|
31
|
+
const html = await page.content();
|
|
32
|
+
const finishedAt = now();
|
|
33
|
+
const extraction = extractReadableContentSafely(html);
|
|
34
|
+
const cleanedContent = {
|
|
35
|
+
...extraction.content,
|
|
36
|
+
text: cleanupRenderedText(extraction.content.text)
|
|
37
|
+
};
|
|
38
|
+
if (!cleanedContent.text || cleanedContent.text.length < 40) {
|
|
39
|
+
return {
|
|
40
|
+
status: 'blocked',
|
|
41
|
+
url,
|
|
42
|
+
metadata: {
|
|
43
|
+
method: 'headless',
|
|
44
|
+
cacheHit: false,
|
|
45
|
+
browser: resolved.browser,
|
|
46
|
+
navigationMs: finishedAt - startedAt
|
|
47
|
+
},
|
|
48
|
+
error: {
|
|
49
|
+
code: 'HEADLESS_EXTRACTION_WEAK',
|
|
50
|
+
message: 'Rendered page did not produce enough readable content.'
|
|
51
|
+
}
|
|
52
|
+
};
|
|
53
|
+
}
|
|
54
|
+
return {
|
|
55
|
+
status: 'ok',
|
|
56
|
+
url,
|
|
57
|
+
content: cleanedContent,
|
|
58
|
+
metadata: {
|
|
59
|
+
method: 'headless',
|
|
60
|
+
cacheHit: false,
|
|
61
|
+
browser: resolved.browser,
|
|
62
|
+
navigationMs: finishedAt - startedAt,
|
|
63
|
+
truncated: cleanedContent.text.length >= 4000
|
|
64
|
+
}
|
|
65
|
+
};
|
|
66
|
+
}
|
|
67
|
+
catch (error) {
|
|
68
|
+
return {
|
|
69
|
+
status: 'error',
|
|
70
|
+
url,
|
|
71
|
+
metadata: {
|
|
72
|
+
method: 'headless',
|
|
73
|
+
cacheHit: false,
|
|
74
|
+
browser: resolved.browser
|
|
75
|
+
},
|
|
76
|
+
error: {
|
|
77
|
+
code: 'HEADLESS_NAVIGATION_FAILED',
|
|
78
|
+
message: error instanceof Error ? error.message : 'Unknown headless navigation failure.'
|
|
79
|
+
}
|
|
80
|
+
};
|
|
81
|
+
}
|
|
82
|
+
finally {
|
|
83
|
+
await page?.close?.().catch(() => undefined);
|
|
84
|
+
await context?.close?.().catch(() => undefined);
|
|
85
|
+
await browser?.close?.().catch(() => undefined);
|
|
86
|
+
}
|
|
87
|
+
}
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
import { extractReadableContentSafely } from '../extract/readability.js';
|
|
2
|
+
function looksLikeScriptShell(html) {
|
|
3
|
+
const lower = html.toLowerCase();
|
|
4
|
+
return lower.includes('<script') && (lower.includes('id="app"') || lower.includes('id="root"'));
|
|
5
|
+
}
|
|
6
|
+
function isWeakHttpContent(options) {
|
|
7
|
+
const normalizedText = options.text.replace(/\s+/g, ' ').trim();
|
|
8
|
+
const normalizedHtml = options.html.replace(/\s+/g, ' ').trim();
|
|
9
|
+
const textLength = normalizedText.length;
|
|
10
|
+
const htmlLength = normalizedHtml.length;
|
|
11
|
+
const hasGenericShellMarker = /enable javascript|javascript required|please turn on javascript/i.test(options.html);
|
|
12
|
+
const veryShortBody = textLength > 0 && textLength < 120;
|
|
13
|
+
const lowDensity = htmlLength > 0 && textLength / htmlLength < 0.02;
|
|
14
|
+
return veryShortBody && (lowDensity || hasGenericShellMarker);
|
|
15
|
+
}
|
|
16
|
+
export function createHttpFetcher({ fetchImpl = fetch } = {}) {
|
|
17
|
+
return async function httpFetch(url) {
|
|
18
|
+
const response = await fetchImpl(url);
|
|
19
|
+
const contentType = response.headers.get('content-type') ?? '';
|
|
20
|
+
if (!contentType.includes('text/html')) {
|
|
21
|
+
return {
|
|
22
|
+
status: 'unsupported',
|
|
23
|
+
url: response.url,
|
|
24
|
+
metadata: { method: 'http', cacheHit: false, contentType }
|
|
25
|
+
};
|
|
26
|
+
}
|
|
27
|
+
const html = await response.text();
|
|
28
|
+
const extraction = extractReadableContentSafely(html);
|
|
29
|
+
const content = extraction.content;
|
|
30
|
+
if (looksLikeScriptShell(html) ||
|
|
31
|
+
content.text.length < 40 ||
|
|
32
|
+
isWeakHttpContent({ html, title: content.title, text: content.text })) {
|
|
33
|
+
return {
|
|
34
|
+
status: 'needs_headless',
|
|
35
|
+
url: response.url,
|
|
36
|
+
metadata: { method: 'http', cacheHit: false, contentType },
|
|
37
|
+
error: {
|
|
38
|
+
code: 'WEAK_EXTRACTION',
|
|
39
|
+
message: 'HTTP extraction was not reliable enough.'
|
|
40
|
+
}
|
|
41
|
+
};
|
|
42
|
+
}
|
|
43
|
+
return {
|
|
44
|
+
status: 'ok',
|
|
45
|
+
url: response.url,
|
|
46
|
+
content,
|
|
47
|
+
metadata: { method: 'http', cacheHit: false, contentType, truncated: content.text.length >= 4000 }
|
|
48
|
+
};
|
|
49
|
+
};
|
|
50
|
+
}
|