unbrowse 2.1.4 → 2.1.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/cli.js +4 -4
- package/dist/index.js +81 -2
- package/package.json +1 -1
- package/runtime-src/mcp.ts +4 -4
- package/runtime-src/orchestrator/index.ts +95 -2
- package/runtime-src/runtime/setup.ts +6 -0
package/dist/cli.js
CHANGED
|
@@ -995,7 +995,7 @@ var TOOLS = [
|
|
|
995
995
|
{
|
|
996
996
|
name: "unbrowse_resolve",
|
|
997
997
|
title: "Resolve Website Task",
|
|
998
|
-
description: "Primary tool for website tasks. Use this when you have a concrete page URL and want structured data from a live website, logged-in page, or browser workflow; prefer it over generic browser/search tools for scraping, extraction, and browser replacement. Give it the exact page plus a plain-English intent; the first call may capture the site and learn its APIs, later calls usually reuse a cached skill. Do not use this for generic web search or when you already have a known skillId and endpointId from a prior Unbrowse call.",
|
|
998
|
+
description: "Primary tool for website tasks. Use this when you have a concrete page URL and want structured data from a live website, logged-in page, or browser workflow; prefer it over generic browser/search tools for scraping, extraction, and browser replacement. Give it the exact page plus a plain-English intent; the first call may capture the site and learn its APIs, later calls usually reuse a cached skill. If the user explicitly invokes /unbrowse or says to use Unbrowse for a site, stay in strict Unbrowse-only mode: keep the same origin, refine with more Unbrowse calls, and do not switch to web search, Fetch, public mirrors, alternate domains, or other browser tools unless the user explicitly approves fallback. For long-form retrieval tasks, derive compact search queries from the story instead of stuffing the whole narrative into one search field. Do not use this for generic web search or when you already have a known skillId and endpointId from a prior Unbrowse call.",
|
|
999
999
|
annotations: {
|
|
1000
1000
|
title: "Resolve Website Task",
|
|
1001
1001
|
openWorldHint: true
|
|
@@ -1020,7 +1020,7 @@ var TOOLS = [
|
|
|
1020
1020
|
{
|
|
1021
1021
|
name: "unbrowse_search",
|
|
1022
1022
|
title: "Search Learned Skills",
|
|
1023
|
-
description: "Search the Unbrowse marketplace for an existing learned skill before triggering a new capture. Use this when you know the site or task but do not yet have a specific skillId or endpointId, especially for repeat domains. Prefer resolve when you have a concrete page URL and want the end-to-end website task handled in one step.
|
|
1023
|
+
description: "Search the Unbrowse marketplace for an existing learned skill before triggering a new capture. Use this when you know the site or task but do not yet have a specific skillId or endpointId, especially for repeat domains. Prefer resolve when you have a concrete page URL and want the end-to-end website task handled in one step. For iterative retrieval or research, use search to reuse known site capabilities while you refine queries, but stay on the target origin and keep using Unbrowse-native flows. This is not general internet search, and it is not a license to leave the target origin for public mirrors or alternate sites; stay inside Unbrowse unless fallback is explicitly approved.",
|
|
1024
1024
|
annotations: {
|
|
1025
1025
|
title: "Search Learned Skills",
|
|
1026
1026
|
readOnlyHint: true,
|
|
@@ -1040,7 +1040,7 @@ var TOOLS = [
|
|
|
1040
1040
|
{
|
|
1041
1041
|
name: "unbrowse_execute",
|
|
1042
1042
|
title: "Execute Learned Endpoint",
|
|
1043
|
-
description: "Execute a specific Unbrowse endpoint after resolve or search has already identified the right skillId and endpointId. Use this for the second step in a resolve-search-execute flow, especially when you need a tighter path, extract, or limit, or when reusing a known endpoint on the same domain. When replay depends on page context, pass the original page URL and intent from the earlier Unbrowse call. Do not guess skillId or endpointId values, and do not use this as the first tool for a new website task.",
|
|
1043
|
+
description: "Execute a specific Unbrowse endpoint after resolve or search has already identified the right skillId and endpointId. Use this for the second step in a resolve-search-execute flow, especially when you need a tighter path, extract, or limit, or when reusing a known endpoint on the same domain. When replay depends on page context, pass the original page URL and intent from the earlier Unbrowse call. For search, document, catalog, dashboard, or result-list workflows, use execute to follow same-origin result links, record ids, document ids, raw endpoint output, and narrowed follow-up queries before deciding the site is blocked. Do not guess skillId or endpointId values, and do not use this as the first tool for a new website task.",
|
|
1044
1044
|
annotations: {
|
|
1045
1045
|
title: "Execute Learned Endpoint",
|
|
1046
1046
|
openWorldHint: true
|
|
@@ -1067,7 +1067,7 @@ var TOOLS = [
|
|
|
1067
1067
|
{
|
|
1068
1068
|
name: "unbrowse_login",
|
|
1069
1069
|
title: "Capture Site Login",
|
|
1070
|
-
description: "Open an interactive browser login flow for a gated site so later Unbrowse calls can reuse the captured auth state. Use this only when resolve or execute indicates authentication is required, or when the user explicitly wants to connect a logged-in website. Do not use this for ordinary public pages.",
|
|
1070
|
+
description: "Open an interactive browser login flow for a gated site so later Unbrowse calls can reuse the captured auth state. Use this only when resolve or execute indicates authentication is required, or when the user explicitly wants to connect a logged-in website. Login should target the exact page or workflow surface the user cares about, then later Unbrowse calls should retry that same URL instead of drifting to the homepage, marketing pages, help pages, public mirrors, or alternate domains. Do not use this for ordinary public pages.",
|
|
1071
1071
|
annotations: {
|
|
1072
1072
|
title: "Capture Site Login",
|
|
1073
1073
|
openWorldHint: true
|
package/dist/index.js
CHANGED
|
@@ -14883,6 +14883,7 @@ var SEARCH_INTENT_STOPWORDS = new Set([
|
|
|
14883
14883
|
var SEARCH_DIRECTIVE_PREFIX = /^(search\s+for|search|find\s+me|find|look\s+for|looking\s+for|show\s+me|show|get\s+me|get|browse|discover|shop\s+for|buy)\s+/i;
|
|
14884
14884
|
var SEARCH_TRAILING_SITE_HINT = /\s+(on|at|from|in|via)\s+\S+$/i;
|
|
14885
14885
|
var SEARCH_INSTRUCTION_NOISE = /\b(do not|don't|dont|tell me|let me know|extremely thoroughly|thoroughly|random cases|for the sake of it|if there is no such|if none exists|if no such)\b/i;
|
|
14886
|
+
var SEARCH_PRIORITY_PATTERN = /\b(high|court|appeal|leave|adduce|evidence|assessment|damages?|tranche|tranches|started|late|stage|hearing|trial|mediation|case|cases|allow|allowed)\b/;
|
|
14886
14887
|
function isLikelySearchParam(urlTemplate, param) {
|
|
14887
14888
|
const lowerParam = param.toLowerCase();
|
|
14888
14889
|
if (/(^q$|^k$|basicsearchkey|basic_search_key|query|keyword|keywords|search|lookup|find|term|phrase|querystr|query_string)/.test(lowerParam)) {
|
|
@@ -14981,16 +14982,94 @@ function selectSearchTermsForExecution(intent) {
|
|
|
14981
14982
|
return literal;
|
|
14982
14983
|
if (!hasSentencePunctuation && !tooLongForSingleField)
|
|
14983
14984
|
return literal;
|
|
14985
|
+
if (tooLongForSingleField) {
|
|
14986
|
+
const compactPhraseQuery = buildCompactPhraseSearchQuery(intent);
|
|
14987
|
+
if (compactPhraseQuery)
|
|
14988
|
+
return compactPhraseQuery;
|
|
14989
|
+
}
|
|
14984
14990
|
return condensed;
|
|
14985
14991
|
}
|
|
14992
|
+
function buildCompactPhraseSearchQuery(intent) {
|
|
14993
|
+
const stripped = stripSearchIntentBoilerplate(intent);
|
|
14994
|
+
if (!stripped)
|
|
14995
|
+
return null;
|
|
14996
|
+
const sourceText = extractLiteralSearchTermsFromIntent(intent) ?? stripped;
|
|
14997
|
+
const clauses = sourceText.split(/(?<=[.!?])\s+|\n+/).map((clause) => clause.trim()).filter(Boolean);
|
|
14998
|
+
const phraseScores = new Map;
|
|
14999
|
+
const remember = (rawPhrase, score, clauseIndex) => {
|
|
15000
|
+
const phrase = rawPhrase.toLowerCase().replace(/[^a-z0-9\s/-]+/g, " ").replace(/\s+/g, " ").trim();
|
|
15001
|
+
if (!phrase)
|
|
15002
|
+
return;
|
|
15003
|
+
const words = phrase.split(/\s+/).filter(Boolean);
|
|
15004
|
+
const contentWords = words.filter((word) => !SEARCH_INTENT_STOPWORDS.has(word));
|
|
15005
|
+
if (contentWords.length < 2)
|
|
15006
|
+
return;
|
|
15007
|
+
if (!contentWords.some((word) => SEARCH_PRIORITY_PATTERN.test(word)))
|
|
15008
|
+
return;
|
|
15009
|
+
if (words.length > 8)
|
|
15010
|
+
return;
|
|
15011
|
+
if (SEARCH_INSTRUCTION_NOISE.test(phrase))
|
|
15012
|
+
return;
|
|
15013
|
+
const priorityHits = contentWords.filter((word) => SEARCH_PRIORITY_PATTERN.test(word)).length;
|
|
15014
|
+
const proceduralHits = contentWords.filter((word) => /^(started|tranche|tranches|allow|allowed)$/.test(word)).length;
|
|
15015
|
+
const startsBadly = /^(eg|\d)$/.test(words[0] ?? "") || /^\d+$/.test(words[0] ?? "");
|
|
15016
|
+
const endsBadly = /^(eg|\d)$/.test(words[words.length - 1] ?? "") || /^\d+$/.test(words[words.length - 1] ?? "");
|
|
15017
|
+
const connectorHits = words.filter((word) => ["of", "to", "for", "at", "after"].includes(word)).length;
|
|
15018
|
+
if (/\b(such|none|random)\b/.test(phrase))
|
|
15019
|
+
return;
|
|
15020
|
+
const boostedScore = score + Math.min(contentWords.length, 4) + priorityHits * 3 + proceduralHits * 4 + connectorHits + (words.length >= 3 && words.length <= 5 ? 2 : 0) + (/\d/.test(phrase) ? 2 : 0) - (startsBadly ? 4 : 0) - (endsBadly ? 4 : 0) - (/\beg\b/.test(phrase) ? 6 : 0);
|
|
15021
|
+
const existing = phraseScores.get(phrase);
|
|
15022
|
+
if (!existing || boostedScore > existing.score)
|
|
15023
|
+
phraseScores.set(phrase, { score: boostedScore, clauseIndex });
|
|
15024
|
+
};
|
|
15025
|
+
for (const [clauseIndex, clause] of clauses.entries()) {
|
|
15026
|
+
for (const match of clause.matchAll(/["“”']([^"“”']{3,80})["“”']/g)) {
|
|
15027
|
+
remember(match[1], 12, clauseIndex);
|
|
15028
|
+
}
|
|
15029
|
+
}
|
|
15030
|
+
for (const [clauseIndex, clause] of clauses.entries()) {
|
|
15031
|
+
for (const match of clause.matchAll(/\b[a-z0-9-]+(?:\s+(?:of|to|for|at|after)\s+[a-z0-9-]+){1,4}\b/gi)) {
|
|
15032
|
+
remember(match[0], 14, clauseIndex);
|
|
15033
|
+
}
|
|
15034
|
+
const tokens = clause.toLowerCase().replace(/[^a-z0-9\s/-]+/g, " ").split(/\s+/).filter(Boolean);
|
|
15035
|
+
for (let start2 = 0;start2 < tokens.length; start2++) {
|
|
15036
|
+
for (let size = 2;size <= 6 && start2 + size <= tokens.length; size++) {
|
|
15037
|
+
const slice = tokens.slice(start2, start2 + size);
|
|
15038
|
+
if (SEARCH_INTENT_STOPWORDS.has(slice[0]) || SEARCH_INTENT_STOPWORDS.has(slice[slice.length - 1]))
|
|
15039
|
+
continue;
|
|
15040
|
+
remember(slice.join(" "), 6 - Math.abs(size - 4), clauseIndex);
|
|
15041
|
+
}
|
|
15042
|
+
}
|
|
15043
|
+
}
|
|
15044
|
+
const selected = [];
|
|
15045
|
+
const selectedRaw = [];
|
|
15046
|
+
let currentLength = 0;
|
|
15047
|
+
const clauseCounts = new Map;
|
|
15048
|
+
for (const [phrase, meta] of Array.from(phraseScores.entries()).sort((a, b) => b[1].score - a[1].score || a[0].length - b[0].length)) {
|
|
15049
|
+
if (selectedRaw.some((chosen) => chosen.includes(phrase) || phrase.includes(chosen)))
|
|
15050
|
+
continue;
|
|
15051
|
+
if ((clauseCounts.get(meta.clauseIndex) ?? 0) >= 2)
|
|
15052
|
+
continue;
|
|
15053
|
+
const rendered = `"${phrase}"`;
|
|
15054
|
+
const nextLength = currentLength === 0 ? rendered.length : currentLength + 1 + rendered.length;
|
|
15055
|
+
if (nextLength > 140)
|
|
15056
|
+
continue;
|
|
15057
|
+
selected.push(rendered);
|
|
15058
|
+
selectedRaw.push(phrase);
|
|
15059
|
+
clauseCounts.set(meta.clauseIndex, (clauseCounts.get(meta.clauseIndex) ?? 0) + 1);
|
|
15060
|
+
currentLength = nextLength;
|
|
15061
|
+
if (selected.length >= 4)
|
|
15062
|
+
break;
|
|
15063
|
+
}
|
|
15064
|
+
return selected.length > 0 ? selected.join(" ") : null;
|
|
15065
|
+
}
|
|
14986
15066
|
function condenseSearchIntent(intent) {
|
|
14987
15067
|
const wantsSearchAction = /\b(search|find|lookup|look\s+for|browse|discover)\b/i.test(intent);
|
|
14988
|
-
const priorityPattern = /\b(high|court|appeal|leave|adduce|evidence|assessment|damages?|tranche|tranches|started|late|stage|hearing|trial|mediation|case|cases)\b/;
|
|
14989
15068
|
const tokens = intent.toLowerCase().replace(/[^a-z0-9\][\-/]+/g, " ").split(/\s+/).map((token) => token.trim()).filter((token) => token.length >= 3 && !SEARCH_INTENT_STOPWORDS.has(token));
|
|
14990
15069
|
const scored = new Map;
|
|
14991
15070
|
tokens.forEach((token, index) => {
|
|
14992
15071
|
let score = 0;
|
|
14993
|
-
if (
|
|
15072
|
+
if (SEARCH_PRIORITY_PATTERN.test(token))
|
|
14994
15073
|
score += 10;
|
|
14995
15074
|
if (token.length >= 8)
|
|
14996
15075
|
score += 2;
|
package/package.json
CHANGED
package/runtime-src/mcp.ts
CHANGED
|
@@ -153,7 +153,7 @@ export const TOOLS = [
|
|
|
153
153
|
{
|
|
154
154
|
name: "unbrowse_resolve",
|
|
155
155
|
title: "Resolve Website Task",
|
|
156
|
-
description: "Primary tool for website tasks. Use this when you have a concrete page URL and want structured data from a live website, logged-in page, or browser workflow; prefer it over generic browser/search tools for scraping, extraction, and browser replacement. Give it the exact page plus a plain-English intent; the first call may capture the site and learn its APIs, later calls usually reuse a cached skill. Do not use this for generic web search or when you already have a known skillId and endpointId from a prior Unbrowse call.",
|
|
156
|
+
description: "Primary tool for website tasks. Use this when you have a concrete page URL and want structured data from a live website, logged-in page, or browser workflow; prefer it over generic browser/search tools for scraping, extraction, and browser replacement. Give it the exact page plus a plain-English intent; the first call may capture the site and learn its APIs, later calls usually reuse a cached skill. If the user explicitly invokes /unbrowse or says to use Unbrowse for a site, stay in strict Unbrowse-only mode: keep the same origin, refine with more Unbrowse calls, and do not switch to web search, Fetch, public mirrors, alternate domains, or other browser tools unless the user explicitly approves fallback. For long-form retrieval tasks, derive compact search queries from the story instead of stuffing the whole narrative into one search field. Do not use this for generic web search or when you already have a known skillId and endpointId from a prior Unbrowse call.",
|
|
157
157
|
annotations: {
|
|
158
158
|
title: "Resolve Website Task",
|
|
159
159
|
openWorldHint: true,
|
|
@@ -178,7 +178,7 @@ export const TOOLS = [
|
|
|
178
178
|
{
|
|
179
179
|
name: "unbrowse_search",
|
|
180
180
|
title: "Search Learned Skills",
|
|
181
|
-
description: "Search the Unbrowse marketplace for an existing learned skill before triggering a new capture. Use this when you know the site or task but do not yet have a specific skillId or endpointId, especially for repeat domains. Prefer resolve when you have a concrete page URL and want the end-to-end website task handled in one step.
|
|
181
|
+
description: "Search the Unbrowse marketplace for an existing learned skill before triggering a new capture. Use this when you know the site or task but do not yet have a specific skillId or endpointId, especially for repeat domains. Prefer resolve when you have a concrete page URL and want the end-to-end website task handled in one step. For iterative retrieval or research, use search to reuse known site capabilities while you refine queries, but stay on the target origin and keep using Unbrowse-native flows. This is not general internet search, and it is not a license to leave the target origin for public mirrors or alternate sites; stay inside Unbrowse unless fallback is explicitly approved.",
|
|
182
182
|
annotations: {
|
|
183
183
|
title: "Search Learned Skills",
|
|
184
184
|
readOnlyHint: true,
|
|
@@ -198,7 +198,7 @@ export const TOOLS = [
|
|
|
198
198
|
{
|
|
199
199
|
name: "unbrowse_execute",
|
|
200
200
|
title: "Execute Learned Endpoint",
|
|
201
|
-
description: "Execute a specific Unbrowse endpoint after resolve or search has already identified the right skillId and endpointId. Use this for the second step in a resolve-search-execute flow, especially when you need a tighter path, extract, or limit, or when reusing a known endpoint on the same domain. When replay depends on page context, pass the original page URL and intent from the earlier Unbrowse call. Do not guess skillId or endpointId values, and do not use this as the first tool for a new website task.",
|
|
201
|
+
description: "Execute a specific Unbrowse endpoint after resolve or search has already identified the right skillId and endpointId. Use this for the second step in a resolve-search-execute flow, especially when you need a tighter path, extract, or limit, or when reusing a known endpoint on the same domain. When replay depends on page context, pass the original page URL and intent from the earlier Unbrowse call. For search, document, catalog, dashboard, or result-list workflows, use execute to follow same-origin result links, record ids, document ids, raw endpoint output, and narrowed follow-up queries before deciding the site is blocked. Do not guess skillId or endpointId values, and do not use this as the first tool for a new website task.",
|
|
202
202
|
annotations: {
|
|
203
203
|
title: "Execute Learned Endpoint",
|
|
204
204
|
openWorldHint: true,
|
|
@@ -225,7 +225,7 @@ export const TOOLS = [
|
|
|
225
225
|
{
|
|
226
226
|
name: "unbrowse_login",
|
|
227
227
|
title: "Capture Site Login",
|
|
228
|
-
description: "Open an interactive browser login flow for a gated site so later Unbrowse calls can reuse the captured auth state. Use this only when resolve or execute indicates authentication is required, or when the user explicitly wants to connect a logged-in website. Do not use this for ordinary public pages.",
|
|
228
|
+
description: "Open an interactive browser login flow for a gated site so later Unbrowse calls can reuse the captured auth state. Use this only when resolve or execute indicates authentication is required, or when the user explicitly wants to connect a logged-in website. Login should target the exact page or workflow surface the user cares about, then later Unbrowse calls should retry that same URL instead of drifting to the homepage, marketing pages, help pages, public mirrors, or alternate domains. Do not use this for ordinary public pages.",
|
|
229
229
|
annotations: {
|
|
230
230
|
title: "Capture Site Login",
|
|
231
231
|
openWorldHint: true,
|
|
@@ -990,6 +990,8 @@ const SEARCH_DIRECTIVE_PREFIX =
|
|
|
990
990
|
const SEARCH_TRAILING_SITE_HINT = /\s+(on|at|from|in|via)\s+\S+$/i;
|
|
991
991
|
const SEARCH_INSTRUCTION_NOISE =
|
|
992
992
|
/\b(do not|don't|dont|tell me|let me know|extremely thoroughly|thoroughly|random cases|for the sake of it|if there is no such|if none exists|if no such)\b/i;
|
|
993
|
+
const SEARCH_PRIORITY_PATTERN =
|
|
994
|
+
/\b(high|court|appeal|leave|adduce|evidence|assessment|damages?|tranche|tranches|started|late|stage|hearing|trial|mediation|case|cases|allow|allowed)\b/;
|
|
993
995
|
|
|
994
996
|
function isLikelySearchParam(
|
|
995
997
|
urlTemplate: string,
|
|
@@ -1109,12 +1111,103 @@ export function selectSearchTermsForExecution(intent: string): string | null {
|
|
|
1109
1111
|
const tooLongForSingleField = literal.length > 180 || wordCount > 24;
|
|
1110
1112
|
if (hasQuotedPhrase && !tooLongForSingleField) return literal;
|
|
1111
1113
|
if (!hasSentencePunctuation && !tooLongForSingleField) return literal;
|
|
1114
|
+
if (tooLongForSingleField) {
|
|
1115
|
+
const compactPhraseQuery = buildCompactPhraseSearchQuery(intent);
|
|
1116
|
+
if (compactPhraseQuery) return compactPhraseQuery;
|
|
1117
|
+
}
|
|
1112
1118
|
return condensed;
|
|
1113
1119
|
}
|
|
1114
1120
|
|
|
1121
|
+
function buildCompactPhraseSearchQuery(intent: string): string | null {
|
|
1122
|
+
const stripped = stripSearchIntentBoilerplate(intent);
|
|
1123
|
+
if (!stripped) return null;
|
|
1124
|
+
const sourceText = extractLiteralSearchTermsFromIntent(intent) ?? stripped;
|
|
1125
|
+
const clauses = sourceText
|
|
1126
|
+
.split(/(?<=[.!?])\s+|\n+/)
|
|
1127
|
+
.map((clause) => clause.trim())
|
|
1128
|
+
.filter(Boolean);
|
|
1129
|
+
const phraseScores = new Map<string, { score: number; clauseIndex: number }>();
|
|
1130
|
+
const remember = (rawPhrase: string, score: number, clauseIndex: number) => {
|
|
1131
|
+
const phrase = rawPhrase
|
|
1132
|
+
.toLowerCase()
|
|
1133
|
+
.replace(/[^a-z0-9\s/-]+/g, " ")
|
|
1134
|
+
.replace(/\s+/g, " ")
|
|
1135
|
+
.trim();
|
|
1136
|
+
if (!phrase) return;
|
|
1137
|
+
const words = phrase.split(/\s+/).filter(Boolean);
|
|
1138
|
+
const contentWords = words.filter((word) => !SEARCH_INTENT_STOPWORDS.has(word));
|
|
1139
|
+
if (contentWords.length < 2) return;
|
|
1140
|
+
if (!contentWords.some((word) => SEARCH_PRIORITY_PATTERN.test(word))) return;
|
|
1141
|
+
if (words.length > 8) return;
|
|
1142
|
+
if (SEARCH_INSTRUCTION_NOISE.test(phrase)) return;
|
|
1143
|
+
const priorityHits = contentWords.filter((word) => SEARCH_PRIORITY_PATTERN.test(word)).length;
|
|
1144
|
+
const proceduralHits = contentWords.filter((word) => /^(started|tranche|tranches|allow|allowed)$/.test(word)).length;
|
|
1145
|
+
const startsBadly = /^(eg|\d)$/.test(words[0] ?? "") || /^\d+$/.test(words[0] ?? "");
|
|
1146
|
+
const endsBadly = /^(eg|\d)$/.test(words[words.length - 1] ?? "") || /^\d+$/.test(words[words.length - 1] ?? "");
|
|
1147
|
+
const connectorHits = words.filter((word) => ["of", "to", "for", "at", "after"].includes(word)).length;
|
|
1148
|
+
if (/\b(such|none|random)\b/.test(phrase)) return;
|
|
1149
|
+
const boostedScore =
|
|
1150
|
+
score
|
|
1151
|
+
+ Math.min(contentWords.length, 4)
|
|
1152
|
+
+ priorityHits * 3
|
|
1153
|
+
+ proceduralHits * 4
|
|
1154
|
+
+ connectorHits
|
|
1155
|
+
+ (words.length >= 3 && words.length <= 5 ? 2 : 0)
|
|
1156
|
+
+ (/\d/.test(phrase) ? 2 : 0)
|
|
1157
|
+
- (startsBadly ? 4 : 0)
|
|
1158
|
+
- (endsBadly ? 4 : 0)
|
|
1159
|
+
- (/\beg\b/.test(phrase) ? 6 : 0);
|
|
1160
|
+
const existing = phraseScores.get(phrase);
|
|
1161
|
+
if (!existing || boostedScore > existing.score) phraseScores.set(phrase, { score: boostedScore, clauseIndex });
|
|
1162
|
+
};
|
|
1163
|
+
|
|
1164
|
+
for (const [clauseIndex, clause] of clauses.entries()) {
|
|
1165
|
+
for (const match of clause.matchAll(/["“”']([^"“”']{3,80})["“”']/g)) {
|
|
1166
|
+
remember(match[1], 12, clauseIndex);
|
|
1167
|
+
}
|
|
1168
|
+
}
|
|
1169
|
+
|
|
1170
|
+
for (const [clauseIndex, clause] of clauses.entries()) {
|
|
1171
|
+
for (const match of clause.matchAll(/\b[a-z0-9-]+(?:\s+(?:of|to|for|at|after)\s+[a-z0-9-]+){1,4}\b/gi)) {
|
|
1172
|
+
remember(match[0], 14, clauseIndex);
|
|
1173
|
+
}
|
|
1174
|
+
const tokens = clause
|
|
1175
|
+
.toLowerCase()
|
|
1176
|
+
.replace(/[^a-z0-9\s/-]+/g, " ")
|
|
1177
|
+
.split(/\s+/)
|
|
1178
|
+
.filter(Boolean);
|
|
1179
|
+
for (let start = 0; start < tokens.length; start++) {
|
|
1180
|
+
for (let size = 2; size <= 6 && start + size <= tokens.length; size++) {
|
|
1181
|
+
const slice = tokens.slice(start, start + size);
|
|
1182
|
+
if (SEARCH_INTENT_STOPWORDS.has(slice[0]) || SEARCH_INTENT_STOPWORDS.has(slice[slice.length - 1])) continue;
|
|
1183
|
+
remember(slice.join(" "), 6 - Math.abs(size - 4), clauseIndex);
|
|
1184
|
+
}
|
|
1185
|
+
}
|
|
1186
|
+
}
|
|
1187
|
+
|
|
1188
|
+
const selected: string[] = [];
|
|
1189
|
+
const selectedRaw: string[] = [];
|
|
1190
|
+
let currentLength = 0;
|
|
1191
|
+
const clauseCounts = new Map<number, number>();
|
|
1192
|
+
for (const [phrase, meta] of Array.from(phraseScores.entries())
|
|
1193
|
+
.sort((a, b) => b[1].score - a[1].score || a[0].length - b[0].length)) {
|
|
1194
|
+
if (selectedRaw.some((chosen) => chosen.includes(phrase) || phrase.includes(chosen))) continue;
|
|
1195
|
+
if ((clauseCounts.get(meta.clauseIndex) ?? 0) >= 2) continue;
|
|
1196
|
+
const rendered = `"${phrase}"`;
|
|
1197
|
+
const nextLength = currentLength === 0 ? rendered.length : currentLength + 1 + rendered.length;
|
|
1198
|
+
if (nextLength > 140) continue;
|
|
1199
|
+
selected.push(rendered);
|
|
1200
|
+
selectedRaw.push(phrase);
|
|
1201
|
+
clauseCounts.set(meta.clauseIndex, (clauseCounts.get(meta.clauseIndex) ?? 0) + 1);
|
|
1202
|
+
currentLength = nextLength;
|
|
1203
|
+
if (selected.length >= 4) break;
|
|
1204
|
+
}
|
|
1205
|
+
|
|
1206
|
+
return selected.length > 0 ? selected.join(" ") : null;
|
|
1207
|
+
}
|
|
1208
|
+
|
|
1115
1209
|
function condenseSearchIntent(intent: string): string | null {
|
|
1116
1210
|
const wantsSearchAction = /\b(search|find|lookup|look\s+for|browse|discover)\b/i.test(intent);
|
|
1117
|
-
const priorityPattern = /\b(high|court|appeal|leave|adduce|evidence|assessment|damages?|tranche|tranches|started|late|stage|hearing|trial|mediation|case|cases)\b/;
|
|
1118
1211
|
const tokens = intent
|
|
1119
1212
|
.toLowerCase()
|
|
1120
1213
|
.replace(/[^a-z0-9\][\-/]+/g, " ")
|
|
@@ -1124,7 +1217,7 @@ function condenseSearchIntent(intent: string): string | null {
|
|
|
1124
1217
|
const scored = new Map<string, { token: string; index: number; score: number }>();
|
|
1125
1218
|
tokens.forEach((token, index) => {
|
|
1126
1219
|
let score = 0;
|
|
1127
|
-
if (
|
|
1220
|
+
if (SEARCH_PRIORITY_PATTERN.test(token)) score += 10;
|
|
1128
1221
|
if (token.length >= 8) score += 2;
|
|
1129
1222
|
if (index < 12) score += 1;
|
|
1130
1223
|
const existing = scored.get(token);
|
|
@@ -79,9 +79,15 @@ description: Use Unbrowse as the exclusive web-access tool for a task.
|
|
|
79
79
|
Use Unbrowse as the only allowed tool for website access in this task.
|
|
80
80
|
|
|
81
81
|
Rules:
|
|
82
|
+
- If the user explicitly invokes \`/unbrowse\` or says to use Unbrowse for a site, treat that as strict Unbrowse-only mode until the user explicitly approves fallback.
|
|
82
83
|
- Do not use Brave Search, built-in web search, browser MCPs, curl, or other network tools for website access unless the user explicitly authorizes fallback.
|
|
84
|
+
- Public mirrors, alternate domains, cached copies, and site-adjacent public portals also count as fallback. Do not switch from the target origin to those surfaces on your own.
|
|
83
85
|
- If Unbrowse is slow on a first-time site, wait for it. Do not switch tools just because capture or indexing is still running.
|
|
84
86
|
- If Unbrowse returns partial results, refine with more Unbrowse commands (\`resolve\`, \`search\`, \`execute\`, \`login\`) before considering fallback.
|
|
87
|
+
- If login is required, call \`unbrowse login --url "<the exact page or workflow surface the user cares about>"\`, then retry \`resolve\` against that same URL.
|
|
88
|
+
- After login, do not pivot to the site homepage, marketing pages, help pages, or alternate public sections unless the user explicitly asked for those.
|
|
89
|
+
- For long-form retrieval or research prompts, do not dump the entire story into one search field. Derive 2-4 compact search queries with quoted phrases, product names, titles, IDs, people, dates, or other discriminative terms, then retry inside Unbrowse.
|
|
90
|
+
- For document, catalog, dashboard, or search-result workflows, stay on the same origin and follow result links, record ids, document ids, or raw endpoint output with Unbrowse before asking for any other tool.
|
|
85
91
|
- If Unbrowse genuinely cannot complete the task, explain why and ask before using another tool.
|
|
86
92
|
|
|
87
93
|
Suggested start:
|