ax-grep 0.0.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +106 -1
- package/dist/browser.d.ts +11 -0
- package/dist/browser.js +12 -0
- package/dist/browser.js.map +1 -0
- package/dist/chunk-HPZ32BKV.js +612 -0
- package/dist/chunk-HPZ32BKV.js.map +1 -0
- package/dist/chunk-ZXTURCRT.js +925 -0
- package/dist/chunk-ZXTURCRT.js.map +1 -0
- package/dist/cli.d.ts +10 -0
- package/dist/cli.js +22364 -0
- package/dist/cli.js.map +1 -0
- package/dist/index.d.ts +18 -0
- package/dist/index.js +436 -0
- package/dist/index.js.map +1 -0
- package/dist/static.d.ts +6 -0
- package/dist/static.js +8 -0
- package/dist/static.js.map +1 -0
- package/dist/types-gwHWhYmw.d.ts +3660 -0
- package/docs/README.md +19 -0
- package/docs/agent-handoff.md +95 -0
- package/docs/agent-readiness.md +38 -0
- package/docs/assets/ax-grep-benchmark.png +0 -0
- package/docs/assets/ax-grep-og.png +0 -0
- package/docs/assets/ax-grep-search.png +0 -0
- package/docs/benchmarks.md +123 -0
- package/docs/cli-agent.md +194 -0
- package/docs/comparison-baseline.md +625 -0
- package/docs/features.md +28 -0
- package/docs/library-api.md +211 -0
- package/docs/progress.md +1306 -0
- package/package.json +92 -6
- package/skills/ax-grep-cli/SKILL.md +89 -0
- package/skills.sh +24 -0
- package/index.js +0 -1
|
@@ -0,0 +1,625 @@
|
|
|
1
|
+
# Comparison Baseline
|
|
2
|
+
|
|
3
|
+
Generated on 2026-06-05.
|
|
4
|
+
|
|
5
|
+
The comparison harness runs `ax-grep` through Puppeteer and compares named
|
|
6
|
+
role output against `agent-browser snapshot`.
|
|
7
|
+
|
|
8
|
+
Operational rule: run comparison suites sequentially. `agent-browser` and
|
|
9
|
+
Chromium can exhaust the host when several `compare:static*` runs overlap. Check
|
|
10
|
+
for existing browser processes before a run and confirm cleanup afterward.
|
|
11
|
+
|
|
12
|
+
The historical `named role overlap` score is intentionally strict: it is an
|
|
13
|
+
exact match over normalized `role:name` pairs, with no fuzzy or containment
|
|
14
|
+
matching. It is useful for tracking whether `ax-grep` is reproducing
|
|
15
|
+
`agent-browser snapshot` output, but it can understate agent usefulness when the
|
|
16
|
+
missing items are mostly static text.
|
|
17
|
+
|
|
18
|
+
The harness also reports `agentReadiness` scores. These are still based on the
|
|
19
|
+
same exact normalized `role:name` matches, but split by agent-facing use:
|
|
20
|
+
|
|
21
|
+
- `referenceRecall`: how much of the `agent-browser` named output appears in
|
|
22
|
+
`ax-grep`.
|
|
23
|
+
- `candidatePrecision`: how much of the `ax-grep` named output appears in
|
|
24
|
+
`agent-browser`.
|
|
25
|
+
- `actionableRecall`: exact recall for links, buttons, fields, tabs, and other
|
|
26
|
+
operation targets.
|
|
27
|
+
- `navigationRecall`: exact recall for links, headings, landmarks, and search.
|
|
28
|
+
- `contentRecall`: exact recall for headings, images, table/list structure, and
|
|
29
|
+
static text.
|
|
30
|
+
- `structuralContentRecall`: exact recall for headings, images, table/list
|
|
31
|
+
structure, and other non-StaticText content roles.
|
|
32
|
+
- `textRecall`: exact recall for StaticText roles only.
|
|
33
|
+
- `score`: weighted summary for agent parsing: actionable 40%, navigation 25%,
|
|
34
|
+
structural content 20%, precision 15%. Strict `contentRecall` and
|
|
35
|
+
`textRecall` remain visible because text-heavy pages can expose useful gaps,
|
|
36
|
+
but raw StaticText volume should not dominate the agent-usefulness score.
|
|
37
|
+
|
|
38
|
+
The static harness also emits `cliAgentSummary`, which scores the actual
|
|
39
|
+
agent-facing `--agent` compact JSON envelope rather than raw tree overlap. It uses
|
|
40
|
+
top-level `agent` routing status, `pageCheck`, structured content evidence, source links and source quality,
|
|
41
|
+
readability, requested verification status, follow-up `nextSteps`, `searchResults`, and `suggestedActions` to estimate how directly an agent can
|
|
42
|
+
decide whether to read, open, or retry a page. `averageCliAgentScore`
|
|
43
|
+
in `gateSummary` tracks that higher-level usefulness separately from
|
|
44
|
+
`agentReadiness`, which remains an exact `agent-browser snapshot` overlap
|
|
45
|
+
metric.
|
|
46
|
+
`minCliAgentScore` enforces the same readiness floor per included target, so a
|
|
47
|
+
weak search, page-check, or browser-retry case cannot be hidden by strong
|
|
48
|
+
average results.
|
|
49
|
+
When a per-target floor fails, `weakAgentTargets` lists the affected category,
|
|
50
|
+
URL, scores, status, and primary action for quick follow-up.
|
|
51
|
+
`averageAgentExecutorScore` is the executor-focused aggregate. It combines the
|
|
52
|
+
schema, routing, `next`, expected-outcome, signal, read-target, command,
|
|
53
|
+
browser-retry, continuation, response, diagnostic, and verification fields that
|
|
54
|
+
subagents need to run a search/page-check loop without reconstructing intent
|
|
55
|
+
from the raw tree.
|
|
56
|
+
`minAgentExecutorScore` applies the executor floor per included target.
|
|
57
|
+
The score includes action-schema completeness, so `run-command` actions need
|
|
58
|
+
both a human-readable command and raw `commandArgs`, `read-current` actions need
|
|
59
|
+
`readFrom`, and `interact-browser` actions need an explicit browser-interaction
|
|
60
|
+
signal.
|
|
61
|
+
`averageActionSchemaScore` tracks that schema completeness directly across the
|
|
62
|
+
gate-included targets.
|
|
63
|
+
`averageSearchResultActionScore` tracks whether compact search results include
|
|
64
|
+
rank-specific `openResult`, `command`, and raw `commandArgs`, so search agents
|
|
65
|
+
can open alternate results without reconstructing commands.
|
|
66
|
+
`averageAgentRoutingIntentScore` tracks whether `agent.routingIntent` correctly
|
|
67
|
+
summarizes the primary action as reading current payload, opening a URL,
|
|
68
|
+
searching, retrying with browser HTML, requiring browser interaction, or
|
|
69
|
+
stopping.
|
|
70
|
+
`averageAgentContinuationModeScore` tracks whether `agent.continuationMode`
|
|
71
|
+
maps that intent to the executor-facing mode: `read`, `command`, `browser`,
|
|
72
|
+
`capture-html`, `inspect`, or `stop`.
|
|
73
|
+
`averageAgentNextScore` tracks whether `agent.next` is a canonical executor
|
|
74
|
+
payload that agrees with `continuationMode` and mirrors the primary action's
|
|
75
|
+
command, read pointer, URL, browser interaction, and terminal fields.
|
|
76
|
+
`averageAgentNextShortcutScore` tracks whether top-level `agent.next*`
|
|
77
|
+
shortcuts mirror that canonical `agent.next` payload.
|
|
78
|
+
`averageAgentRunbookShortcutScore` tracks whether top-level `agent.runbook*`
|
|
79
|
+
shortcuts mirror the nested loop runbook contract.
|
|
80
|
+
`averageAgentExpectedOutcomeScore` tracks whether `agent.expectedOutcome`
|
|
81
|
+
describes the success condition for the next step, including read pointers when
|
|
82
|
+
the next step is evidence reading.
|
|
83
|
+
`averageAgentPlanShortcutScore` tracks whether top-level
|
|
84
|
+
`agent.expectedOutcome*` and `agent.executionPlan*` shortcuts mirror the nested
|
|
85
|
+
next-step contract.
|
|
86
|
+
`averageAgentSignalScore` tracks whether `agent.signals` exposes structured
|
|
87
|
+
content, verification, search result, source link, browser, response, and
|
|
88
|
+
diagnostic signals needed for fast agent routing.
|
|
89
|
+
`averageContentEvidenceMetadataScore` tracks whether `pageCheck.contentEvidence`
|
|
90
|
+
items include `source` and bounded `score` metadata, so agents can prioritize
|
|
91
|
+
semantic evidence over fallback excerpts.
|
|
92
|
+
`averageReadabilityReasonScore` tracks whether compact page checks preserve
|
|
93
|
+
concise readability reasons, so agents can understand why a page is readable,
|
|
94
|
+
thin, blocked, or worth retrying.
|
|
95
|
+
`averageAgentReadabilityReasonScore` tracks whether the compact top-level
|
|
96
|
+
`agent` summary repeats concise readability reasons, so agents can route from
|
|
97
|
+
the first object before drilling into `pageCheck`.
|
|
98
|
+
`averageAgentPageMetadataShortcutScore` tracks whether `agent.page*` mirrors
|
|
99
|
+
root page metadata such as canonical URL, language, author, dates, and
|
|
100
|
+
structured-data types.
|
|
101
|
+
`averageAgentSemanticSummaryScore` tracks whether `agent.semanticSummary` and
|
|
102
|
+
top-level `agent.semantic*` shortcuts preserve semantic tree counts, role-group
|
|
103
|
+
counts, heading/landmark outline flow with parent context, keyboard shortcut hints, in-page links, top role/heading/landmark/named role fields with direct paths,
|
|
104
|
+
interactive and focusable state, link URL, button description, image,
|
|
105
|
+
table/list structure, form-field input hints, description, value, resolved relation targets, choice, state, and unavailable-subtree shortcuts for quick page-shape routing.
|
|
106
|
+
State scoring includes parsed top-state fields so agents do not need to parse
|
|
107
|
+
ARIA state strings.
|
|
108
|
+
`averageAgentBarrierShortcutScore` tracks whether top-level `agent.topBarrier*`
|
|
109
|
+
shortcuts mirror the highest-priority page barrier.
|
|
110
|
+
`averageAgentStructuredShortcutScore` tracks whether top-level structured
|
|
111
|
+
content counts and `top*` shortcuts mirror the first table, FAQ, code block,
|
|
112
|
+
resource, media item, section paths/selectors, navigation/media structure,
|
|
113
|
+
provenance, offer, dataset, identity, timeline, contact point, and best structured
|
|
114
|
+
read-target shortcut.
|
|
115
|
+
`averageAgentReadTargetScore` tracks whether `agent.readTargets` points to
|
|
116
|
+
payload fields that actually exist and are worth reading, and whether
|
|
117
|
+
`read-current` actions mark the matching target as primary.
|
|
118
|
+
`averageAgentTopReadTargetShortcutScore` tracks whether `agent.topReadTarget*`
|
|
119
|
+
mirrors the first read-target entry for fast routing without scanning
|
|
120
|
+
`agent.readTargets`.
|
|
121
|
+
`averageAgentAlternativeActionShortcutScore` tracks whether top-level
|
|
122
|
+
`agent.alternativeAction*` shortcuts mirror the first non-primary action.
|
|
123
|
+
`averageAgentHandoffScore` and `averageAgentBriefExecutorStepScore` also cover
|
|
124
|
+
handoff detail preservation. Search handoffs must keep executable result/source
|
|
125
|
+
choices with snippets and command args; answer handoffs must keep selected
|
|
126
|
+
evidence text/reasons; read handoffs for forms and action targets must keep URL
|
|
127
|
+
templates, fields, selectors, methods, and encoding; diagnostic handoffs must
|
|
128
|
+
keep selected signals and quality gates. This prevents the compact handoff from
|
|
129
|
+
turning into an opaque "retry/open this" instruction.
|
|
130
|
+
`averageAgentResultChoiceScore`, `averageAgentSourceChoiceScore`, and
|
|
131
|
+
`averageAgentActionListScore` cover the same problem outside the handoff.
|
|
132
|
+
Search choices keep snippets, freshness dates, and sitelinks; source choices
|
|
133
|
+
keep source-link text, snippets, selectors, and executable commands; source-link
|
|
134
|
+
actions keep `sourceLinkRef` so agents can jump back to the exact
|
|
135
|
+
`pageCheck.sourceLinks[n]` item.
|
|
136
|
+
`averageAgentTopActionShortcutScore` tracks whether `agent.topAction*` mirrors
|
|
137
|
+
the first action candidate, including execution, priority, command/read target,
|
|
138
|
+
URL, and source-link reference.
|
|
139
|
+
`averageAgentResultCountScore` tracks whether `agent.resultCount` is zero for
|
|
140
|
+
non-search pages and at least the compact result count for search pages.
|
|
141
|
+
`averageAgentChoiceCountScore` tracks whether executable choice-count shortcuts
|
|
142
|
+
match their result, form, action-target, and source-link source counts.
|
|
143
|
+
`averageAgentTopChoiceShortcutScore` tracks whether `agent.topChoiceKind`,
|
|
144
|
+
path, label, URL, and command arguments mirror the first executable result,
|
|
145
|
+
source, form, or action-target choice for fast subagent routing.
|
|
146
|
+
`averageAgentTopResultChoiceShortcutScore` tracks whether `agent.topResultChoice*`
|
|
147
|
+
mirrors the first search result choice, including URL, rank, open-result value,
|
|
148
|
+
command arguments, source quality, freshness/relevance, matches, sitelinks, and
|
|
149
|
+
selection reason.
|
|
150
|
+
`averageAgentTopSourceChoiceShortcutScore` tracks whether source-link specific
|
|
151
|
+
top-level shortcuts mirror the first executable source choice, including source
|
|
152
|
+
type, hints, score, and selection reason.
|
|
153
|
+
`averageAgentEvidenceCountShortcutScore` tracks citation, answer-evidence,
|
|
154
|
+
read-target, and action count shortcuts against their agent arrays.
|
|
155
|
+
`averageAgentTopCitationShortcutScore` tracks whether `agent.topCitation*`
|
|
156
|
+
mirrors the first citation item, including path, kind, confidence, reason, URL,
|
|
157
|
+
and score.
|
|
158
|
+
`averageAgentSignalCountShortcutScore` tracks signal severity and failing
|
|
159
|
+
quality-gate count shortcuts against `agent.signals` and `agent.qualityGates`.
|
|
160
|
+
`averageAgentTopQualityShortcutScore` tracks whether `agent.topSignal*` and
|
|
161
|
+
`agent.topQualityGate*` mirror the first signal and quality gate for fast
|
|
162
|
+
accept/block routing without scanning diagnostic arrays.
|
|
163
|
+
`averageAgentProblemShortcutScore` tracks whether `agent.problemSignalKind`,
|
|
164
|
+
severity, message, and `agent.failingQualityGate*` mirror the first
|
|
165
|
+
warning/error signal and first failing quality gate, including gate severity and
|
|
166
|
+
score, so agents can explain blocked pages without scanning diagnostic arrays.
|
|
167
|
+
`averageAgentSourceLinkCountScore` tracks whether `agent.sourceLinkCount` is
|
|
168
|
+
zero for search pages and matches compact `pageCheck.sourceLinks` for ordinary
|
|
169
|
+
content pages.
|
|
170
|
+
`averageAgentFormActionCountScore` tracks whether top-level `agent.formCount`
|
|
171
|
+
and `agent.actionTargetCount` match compact `pageCheck.forms` and
|
|
172
|
+
`pageCheck.actionTargets`, so agents can detect hidden forms and JSON-LD/OpenSearch
|
|
173
|
+
actions before scanning nested page-check arrays.
|
|
174
|
+
`averageAgentFormActionChoiceScore` tracks whether `agent.formChoices` and
|
|
175
|
+
`agent.actionTargetChoices` preserve the compact form/action target IDs, paths,
|
|
176
|
+
selectors, URL templates, query inputs, submit text, first-field hints, and
|
|
177
|
+
methods needed for subagent form execution and selection loops.
|
|
178
|
+
`averageAgentTopFormActionChoiceShortcutScore` tracks whether top-level
|
|
179
|
+
form/action-target shortcuts mirror the first executable form and action target.
|
|
180
|
+
`averagePageLinkCommandScore` tracks whether compact `pageCheck.primaryLinks`
|
|
181
|
+
and `pageCheck.sourceLinks` include direct `command` and `commandArgs`, so
|
|
182
|
+
agents can open page links without reconstructing fetch flags.
|
|
183
|
+
`averageAgentBrowserNeedScore` tracks whether `agent.needsBrowserHtml` agrees
|
|
184
|
+
with the primary action: browser HTML retry actions should require browser
|
|
185
|
+
HTML, while URL search recovery, alternate-result recovery, read-current, and
|
|
186
|
+
retry-later actions should not.
|
|
187
|
+
`averageAgentBrowserHtmlScore` tracks whether browser HTML fallback payloads
|
|
188
|
+
preserve capture file/script data plus nested reason, command, and command-args
|
|
189
|
+
fields across `agent.next`, execution plans, executor steps, and handoff steps.
|
|
190
|
+
`averageAgentPageKindScore` tracks whether `agent.pageKind` mirrors the root
|
|
191
|
+
payload `kind`, so agents can route from the top-level `agent` object without
|
|
192
|
+
re-reading `analysis.kind` or the envelope root.
|
|
193
|
+
`averageAgentAlternativeActionCountScore` tracks whether
|
|
194
|
+
`agent.alternativeActionCount` matches the deduplicated compact follow-up
|
|
195
|
+
actions left outside `agent.primaryAction`, so agents can know whether a page
|
|
196
|
+
has useful alternatives before scanning nested action arrays.
|
|
197
|
+
`averageAgentUsabilityScoreConsistency` tracks whether `agent.usabilityScore`
|
|
198
|
+
matches the documented compact quality heuristic derived from status,
|
|
199
|
+
readability, confidence, evidence, search results, source links, and
|
|
200
|
+
verification status.
|
|
201
|
+
`averageAgentEvidenceQualityScoreConsistency` and
|
|
202
|
+
`averageAgentSourceQualityScoreConsistency` track whether top-level evidence
|
|
203
|
+
and source quality scores match the compact evidence/source arrays, so agents
|
|
204
|
+
can compare payload quality before reading every candidate item.
|
|
205
|
+
`averageAgentBestReadTargetScore` tracks whether `agent.bestReadTarget` and its
|
|
206
|
+
count, score, primary flag, and reason match the primary or highest-scored
|
|
207
|
+
`agent.readTargets` entry, so agents can start reading the best compact field
|
|
208
|
+
without sorting candidates.
|
|
209
|
+
`averageAgentDiagnosticCountScore` tracks whether top-level diagnostic severity
|
|
210
|
+
counts and `agent.topDiagnostic*` match the compact diagnostics array, so agents
|
|
211
|
+
can distinguish warnings from hard errors before drilling into diagnostic
|
|
212
|
+
messages.
|
|
213
|
+
`averageAgentVerificationCountScore` tracks whether top-level verification
|
|
214
|
+
requested/found/missing counts match the compact verification object, so agents
|
|
215
|
+
can decide whether requested evidence is complete before reading details.
|
|
216
|
+
`averageAgentVerificationQueryScore` tracks whether
|
|
217
|
+
`agent.verificationFoundQueries` and `agent.verificationMissingQueries` preserve
|
|
218
|
+
the exact matched and missing `--find` query lists and whether the top matched
|
|
219
|
+
or missing query shortcuts mirror the first items; `agent.handoff` and
|
|
220
|
+
`agent.executor` carry the same lists for brief subagent loops.
|
|
221
|
+
`averageAgentResponseMetadataScore` tracks whether `agent.responseStatus`,
|
|
222
|
+
`agent.responseOk`, `agent.responseContentType`, and `agent.finalUrlChanged`
|
|
223
|
+
mirror the compact envelope response fields, so agents can judge fetch health
|
|
224
|
+
from the top-level `agent` object.
|
|
225
|
+
`averageAgentHiddenSignalScore` tracks whether hidden `pageCheck` groups such
|
|
226
|
+
as hydration, API endpoints, app config, app/mobile hints, provenance,
|
|
227
|
+
policies, JSON-LD facts, and resource metadata are present at valid payload
|
|
228
|
+
paths and discoverable through at least one read target when available. This is
|
|
229
|
+
the executor-focused counterweight to raw accessibility-tree overlap: these
|
|
230
|
+
signals are often useful to subagents but absent from browser accessibility
|
|
231
|
+
snapshots.
|
|
232
|
+
`averageAgentHiddenSignalCountScore` tracks whether top-level
|
|
233
|
+
`agent.hiddenSignalCount`, `agent.hiddenReadTargetCount`, and
|
|
234
|
+
`agent.bestHiddenReadTarget*` match those hidden groups and read-target
|
|
235
|
+
shortcuts.
|
|
236
|
+
`averageAgentTopHiddenSignalShortcutScore` tracks whether
|
|
237
|
+
`agent.topHiddenSignal*` mirrors the first hidden metadata, API, config, or
|
|
238
|
+
provenance signal.
|
|
239
|
+
`averageAgentHiddenCommandShortcutScore` tracks whether hidden hydration, safe
|
|
240
|
+
GET-like API, and app-hint URLs keep executable top-level follow-up commands.
|
|
241
|
+
`averageAgentBrowserAdvantageScore` tracks whether those hidden `pageCheck`
|
|
242
|
+
signals create a concrete agent-browser advantage when they exist, rather than
|
|
243
|
+
only matching visible accessibility-tree roles.
|
|
244
|
+
The higher-level CLI agent score also credits hidden `pageCheck` signal groups
|
|
245
|
+
and recoverable browser-HTML retry actions. A page with little visible text can
|
|
246
|
+
still be useful to a subagent when it exposes metadata read targets or a
|
|
247
|
+
runnable browser-capture handoff.
|
|
248
|
+
When `ax-grep` produces a ready, high-scoring agent payload with content
|
|
249
|
+
evidence, the static comparison treats the page as usable before applying raw
|
|
250
|
+
tree-size failure classes such as thin-reference challenge or over-collection.
|
|
251
|
+
This keeps agent-browser advantage cases visible in the gate even when the raw
|
|
252
|
+
static tree is larger than the browser snapshot.
|
|
253
|
+
`averageAgentCanContinueScore` tracks whether `agent.canContinue` agrees with
|
|
254
|
+
the primary action execution class, so recoverable errors with runnable actions
|
|
255
|
+
do not look terminal and usage/input errors without actions do not look
|
|
256
|
+
actionable.
|
|
257
|
+
`averageAgentPrimaryExecutionScore` tracks whether `agent.primaryExecution`
|
|
258
|
+
matches `agent.primaryAction.execution`, so agents can route from the shortcut
|
|
259
|
+
field without rereading the full action object.
|
|
260
|
+
`averageAgentPrimaryShortcutScore` tracks whether `agent.primaryActionName`,
|
|
261
|
+
reason, priority, command, URL, rank, read-from, source-link reference, and
|
|
262
|
+
browser shortcuts mirror `agent.primaryAction`, so agents can continue from
|
|
263
|
+
top-level routing fields.
|
|
264
|
+
`averageAgentExecutorShortcutScore` tracks whether `agent.executorActionName`,
|
|
265
|
+
decision, mode, operation, confidence, terminal/continue flags, command
|
|
266
|
+
arguments, read-from, URL, target, and expected-outcome shortcuts mirror
|
|
267
|
+
`agent.executor`, so subagents can route the next step without parsing the full
|
|
268
|
+
executor object.
|
|
269
|
+
`averageAgentHandoffShortcutScore` tracks whether `agent.handoffActionName`,
|
|
270
|
+
decision, mode, operation, answer status, confidence, terminal/continue flags,
|
|
271
|
+
priority, command arguments, read-from, URL, target, and expected-outcome shortcuts
|
|
272
|
+
mirror `agent.handoff`, so brief loops can run from top-level fields when they
|
|
273
|
+
do not need the full handoff object.
|
|
274
|
+
`averageAgentAnswerShortcutScore` tracks whether `agent.answerPlanStatus`,
|
|
275
|
+
confidence, reason, next action, gap count, citation IDs, first answer-evidence
|
|
276
|
+
metadata, command arguments, after-interaction command, read-from, and URL
|
|
277
|
+
shortcuts mirror `agent.answerPlan` and
|
|
278
|
+
`agent.answerEvidence`, so agents can decide whether to answer or continue
|
|
279
|
+
without parsing the full plan object.
|
|
280
|
+
`averageAgentSourceSearchProvenanceScore` tracks whether opened-result payloads
|
|
281
|
+
with `sourceSearch.selectedResult` or `sourceSearch.alternateResults` expose
|
|
282
|
+
matching `agent.readTargets`, so agents can inspect original SERP provenance
|
|
283
|
+
before trusting or recovering from an opened page.
|
|
284
|
+
`averageAgentSourceSearchShortcutScore` tracks whether top-level
|
|
285
|
+
`agent.sourceSearchQuery`, locale, verification-query count/top query,
|
|
286
|
+
engine/search URL, selected rank/title/URL, selected command, and first
|
|
287
|
+
alternate command mirror the source-search payload for quick SERP recovery
|
|
288
|
+
decisions.
|
|
289
|
+
The command shortcuts are exposed as `sourceSearchSelectedCommandArgs` and
|
|
290
|
+
`sourceSearchAlternateCommandArgs`.
|
|
291
|
+
`averageAgentRecommendedMetadataScore` tracks whether search pages with a
|
|
292
|
+
`recommendedResult` repeat its URL, title, rank, source, relevance,
|
|
293
|
+
official-source hint, selection reason, and command args on the top-level
|
|
294
|
+
`agent` object for quick routing.
|
|
295
|
+
`averageAgentSearchDecisionScore` and `averageAgentPageDecisionScore` also check
|
|
296
|
+
top-level `agent.searchDecision*` and `agent.pageDecision*` shortcuts, so agents
|
|
297
|
+
can route without reopening the nested decision objects.
|
|
298
|
+
Terminal actions such as `read-content` and `use-evidence` are treated as
|
|
299
|
+
usable without executable commands when `execution` is `read-current` and a
|
|
300
|
+
`readFrom` pointer is present, because the compact payload already contains the
|
|
301
|
+
evidence an agent should read. Browser-interaction actions are also valid
|
|
302
|
+
without commands when `execution` is `interact-browser`; in those cases another
|
|
303
|
+
static fetch would not advance the page state.
|
|
304
|
+
|
|
305
|
+
The `--agent` payload intentionally removes repeated routing data: top-level
|
|
306
|
+
diagnostics are represented as `agent.diagnosticCodes`, repeated primary actions
|
|
307
|
+
are omitted from `suggestedActions` and verification/page-check action slots,
|
|
308
|
+
page-level alternatives are suppressed when verification has selected
|
|
309
|
+
`use-evidence`,
|
|
310
|
+
search-page link/action follow-ups are omitted when `searchResults` already
|
|
311
|
+
carry the decision surface, and search output is capped to the first five
|
|
312
|
+
results plus any out-of-window recommended result. Opened-result payloads omit
|
|
313
|
+
engine attempts once `sourceSearch` records the selected result, but keep compact
|
|
314
|
+
selected/alternate SERP candidates with executable open commands for failure
|
|
315
|
+
recovery, preserving custom fetch options such as `--timeout` and
|
|
316
|
+
`--user-agent`. Page checks also
|
|
317
|
+
skip common global-navigation headings, links, and buttons, and omit extra
|
|
318
|
+
external primary links when source links are already present, so repository and
|
|
319
|
+
documentation pages route agents toward page content instead of site chrome.
|
|
320
|
+
Fetch failures that still have a target URL emit an executable browser-HTML
|
|
321
|
+
retry command, preserving `--find` checks for the next run. When browser HTML is
|
|
322
|
+
already supplied through `--html-file` or `--stdin`, the compact agent payload
|
|
323
|
+
does not ask for another browser retry. Parsed search result pages keep
|
|
324
|
+
`agent.canUseFetchedHtml` true even when their page-readability score is low,
|
|
325
|
+
because the result cards remain usable for open/refine routing.
|
|
326
|
+
Captured blocker pages still keep challenge/login/paywall diagnostics; the
|
|
327
|
+
follow-up action changes to browser-state inspection instead of another capture
|
|
328
|
+
loop. HTTP error actions are status-aware, so missing URLs and transient server
|
|
329
|
+
errors no longer all collapse into a browser-HTML retry; missing opened search
|
|
330
|
+
results and opened-result verification failures can route directly to an
|
|
331
|
+
alternate original SERP candidate when one matches the missing `--find` text.
|
|
332
|
+
|
|
333
|
+
The default baseline does not force a viewport. A shared viewport can be tested
|
|
334
|
+
with `AX_LITE_COMPARE_VIEWPORT=WIDTHxHEIGHT`, but the default run is kept stable
|
|
335
|
+
to avoid changing the benchmark shape unexpectedly.
|
|
336
|
+
|
|
337
|
+
For state-sensitive pages, `AX_LITE_COMPARE_SETUP=path/to/setup.js` evaluates a
|
|
338
|
+
setup script in both Puppeteer and `agent-browser` before extraction. This keeps
|
|
339
|
+
exact-match scoring intact while making page state explicit.
|
|
340
|
+
|
|
341
|
+
## Sample Results
|
|
342
|
+
|
|
343
|
+
| URL | ax-grep nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score |
|
|
344
|
+
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
345
|
+
| `https://example.com` | 4 | 3 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
|
|
346
|
+
| `https://www.wikipedia.org` | 140 | 105 | 0.57 | 0.97 | 1.00 | 0.04 | 0.80 |
|
|
347
|
+
| `https://developer.mozilla.org/en-US/docs/Web/Accessibility` | 315 | 286 | 0.56 | 0.74 | 0.89 | 0.15 | 0.68 |
|
|
348
|
+
| `https://news.ycombinator.com` | 710 | 501 | 0.75 | 0.82 | 0.82 | 0.63 | 0.78 |
|
|
349
|
+
| `https://github.com/features` | 764 | 538 | 0.90 | 0.88 | 0.95 | 0.93 | 0.92 |
|
|
350
|
+
| `https://libraries.io/npm/typescript` | 382 | 609 | 0.49 | 0.95 | 0.95 | 0.17 | 0.80 |
|
|
351
|
+
| `https://www.npmjs.com/package/typescript` | 16 | 15 | 0.50 | 0.67 | 0.80 | 0.50 | 0.72 |
|
|
352
|
+
|
|
353
|
+
## Korean Sample Results
|
|
354
|
+
|
|
355
|
+
Run with `pnpm compare:korea`.
|
|
356
|
+
|
|
357
|
+
| URL | ax-grep nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score |
|
|
358
|
+
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
359
|
+
| `https://ko.wikipedia.org/wiki/%EB%8C%80%ED%95%9C%EB%AF%BC%EA%B5%AD` | 5713 | 7088 | 0.44 | 0.82 | 0.85 | 0.14 | 0.69 |
|
|
360
|
+
| `https://www.hani.co.kr/` | 998 | 992 | 0.42 | 0.50 | 0.48 | 0.23 | 0.48 |
|
|
361
|
+
| `https://www.korea.kr/` | 569 | 494 | 0.47 | 0.66 | 0.69 | 0.24 | 0.59 |
|
|
362
|
+
| `https://www.yonhapnewstv.co.kr/` | 566 | 448 | 0.79 | 0.79 | 0.83 | 0.79 | 0.81 |
|
|
363
|
+
|
|
364
|
+
## Static SSR HTML Results
|
|
365
|
+
|
|
366
|
+
Run with `pnpm compare:static URL...`.
|
|
367
|
+
Run with `pnpm compare:static:agent` for the smaller executor-focused regression
|
|
368
|
+
set that exercises readable pages, listings, forum-style links, and a search
|
|
369
|
+
diagnostic while tracking `averageAgentExecutorScore`.
|
|
370
|
+
In the current run, the gate summary includes 8 targets and excludes 4
|
|
371
|
+
diagnostics; `averageAgentExecutorScore` is 1.00 and
|
|
372
|
+
`averageAgentHiddenSignalScore` is 1.00 for the included executor targets.
|
|
373
|
+
`averageAgentBrowserAdvantageScore` is also tracked so the hidden-metadata
|
|
374
|
+
fixture proves more than raw accessibility-tree overlap.
|
|
375
|
+
The set includes a synthetic hidden-metadata gate whose browser snapshot only
|
|
376
|
+
contains a visible heading while `pageCheck` exposes 13 hidden head, script,
|
|
377
|
+
policy, app-link, provenance, and JSON-LD signals.
|
|
378
|
+
|
|
379
|
+
This path fetches HTML and runs `extract(html)` from the static entry without
|
|
380
|
+
Chrome, jsdom, WebView, layout, or script execution. `agent-browser` is used only
|
|
381
|
+
as the reference snapshot for comparison.
|
|
382
|
+
|
|
383
|
+
| URL | fetched bytes | static nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score |
|
|
384
|
+
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
385
|
+
| `https://example.com` | 528 | 5 | 3 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
|
|
386
|
+
| `https://www.wikipedia.org` | 120361 | 193 | 105 | 0.57 | 0.97 | 1.00 | 0.04 | 0.77 |
|
|
387
|
+
| `https://news.ycombinator.com` | 34665 | 700 | 498 | 0.74 | 0.81 | 0.81 | 0.64 | 0.77 |
|
|
388
|
+
| `https://www.yonhapnewstv.co.kr/` | 47910 | 630 | 440 | 0.51 | 0.75 | 0.78 | 0.75 | 0.72 |
|
|
389
|
+
|
|
390
|
+
## Diverse Static Results
|
|
391
|
+
|
|
392
|
+
Run with `pnpm compare:static:diverse`.
|
|
393
|
+
|
|
394
|
+
| Category | URL | class | fetched bytes | static nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score |
|
|
395
|
+
| --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
396
|
+
| News index | `https://www.bbc.com/news` | usable | 317290 | 672 | 427 | 0.48 | 0.52 | 0.65 | 0.55 | 0.55 |
|
|
397
|
+
| News article | `https://www.npr.org/2025/03/11/nx-s1-5324543/ntsb-dca-mid-air-collision-american-black-hawk` | usable | 106490 | 481 | 201 | 0.33 | 0.83 | 0.89 | 0.48 | 0.70 |
|
|
398
|
+
| News portal stress | `https://www.theguardian.com/international` | over-collected | 1429586 | 3829 | 1225 | 0.30 | 0.90 | 0.64 | 0.27 | 0.68 |
|
|
399
|
+
| Government service | `https://www.gov.uk/foreign-travel-advice` | usable | 111369 | 714 | 698 | 0.53 | 0.97 | 0.99 | 0.49 | 0.81 |
|
|
400
|
+
| Accessibility guide | `https://www.nottinghamshire.gov.uk/global-content/how-to-create-accessible-content/how-to-make-web-pages-accessible/checklist-web-page` | usable | 31747 | 239 | 250 | 0.49 | 0.70 | 0.76 | 0.33 | 0.61 |
|
|
401
|
+
| Ecommerce fixture | `https://books.toscrape.com/` | usable | 51294 | 482 | 528 | 0.61 | 0.88 | 0.91 | 0.77 | 0.82 |
|
|
402
|
+
| Reddit legacy | `https://old.reddit.com/r/programming/` | challenge | 136514 | 1255 | 1 | 0.00 | 0.00 | 0.00 | 1.00 | 0.20 |
|
|
403
|
+
| Reddit modern | `https://www.reddit.com/r/programming/` | challenge | 8438 | 53 | 1 | 0.00 | 0.00 | 0.00 | 1.00 | 0.35 |
|
|
404
|
+
| X social challenge | `https://x.com/NASA` | needs-browser | 277862 | 38 | 35 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
|
|
405
|
+
| Instagram social challenge | `https://www.instagram.com/nasa/` | shell | 882680 | 3 | 1 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
|
|
406
|
+
|
|
407
|
+
## Token Cost Results
|
|
408
|
+
|
|
409
|
+
Run with `pnpm compare:tokens URL...`.
|
|
410
|
+
|
|
411
|
+
This serializes both browser-injected and static SSR extraction into compact
|
|
412
|
+
agent prompt text and estimates token cost with `cl100k_base`. It also measures
|
|
413
|
+
the recommended `--agent` compact JSON payload, so the benchmark can compare raw
|
|
414
|
+
tree prompts with the actual CLI payload agents should use. The prompt text
|
|
415
|
+
includes role, name, state/value, and selectors for interactive nodes.
|
|
416
|
+
Token gate averages skip browser references that are only a tiny shell while
|
|
417
|
+
static or agent output contains substantially more inspectable payload. Those
|
|
418
|
+
thin browser snapshots are counted separately as `excludedThinBrowserReference`
|
|
419
|
+
instead of distorting static/browser and agent/browser ratios.
|
|
420
|
+
|
|
421
|
+
| URL | browser nodes | browser tokens | static nodes | static tokens | static delta | static/browser ratio |
|
|
422
|
+
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
423
|
+
| `https://example.com` | 4 | 37 | 5 | 29 | -8 | 0.78 |
|
|
424
|
+
| `https://www.wikipedia.org` | 140 | 1339 | 193 | 1292 | -47 | 0.97 |
|
|
425
|
+
| `https://news.ycombinator.com` | 704 | 14503 | 700 | 6356 | -8147 | 0.44 |
|
|
426
|
+
| `https://www.yonhapnewstv.co.kr/` | 568 | 14397 | 630 | 10877 | -3520 | 0.76 |
|
|
427
|
+
|
|
428
|
+
## Diverse Token Cost Results
|
|
429
|
+
|
|
430
|
+
Run with `pnpm compare:tokens:diverse`.
|
|
431
|
+
|
|
432
|
+
| Category | URL | browser nodes | browser tokens | static nodes | static tokens | static/browser ratio |
|
|
433
|
+
| --- | --- | ---: | ---: | ---: | ---: | ---: |
|
|
434
|
+
| News index | `https://www.bbc.com/news` | 554 | 9617 | 672 | 6606 | 0.69 |
|
|
435
|
+
| News article | `https://www.npr.org/2025/03/11/nx-s1-5324543/ntsb-dca-mid-air-collision-american-black-hawk` | 504 | 9122 | 481 | 4152 | 0.46 |
|
|
436
|
+
| Government service | `https://www.gov.uk/foreign-travel-advice` | 722 | 19115 | 714 | 6477 | 0.34 |
|
|
437
|
+
| Ecommerce fixture | `https://books.toscrape.com/` | 455 | 7014 | 482 | 3599 | 0.51 |
|
|
438
|
+
| Reddit legacy challenge | `https://old.reddit.com/r/programming/` | 6 | 58 | 1264 | 9343 | 161.09 |
|
|
439
|
+
| X social challenge | `https://x.com/NASA` | 314 | 8041 | 38 | 237 | 0.03 |
|
|
440
|
+
| Instagram social challenge | `https://www.instagram.com/nasa/` | 35 | 640 | 351 | 1883 | 2.94 |
|
|
441
|
+
|
|
442
|
+
## Korean/Social Static Benchmark
|
|
443
|
+
|
|
444
|
+
Run with `pnpm compare:static:korea-social` and
|
|
445
|
+
`pnpm compare:tokens:korea-social`.
|
|
446
|
+
|
|
447
|
+
This target set covers Clien, Ruliweb, DCInside, Google/Bing/Startpage Search,
|
|
448
|
+
X/Twitter, and Instagram. The static comparison benchmark first tries plain HTML
|
|
449
|
+
fetch. If the response looks like a bot challenge, login shell, or empty
|
|
450
|
+
client-rendered shell, it falls back to `agent-browser` rendered HTML through
|
|
451
|
+
`document.documentElement.outerHTML` before running the static extractor.
|
|
452
|
+
|
|
453
|
+
Search and social targets stay in the benchmark as diagnostics, but are not
|
|
454
|
+
included in the gate summary because their logged-out public views are
|
|
455
|
+
anti-bot, hydration, and personalization sensitive. In this run, the gate
|
|
456
|
+
summary includes 4 targets and excludes 5 diagnostics; the average gate agent
|
|
457
|
+
score is 0.698 and the average static/browser token ratio is 0.395.
|
|
458
|
+
|
|
459
|
+
| Category | gate | HTML source | class | static nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score | static/browser token ratio |
|
|
460
|
+
| --- | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
461
|
+
| Clien home | gate | fetch | usable | 469 | 657 | 0.522 | 0.799 | 0.806 | 0.053 | 0.675 | 0.286 |
|
|
462
|
+
| Clien post | gate | fetch | usable | 573 | 1167 | 0.234 | 0.553 | 0.542 | 0.008 | 0.483 | 0.303 |
|
|
463
|
+
| Ruliweb post | gate | fetch | usable | 396 | 297 | 0.620 | 0.974 | 0.978 | 0.821 | 0.892 | 0.610 |
|
|
464
|
+
| DCInside post | gate | fetch | usable | 1217 | 358 | 0.233 | 0.951 | 0.957 | 0.429 | 0.740 | 0.381 |
|
|
465
|
+
| Google search | diagnostic | fetch | reference-challenge | 1 | 5 | 0.000 | 0.000 | 0.000 | 0.000 | 0.150 | 0.110 |
|
|
466
|
+
| Bing search | diagnostic | fetch | volatile | 152 | 126 | 0.380 | 0.719 | 0.590 | 0.091 | 0.511 | 1.249 |
|
|
467
|
+
| Startpage search | diagnostic | fetch | reference-challenge | 85 | 61 | 0.861 | 0.963 | 0.957 | 0.625 | 0.878 | 0.341 |
|
|
468
|
+
| X social | diagnostic | fetch | usable | 156 | 36 | 0.169 | 1.000 | 0.900 | 0.500 | 0.750 | 0.193 |
|
|
469
|
+
| Instagram social | diagnostic | fetch | usable | 36 | 115 | 0.255 | 0.293 | 1.000 | 0.130 | 0.543 | 0.145 |
|
|
470
|
+
|
|
471
|
+
Notes:
|
|
472
|
+
|
|
473
|
+
- Ruliweb can require rendered HTML fallback in some runs, but this run fetched
|
|
474
|
+
useful static HTML directly.
|
|
475
|
+
- Clien matching improved after benchmark normalization started stripping icon
|
|
476
|
+
font private-use glyphs and leading menu bullets from comparable names.
|
|
477
|
+
- DCInside preserved action/navigation signals, moved from `over-collected` to
|
|
478
|
+
`usable`, and lowered static/browser token ratio below 0.50 after compact
|
|
479
|
+
static extraction started pruning unnamed leaf wrappers.
|
|
480
|
+
- Google Search returned a bot/interstitial shell in the browser reference path.
|
|
481
|
+
- Bing Search is volatile in this environment: fetch or rendered HTML can expose
|
|
482
|
+
useful search UI or unrelated image-search affordances, but the exact
|
|
483
|
+
reference comparison is not stable enough for a gate yet. Search diagnostics
|
|
484
|
+
can be classified as `volatile` instead of `usable`.
|
|
485
|
+
- Startpage can return useful fetch HTML, but this run hit a suspended-connection
|
|
486
|
+
captcha page in the browser-derived reference path. Embedded CSS-in-JS text is
|
|
487
|
+
now excluded from static names, but the target remains a `reference-challenge`
|
|
488
|
+
fixture.
|
|
489
|
+
- Instagram can alternate between login-only and fuller logged-out shells in
|
|
490
|
+
this environment; keep it diagnostic even when a run scores as `usable`.
|
|
491
|
+
|
|
492
|
+
## China/Japan Static Benchmark
|
|
493
|
+
|
|
494
|
+
Run with `pnpm compare:static:china-japan` and
|
|
495
|
+
`pnpm compare:tokens:china-japan`.
|
|
496
|
+
|
|
497
|
+
This target set covers Chinese and Japanese encyclopedia, news, portal, forum,
|
|
498
|
+
developer, search, and video/social pages. Search, video/social, and pages whose
|
|
499
|
+
reference navigation fails in this environment stay in diagnostics. In this
|
|
500
|
+
run, the gate summary includes 7 targets and excludes 6 diagnostics; the
|
|
501
|
+
average gate agent score is 0.654 and the average static/browser token ratio is
|
|
502
|
+
0.544.
|
|
503
|
+
|
|
504
|
+
| Category | gate | HTML source | class | static nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score | static/browser token ratio |
|
|
505
|
+
| --- | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
506
|
+
| China Wikipedia | gate | fetch | usable | 4907 | 8518 | 0.421 | 0.866 | 0.871 | 0.050 | 0.687 | 0.277 |
|
|
507
|
+
| People China portal | diagnostic | fetch | reference-missing | 912 | | 0.000 | 1.000 | 1.000 | 1.000 | 0.850 | 0.482 |
|
|
508
|
+
| Xinhua portal | gate | fetch | usable | 1041 | 1076 | 0.706 | 0.973 | 0.973 | 0.167 | 0.772 | 0.453 |
|
|
509
|
+
| Douban home | gate | fetch | usable | 700 | 952 | 0.757 | 0.930 | 0.893 | 0.095 | 0.738 | 0.307 |
|
|
510
|
+
| Baidu search | diagnostic | fetch | needs-browser | 250 | 7 | 0.000 | 0.000 | 0.000 | 1.000 | 0.200 | 0.677 |
|
|
511
|
+
| Bilibili home | diagnostic | fetch | usable | 231 | 217 | 0.523 | 0.754 | 0.770 | 0.436 | 0.660 | 0.371 |
|
|
512
|
+
| Japan Wikipedia | gate | fetch | usable | 5349 | 11774 | 0.311 | 0.599 | 0.618 | 0.072 | 0.491 | 0.649 |
|
|
513
|
+
| NHK News | diagnostic | fetch | reference-missing | 490 | | 0.000 | 1.000 | 1.000 | 1.000 | 0.850 | n/a |
|
|
514
|
+
| Qiita TypeScript tag | gate | fetch | usable | 674 | 893 | 0.645 | 0.719 | 0.741 | 0.508 | 0.693 | 0.308 |
|
|
515
|
+
| Hatena IT hotentry | gate | fetch | usable | 1675 | 1775 | 0.566 | 0.922 | 0.937 | 0.258 | 0.739 | 0.782 |
|
|
516
|
+
| 5ch board | gate | fetch | usable | 780 | 360 | 0.105 | 0.574 | 0.600 | 0.316 | 0.459 | 1.031 |
|
|
517
|
+
| Yahoo Japan search | diagnostic | fetch | needs-browser | 54 | 158 | 0.187 | 0.327 | 0.293 | 0.000 | 0.279 | 0.177 |
|
|
518
|
+
| Niconico home | diagnostic | fetch | needs-browser | 212 | 373 | 0.123 | 0.186 | 0.196 | 0.018 | 0.159 | 0.228 |
|
|
519
|
+
|
|
520
|
+
Notes:
|
|
521
|
+
|
|
522
|
+
- China Wikipedia became usable after the benchmark stopped treating Wikipedia
|
|
523
|
+
table-of-contents section numbers as part of comparable link names and static
|
|
524
|
+
extraction started auto-detecting wiki-like HTML to preserve more article
|
|
525
|
+
links by default.
|
|
526
|
+
- Xinhua and Douban are the strongest Chinese gate targets in this run.
|
|
527
|
+
- People China fetches usable HTML, but `agent-browser` navigation is blocked in
|
|
528
|
+
this environment, so the target is diagnostic until a stable reference path is
|
|
529
|
+
available.
|
|
530
|
+
- Baidu search is unstable across runs. It can collapse to a tiny feedback shell
|
|
531
|
+
or expose a larger fetched search page; keep it diagnostic.
|
|
532
|
+
- Japan Wikipedia is usable but still has low exact content recall on the large
|
|
533
|
+
article body.
|
|
534
|
+
- NHK fetches static HTML, but Puppeteer and `agent-browser` both hit HTTP/2
|
|
535
|
+
navigation failures in this environment. Token ratio is reported as `n/a`
|
|
536
|
+
when the browser reference is unavailable.
|
|
537
|
+
- Qiita and Hatena are useful Japanese gate targets; Hatena remains a token-cost
|
|
538
|
+
stress case.
|
|
539
|
+
- 5ch became usable after reference comparison hardening, forum thread metadata
|
|
540
|
+
normalization, auto-detected forum link-farm limits, and pruning redundant
|
|
541
|
+
listitem wrappers around links/buttons. It remains a token-cost stress case at
|
|
542
|
+
roughly parity with browser injection.
|
|
543
|
+
|
|
544
|
+
## Observations
|
|
545
|
+
|
|
546
|
+
- Simple static pages line up well. `example.com` matched the important named roles exactly.
|
|
547
|
+
- Wikipedia exposes a large language `<select>`. `ax-grep` can still unroll options for agent operation, but the comparison harness now disables option unrolling to match `agent-browser snapshot` more closely.
|
|
548
|
+
- Wikipedia language links use both visible article-count text and descriptive `title` attributes. `ax-grep` now follows accessible-name priority more closely by using link contents before title fallback.
|
|
549
|
+
- MDN uses many custom elements. `ax-grep` now prunes simple custom-element wrappers, but host elements that expose state, ids, or shadow content still need deeper handling.
|
|
550
|
+
- MDN ad-like placements can be excluded in comparison mode with `excludeLikelyAds`. The general extractor keeps this off by default so callers do not silently lose content.
|
|
551
|
+
- A shared comparison viewport is available through `AX_LITE_COMPARE_VIEWPORT=WIDTHxHEIGHT`, but it is opt-in because responsive pages can change the benchmark shape significantly.
|
|
552
|
+
- Hacker News relies on layout tables. The comparison harness normalizes Chrome's `LayoutTableCell` role to `cell` and removes punctuation-adjacent whitespace, improving overlap from 0.64 to 0.75.
|
|
553
|
+
- The comparison harness normalizes common role vocabulary differences such as `image` vs `img`, `paragraph` vs `p`, and `StaticText` vs `text`.
|
|
554
|
+
- `libraries.io/npm/typescript` is the stable package-registry-like sample.
|
|
555
|
+
- The new agent-facing metrics show a different picture than raw overlap on
|
|
556
|
+
Wikipedia and Libraries.io: static-text recall is low, but actionable and
|
|
557
|
+
navigation targets are mostly preserved. That distinction better matches the
|
|
558
|
+
goal of making pages tractable for agents.
|
|
559
|
+
- Korean samples cover a large encyclopedia article, two news-like pages, and a
|
|
560
|
+
public portal. The Korean Wikipedia page is intentionally heavy and is kept in
|
|
561
|
+
`compare:korea` rather than the default sample script.
|
|
562
|
+
- `hani.co.kr` timed out waiting for Puppeteer network idle during the baseline
|
|
563
|
+
run and used the DOMContentLoaded state. Keep it as a news-site stress case,
|
|
564
|
+
but do not treat it as a tightly stable target yet.
|
|
565
|
+
- Korean live pages can shift by a few nodes or snapshot lines between runs as
|
|
566
|
+
headlines, ads, and embedded widgets update.
|
|
567
|
+
- `yonhapnewstv.co.kr` currently lines up best among the Korean samples across
|
|
568
|
+
exact overlap, content recall, and agent score.
|
|
569
|
+
- Static SSR extraction is viable for simple and server-rendered pages. It works
|
|
570
|
+
especially well on Hacker News and reasonably on Yonhap News TV without any
|
|
571
|
+
browser runtime.
|
|
572
|
+
- Static SSR extraction can prune some non-exposed menu content from HTML
|
|
573
|
+
alone. The most important signal so far is a collapsed control with
|
|
574
|
+
`aria-expanded="false"` and `aria-controls`; pruning the controlled subtree
|
|
575
|
+
reduced Wikipedia static tokens from 11,183 to 1,292 and improved exact
|
|
576
|
+
overlap from 0.05 to 0.57.
|
|
577
|
+
- Static SSR extraction now skips non-semantic payload tags, summarizes large
|
|
578
|
+
child lists, and collapses repeated template-like subtrees. This keeps raw SSR
|
|
579
|
+
payloads from turning into unbounded prompt input, while preserving an
|
|
580
|
+
explicit `note` that nodes were omitted.
|
|
581
|
+
- Compact static extraction prunes unnamed leaf wrappers such as decorative
|
|
582
|
+
spans, emphasis tags, empty inputs, and line breaks. Ancestor accessible names
|
|
583
|
+
are computed before pruning, so useful link/button names are preserved while
|
|
584
|
+
prompt-only wrapper noise is removed.
|
|
585
|
+
- Static SSR extraction cannot account for computed CSS, responsive layout,
|
|
586
|
+
client-only rendering, open shadow roots, iframe documents, or post-load DOM
|
|
587
|
+
mutation. Treat it as a lightweight agent parsing fallback, not an AXTree
|
|
588
|
+
replacement.
|
|
589
|
+
- Static SSR extraction is not automatically cheaper in prompt tokens, but it
|
|
590
|
+
can be competitive when collapsed controlled regions are pruned. It is now
|
|
591
|
+
slightly cheaper than browser injection on Wikipedia and still cheaper on
|
|
592
|
+
Hacker News and Yonhap News TV.
|
|
593
|
+
- Token cost needs its own benchmark gate. Agent-readiness can be acceptable
|
|
594
|
+
while prompt cost is unacceptable, especially on SSR pages with large hidden
|
|
595
|
+
menus, language selectors, or template payloads.
|
|
596
|
+
- Diverse targets show why benchmark categories matter. Government, ecommerce,
|
|
597
|
+
and article pages preserve useful action/navigation signals; large news
|
|
598
|
+
portals are good stress tests; Reddit/X/Instagram are better treated as
|
|
599
|
+
social/challenge fixtures because public logged-out views often collapse to
|
|
600
|
+
shell, login, or bot-protection states.
|
|
601
|
+
- Diverse token results show static extraction is often cheaper on server
|
|
602
|
+
rendered news, government, and ecommerce pages. Social sites are inconsistent:
|
|
603
|
+
X's fetched shell is tiny compared with the browser view, old Reddit is the
|
|
604
|
+
opposite in this environment, and Instagram exposes enough SSR payload to make
|
|
605
|
+
static more expensive than the rendered shell.
|
|
606
|
+
- Shell/challenge classification is required because exact overlap and agent
|
|
607
|
+
score can look deceptively good when both static and reference snapshots are
|
|
608
|
+
nearly empty.
|
|
609
|
+
- AP News and Ars Technica were tested as additional candidates but omitted from
|
|
610
|
+
`compare:static:diverse` because the reference snapshot timed out in this
|
|
611
|
+
environment. Reuters returned HTTP 401 from plain fetch and is also omitted
|
|
612
|
+
from the automated diverse set.
|
|
613
|
+
- `npmjs.com` currently serves a Cloudflare challenge in the sample environment. The baseline is useful as a challenge-page fixture, not as a package-page content fixture.
|
|
614
|
+
|
|
615
|
+
## Next Improvements
|
|
616
|
+
|
|
617
|
+
- Improve custom-element/shadow-host pruning without losing useful selector targets.
|
|
618
|
+
- Add explicit benchmark gates for actionable and navigation recall once a
|
|
619
|
+
stable target set is chosen.
|
|
620
|
+
- Compare browser and static extraction side-by-side on the same target set to
|
|
621
|
+
decide when the Worker-compatible path is good enough.
|
|
622
|
+
- Tune static pruning controls for hidden menus, select/options, and repeated
|
|
623
|
+
template regions against the diverse benchmark set.
|
|
624
|
+
- Support authenticated/cached sessions for `npmjs.com` if the real npm package page remains useful as a target.
|
|
625
|
+
- Add more real WebView smoke tests once Android/iOS host projects exist.
|
package/docs/features.md
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
# Feature Overview
|
|
2
|
+
|
|
3
|
+
`ax-grep` keeps the root README short. Use this page for the fuller feature map.
|
|
4
|
+
|
|
5
|
+
## Semantic Tree
|
|
6
|
+
|
|
7
|
+
- Fetch a URL, read an HTML file, read stdin, or inspect browser-captured HTML.
|
|
8
|
+
- Print compact text by default, or return structured JSON.
|
|
9
|
+
- Summarize headings, links, forms, media, metadata, tables, code blocks, and page state.
|
|
10
|
+
|
|
11
|
+
## Agent Mode
|
|
12
|
+
|
|
13
|
+
- `--agent` returns the next useful step for an automation loop.
|
|
14
|
+
- `agent.executor` is the shortest machine-facing step.
|
|
15
|
+
- `agent.handoff` and `agent.next` keep compatibility with older integrations.
|
|
16
|
+
- `pageCheck` and `verification` summarize whether the page answers the request.
|
|
17
|
+
|
|
18
|
+
## Search and Verification
|
|
19
|
+
|
|
20
|
+
- `--search` can collect search results and optionally open a ranked result.
|
|
21
|
+
- `--find` verifies requested text on a page or search result.
|
|
22
|
+
- Locale flags such as `--lang` and `--region` make searches easier to reproduce.
|
|
23
|
+
|
|
24
|
+
## Browser Fallback
|
|
25
|
+
|
|
26
|
+
`ax-grep` does not bypass login, paywalls, bot checks, or JavaScript rendering.
|
|
27
|
+
When plain fetch is not enough, capture rendered HTML in a browser and pass it
|
|
28
|
+
back with `--html-file` or `--stdin`.
|