ax-grep 0.0.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,625 @@
1
+ # Comparison Baseline
2
+
3
+ Generated on 2026-06-05.
4
+
5
+ The comparison harness runs `ax-grep` through Puppeteer and compares named
6
+ role output against `agent-browser snapshot`.
7
+
8
+ Operational rule: run comparison suites sequentially. `agent-browser` and
9
+ Chromium can exhaust the host when several `compare:static*` runs overlap. Check
10
+ for existing browser processes before a run and confirm cleanup afterward.
11
+
12
+ The historical `named role overlap` score is intentionally strict: it is an
13
+ exact match over normalized `role:name` pairs, with no fuzzy or containment
14
+ matching. It is useful for tracking whether `ax-grep` is reproducing
15
+ `agent-browser snapshot` output, but it can understate agent usefulness when the
16
+ missing items are mostly static text.
17
+
18
+ The harness also reports `agentReadiness` scores. These are still based on the
19
+ same exact normalized `role:name` matches, but split by agent-facing use:
20
+
21
+ - `referenceRecall`: how much of the `agent-browser` named output appears in
22
+ `ax-grep`.
23
+ - `candidatePrecision`: how much of the `ax-grep` named output appears in
24
+ `agent-browser`.
25
+ - `actionableRecall`: exact recall for links, buttons, fields, tabs, and other
26
+ operation targets.
27
+ - `navigationRecall`: exact recall for links, headings, landmarks, and search.
28
+ - `contentRecall`: exact recall for headings, images, table/list structure, and
29
+ static text.
30
+ - `structuralContentRecall`: exact recall for headings, images, table/list
31
+ structure, and other non-StaticText content roles.
32
+ - `textRecall`: exact recall for StaticText roles only.
33
+ - `score`: weighted summary for agent parsing: actionable 40%, navigation 25%,
34
+ structural content 20%, precision 15%. Strict `contentRecall` and
35
+ `textRecall` remain visible because text-heavy pages can expose useful gaps,
36
+ but raw StaticText volume should not dominate the agent-usefulness score.
37
+
38
+ The static harness also emits `cliAgentSummary`, which scores the actual
39
+ agent-facing `--agent` compact JSON envelope rather than raw tree overlap. It uses
40
+ top-level `agent` routing status, `pageCheck`, structured content evidence, source links and source quality,
41
+ readability, requested verification status, follow-up `nextSteps`, `searchResults`, and `suggestedActions` to estimate how directly an agent can
42
+ decide whether to read, open, or retry a page. `averageCliAgentScore`
43
+ in `gateSummary` tracks that higher-level usefulness separately from
44
+ `agentReadiness`, which remains an exact `agent-browser snapshot` overlap
45
+ metric.
46
+ `minCliAgentScore` enforces the same readiness floor per included target, so a
47
+ weak search, page-check, or browser-retry case cannot be hidden by strong
48
+ average results.
49
+ When a per-target floor fails, `weakAgentTargets` lists the affected category,
50
+ URL, scores, status, and primary action for quick follow-up.
51
+ `averageAgentExecutorScore` is the executor-focused aggregate. It combines the
52
+ schema, routing, `next`, expected-outcome, signal, read-target, command,
53
+ browser-retry, continuation, response, diagnostic, and verification fields that
54
+ subagents need to run a search/page-check loop without reconstructing intent
55
+ from the raw tree.
56
+ `minAgentExecutorScore` applies the executor floor per included target.
57
+ The score includes action-schema completeness, so `run-command` actions need
58
+ both a human-readable command and raw `commandArgs`, `read-current` actions need
59
+ `readFrom`, and `interact-browser` actions need an explicit browser-interaction
60
+ signal.
61
+ `averageActionSchemaScore` tracks that schema completeness directly across the
62
+ gate-included targets.
63
+ `averageSearchResultActionScore` tracks whether compact search results include
64
+ rank-specific `openResult`, `command`, and raw `commandArgs`, so search agents
65
+ can open alternate results without reconstructing commands.
66
+ `averageAgentRoutingIntentScore` tracks whether `agent.routingIntent` correctly
67
+ summarizes the primary action as reading current payload, opening a URL,
68
+ searching, retrying with browser HTML, requiring browser interaction, or
69
+ stopping.
70
+ `averageAgentContinuationModeScore` tracks whether `agent.continuationMode`
71
+ maps that intent to the executor-facing mode: `read`, `command`, `browser`,
72
+ `capture-html`, `inspect`, or `stop`.
73
+ `averageAgentNextScore` tracks whether `agent.next` is a canonical executor
74
+ payload that agrees with `continuationMode` and mirrors the primary action's
75
+ command, read pointer, URL, browser interaction, and terminal fields.
76
+ `averageAgentNextShortcutScore` tracks whether top-level `agent.next*`
77
+ shortcuts mirror that canonical `agent.next` payload.
78
+ `averageAgentRunbookShortcutScore` tracks whether top-level `agent.runbook*`
79
+ shortcuts mirror the nested loop runbook contract.
80
+ `averageAgentExpectedOutcomeScore` tracks whether `agent.expectedOutcome`
81
+ describes the success condition for the next step, including read pointers when
82
+ the next step is evidence reading.
83
+ `averageAgentPlanShortcutScore` tracks whether top-level
84
+ `agent.expectedOutcome*` and `agent.executionPlan*` shortcuts mirror the nested
85
+ next-step contract.
86
+ `averageAgentSignalScore` tracks whether `agent.signals` exposes structured
87
+ content, verification, search result, source link, browser, response, and
88
+ diagnostic signals needed for fast agent routing.
89
+ `averageContentEvidenceMetadataScore` tracks whether `pageCheck.contentEvidence`
90
+ items include `source` and bounded `score` metadata, so agents can prioritize
91
+ semantic evidence over fallback excerpts.
92
+ `averageReadabilityReasonScore` tracks whether compact page checks preserve
93
+ concise readability reasons, so agents can understand why a page is readable,
94
+ thin, blocked, or worth retrying.
95
+ `averageAgentReadabilityReasonScore` tracks whether the compact top-level
96
+ `agent` summary repeats concise readability reasons, so agents can route from
97
+ the first object before drilling into `pageCheck`.
98
+ `averageAgentPageMetadataShortcutScore` tracks whether `agent.page*` mirrors
99
+ root page metadata such as canonical URL, language, author, dates, and
100
+ structured-data types.
101
+ `averageAgentSemanticSummaryScore` tracks whether `agent.semanticSummary` and
102
+ top-level `agent.semantic*` shortcuts preserve semantic tree counts, role-group
103
+ counts, heading/landmark outline flow with parent context, keyboard shortcut hints, in-page links, top role/heading/landmark/named role fields with direct paths,
104
+ interactive and focusable state, link URL, button description, image,
105
+ table/list structure, form-field input hints, description, value, resolved relation targets, choice, state, and unavailable-subtree shortcuts for quick page-shape routing.
106
+ State scoring includes parsed top-state fields so agents do not need to parse
107
+ ARIA state strings.
108
+ `averageAgentBarrierShortcutScore` tracks whether top-level `agent.topBarrier*`
109
+ shortcuts mirror the highest-priority page barrier.
110
+ `averageAgentStructuredShortcutScore` tracks whether top-level structured
111
+ content counts and `top*` shortcuts mirror the first table, FAQ, code block,
112
+ resource, media item, section paths/selectors, navigation/media structure,
113
+ provenance, offer, dataset, identity, timeline, contact point, and best structured
114
+ read-target shortcut.
115
+ `averageAgentReadTargetScore` tracks whether `agent.readTargets` points to
116
+ payload fields that actually exist and are worth reading, and whether
117
+ `read-current` actions mark the matching target as primary.
118
+ `averageAgentTopReadTargetShortcutScore` tracks whether `agent.topReadTarget*`
119
+ mirrors the first read-target entry for fast routing without scanning
120
+ `agent.readTargets`.
121
+ `averageAgentAlternativeActionShortcutScore` tracks whether top-level
122
+ `agent.alternativeAction*` shortcuts mirror the first non-primary action.
123
+ `averageAgentHandoffScore` and `averageAgentBriefExecutorStepScore` also cover
124
+ handoff detail preservation. Search handoffs must keep executable result/source
125
+ choices with snippets and command args; answer handoffs must keep selected
126
+ evidence text/reasons; read handoffs for forms and action targets must keep URL
127
+ templates, fields, selectors, methods, and encoding; diagnostic handoffs must
128
+ keep selected signals and quality gates. This prevents the compact handoff from
129
+ turning into an opaque "retry/open this" instruction.
130
+ `averageAgentResultChoiceScore`, `averageAgentSourceChoiceScore`, and
131
+ `averageAgentActionListScore` cover the same problem outside the handoff.
132
+ Search choices keep snippets, freshness dates, and sitelinks; source choices
133
+ keep source-link text, snippets, selectors, and executable commands; source-link
134
+ actions keep `sourceLinkRef` so agents can jump back to the exact
135
+ `pageCheck.sourceLinks[n]` item.
136
+ `averageAgentTopActionShortcutScore` tracks whether `agent.topAction*` mirrors
137
+ the first action candidate, including execution, priority, command/read target,
138
+ URL, and source-link reference.
139
+ `averageAgentResultCountScore` tracks whether `agent.resultCount` is zero for
140
+ non-search pages and at least the compact result count for search pages.
141
+ `averageAgentChoiceCountScore` tracks whether executable choice-count shortcuts
142
+ match their result, form, action-target, and source-link source counts.
143
+ `averageAgentTopChoiceShortcutScore` tracks whether `agent.topChoiceKind`,
144
+ path, label, URL, and command arguments mirror the first executable result,
145
+ source, form, or action-target choice for fast subagent routing.
146
+ `averageAgentTopResultChoiceShortcutScore` tracks whether `agent.topResultChoice*`
147
+ mirrors the first search result choice, including URL, rank, open-result value,
148
+ command arguments, source quality, freshness/relevance, matches, sitelinks, and
149
+ selection reason.
150
+ `averageAgentTopSourceChoiceShortcutScore` tracks whether source-link specific
151
+ top-level shortcuts mirror the first executable source choice, including source
152
+ type, hints, score, and selection reason.
153
+ `averageAgentEvidenceCountShortcutScore` tracks citation, answer-evidence,
154
+ read-target, and action count shortcuts against their agent arrays.
155
+ `averageAgentTopCitationShortcutScore` tracks whether `agent.topCitation*`
156
+ mirrors the first citation item, including path, kind, confidence, reason, URL,
157
+ and score.
158
+ `averageAgentSignalCountShortcutScore` tracks signal severity and failing
159
+ quality-gate count shortcuts against `agent.signals` and `agent.qualityGates`.
160
+ `averageAgentTopQualityShortcutScore` tracks whether `agent.topSignal*` and
161
+ `agent.topQualityGate*` mirror the first signal and quality gate for fast
162
+ accept/block routing without scanning diagnostic arrays.
163
+ `averageAgentProblemShortcutScore` tracks whether `agent.problemSignalKind`,
164
+ severity, message, and `agent.failingQualityGate*` mirror the first
165
+ warning/error signal and first failing quality gate, including gate severity and
166
+ score, so agents can explain blocked pages without scanning diagnostic arrays.
167
+ `averageAgentSourceLinkCountScore` tracks whether `agent.sourceLinkCount` is
168
+ zero for search pages and matches compact `pageCheck.sourceLinks` for ordinary
169
+ content pages.
170
+ `averageAgentFormActionCountScore` tracks whether top-level `agent.formCount`
171
+ and `agent.actionTargetCount` match compact `pageCheck.forms` and
172
+ `pageCheck.actionTargets`, so agents can detect hidden forms and JSON-LD/OpenSearch
173
+ actions before scanning nested page-check arrays.
174
+ `averageAgentFormActionChoiceScore` tracks whether `agent.formChoices` and
175
+ `agent.actionTargetChoices` preserve the compact form/action target IDs, paths,
176
+ selectors, URL templates, query inputs, submit text, first-field hints, and
177
+ methods needed for subagent form execution and selection loops.
178
+ `averageAgentTopFormActionChoiceShortcutScore` tracks whether top-level
179
+ form/action-target shortcuts mirror the first executable form and action target.
180
+ `averagePageLinkCommandScore` tracks whether compact `pageCheck.primaryLinks`
181
+ and `pageCheck.sourceLinks` include direct `command` and `commandArgs`, so
182
+ agents can open page links without reconstructing fetch flags.
183
+ `averageAgentBrowserNeedScore` tracks whether `agent.needsBrowserHtml` agrees
184
+ with the primary action: browser HTML retry actions should require browser
185
+ HTML, while URL search recovery, alternate-result recovery, read-current, and
186
+ retry-later actions should not.
187
+ `averageAgentBrowserHtmlScore` tracks whether browser HTML fallback payloads
188
+ preserve capture file/script data plus nested reason, command, and command-args
189
+ fields across `agent.next`, execution plans, executor steps, and handoff steps.
190
+ `averageAgentPageKindScore` tracks whether `agent.pageKind` mirrors the root
191
+ payload `kind`, so agents can route from the top-level `agent` object without
192
+ re-reading `analysis.kind` or the envelope root.
193
+ `averageAgentAlternativeActionCountScore` tracks whether
194
+ `agent.alternativeActionCount` matches the deduplicated compact follow-up
195
+ actions left outside `agent.primaryAction`, so agents can know whether a page
196
+ has useful alternatives before scanning nested action arrays.
197
+ `averageAgentUsabilityScoreConsistency` tracks whether `agent.usabilityScore`
198
+ matches the documented compact quality heuristic derived from status,
199
+ readability, confidence, evidence, search results, source links, and
200
+ verification status.
201
+ `averageAgentEvidenceQualityScoreConsistency` and
202
+ `averageAgentSourceQualityScoreConsistency` track whether top-level evidence
203
+ and source quality scores match the compact evidence/source arrays, so agents
204
+ can compare payload quality before reading every candidate item.
205
+ `averageAgentBestReadTargetScore` tracks whether `agent.bestReadTarget` and its
206
+ count, score, primary flag, and reason match the primary or highest-scored
207
+ `agent.readTargets` entry, so agents can start reading the best compact field
208
+ without sorting candidates.
209
+ `averageAgentDiagnosticCountScore` tracks whether top-level diagnostic severity
210
+ counts and `agent.topDiagnostic*` match the compact diagnostics array, so agents
211
+ can distinguish warnings from hard errors before drilling into diagnostic
212
+ messages.
213
+ `averageAgentVerificationCountScore` tracks whether top-level verification
214
+ requested/found/missing counts match the compact verification object, so agents
215
+ can decide whether requested evidence is complete before reading details.
216
+ `averageAgentVerificationQueryScore` tracks whether
217
+ `agent.verificationFoundQueries` and `agent.verificationMissingQueries` preserve
218
+ the exact matched and missing `--find` query lists and whether the top matched
219
+ or missing query shortcuts mirror the first items; `agent.handoff` and
220
+ `agent.executor` carry the same lists for brief subagent loops.
221
+ `averageAgentResponseMetadataScore` tracks whether `agent.responseStatus`,
222
+ `agent.responseOk`, `agent.responseContentType`, and `agent.finalUrlChanged`
223
+ mirror the compact envelope response fields, so agents can judge fetch health
224
+ from the top-level `agent` object.
225
+ `averageAgentHiddenSignalScore` tracks whether hidden `pageCheck` groups such
226
+ as hydration, API endpoints, app config, app/mobile hints, provenance,
227
+ policies, JSON-LD facts, and resource metadata are present at valid payload
228
+ paths and discoverable through at least one read target when available. This is
229
+ the executor-focused counterweight to raw accessibility-tree overlap: these
230
+ signals are often useful to subagents but absent from browser accessibility
231
+ snapshots.
232
+ `averageAgentHiddenSignalCountScore` tracks whether top-level
233
+ `agent.hiddenSignalCount`, `agent.hiddenReadTargetCount`, and
234
+ `agent.bestHiddenReadTarget*` match those hidden groups and read-target
235
+ shortcuts.
236
+ `averageAgentTopHiddenSignalShortcutScore` tracks whether
237
+ `agent.topHiddenSignal*` mirrors the first hidden metadata, API, config, or
238
+ provenance signal.
239
+ `averageAgentHiddenCommandShortcutScore` tracks whether hidden hydration, safe
240
+ GET-like API, and app-hint URLs keep executable top-level follow-up commands.
241
+ `averageAgentBrowserAdvantageScore` tracks whether those hidden `pageCheck`
242
+ signals create a concrete agent-browser advantage when they exist, rather than
243
+ only matching visible accessibility-tree roles.
244
+ The higher-level CLI agent score also credits hidden `pageCheck` signal groups
245
+ and recoverable browser-HTML retry actions. A page with little visible text can
246
+ still be useful to a subagent when it exposes metadata read targets or a
247
+ runnable browser-capture handoff.
248
+ When `ax-grep` produces a ready, high-scoring agent payload with content
249
+ evidence, the static comparison treats the page as usable before applying raw
250
+ tree-size failure classes such as thin-reference challenge or over-collection.
251
+ This keeps agent-browser advantage cases visible in the gate even when the raw
252
+ static tree is larger than the browser snapshot.
253
+ `averageAgentCanContinueScore` tracks whether `agent.canContinue` agrees with
254
+ the primary action execution class, so recoverable errors with runnable actions
255
+ do not look terminal and usage/input errors without actions do not look
256
+ actionable.
257
+ `averageAgentPrimaryExecutionScore` tracks whether `agent.primaryExecution`
258
+ matches `agent.primaryAction.execution`, so agents can route from the shortcut
259
+ field without rereading the full action object.
260
+ `averageAgentPrimaryShortcutScore` tracks whether `agent.primaryActionName`,
261
+ reason, priority, command, URL, rank, read-from, source-link reference, and
262
+ browser shortcuts mirror `agent.primaryAction`, so agents can continue from
263
+ top-level routing fields.
264
+ `averageAgentExecutorShortcutScore` tracks whether `agent.executorActionName`,
265
+ decision, mode, operation, confidence, terminal/continue flags, command
266
+ arguments, read-from, URL, target, and expected-outcome shortcuts mirror
267
+ `agent.executor`, so subagents can route the next step without parsing the full
268
+ executor object.
269
+ `averageAgentHandoffShortcutScore` tracks whether `agent.handoffActionName`,
270
+ decision, mode, operation, answer status, confidence, terminal/continue flags,
271
+ priority, command arguments, read-from, URL, target, and expected-outcome shortcuts
272
+ mirror `agent.handoff`, so brief loops can run from top-level fields when they
273
+ do not need the full handoff object.
274
+ `averageAgentAnswerShortcutScore` tracks whether `agent.answerPlanStatus`,
275
+ confidence, reason, next action, gap count, citation IDs, first answer-evidence
276
+ metadata, command arguments, after-interaction command, read-from, and URL
277
+ shortcuts mirror `agent.answerPlan` and
278
+ `agent.answerEvidence`, so agents can decide whether to answer or continue
279
+ without parsing the full plan object.
280
+ `averageAgentSourceSearchProvenanceScore` tracks whether opened-result payloads
281
+ with `sourceSearch.selectedResult` or `sourceSearch.alternateResults` expose
282
+ matching `agent.readTargets`, so agents can inspect original SERP provenance
283
+ before trusting or recovering from an opened page.
284
+ `averageAgentSourceSearchShortcutScore` tracks whether top-level
285
+ `agent.sourceSearchQuery`, locale, verification-query count/top query,
286
+ engine/search URL, selected rank/title/URL, selected command, and first
287
+ alternate command mirror the source-search payload for quick SERP recovery
288
+ decisions.
289
+ The command shortcuts are exposed as `sourceSearchSelectedCommandArgs` and
290
+ `sourceSearchAlternateCommandArgs`.
291
+ `averageAgentRecommendedMetadataScore` tracks whether search pages with a
292
+ `recommendedResult` repeat its URL, title, rank, source, relevance,
293
+ official-source hint, selection reason, and command args on the top-level
294
+ `agent` object for quick routing.
295
+ `averageAgentSearchDecisionScore` and `averageAgentPageDecisionScore` also check
296
+ top-level `agent.searchDecision*` and `agent.pageDecision*` shortcuts, so agents
297
+ can route without reopening the nested decision objects.
298
+ Terminal actions such as `read-content` and `use-evidence` are treated as
299
+ usable without executable commands when `execution` is `read-current` and a
300
+ `readFrom` pointer is present, because the compact payload already contains the
301
+ evidence an agent should read. Browser-interaction actions are also valid
302
+ without commands when `execution` is `interact-browser`; in those cases another
303
+ static fetch would not advance the page state.
304
+
305
+ The `--agent` payload intentionally removes repeated routing data: top-level
306
+ diagnostics are represented as `agent.diagnosticCodes`, repeated primary actions
307
+ are omitted from `suggestedActions` and verification/page-check action slots,
308
+ page-level alternatives are suppressed when verification has selected
309
+ `use-evidence`,
310
+ search-page link/action follow-ups are omitted when `searchResults` already
311
+ carry the decision surface, and search output is capped to the first five
312
+ results plus any out-of-window recommended result. Opened-result payloads omit
313
+ engine attempts once `sourceSearch` records the selected result, but keep compact
314
+ selected/alternate SERP candidates with executable open commands for failure
315
+ recovery, preserving custom fetch options such as `--timeout` and
316
+ `--user-agent`. Page checks also
317
+ skip common global-navigation headings, links, and buttons, and omit extra
318
+ external primary links when source links are already present, so repository and
319
+ documentation pages route agents toward page content instead of site chrome.
320
+ Fetch failures that still have a target URL emit an executable browser-HTML
321
+ retry command, preserving `--find` checks for the next run. When browser HTML is
322
+ already supplied through `--html-file` or `--stdin`, the compact agent payload
323
+ does not ask for another browser retry. Parsed search result pages keep
324
+ `agent.canUseFetchedHtml` true even when their page-readability score is low,
325
+ because the result cards remain usable for open/refine routing.
326
+ Captured blocker pages still keep challenge/login/paywall diagnostics; the
327
+ follow-up action changes to browser-state inspection instead of another capture
328
+ loop. HTTP error actions are status-aware, so missing URLs and transient server
329
+ errors no longer all collapse into a browser-HTML retry; missing opened search
330
+ results and opened-result verification failures can route directly to an
331
+ alternate original SERP candidate when one matches the missing `--find` text.
332
+
333
+ The default baseline does not force a viewport. A shared viewport can be tested
334
+ with `AX_LITE_COMPARE_VIEWPORT=WIDTHxHEIGHT`, but the default run is kept stable
335
+ to avoid changing the benchmark shape unexpectedly.
336
+
337
+ For state-sensitive pages, `AX_LITE_COMPARE_SETUP=path/to/setup.js` evaluates a
338
+ setup script in both Puppeteer and `agent-browser` before extraction. This keeps
339
+ exact-match scoring intact while making page state explicit.
340
+
341
+ ## Sample Results
342
+
343
+ | URL | ax-grep nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score |
344
+ | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
345
+ | `https://example.com` | 4 | 3 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
346
+ | `https://www.wikipedia.org` | 140 | 105 | 0.57 | 0.97 | 1.00 | 0.04 | 0.80 |
347
+ | `https://developer.mozilla.org/en-US/docs/Web/Accessibility` | 315 | 286 | 0.56 | 0.74 | 0.89 | 0.15 | 0.68 |
348
+ | `https://news.ycombinator.com` | 710 | 501 | 0.75 | 0.82 | 0.82 | 0.63 | 0.78 |
349
+ | `https://github.com/features` | 764 | 538 | 0.90 | 0.88 | 0.95 | 0.93 | 0.92 |
350
+ | `https://libraries.io/npm/typescript` | 382 | 609 | 0.49 | 0.95 | 0.95 | 0.17 | 0.80 |
351
+ | `https://www.npmjs.com/package/typescript` | 16 | 15 | 0.50 | 0.67 | 0.80 | 0.50 | 0.72 |
352
+
353
+ ## Korean Sample Results
354
+
355
+ Run with `pnpm compare:korea`.
356
+
357
+ | URL | ax-grep nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score |
358
+ | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
359
+ | `https://ko.wikipedia.org/wiki/%EB%8C%80%ED%95%9C%EB%AF%BC%EA%B5%AD` | 5713 | 7088 | 0.44 | 0.82 | 0.85 | 0.14 | 0.69 |
360
+ | `https://www.hani.co.kr/` | 998 | 992 | 0.42 | 0.50 | 0.48 | 0.23 | 0.48 |
361
+ | `https://www.korea.kr/` | 569 | 494 | 0.47 | 0.66 | 0.69 | 0.24 | 0.59 |
362
+ | `https://www.yonhapnewstv.co.kr/` | 566 | 448 | 0.79 | 0.79 | 0.83 | 0.79 | 0.81 |
363
+
364
+ ## Static SSR HTML Results
365
+
366
+ Run with `pnpm compare:static URL...`.
367
+ Run with `pnpm compare:static:agent` for the smaller executor-focused regression
368
+ set that exercises readable pages, listings, forum-style links, and a search
369
+ diagnostic while tracking `averageAgentExecutorScore`.
370
+ In the current run, the gate summary includes 8 targets and excludes 4
371
+ diagnostics; `averageAgentExecutorScore` is 1.00 and
372
+ `averageAgentHiddenSignalScore` is 1.00 for the included executor targets.
373
+ `averageAgentBrowserAdvantageScore` is also tracked so the hidden-metadata
374
+ fixture proves more than raw accessibility-tree overlap.
375
+ The set includes a synthetic hidden-metadata gate whose browser snapshot only
376
+ contains a visible heading while `pageCheck` exposes 13 hidden head, script,
377
+ policy, app-link, provenance, and JSON-LD signals.
378
+
379
+ This path fetches HTML and runs `extract(html)` from the static entry without
380
+ Chrome, jsdom, WebView, layout, or script execution. `agent-browser` is used only
381
+ as the reference snapshot for comparison.
382
+
383
+ | URL | fetched bytes | static nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score |
384
+ | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
385
+ | `https://example.com` | 528 | 5 | 3 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
386
+ | `https://www.wikipedia.org` | 120361 | 193 | 105 | 0.57 | 0.97 | 1.00 | 0.04 | 0.77 |
387
+ | `https://news.ycombinator.com` | 34665 | 700 | 498 | 0.74 | 0.81 | 0.81 | 0.64 | 0.77 |
388
+ | `https://www.yonhapnewstv.co.kr/` | 47910 | 630 | 440 | 0.51 | 0.75 | 0.78 | 0.75 | 0.72 |
389
+
390
+ ## Diverse Static Results
391
+
392
+ Run with `pnpm compare:static:diverse`.
393
+
394
+ | Category | URL | class | fetched bytes | static nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score |
395
+ | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
396
+ | News index | `https://www.bbc.com/news` | usable | 317290 | 672 | 427 | 0.48 | 0.52 | 0.65 | 0.55 | 0.55 |
397
+ | News article | `https://www.npr.org/2025/03/11/nx-s1-5324543/ntsb-dca-mid-air-collision-american-black-hawk` | usable | 106490 | 481 | 201 | 0.33 | 0.83 | 0.89 | 0.48 | 0.70 |
398
+ | News portal stress | `https://www.theguardian.com/international` | over-collected | 1429586 | 3829 | 1225 | 0.30 | 0.90 | 0.64 | 0.27 | 0.68 |
399
+ | Government service | `https://www.gov.uk/foreign-travel-advice` | usable | 111369 | 714 | 698 | 0.53 | 0.97 | 0.99 | 0.49 | 0.81 |
400
+ | Accessibility guide | `https://www.nottinghamshire.gov.uk/global-content/how-to-create-accessible-content/how-to-make-web-pages-accessible/checklist-web-page` | usable | 31747 | 239 | 250 | 0.49 | 0.70 | 0.76 | 0.33 | 0.61 |
401
+ | Ecommerce fixture | `https://books.toscrape.com/` | usable | 51294 | 482 | 528 | 0.61 | 0.88 | 0.91 | 0.77 | 0.82 |
402
+ | Reddit legacy | `https://old.reddit.com/r/programming/` | challenge | 136514 | 1255 | 1 | 0.00 | 0.00 | 0.00 | 1.00 | 0.20 |
403
+ | Reddit modern | `https://www.reddit.com/r/programming/` | challenge | 8438 | 53 | 1 | 0.00 | 0.00 | 0.00 | 1.00 | 0.35 |
404
+ | X social challenge | `https://x.com/NASA` | needs-browser | 277862 | 38 | 35 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
405
+ | Instagram social challenge | `https://www.instagram.com/nasa/` | shell | 882680 | 3 | 1 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
406
+
407
+ ## Token Cost Results
408
+
409
+ Run with `pnpm compare:tokens URL...`.
410
+
411
+ This serializes both browser-injected and static SSR extraction into compact
412
+ agent prompt text and estimates token cost with `cl100k_base`. It also measures
413
+ the recommended `--agent` compact JSON payload, so the benchmark can compare raw
414
+ tree prompts with the actual CLI payload agents should use. The prompt text
415
+ includes role, name, state/value, and selectors for interactive nodes.
416
+ Token gate averages skip browser references that are only a tiny shell while
417
+ static or agent output contains substantially more inspectable payload. Those
418
+ thin browser snapshots are counted separately as `excludedThinBrowserReference`
419
+ instead of distorting static/browser and agent/browser ratios.
420
+
421
+ | URL | browser nodes | browser tokens | static nodes | static tokens | static delta | static/browser ratio |
422
+ | --- | ---: | ---: | ---: | ---: | ---: | ---: |
423
+ | `https://example.com` | 4 | 37 | 5 | 29 | -8 | 0.78 |
424
+ | `https://www.wikipedia.org` | 140 | 1339 | 193 | 1292 | -47 | 0.97 |
425
+ | `https://news.ycombinator.com` | 704 | 14503 | 700 | 6356 | -8147 | 0.44 |
426
+ | `https://www.yonhapnewstv.co.kr/` | 568 | 14397 | 630 | 10877 | -3520 | 0.76 |
427
+
428
+ ## Diverse Token Cost Results
429
+
430
+ Run with `pnpm compare:tokens:diverse`.
431
+
432
+ | Category | URL | browser nodes | browser tokens | static nodes | static tokens | static/browser ratio |
433
+ | --- | --- | ---: | ---: | ---: | ---: | ---: |
434
+ | News index | `https://www.bbc.com/news` | 554 | 9617 | 672 | 6606 | 0.69 |
435
+ | News article | `https://www.npr.org/2025/03/11/nx-s1-5324543/ntsb-dca-mid-air-collision-american-black-hawk` | 504 | 9122 | 481 | 4152 | 0.46 |
436
+ | Government service | `https://www.gov.uk/foreign-travel-advice` | 722 | 19115 | 714 | 6477 | 0.34 |
437
+ | Ecommerce fixture | `https://books.toscrape.com/` | 455 | 7014 | 482 | 3599 | 0.51 |
438
+ | Reddit legacy challenge | `https://old.reddit.com/r/programming/` | 6 | 58 | 1264 | 9343 | 161.09 |
439
+ | X social challenge | `https://x.com/NASA` | 314 | 8041 | 38 | 237 | 0.03 |
440
+ | Instagram social challenge | `https://www.instagram.com/nasa/` | 35 | 640 | 351 | 1883 | 2.94 |
441
+
442
+ ## Korean/Social Static Benchmark
443
+
444
+ Run with `pnpm compare:static:korea-social` and
445
+ `pnpm compare:tokens:korea-social`.
446
+
447
+ This target set covers Clien, Ruliweb, DCInside, Google/Bing/Startpage Search,
448
+ X/Twitter, and Instagram. The static comparison benchmark first tries plain HTML
449
+ fetch. If the response looks like a bot challenge, login shell, or empty
450
+ client-rendered shell, it falls back to `agent-browser` rendered HTML through
451
+ `document.documentElement.outerHTML` before running the static extractor.
452
+
453
+ Search and social targets stay in the benchmark as diagnostics, but are not
454
+ included in the gate summary because their logged-out public views are
455
+ anti-bot, hydration, and personalization sensitive. In this run, the gate
456
+ summary includes 4 targets and excludes 5 diagnostics; the average gate agent
457
+ score is 0.698 and the average static/browser token ratio is 0.395.
458
+
459
+ | Category | gate | HTML source | class | static nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score | static/browser token ratio |
460
+ | --- | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
461
+ | Clien home | gate | fetch | usable | 469 | 657 | 0.522 | 0.799 | 0.806 | 0.053 | 0.675 | 0.286 |
462
+ | Clien post | gate | fetch | usable | 573 | 1167 | 0.234 | 0.553 | 0.542 | 0.008 | 0.483 | 0.303 |
463
+ | Ruliweb post | gate | fetch | usable | 396 | 297 | 0.620 | 0.974 | 0.978 | 0.821 | 0.892 | 0.610 |
464
+ | DCInside post | gate | fetch | usable | 1217 | 358 | 0.233 | 0.951 | 0.957 | 0.429 | 0.740 | 0.381 |
465
+ | Google search | diagnostic | fetch | reference-challenge | 1 | 5 | 0.000 | 0.000 | 0.000 | 0.000 | 0.150 | 0.110 |
466
+ | Bing search | diagnostic | fetch | volatile | 152 | 126 | 0.380 | 0.719 | 0.590 | 0.091 | 0.511 | 1.249 |
467
+ | Startpage search | diagnostic | fetch | reference-challenge | 85 | 61 | 0.861 | 0.963 | 0.957 | 0.625 | 0.878 | 0.341 |
468
+ | X social | diagnostic | fetch | usable | 156 | 36 | 0.169 | 1.000 | 0.900 | 0.500 | 0.750 | 0.193 |
469
+ | Instagram social | diagnostic | fetch | usable | 36 | 115 | 0.255 | 0.293 | 1.000 | 0.130 | 0.543 | 0.145 |
470
+
471
+ Notes:
472
+
473
+ - Ruliweb can require rendered HTML fallback in some runs, but this run fetched
474
+ useful static HTML directly.
475
+ - Clien matching improved after benchmark normalization started stripping icon
476
+ font private-use glyphs and leading menu bullets from comparable names.
477
+ - DCInside preserved action/navigation signals, moved from `over-collected` to
478
+ `usable`, and lowered static/browser token ratio below 0.50 after compact
479
+ static extraction started pruning unnamed leaf wrappers.
480
+ - Google Search returned a bot/interstitial shell in the browser reference path.
481
+ - Bing Search is volatile in this environment: fetch or rendered HTML can expose
482
+ useful search UI or unrelated image-search affordances, but the exact
483
+ reference comparison is not stable enough for a gate yet. Search diagnostics
484
+ can be classified as `volatile` instead of `usable`.
485
+ - Startpage can return useful fetch HTML, but this run hit a suspended-connection
486
+ captcha page in the browser-derived reference path. Embedded CSS-in-JS text is
487
+ now excluded from static names, but the target remains a `reference-challenge`
488
+ fixture.
489
+ - Instagram can alternate between login-only and fuller logged-out shells in
490
+ this environment; keep it diagnostic even when a run scores as `usable`.
491
+
492
+ ## China/Japan Static Benchmark
493
+
494
+ Run with `pnpm compare:static:china-japan` and
495
+ `pnpm compare:tokens:china-japan`.
496
+
497
+ This target set covers Chinese and Japanese encyclopedia, news, portal, forum,
498
+ developer, search, and video/social pages. Search, video/social, and pages whose
499
+ reference navigation fails in this environment stay in diagnostics. In this
500
+ run, the gate summary includes 7 targets and excludes 6 diagnostics; the
501
+ average gate agent score is 0.654 and the average static/browser token ratio is
502
+ 0.544.
503
+
504
+ | Category | gate | HTML source | class | static nodes | agent-browser lines | named role overlap | action recall | nav recall | content recall | agent score | static/browser token ratio |
505
+ | --- | --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
506
+ | China Wikipedia | gate | fetch | usable | 4907 | 8518 | 0.421 | 0.866 | 0.871 | 0.050 | 0.687 | 0.277 |
507
+ | People China portal | diagnostic | fetch | reference-missing | 912 | | 0.000 | 1.000 | 1.000 | 1.000 | 0.850 | 0.482 |
508
+ | Xinhua portal | gate | fetch | usable | 1041 | 1076 | 0.706 | 0.973 | 0.973 | 0.167 | 0.772 | 0.453 |
509
+ | Douban home | gate | fetch | usable | 700 | 952 | 0.757 | 0.930 | 0.893 | 0.095 | 0.738 | 0.307 |
510
+ | Baidu search | diagnostic | fetch | needs-browser | 250 | 7 | 0.000 | 0.000 | 0.000 | 1.000 | 0.200 | 0.677 |
511
+ | Bilibili home | diagnostic | fetch | usable | 231 | 217 | 0.523 | 0.754 | 0.770 | 0.436 | 0.660 | 0.371 |
512
+ | Japan Wikipedia | gate | fetch | usable | 5349 | 11774 | 0.311 | 0.599 | 0.618 | 0.072 | 0.491 | 0.649 |
513
+ | NHK News | diagnostic | fetch | reference-missing | 490 | | 0.000 | 1.000 | 1.000 | 1.000 | 0.850 | n/a |
514
+ | Qiita TypeScript tag | gate | fetch | usable | 674 | 893 | 0.645 | 0.719 | 0.741 | 0.508 | 0.693 | 0.308 |
515
+ | Hatena IT hotentry | gate | fetch | usable | 1675 | 1775 | 0.566 | 0.922 | 0.937 | 0.258 | 0.739 | 0.782 |
516
+ | 5ch board | gate | fetch | usable | 780 | 360 | 0.105 | 0.574 | 0.600 | 0.316 | 0.459 | 1.031 |
517
+ | Yahoo Japan search | diagnostic | fetch | needs-browser | 54 | 158 | 0.187 | 0.327 | 0.293 | 0.000 | 0.279 | 0.177 |
518
+ | Niconico home | diagnostic | fetch | needs-browser | 212 | 373 | 0.123 | 0.186 | 0.196 | 0.018 | 0.159 | 0.228 |
519
+
520
+ Notes:
521
+
522
+ - China Wikipedia became usable after the benchmark stopped treating Wikipedia
523
+ table-of-contents section numbers as part of comparable link names and static
524
+ extraction started auto-detecting wiki-like HTML to preserve more article
525
+ links by default.
526
+ - Xinhua and Douban are the strongest Chinese gate targets in this run.
527
+ - People China fetches usable HTML, but `agent-browser` navigation is blocked in
528
+ this environment, so the target is diagnostic until a stable reference path is
529
+ available.
530
+ - Baidu search is unstable across runs. It can collapse to a tiny feedback shell
531
+ or expose a larger fetched search page; keep it diagnostic.
532
+ - Japan Wikipedia is usable but still has low exact content recall on the large
533
+ article body.
534
+ - NHK fetches static HTML, but Puppeteer and `agent-browser` both hit HTTP/2
535
+ navigation failures in this environment. Token ratio is reported as `n/a`
536
+ when the browser reference is unavailable.
537
+ - Qiita and Hatena are useful Japanese gate targets; Hatena remains a token-cost
538
+ stress case.
539
+ - 5ch became usable after reference comparison hardening, forum thread metadata
540
+ normalization, auto-detected forum link-farm limits, and pruning redundant
541
+ listitem wrappers around links/buttons. It remains a token-cost stress case at
542
+ roughly parity with browser injection.
543
+
544
+ ## Observations
545
+
546
+ - Simple static pages line up well. `example.com` matched the important named roles exactly.
547
+ - Wikipedia exposes a large language `<select>`. `ax-grep` can still unroll options for agent operation, but the comparison harness now disables option unrolling to match `agent-browser snapshot` more closely.
548
+ - Wikipedia language links use both visible article-count text and descriptive `title` attributes. `ax-grep` now follows accessible-name priority more closely by using link contents before title fallback.
549
+ - MDN uses many custom elements. `ax-grep` now prunes simple custom-element wrappers, but host elements that expose state, ids, or shadow content still need deeper handling.
550
+ - MDN ad-like placements can be excluded in comparison mode with `excludeLikelyAds`. The general extractor keeps this off by default so callers do not silently lose content.
551
+ - A shared comparison viewport is available through `AX_LITE_COMPARE_VIEWPORT=WIDTHxHEIGHT`, but it is opt-in because responsive pages can change the benchmark shape significantly.
552
+ - Hacker News relies on layout tables. The comparison harness normalizes Chrome's `LayoutTableCell` role to `cell` and removes punctuation-adjacent whitespace, improving overlap from 0.64 to 0.75.
553
+ - The comparison harness normalizes common role vocabulary differences such as `image` vs `img`, `paragraph` vs `p`, and `StaticText` vs `text`.
554
+ - `libraries.io/npm/typescript` is the stable package-registry-like sample.
555
+ - The new agent-facing metrics show a different picture than raw overlap on
556
+ Wikipedia and Libraries.io: static-text recall is low, but actionable and
557
+ navigation targets are mostly preserved. That distinction better matches the
558
+ goal of making pages tractable for agents.
559
+ - Korean samples cover a large encyclopedia article, two news-like pages, and a
560
+ public portal. The Korean Wikipedia page is intentionally heavy and is kept in
561
+ `compare:korea` rather than the default sample script.
562
+ - `hani.co.kr` timed out waiting for Puppeteer network idle during the baseline
563
+ run and used the DOMContentLoaded state. Keep it as a news-site stress case,
564
+ but do not treat it as a tightly stable target yet.
565
+ - Korean live pages can shift by a few nodes or snapshot lines between runs as
566
+ headlines, ads, and embedded widgets update.
567
+ - `yonhapnewstv.co.kr` currently lines up best among the Korean samples across
568
+ exact overlap, content recall, and agent score.
569
+ - Static SSR extraction is viable for simple and server-rendered pages. It works
570
+ especially well on Hacker News and reasonably on Yonhap News TV without any
571
+ browser runtime.
572
+ - Static SSR extraction can prune some non-exposed menu content from HTML
573
+ alone. The most important signal so far is a collapsed control with
574
+ `aria-expanded="false"` and `aria-controls`; pruning the controlled subtree
575
+ reduced Wikipedia static tokens from 11,183 to 1,292 and improved exact
576
+ overlap from 0.05 to 0.57.
577
+ - Static SSR extraction now skips non-semantic payload tags, summarizes large
578
+ child lists, and collapses repeated template-like subtrees. This keeps raw SSR
579
+ payloads from turning into unbounded prompt input, while preserving an
580
+ explicit `note` that nodes were omitted.
581
+ - Compact static extraction prunes unnamed leaf wrappers such as decorative
582
+ spans, emphasis tags, empty inputs, and line breaks. Ancestor accessible names
583
+ are computed before pruning, so useful link/button names are preserved while
584
+ prompt-only wrapper noise is removed.
585
+ - Static SSR extraction cannot account for computed CSS, responsive layout,
586
+ client-only rendering, open shadow roots, iframe documents, or post-load DOM
587
+ mutation. Treat it as a lightweight agent parsing fallback, not an AXTree
588
+ replacement.
589
+ - Static SSR extraction is not automatically cheaper in prompt tokens, but it
590
+ can be competitive when collapsed controlled regions are pruned. It is now
591
+ slightly cheaper than browser injection on Wikipedia and still cheaper on
592
+ Hacker News and Yonhap News TV.
593
+ - Token cost needs its own benchmark gate. Agent-readiness can be acceptable
594
+ while prompt cost is unacceptable, especially on SSR pages with large hidden
595
+ menus, language selectors, or template payloads.
596
+ - Diverse targets show why benchmark categories matter. Government, ecommerce,
597
+ and article pages preserve useful action/navigation signals; large news
598
+ portals are good stress tests; Reddit/X/Instagram are better treated as
599
+ social/challenge fixtures because public logged-out views often collapse to
600
+ shell, login, or bot-protection states.
601
+ - Diverse token results show static extraction is often cheaper on server
602
+ rendered news, government, and ecommerce pages. Social sites are inconsistent:
603
+ X's fetched shell is tiny compared with the browser view, old Reddit is the
604
+ opposite in this environment, and Instagram exposes enough SSR payload to make
605
+ static more expensive than the rendered shell.
606
+ - Shell/challenge classification is required because exact overlap and agent
607
+ score can look deceptively good when both static and reference snapshots are
608
+ nearly empty.
609
+ - AP News and Ars Technica were tested as additional candidates but omitted from
610
+ `compare:static:diverse` because the reference snapshot timed out in this
611
+ environment. Reuters returned HTTP 401 from plain fetch and is also omitted
612
+ from the automated diverse set.
613
+ - `npmjs.com` currently serves a Cloudflare challenge in the sample environment. The baseline is useful as a challenge-page fixture, not as a package-page content fixture.
614
+
615
+ ## Next Improvements
616
+
617
+ - Improve custom-element/shadow-host pruning without losing useful selector targets.
618
+ - Add explicit benchmark gates for actionable and navigation recall once a
619
+ stable target set is chosen.
620
+ - Compare browser and static extraction side-by-side on the same target set to
621
+ decide when the Worker-compatible path is good enough.
622
+ - Tune static pruning controls for hidden menus, select/options, and repeated
623
+ template regions against the diverse benchmark set.
624
+ - Support authenticated/cached sessions for `npmjs.com` if the real npm package page remains useful as a target.
625
+ - Add more real WebView smoke tests once Android/iOS host projects exist.
@@ -0,0 +1,28 @@
1
+ # Feature Overview
2
+
3
+ `ax-grep` keeps the root README short. Use this page for the fuller feature map.
4
+
5
+ ## Semantic Tree
6
+
7
+ - Fetch a URL, read an HTML file, read stdin, or inspect browser-captured HTML.
8
+ - Print compact text by default, or return structured JSON.
9
+ - Summarize headings, links, forms, media, metadata, tables, code blocks, and page state.
10
+
11
+ ## Agent Mode
12
+
13
+ - `--agent` returns the next useful step for an automation loop.
14
+ - `agent.executor` is the shortest machine-facing step.
15
+ - `agent.handoff` and `agent.next` keep compatibility with older integrations.
16
+ - `pageCheck` and `verification` summarize whether the page answers the request.
17
+
18
+ ## Search and Verification
19
+
20
+ - `--search` can collect search results and optionally open a ranked result.
21
+ - `--find` verifies requested text on a page or search result.
22
+ - Locale flags such as `--lang` and `--region` make searches easier to reproduce.
23
+
24
+ ## Browser Fallback
25
+
26
+ `ax-grep` does not bypass login, paywalls, bot checks, or JavaScript rendering.
27
+ When plain fetch is not enough, capture rendered HTML in a browser and pass it
28
+ back with `--html-file` or `--stdin`.