ax-audit 3.1.0 → 3.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (58) hide show
  1. package/CHANGELOG.md +60 -0
  2. package/README.md +61 -225
  3. package/dist/checks/agent-access.d.ts +16 -0
  4. package/dist/checks/agent-access.d.ts.map +1 -0
  5. package/dist/checks/agent-access.js +110 -0
  6. package/dist/checks/agent-access.js.map +1 -0
  7. package/dist/checks/crawl-efficiency.d.ts +4 -0
  8. package/dist/checks/crawl-efficiency.d.ts.map +1 -0
  9. package/dist/checks/crawl-efficiency.js +122 -0
  10. package/dist/checks/crawl-efficiency.js.map +1 -0
  11. package/dist/checks/index.d.ts.map +1 -1
  12. package/dist/checks/index.js +6 -0
  13. package/dist/checks/index.js.map +1 -1
  14. package/dist/checks/robots-txt.d.ts +20 -0
  15. package/dist/checks/robots-txt.d.ts.map +1 -1
  16. package/dist/checks/robots-txt.js +111 -3
  17. package/dist/checks/robots-txt.js.map +1 -1
  18. package/dist/checks/rsl.d.ts +6 -0
  19. package/dist/checks/rsl.d.ts.map +1 -0
  20. package/dist/checks/rsl.js +252 -0
  21. package/dist/checks/rsl.js.map +1 -0
  22. package/dist/cli.d.ts.map +1 -1
  23. package/dist/cli.js +20 -2
  24. package/dist/cli.js.map +1 -1
  25. package/dist/constants.d.ts +17 -0
  26. package/dist/constants.d.ts.map +1 -1
  27. package/dist/constants.js +39 -1
  28. package/dist/constants.js.map +1 -1
  29. package/dist/fetcher.d.ts +5 -1
  30. package/dist/fetcher.d.ts.map +1 -1
  31. package/dist/fetcher.js +32 -27
  32. package/dist/fetcher.js.map +1 -1
  33. package/dist/index.d.ts +2 -1
  34. package/dist/index.d.ts.map +1 -1
  35. package/dist/index.js +1 -0
  36. package/dist/index.js.map +1 -1
  37. package/dist/orchestrator.d.ts +2 -2
  38. package/dist/orchestrator.d.ts.map +1 -1
  39. package/dist/orchestrator.js +13 -6
  40. package/dist/orchestrator.js.map +1 -1
  41. package/dist/reporter/index.d.ts.map +1 -1
  42. package/dist/reporter/index.js +7 -0
  43. package/dist/reporter/index.js.map +1 -1
  44. package/dist/reporter/markdown.d.ts +8 -0
  45. package/dist/reporter/markdown.d.ts.map +1 -0
  46. package/dist/reporter/markdown.js +76 -0
  47. package/dist/reporter/markdown.js.map +1 -0
  48. package/dist/types.d.ts +7 -1
  49. package/dist/types.d.ts.map +1 -1
  50. package/docs/api.md +200 -0
  51. package/docs/architecture.md +88 -0
  52. package/docs/checks.md +322 -0
  53. package/docs/ci.md +89 -0
  54. package/docs/cli.md +67 -0
  55. package/docs/concepts.md +87 -0
  56. package/docs/faq.md +77 -0
  57. package/docs/getting-started.md +101 -0
  58. package/package.json +2 -1
package/CHANGELOG.md CHANGED
@@ -2,6 +2,66 @@
2
2
 
3
3
  All notable changes to ax-audit are documented here.
4
4
 
5
+ ## [3.6.0] - 2026-06-06
6
+
7
+ ### Added
8
+
9
+ - **Fetcher retries with exponential backoff**: transient failures (network errors, timeouts, and 408/425/429/500/502/503/504) are retried automatically. Configurable via `--retries <n>` (CLI, default 2) and `retries` (programmatic `AuditOptions`); backoff doubles from a 250ms base. Non-retryable responses (e.g. 404) short-circuit immediately. Previously a single transient timeout scored a check 0.
10
+ - **Parallel batch auditing**: `--concurrency <n>` (CLI) and `concurrency` on the new `BatchOptions` type run multiple URL audits in parallel via an order-preserving work queue. Default remains sequential (1).
11
+ - **Markdown reporter**: `--output markdown` emits a self-contained Markdown report (score, summary table, per-check findings with status emoji, baseline deltas) — ideal for CI logs and PR comments. Supported for single and batch audits. New exports: `renderMarkdown`, `renderBatchMarkdown`.
12
+ - **Crawler list refresh**: added Google's official signed AI-agent user-agent `Google-Agent` (identity `https://agent.bot.goog`) to the known-crawlers list.
13
+ - **CLI validation**: `--retries`, `--concurrency`, and `--output` now reject invalid values with a clear error.
14
+ - **17 new tests** (301 total): fetcher retry behavior (against a flaky local server), batch ordering/concurrency, and the Markdown reporter.
15
+
16
+ ### Notes
17
+
18
+ - No scoring changes. Retries can raise scores on flaky endpoints that previously timed out, but the scoring model itself is unchanged.
19
+
20
+ ## [3.5.0] - 2026-06-06
21
+
22
+ ### Added
23
+
24
+ - **crawl-efficiency check (informational)**: measures the cost of crawling your pages across three dimensions. Compression — rewards Brotli, accepts gzip/deflate/zstd (suggesting br), warns when uncompressed (−30). Conditional GET — checks for an `ETag` or `Last-Modified` validator, then issues a follow-up request with `If-None-Match` / `If-Modified-Since` and verifies the server returns `304 Not Modified` (−30 for no validator, −15 when 304 is not honored). Response size — warns on pages over 500 KB (−5) and 2 MB (−10) of decompressed HTML. The probe advertises `Accept-Encoding: br, gzip, deflate`; the conditional request reuses the per-request header support added in 3.1.0.
25
+ - **12 new tests** (284 total).
26
+
27
+ ### Scoring
28
+
29
+ - The new check carries **weight 0 in 3.x** (informational), consistent with 3.1.0–3.4.0.
30
+
31
+ ## [3.4.0] - 2026-06-06
32
+
33
+ ### Added
34
+
35
+ - **agent-access check (informational)**: cloaking and blocking detection. Probes the homepage with realistic user-agents for each of the 8 core AI crawlers (GPTBot, ClaudeBot, ChatGPT-User, Claude-SearchBot, Google-Extended, PerplexityBot, OAI-SearchBot, CCBot) and compares status and visible-text volume against the default-UA baseline. Flags the failure mode invisible to operators: robots.txt allows a crawler while the WAF returns 403 to its user-agent (Cloudflare's "Block AI Crawlers" toggle produces exactly this). Blocks consistent with an explicit robots.txt `Disallow` (or wildcard block) are reported as intentional and not penalized. Responses with under 50% of baseline visible text count as reduced content (half credit); content comparison is skipped for baselines under 200 chars to avoid SPA-shell noise. Hints note the verified-bots caveat: WAFs using Web Bot Auth / IP verification may pass the real crawler while rejecting this unverified probe.
36
+ - `parseUserAgents` and `BotEntry` are now exported from the robots-txt check for reuse.
37
+ - **12 new tests** (272 total).
38
+
39
+ ### Scoring
40
+
41
+ - Internal score is the credit ratio across the 8 probes; the check carries **weight 0 in 3.x** (informational), consistent with 3.1.0–3.3.0.
42
+
43
+ ## [3.3.0] - 2026-06-06
44
+
45
+ ### Added
46
+
47
+ - **rsl check (informational)**: validates [Really Simple Licensing 1.0](https://rslstandard.org/rsl) — the machine-readable content-licensing standard endorsed by 1,500+ publishers (Reddit, Yahoo, Medium, O'Reilly) with infrastructure support from Cloudflare and Fastly. Discovery via all three spec mechanisms: robots.txt `License:` directive (absolute-URI enforcement per §4.4.1), HTTP `Link: rel="license"; type="application/rsl+xml"` header, and `<link rel="license" type="application/rsl+xml">` (plain CC-style license links without the RSL media type are ignored). Document validation: `application/rsl+xml` Content-Type (−5), `<rsl>` root + `https://rslstandard.org/rsl` namespace, required `url` attribute on every `<content>` (empty value allowed per §3.3), `<license>` presence, `permits`/`prohibits` type and token vocabulary (`usage`: all/ai-all/ai-train/ai-input/ai-index/search; `user`; `geo` as ISO 3166-1 alpha-2), and `payment` types.
48
+ - **21 new tests** (260 total) covering the three discovery mechanisms, vocabulary enforcement, namespace/root/structure validation, XML-comment stripping, and score caps.
49
+
50
+ ### Scoring
51
+
52
+ - The new check carries **weight 0 in 3.x** (informational), consistent with 3.1.0/3.2.0: no impact on existing scores or baselines until v4.0.
53
+
54
+ ## [3.2.0] - 2026-06-06
55
+
56
+ ### Added
57
+
58
+ - **Content Signals Policy support in robots-txt** ([contentsignals.org](https://contentsignals.org), CC0): the check now parses `Content-Signal:` directives — the machine-readable `search` / `ai-input` / `ai-train` preferences that Cloudflare serves by default on its 3.8M+ managed robots.txt domains. Declared signals are reported per User-agent group; malformed segments, unknown signal names, and directives placed outside a User-agent group produce warnings. Absence of the directive produces an informational nudge. The group parser now also treats `Content-Signal` as a group-closing directive, fixing potential User-agent group leakage.
59
+ - **10 new tests** (239 total) covering declaration reporting, malformed/unknown signals, shared User-agent groups, case-insensitivity, out-of-group placement, and score neutrality.
60
+
61
+ ### Scoring
62
+
63
+ - All Content Signals findings are **informational in 3.x**: they never alter the robots-txt score, so existing scores and baselines are unchanged.
64
+
5
65
  ## [3.1.0] - 2026-06-06
6
66
 
7
67
  ### Added
package/README.md CHANGED
@@ -25,273 +25,109 @@ npx ax-audit https://your-site.com
25
25
  PASS /llms.txt exists
26
26
  PASS /llms.txt Content-Type OK (text/plain)
27
27
  PASS H1 heading: "Lucio Duran — Personal Portfolio"
28
- PASS /llms-full.txt also available (bonus)
29
28
 
30
29
  Robots.txt (100/100)
31
30
  PASS All 8 core AI crawlers explicitly configured
32
- PASS 32/47 known AI crawlers have explicit rules
33
-
34
- HTML Rendering (90/100)
35
- PASS Server-rendered content detected (473 words)
36
- PASS Semantic landmarks present (main, article, header, footer, nav)
37
- PASS Single <h1> heading
38
- PASS 3/3 <img> tags have alt attributes
39
-
40
- TLS / HTTPS (100/100)
41
- PASS Site is served over HTTPS
42
- PASS HTTP requests redirect to HTTPS
43
- PASS HSTS preload-eligible
31
+ PASS Content signals declared for User-agent: * — search=yes, ai-train=no
32
+
33
+ Content Negotiation (100/100)
34
+ PASS Homepage serves Markdown via content negotiation (Accept: text/markdown)
35
+ PASS Markdown is ~95% lighter than the HTML representation
44
36
  ...
45
37
  ```
46
38
 
47
39
  ## Why
48
40
 
49
- AI agents and LLMs are increasingly crawling, indexing, and interacting with websites. Just like Lighthouse audits web performance and axe-core audits accessibility, **ax-audit** tells you how ready your site is for the AI agent ecosystem.
41
+ AI agents and LLMs are increasingly crawling, indexing, and interacting with websites. Just like Lighthouse audits web performance and axe-core audits accessibility, **ax-audit** tells you how ready your site is for the AI agent ecosystem — discovery files, crawler policy, licensing, content negotiation, and the failure modes invisible to operators (like a WAF blocking crawlers your robots.txt allows).
50
42
 
51
43
  ## What it checks
52
44
 
53
- | Check | What it audits | Weight |
54
- |---|---|---|
55
- | **LLMs.txt** | `/llms.txt` presence, [llmstxt.org](https://llmstxt.org) spec, Content-Type | 11% |
56
- | **Robots.txt** | AI crawler configuration (40+ known crawlers), wildcard detection, partial path restrictions | 11% |
57
- | **HTML Rendering** | Server-rendered content, semantic landmarks, SPA-shell detection, alt coverage | 9% |
58
- | **Structured Data** | JSON-LD on homepage (schema.org, `@graph`, entity types) | 9% |
59
- | **HTTP Headers** | Security headers + AI discovery `Link` headers + CORS on `.well-known` | 9% |
60
- | **Agent Card** | `/.well-known/agent.json` [A2A protocol](https://a2a-protocol.org) + same-origin url + skill quality | 7% |
61
- | **MCP** | `/.well-known/mcp.json` [Model Context Protocol](https://modelcontextprotocol.io) server config | 7% |
62
- | **SEO Basics** | `<title>`, meta description, canonical, `<html lang>`, charset, viewport, hreflang | 7% |
63
- | **Security.txt** | `/.well-known/security.txt` [RFC 9116](https://www.rfc-editor.org/rfc/rfc9116) compliance | 6% |
64
- | **Meta Tags** | AI meta tags (`ai:*`), `rel="alternate"`, `rel="me"`, OpenGraph + Twitter Card completeness | 6% |
65
- | **OpenAPI** | `/.well-known/openapi.json` presence, schema validity, Content-Type | 6% |
66
- | **TLS / HTTPS** | HTTPS, HTTP→HTTPS redirect, HSTS with `preload` + `includeSubDomains` | 5% |
67
- | **Sitemap** | `sitemap.xml` (or `Sitemap:` from robots.txt) — XML validity, `<lastmod>` coverage, freshness, sitemap-index handling | 4% |
68
- | **AI Well-Known** | Emerging files: `/.well-known/ai.txt`, `genai.txt`, `ai-plugin.json`, `agents.json`, `nlweb.json` | 3% |
69
- | **Content Negotiation** | Markdown for agents — `Accept: text/markdown` negotiation, `Vary: Accept`, `rel="alternate"` fallback | 0%* |
70
-
71
- \* **Content Negotiation** is informational in 3.x: it runs and reports findings but does not affect the overall score. It will gain weight in v4.0.
72
-
73
- ## Install
45
+ 18 checks 14 weighted, 4 informational. Full reference: **[docs/checks.md](docs/checks.md)**.
74
46
 
75
- ```bash
76
- npm install -g ax-audit
77
- ```
47
+ | Check | Weight | Check | Weight |
48
+ |---|---|---|---|
49
+ | LLMs.txt | 11% | Security.txt | 6% |
50
+ | Robots.txt + [Content Signals](https://contentsignals.org) | 11% | Meta Tags (OG / Twitter / AI) | 6% |
51
+ | HTML Rendering | 9% | OpenAPI | 6% |
52
+ | Structured Data (JSON-LD) | 9% | TLS / HTTPS | 5% |
53
+ | HTTP Headers | 9% | Sitemap | 4% |
54
+ | Agent Card ([A2A](https://a2a-protocol.org)) | 7% | AI Well-Known | 3% |
55
+ | MCP | 7% | Content Negotiation (Markdown for Agents) | 0%* |
56
+ | SEO Basics | 7% | [RSL License](https://rslstandard.org) · Agent Access (cloaking) · Crawl Efficiency | 0%* |
78
57
 
79
- Or run directly without installing:
58
+ \* Informational in 3.x: reported in full, no effect on the score. Weighted in v4.0.
80
59
 
81
- ```bash
82
- npx ax-audit https://your-site.com
83
- ```
60
+ Every finding links to a step-by-step **[remediation guide](https://lucioduran.com/projects/ax-audit/guides)**.
84
61
 
85
62
  ## Usage
86
63
 
87
64
  ```bash
88
- # Full audit with colored terminal output
89
- ax-audit https://example.com
90
-
91
- # Batch audit audit multiple URLs in a single run
92
- ax-audit https://example.com https://other-site.com https://third.dev
93
-
94
- # HTML report — self-contained, dark mode, shareable
95
- ax-audit https://example.com --output html > report.html
96
-
97
- # JSON output for CI/CD pipelines
98
- ax-audit https://example.com --json
99
-
100
- # Run only specific checks (validates IDs, errors on unknown)
101
- ax-audit https://example.com --checks llms-txt,robots-txt,agent-json
102
-
103
- # Custom timeout per request (default: 10s)
104
- ax-audit https://example.com --timeout 15000
105
-
106
- # Verbose mode — see every HTTP request, cache hit, and check score
107
- ax-audit https://example.com --verbose
108
-
109
- # Only show failures and warnings (hide passing findings)
110
- ax-audit https://example.com --only-failures
111
-
112
- # Save a baseline for future comparison
113
- ax-audit https://example.com --save-baseline baseline.json
114
-
115
- # Compare against a baseline — shows per-check score deltas
116
- ax-audit https://example.com --baseline baseline.json
117
-
118
- # Fail CI if any check regresses by more than 5 points
119
- ax-audit https://example.com --baseline baseline.json --fail-on-regression 5
120
- ```
121
-
122
- ### Baseline Comparison
123
-
124
- Track score changes over time by saving a baseline and comparing against it in subsequent runs:
125
-
126
- ```bash
127
- # First run — save the baseline
128
- ax-audit https://example.com --save-baseline .ax-baseline.json
129
-
130
- # Later — compare against the baseline
131
- ax-audit https://example.com --baseline .ax-baseline.json
132
- ```
133
-
134
- ```
135
- AX Audit Report
136
- https://example.com
137
- Baseline: 2026-04-15T12:00:00.000Z
138
-
139
- ████████████████████████████████░░░░░░░░ 82/100 Good ▲7
140
-
141
- LLMs.txt (100/100) ▲20
142
- Robots.txt (70/100) ▼10
143
- ...
144
-
145
- Regressions
146
- Robots.txt: 80 → 70 (▼10)
147
-
148
- Improvements
149
- LLMs.txt: 80 → 100 (▲20)
150
- ```
151
-
152
- Works with all output formats (terminal, JSON, HTML). In JSON mode, a `baselineDiff` object is included with per-check deltas.
153
-
154
- Use `--fail-on-regression <points>` in CI to fail the build if any individual check drops by more than the specified threshold.
155
-
156
- ### Batch Mode
157
-
158
- Pass multiple URLs to audit them sequentially. Each gets its own full report, followed by a summary table:
159
-
160
- ```
161
- ═══ Batch Summary ═══
162
-
163
- URL Score Grade
164
- ────────────────────────────────────────────────────────────
165
- https://example.com 92/100 Excellent
166
- https://other-site.com 45/100 Poor
167
-
168
- 2 URLs audited: 1 passed, 1 failed
169
- ████████████████████████████░░░░░░░░░░░░ 69/100 avg Fair
170
- ```
171
-
172
- Exit code: `0` if all URLs score >= 70, `1` if any fails.
173
-
174
- ### HTML Report
175
-
176
- Generate a self-contained HTML report you can open in any browser or share with your team:
177
-
178
- ```bash
179
- ax-audit https://example.com --output html > report.html
65
+ ax-audit https://example.com # full audit, terminal output
66
+ ax-audit https://a.com https://b.com --concurrency 2 # batch, in parallel
67
+ ax-audit https://example.com --output markdown # also: json, html
68
+ ax-audit https://example.com --checks llms-txt,rsl # subset of checks
69
+ ax-audit https://example.com --only-failures # hide passing findings
70
+ ax-audit https://example.com --baseline .ax-baseline.json --fail-on-regression 5
180
71
  ```
181
72
 
182
- Features: circular score gauge, dark/light mode, collapsible check sections, responsive design. Works for both single and batch audits.
73
+ Exit codes gate CI: `0` for score ≥ 70, `1` below. Full flag reference: **[docs/cli.md](docs/cli.md)** · CI recipes (PR comments, regression gates, scheduled audits): **[docs/ci.md](docs/ci.md)**.
183
74
 
184
75
  ## Programmatic API
185
76
 
186
- Full TypeScript support with all types exported.
187
-
188
77
  ```typescript
189
78
  import { audit, batchAudit } from 'ax-audit';
190
- import type { AuditReport, BatchAuditReport } from 'ax-audit';
191
-
192
- // Single URL
193
- const report: AuditReport = await audit({ url: 'https://example.com' });
194
- console.log(report.overallScore); // 0-100
195
- console.log(report.grade.label); // 'Excellent' | 'Good' | 'Fair' | 'Poor'
196
- console.log(report.results); // Individual check results with findings
197
-
198
- // Batch audit
199
- const batch: BatchAuditReport = await batchAudit(
200
- ['https://example.com', 'https://other.com'],
201
- { timeout: 10000 }
202
- );
203
- console.log(batch.summary.averageScore); // Average across all URLs
204
- console.log(batch.summary.passed); // Number of URLs scoring >= 70
205
- ```
206
-
207
- Also exports `calculateOverallScore`, `getGrade`, `checks`, `saveBaseline`, `loadBaseline`, `diffBaseline`, and `toBaselineData` for advanced usage.
208
-
209
- ## Scoring
210
-
211
- Each check returns a score from 0 to 100. The overall score is a weighted average across all checks.
212
79
 
213
- | Grade | Score | Exit Code |
214
- |---|---|---|
215
- | Excellent | 90 - 100 | `0` |
216
- | Good | 70 - 89 | `0` |
217
- | Fair | 50 - 69 | `1` |
218
- | Poor | 0 - 49 | `1` |
219
-
220
- Exit codes make it easy to gate CI/CD deployments on AX readiness.
221
-
222
- ## CI Integration
223
-
224
- ### GitHub Actions
225
-
226
- ```yaml
227
- - name: AX Audit
228
- run: npx ax-audit https://your-site.com
229
- # Fails the step if score < 70
80
+ const report = await audit({ url: 'https://example.com' });
81
+ report.overallScore; // 0–100
82
+ report.results; // per-check findings
230
83
  ```
231
84
 
232
- Save the report as an artifact:
233
-
234
- ```yaml
235
- - name: AX Audit (JSON)
236
- run: npx ax-audit https://your-site.com --json > ax-report.json
85
+ Full API and types: **[docs/api.md](docs/api.md)**.
237
86
 
238
- - uses: actions/upload-artifact@v4
239
- with:
240
- name: ax-audit-report
241
- path: ax-report.json
242
- ```
87
+ ## Documentation
243
88
 
244
- Fail on regressions using a committed baseline:
89
+ Start here:
245
90
 
246
- ```yaml
247
- - name: AX Audit (regression gate)
248
- run: npx ax-audit https://your-site.com --baseline .ax-baseline.json --fail-on-regression 5
249
- ```
91
+ | Document | Contents |
92
+ |---|---|
93
+ | [docs/getting-started.md](docs/getting-started.md) | First audit, reading the report, fixing in impact order |
94
+ | [docs/concepts.md](docs/concepts.md) | The AX standards landscape — llms.txt, A2A, MCP, RSL, Content Signals, Web Bot Auth |
250
95
 
251
- ## Available Checks
96
+ Reference:
252
97
 
253
- | Check ID | Use with `--checks` |
98
+ | Document | Contents |
254
99
  |---|---|
255
- | `llms-txt` | LLMs.txt spec + Content-Type |
256
- | `robots-txt` | AI crawler configuration (40+ crawlers) |
257
- | `html-rendering` | SSR / SPA-shell detection + semantic HTML |
258
- | `structured-data` | JSON-LD structured data |
259
- | `http-headers` | Security + AI discovery headers |
260
- | `agent-json` | A2A Agent Card + same-origin validation |
261
- | `mcp` | MCP server configuration |
262
- | `seo-basics` | title / description / canonical / lang / hreflang |
263
- | `security-txt` | RFC 9116 Security.txt |
264
- | `meta-tags` | AI meta tags + OpenGraph + Twitter Card |
265
- | `openapi` | OpenAPI specification |
266
- | `tls-https` | HTTPS + HTTP→HTTPS redirect + HSTS preload |
267
- | `sitemap` | sitemap.xml validation + freshness |
268
- | `well-known-ai` | Emerging AI discovery files |
269
- | `content-negotiation` | Markdown via `Accept: text/markdown` (informational) |
270
-
271
- ## Testing
100
+ | [docs/checks.md](docs/checks.md) | All 18 checks with **exact scoring** per finding, weights, scoring model |
101
+ | [docs/cli.md](docs/cli.md) | Every flag, output formats, exit codes, baseline workflow |
102
+ | [docs/api.md](docs/api.md) | `audit`, `batchAudit`, baselines, reporters, types, API-stability policy |
103
+ | [docs/ci.md](docs/ci.md) | GitHub Actions recipes: gates, PR comments, scheduled drift detection |
104
+ | [docs/architecture.md](docs/architecture.md) | Pipeline design, check anatomy, how to add a check, scoring policy |
105
+ | [docs/faq.md](docs/faq.md) | Troubleshooting, false positives, the `agent-access` verified-bots caveat |
106
+ | [Remediation guides](https://lucioduran.com/projects/ax-audit/guides) | Step-by-step fixes for every finding |
272
107
 
273
- ```bash
274
- npm test
275
- ```
108
+ The same documentation is browsable at [lucioduran.com/projects/ax-audit/docs](https://lucioduran.com/projects/ax-audit/docs), rendered from these files. Contributors: see [CONTRIBUTING.md](CONTRIBUTING.md) and [SECURITY.md](SECURITY.md).
109
+
110
+ ## Scoring
276
111
 
277
- 229 tests covering all 15 checks, the scorer, the HTTP fetcher (against a real local server), baseline comparison, HTML parsing utilities, and edge cases. Uses Node.js built-in test runner (`node:test`), no extra test dependencies.
112
+ | Grade | Score | Exit Code |
113
+ |---|---|---|
114
+ | Excellent | 90–100 | `0` |
115
+ | Good | 70–89 | `0` |
116
+ | Fair | 50–69 | `1` |
117
+ | Poor | 0–49 | `1` |
278
118
 
279
- ## Tech Stack
119
+ ## Tech
280
120
 
281
- - **TypeScript** with strict mode
282
- - **2 runtime dependencies**: `chalk` + `commander`
283
- - **Node.js 18+** built-in `fetch` (zero HTTP libraries)
284
- - Parallel check execution via `Promise.allSettled`
285
- - In-memory request caching per audit run
121
+ TypeScript strict mode · 2 runtime dependencies (`chalk`, `commander`) · Node 18+ built-in `fetch` · parallel checks via `Promise.allSettled` · per-run request cache with `Vary`-aware keys · transient-failure retries with backoff · 301 tests on `node:test` with zero test dependencies.
286
122
 
287
123
  ## Contributing
288
124
 
289
- Contributions are welcome. To add a new check:
125
+ Contributions are welcome — see **[docs/architecture.md](docs/architecture.md)** for the pipeline design, check anatomy, and the steps (code, tests, docs, remediation guide) a new check requires.
126
+
127
+ ## Related
290
128
 
291
- 1. Create `src/checks/your-check.ts` exporting `default` (async check function) and `meta` (CheckMeta)
292
- 2. Use `buildResult(meta, score, findings, start)` from `./utils.js` to return results
293
- 3. Register it in `src/checks/index.ts`
294
- 4. Add its weight to `CHECK_WEIGHTS` in `src/constants.ts`
129
+ - **[ax-init](https://github.com/lucioduran/ax-init)** generate the AX files this tool audits
130
+ - **[ax-cite](https://github.com/lucioduran/ax-cite)** embed AI-extractable structured data in your pages
295
131
 
296
132
  ## License
297
133
 
@@ -0,0 +1,16 @@
1
+ import type { CheckContext, CheckResult, CheckMeta } from '../types.js';
2
+ import { type BotEntry } from './robots-txt.js';
3
+ export declare const meta: CheckMeta;
4
+ /**
5
+ * Build a realistic crawler User-Agent for a given bot token. WAF and
6
+ * bot-management rules match on the token substring, which is what we need
7
+ * to trigger the same code path the real crawler would hit.
8
+ */
9
+ export declare function crawlerUserAgent(token: string): string;
10
+ /**
11
+ * Whether robots.txt expresses the intent to block this crawler: an explicit
12
+ * full Disallow for it, or a full wildcard Disallow with no explicit entry.
13
+ */
14
+ export declare function intentBlocked(entries: BotEntry[], crawler: string): boolean;
15
+ export default function check(ctx: CheckContext): Promise<CheckResult>;
16
+ //# sourceMappingURL=agent-access.d.ts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"agent-access.d.ts","sourceRoot":"","sources":["../../src/checks/agent-access.ts"],"names":[],"mappings":"AAEA,OAAO,KAAK,EAAE,YAAY,EAAE,WAAW,EAAE,SAAS,EAAW,MAAM,aAAa,CAAC;AAGjF,OAAO,EAAmB,KAAK,QAAQ,EAAE,MAAM,iBAAiB,CAAC;AAEjE,eAAO,MAAM,IAAI,EAAE,SAKlB,CAAC;AASF;;;;GAIG;AACH,wBAAgB,gBAAgB,CAAC,KAAK,EAAE,MAAM,GAAG,MAAM,CAEtD;AAED;;;GAGG;AACH,wBAAgB,aAAa,CAAC,OAAO,EAAE,QAAQ,EAAE,EAAE,OAAO,EAAE,MAAM,GAAG,OAAO,CAK3E;AAED,wBAA8B,KAAK,CAAC,GAAG,EAAE,YAAY,GAAG,OAAO,CAAC,WAAW,CAAC,CAiF3E"}
@@ -0,0 +1,110 @@
1
+ import { CORE_AI_CRAWLERS } from '../constants.js';
2
+ import { guideUrl } from '../guide-urls.js';
3
+ import { buildResult } from './utils.js';
4
+ import { extractVisibleText } from './html-utils.js';
5
+ import { parseUserAgents } from './robots-txt.js';
6
+ export const meta = {
7
+ id: 'agent-access',
8
+ name: 'Agent Access',
9
+ description: 'Checks that AI crawler user-agents are not blocked or served reduced content (cloaking)',
10
+ weight: 0, // Informational in 3.x — will gain weight in v4.0 (score-affecting changes are treated as breaking).
11
+ };
12
+ /** Content below this fraction of the baseline visible text counts as "reduced". */
13
+ const REDUCED_CONTENT_RATIO = 0.5;
14
+ /** Baselines with less visible text than this are too small for meaningful content comparison. */
15
+ const MIN_BASELINE_TEXT = 200;
16
+ /**
17
+ * Build a realistic crawler User-Agent for a given bot token. WAF and
18
+ * bot-management rules match on the token substring, which is what we need
19
+ * to trigger the same code path the real crawler would hit.
20
+ */
21
+ export function crawlerUserAgent(token) {
22
+ return `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ${token}/1.0)`;
23
+ }
24
+ /**
25
+ * Whether robots.txt expresses the intent to block this crawler: an explicit
26
+ * full Disallow for it, or a full wildcard Disallow with no explicit entry.
27
+ */
28
+ export function intentBlocked(entries, crawler) {
29
+ const explicit = entries.find((e) => e.name.toLowerCase() === crawler.toLowerCase());
30
+ if (explicit)
31
+ return explicit.disallowed;
32
+ const wildcard = entries.find((e) => e.name === '*');
33
+ return wildcard?.disallowed ?? false;
34
+ }
35
+ export default async function check(ctx) {
36
+ const start = performance.now();
37
+ const findings = [];
38
+ const baseline = await ctx.fetch(ctx.url);
39
+ if (!baseline.ok) {
40
+ findings.push({
41
+ status: 'fail',
42
+ message: 'Baseline homepage request failed — cannot compare crawler access',
43
+ detail: `HTTP ${baseline.status || 'network error'}`,
44
+ learnMoreUrl: guideUrl(meta.id, 'baseline-unavailable'),
45
+ });
46
+ return buildResult(meta, 0, findings, start);
47
+ }
48
+ const baselineText = extractVisibleText(baseline.body).length;
49
+ const robotsRes = await ctx.fetch(`${ctx.url}/robots.txt`);
50
+ const robotsEntries = robotsRes.ok ? parseUserAgents(robotsRes.body) : [];
51
+ const outcomes = new Map();
52
+ for (const crawler of CORE_AI_CRAWLERS) {
53
+ const res = await ctx.fetch(ctx.url, { headers: { 'User-Agent': crawlerUserAgent(crawler) } });
54
+ const blockedByRobots = intentBlocked(robotsEntries, crawler);
55
+ if (!res.ok) {
56
+ outcomes.set(crawler, blockedByRobots ? 'blocked-consistent' : 'blocked');
57
+ const detail = `HTTP ${res.status || 'network error'} for User-Agent containing "${crawler}"`;
58
+ if (blockedByRobots) {
59
+ findings.push({
60
+ status: 'pass',
61
+ message: `${crawler} blocked at the server — consistent with its robots.txt Disallow`,
62
+ detail,
63
+ });
64
+ }
65
+ else {
66
+ findings.push({
67
+ status: 'warn',
68
+ message: `${crawler} is ${robotsRes.ok ? 'allowed in robots.txt' : 'not restricted'} but its User-Agent is blocked`,
69
+ detail,
70
+ hint: 'Your WAF or bot management rejects this crawler token even though robots.txt permits it — the block is invisible to you but fatal for the agent. ' +
71
+ 'Check your firewall rules and AI-bot toggles (e.g., Cloudflare "Block AI Crawlers"). ' +
72
+ 'Note: if your WAF verifies bots cryptographically (Web Bot Auth / verified-bots lists), the real crawler may still pass while this unverified probe is rejected — verify against your WAF logs.',
73
+ learnMoreUrl: guideUrl(meta.id, 'blocked-crawler'),
74
+ });
75
+ }
76
+ continue;
77
+ }
78
+ const text = extractVisibleText(res.body).length;
79
+ if (baselineText >= MIN_BASELINE_TEXT && text < baselineText * REDUCED_CONTENT_RATIO) {
80
+ outcomes.set(crawler, 'reduced');
81
+ findings.push({
82
+ status: 'warn',
83
+ message: `${crawler} receives reduced content (${text} vs ${baselineText} chars of visible text)`,
84
+ hint: 'The server returns 200 but serves this crawler substantially less content than a regular client — ' +
85
+ 'often an interstitial, a challenge page, or conditional rendering. Agents index what they receive.',
86
+ learnMoreUrl: guideUrl(meta.id, 'reduced-content'),
87
+ });
88
+ }
89
+ else {
90
+ outcomes.set(crawler, 'ok');
91
+ }
92
+ }
93
+ const okCount = [...outcomes.values()].filter((o) => o === 'ok').length;
94
+ if (okCount === CORE_AI_CRAWLERS.length) {
95
+ findings.unshift({
96
+ status: 'pass',
97
+ message: `All ${CORE_AI_CRAWLERS.length} core AI crawler user-agents receive equivalent responses`,
98
+ });
99
+ }
100
+ const credit = [...outcomes.values()].reduce((acc, o) => {
101
+ if (o === 'ok' || o === 'blocked-consistent')
102
+ return acc + 1;
103
+ if (o === 'reduced')
104
+ return acc + 0.5;
105
+ return acc;
106
+ }, 0);
107
+ const score = Math.round((credit / CORE_AI_CRAWLERS.length) * 100);
108
+ return buildResult(meta, score, findings, start);
109
+ }
110
+ //# sourceMappingURL=agent-access.js.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"agent-access.js","sourceRoot":"","sources":["../../src/checks/agent-access.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,gBAAgB,EAAE,MAAM,iBAAiB,CAAC;AACnD,OAAO,EAAE,QAAQ,EAAE,MAAM,kBAAkB,CAAC;AAE5C,OAAO,EAAE,WAAW,EAAE,MAAM,YAAY,CAAC;AACzC,OAAO,EAAE,kBAAkB,EAAE,MAAM,iBAAiB,CAAC;AACrD,OAAO,EAAE,eAAe,EAAiB,MAAM,iBAAiB,CAAC;AAEjE,MAAM,CAAC,MAAM,IAAI,GAAc;IAC7B,EAAE,EAAE,cAAc;IAClB,IAAI,EAAE,cAAc;IACpB,WAAW,EAAE,yFAAyF;IACtG,MAAM,EAAE,CAAC,EAAE,qGAAqG;CACjH,CAAC;AAEF,oFAAoF;AACpF,MAAM,qBAAqB,GAAG,GAAG,CAAC;AAClC,kGAAkG;AAClG,MAAM,iBAAiB,GAAG,GAAG,CAAC;AAI9B;;;;GAIG;AACH,MAAM,UAAU,gBAAgB,CAAC,KAAa;IAC5C,OAAO,kEAAkE,KAAK,OAAO,CAAC;AACxF,CAAC;AAED;;;GAGG;AACH,MAAM,UAAU,aAAa,CAAC,OAAmB,EAAE,OAAe;IAChE,MAAM,QAAQ,GAAG,OAAO,CAAC,IAAI,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,IAAI,CAAC,WAAW,EAAE,KAAK,OAAO,CAAC,WAAW,EAAE,CAAC,CAAC;IACrF,IAAI,QAAQ;QAAE,OAAO,QAAQ,CAAC,UAAU,CAAC;IACzC,MAAM,QAAQ,GAAG,OAAO,CAAC,IAAI,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,IAAI,KAAK,GAAG,CAAC,CAAC;IACrD,OAAO,QAAQ,EAAE,UAAU,IAAI,KAAK,CAAC;AACvC,CAAC;AAED,MAAM,CAAC,OAAO,CAAC,KAAK,UAAU,KAAK,CAAC,GAAiB;IACnD,MAAM,KAAK,GAAG,WAAW,CAAC,GAAG,EAAE,CAAC;IAChC,MAAM,QAAQ,GAAc,EAAE,CAAC;IAE/B,MAAM,QAAQ,GAAG,MAAM,GAAG,CAAC,KAAK,CAAC,GAAG,CAAC,GAAG,CAAC,CAAC;IAC1C,IAAI,CAAC,QAAQ,CAAC,EAAE,EAAE,CAAC;QACjB,QAAQ,CAAC,IAAI,CAAC;YACZ,MAAM,EAAE,MAAM;YACd,OAAO,EAAE,kEAAkE;YAC3E,MAAM,EAAE,QAAQ,QAAQ,CAAC,MAAM,IAAI,eAAe,EAAE;YACpD,YAAY,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,EAAE,sBAAsB,CAAC;SACxD,CAAC,CAAC;QACH,OAAO,WAAW,CAAC,IAAI,EAAE,CAAC,EAAE,QAAQ,EAAE,KAAK,CAAC,CAAC;IAC/C,CAAC;IACD,MAAM,YAAY,GAAG,kBAAkB,CAAC,QAAQ,CAAC,IAAI,CAAC,CAAC,MAAM,CAAC;IAE9D,MAAM,SAAS,GAAG,MAAM,GAAG,CAAC,KAAK,CAAC,GAAG,GAAG,CAAC,GAAG,aAAa,CAAC,CAAC;IAC3D,MAAM,aAAa,GAAG,SAAS,CAAC,EAAE,CAAC,CAAC,CAAC,eAAe,CAAC,SAAS,CAAC,IAAI,CAAC,CAAC,CAAC,CAAC,EAAE,CAAC;IAE1E,MAAM,QAAQ,GAAG,IAAI,GAAG,EAAmB,CAAC;IAE5C,KAAK,MAAM,OAAO,IAAI,gBAAgB,EAAE,CAAC;QACvC,MAAM,GAAG,GAAG,MAAM,GAAG,CAAC,KAAK,CAAC,GAAG,CAAC,GAAG,EAAE,EAAE,OAAO,EAAE,EAAE,YAAY,EAAE,gBAAgB,CAAC,OAAO,CAAC,EAAE,EAAE,CAAC,CAAC;QAC/F,MAAM,eAAe,GAAG,aAAa,CAAC,aAAa,EAAE,OAAO,CAAC,CAAC;QAE9D,IAAI,CAAC,GAAG,CAAC,EAAE,EAAE,CAAC;YACZ,QAAQ,CAAC,GAAG,CAAC,OAAO,EAAE,eAAe,CAAC,CAAC,CAAC,oBAAoB,CAAC,CAAC,CAAC,SAAS,CAAC,CAAC;YAC1E,MAAM,MAAM,GAAG,QAAQ,GAAG,CAAC,MAAM,IAAI,eAAe,+BAA+B,OAAO,GAAG,CAAC;YAC9F,IAAI,eAAe,EAAE,CAAC;gBACpB,QAAQ,CAAC,IAAI,CAAC;oBACZ,MAAM,EAAE,MAAM;oBACd,OAAO,EAAE,GAAG,OAAO,kEAAkE;oBACrF,MAAM;iBACP,CAAC,CAAC;YACL,CAAC;iBAAM,CAAC;gBACN,QAAQ,CAAC,IAAI,CAAC;oBACZ,MAAM,EAAE,MAAM;oBACd,OAAO,EAAE,GAAG,OAAO,OAAO,SAAS,CAAC,EAAE,CAAC,CAAC,CAAC,uBAAuB,CAAC,CAAC,CAAC,gBAAgB,gCAAgC;oBACnH,MAAM;oBACN,IAAI,EACF,mJAAmJ;wBACnJ,uFAAuF;wBACvF,iMAAiM;oBACnM,YAAY,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,EAAE,iBAAiB,CAAC;iBACnD,CAAC,CAAC;YACL,CAAC;YACD,SAAS;QACX,CAAC;QAED,MAAM,IAAI,GAAG,kBAAkB,CAAC,GAAG,CAAC,IAAI,CAAC,CAAC,MAAM,CAAC;QACjD,IAAI,YAAY,IAAI,iBAAiB,IAAI,IAAI,GAAG,YAAY,GAAG,qBAAqB,EAAE,CAAC;YACrF,QAAQ,CAAC,GAAG,CAAC,OAAO,EAAE,SAAS,CAAC,CAAC;YACjC,QAAQ,CAAC,IAAI,CAAC;gBACZ,MAAM,EAAE,MAAM;gBACd,OAAO,EAAE,GAAG,OAAO,8BAA8B,IAAI,OAAO,YAAY,yBAAyB;gBACjG,IAAI,EACF,oGAAoG;oBACpG,oGAAoG;gBACtG,YAAY,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,EAAE,iBAAiB,CAAC;aACnD,CAAC,CAAC;QACL,CAAC;aAAM,CAAC;YACN,QAAQ,CAAC,GAAG,CAAC,OAAO,EAAE,IAAI,CAAC,CAAC;QAC9B,CAAC;IACH,CAAC;IAED,MAAM,OAAO,GAAG,CAAC,GAAG,QAAQ,CAAC,MAAM,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,KAAK,IAAI,CAAC,CAAC,MAAM,CAAC;IACxE,IAAI,OAAO,KAAK,gBAAgB,CAAC,MAAM,EAAE,CAAC;QACxC,QAAQ,CAAC,OAAO,CAAC;YACf,MAAM,EAAE,MAAM;YACd,OAAO,EAAE,OAAO,gBAAgB,CAAC,MAAM,2DAA2D;SACnG,CAAC,CAAC;IACL,CAAC;IAED,MAAM,MAAM,GAAG,CAAC,GAAG,QAAQ,CAAC,MAAM,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,EAAE;QACtD,IAAI,CAAC,KAAK,IAAI,IAAI,CAAC,KAAK,oBAAoB;YAAE,OAAO,GAAG,GAAG,CAAC,CAAC;QAC7D,IAAI,CAAC,KAAK,SAAS;YAAE,OAAO,GAAG,GAAG,GAAG,CAAC;QACtC,OAAO,GAAG,CAAC;IACb,CAAC,EAAE,CAAC,CAAC,CAAC;IACN,MAAM,KAAK,GAAG,IAAI,CAAC,KAAK,CAAC,CAAC,MAAM,GAAG,gBAAgB,CAAC,MAAM,CAAC,GAAG,GAAG,CAAC,CAAC;IAEnE,OAAO,WAAW,CAAC,IAAI,EAAE,KAAK,EAAE,QAAQ,EAAE,KAAK,CAAC,CAAC;AACnD,CAAC"}
@@ -0,0 +1,4 @@
1
+ import type { CheckContext, CheckResult, CheckMeta } from '../types.js';
2
+ export declare const meta: CheckMeta;
3
+ export default function check(ctx: CheckContext): Promise<CheckResult>;
4
+ //# sourceMappingURL=crawl-efficiency.d.ts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"crawl-efficiency.d.ts","sourceRoot":"","sources":["../../src/checks/crawl-efficiency.ts"],"names":[],"mappings":"AACA,OAAO,KAAK,EAAE,YAAY,EAAE,WAAW,EAAE,SAAS,EAAW,MAAM,aAAa,CAAC;AAGjF,eAAO,MAAM,IAAI,EAAE,SAKlB,CAAC;AAMF,wBAA8B,KAAK,CAAC,GAAG,EAAE,YAAY,GAAG,OAAO,CAAC,WAAW,CAAC,CA8G3E"}