npm - ax-audit - Versions diffs - 3.1.0 → 3.6.0 - Mend

ax-audit 3.1.0 → 3.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

package/CHANGELOG.md +60 -0
package/README.md +61 -225
package/dist/checks/agent-access.d.ts +16 -0
package/dist/checks/agent-access.d.ts.map +1 -0
package/dist/checks/agent-access.js +110 -0
package/dist/checks/agent-access.js.map +1 -0
package/dist/checks/crawl-efficiency.d.ts +4 -0
package/dist/checks/crawl-efficiency.d.ts.map +1 -0
package/dist/checks/crawl-efficiency.js +122 -0
package/dist/checks/crawl-efficiency.js.map +1 -0
package/dist/checks/index.d.ts.map +1 -1
package/dist/checks/index.js +6 -0
package/dist/checks/index.js.map +1 -1
package/dist/checks/robots-txt.d.ts +20 -0
package/dist/checks/robots-txt.d.ts.map +1 -1
package/dist/checks/robots-txt.js +111 -3
package/dist/checks/robots-txt.js.map +1 -1
package/dist/checks/rsl.d.ts +6 -0
package/dist/checks/rsl.d.ts.map +1 -0
package/dist/checks/rsl.js +252 -0
package/dist/checks/rsl.js.map +1 -0
package/dist/cli.d.ts.map +1 -1
package/dist/cli.js +20 -2
package/dist/cli.js.map +1 -1
package/dist/constants.d.ts +17 -0
package/dist/constants.d.ts.map +1 -1
package/dist/constants.js +39 -1
package/dist/constants.js.map +1 -1
package/dist/fetcher.d.ts +5 -1
package/dist/fetcher.d.ts.map +1 -1
package/dist/fetcher.js +32 -27
package/dist/fetcher.js.map +1 -1
package/dist/index.d.ts +2 -1
package/dist/index.d.ts.map +1 -1
package/dist/index.js +1 -0
package/dist/index.js.map +1 -1
package/dist/orchestrator.d.ts +2 -2
package/dist/orchestrator.d.ts.map +1 -1
package/dist/orchestrator.js +13 -6
package/dist/orchestrator.js.map +1 -1
package/dist/reporter/index.d.ts.map +1 -1
package/dist/reporter/index.js +7 -0
package/dist/reporter/index.js.map +1 -1
package/dist/reporter/markdown.d.ts +8 -0
package/dist/reporter/markdown.d.ts.map +1 -0
package/dist/reporter/markdown.js +76 -0
package/dist/reporter/markdown.js.map +1 -0
package/dist/types.d.ts +7 -1
package/dist/types.d.ts.map +1 -1
package/docs/api.md +200 -0
package/docs/architecture.md +88 -0
package/docs/checks.md +322 -0
package/docs/ci.md +89 -0
package/docs/cli.md +67 -0
package/docs/concepts.md +87 -0
package/docs/faq.md +77 -0
package/docs/getting-started.md +101 -0
package/package.json +2 -1

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,66 @@
 All notable changes to ax-audit are documented here.
+## [3.6.0] - 2026-06-06
+### Added
+- **Fetcher retries with exponential backoff**: transient failures (network errors, timeouts, and 408/425/429/500/502/503/504) are retried automatically. Configurable via `--retries <n>` (CLI, default 2) and `retries` (programmatic `AuditOptions`); backoff doubles from a 250ms base. Non-retryable responses (e.g. 404) short-circuit immediately. Previously a single transient timeout scored a check 0.
+- **Parallel batch auditing**: `--concurrency <n>` (CLI) and `concurrency` on the new `BatchOptions` type run multiple URL audits in parallel via an order-preserving work queue. Default remains sequential (1).
+- **Markdown reporter**: `--output markdown` emits a self-contained Markdown report (score, summary table, per-check findings with status emoji, baseline deltas) — ideal for CI logs and PR comments. Supported for single and batch audits. New exports: `renderMarkdown`, `renderBatchMarkdown`.
+- **Crawler list refresh**: added Google's official signed AI-agent user-agent `Google-Agent` (identity `https://agent.bot.goog`) to the known-crawlers list.
+- **CLI validation**: `--retries`, `--concurrency`, and `--output` now reject invalid values with a clear error.
+- **17 new tests** (301 total): fetcher retry behavior (against a flaky local server), batch ordering/concurrency, and the Markdown reporter.
+### Notes
+- No scoring changes. Retries can raise scores on flaky endpoints that previously timed out, but the scoring model itself is unchanged.
+## [3.5.0] - 2026-06-06
+### Added
+- **crawl-efficiency check (informational)**: measures the cost of crawling your pages across three dimensions. Compression — rewards Brotli, accepts gzip/deflate/zstd (suggesting br), warns when uncompressed (−30). Conditional GET — checks for an `ETag` or `Last-Modified` validator, then issues a follow-up request with `If-None-Match` / `If-Modified-Since` and verifies the server returns `304 Not Modified` (−30 for no validator, −15 when 304 is not honored). Response size — warns on pages over 500 KB (−5) and 2 MB (−10) of decompressed HTML. The probe advertises `Accept-Encoding: br, gzip, deflate`; the conditional request reuses the per-request header support added in 3.1.0.
+- **12 new tests** (284 total).
+### Scoring
+- The new check carries **weight 0 in 3.x** (informational), consistent with 3.1.0–3.4.0.
+## [3.4.0] - 2026-06-06
+### Added
+- **agent-access check (informational)**: cloaking and blocking detection. Probes the homepage with realistic user-agents for each of the 8 core AI crawlers (GPTBot, ClaudeBot, ChatGPT-User, Claude-SearchBot, Google-Extended, PerplexityBot, OAI-SearchBot, CCBot) and compares status and visible-text volume against the default-UA baseline. Flags the failure mode invisible to operators: robots.txt allows a crawler while the WAF returns 403 to its user-agent (Cloudflare's "Block AI Crawlers" toggle produces exactly this). Blocks consistent with an explicit robots.txt `Disallow` (or wildcard block) are reported as intentional and not penalized. Responses with under 50% of baseline visible text count as reduced content (half credit); content comparison is skipped for baselines under 200 chars to avoid SPA-shell noise. Hints note the verified-bots caveat: WAFs using Web Bot Auth / IP verification may pass the real crawler while rejecting this unverified probe.
+- `parseUserAgents` and `BotEntry` are now exported from the robots-txt check for reuse.
+- **12 new tests** (272 total).
+### Scoring
+- Internal score is the credit ratio across the 8 probes; the check carries **weight 0 in 3.x** (informational), consistent with 3.1.0–3.3.0.
+## [3.3.0] - 2026-06-06
+### Added
+- **rsl check (informational)**: validates [Really Simple Licensing 1.0](https://rslstandard.org/rsl) — the machine-readable content-licensing standard endorsed by 1,500+ publishers (Reddit, Yahoo, Medium, O'Reilly) with infrastructure support from Cloudflare and Fastly. Discovery via all three spec mechanisms: robots.txt `License:` directive (absolute-URI enforcement per §4.4.1), HTTP `Link: rel="license"; type="application/rsl+xml"` header, and `<link rel="license" type="application/rsl+xml">` (plain CC-style license links without the RSL media type are ignored). Document validation: `application/rsl+xml` Content-Type (−5), `<rsl>` root + `https://rslstandard.org/rsl` namespace, required `url` attribute on every `<content>` (empty value allowed per §3.3), `<license>` presence, `permits`/`prohibits` type and token vocabulary (`usage`: all/ai-all/ai-train/ai-input/ai-index/search; `user`; `geo` as ISO 3166-1 alpha-2), and `payment` types.
+- **21 new tests** (260 total) covering the three discovery mechanisms, vocabulary enforcement, namespace/root/structure validation, XML-comment stripping, and score caps.
+### Scoring
+- The new check carries **weight 0 in 3.x** (informational), consistent with 3.1.0/3.2.0: no impact on existing scores or baselines until v4.0.
+## [3.2.0] - 2026-06-06
+### Added
+- **Content Signals Policy support in robots-txt** ([contentsignals.org](https://contentsignals.org), CC0): the check now parses `Content-Signal:` directives — the machine-readable `search` / `ai-input` / `ai-train` preferences that Cloudflare serves by default on its 3.8M+ managed robots.txt domains. Declared signals are reported per User-agent group; malformed segments, unknown signal names, and directives placed outside a User-agent group produce warnings. Absence of the directive produces an informational nudge. The group parser now also treats `Content-Signal` as a group-closing directive, fixing potential User-agent group leakage.
+- **10 new tests** (239 total) covering declaration reporting, malformed/unknown signals, shared User-agent groups, case-insensitivity, out-of-group placement, and score neutrality.
+### Scoring
+- All Content Signals findings are **informational in 3.x**: they never alter the robots-txt score, so existing scores and baselines are unchanged.
 ## [3.1.0] - 2026-06-06
 ### Added

package/README.md CHANGED Viewed

@@ -25,273 +25,109 @@ npx ax-audit https://your-site.com
     PASS  /llms.txt exists
     PASS  /llms.txt Content-Type OK (text/plain)
     PASS  H1 heading: "Lucio Duran — Personal Portfolio"
-    PASS  /llms-full.txt also available (bonus)
   Robots.txt (100/100)
     PASS  All 8 core AI crawlers explicitly configured
-    PASS  32/47 known AI crawlers have explicit rules
-  HTML Rendering (90/100)
-    PASS  Server-rendered content detected (473 words)
-    PASS  Semantic landmarks present (main, article, header, footer, nav)
-    PASS  Single <h1> heading
-    PASS  3/3 <img> tags have alt attributes
-  TLS / HTTPS (100/100)
-    PASS  Site is served over HTTPS
-    PASS  HTTP requests redirect to HTTPS
-    PASS  HSTS preload-eligible
+    PASS  Content signals declared for User-agent: * — search=yes, ai-train=no
+  Content Negotiation (100/100)
+    PASS  Homepage serves Markdown via content negotiation (Accept: text/markdown)
+    PASS  Markdown is ~95% lighter than the HTML representation
   ...
 ```
 ## Why
-AI agents and LLMs are increasingly crawling, indexing, and interacting with websites. Just like Lighthouse audits web performance and axe-core audits accessibility, **ax-audit** tells you how ready your site is for the AI agent ecosystem.
+AI agents and LLMs are increasingly crawling, indexing, and interacting with websites. Just like Lighthouse audits web performance and axe-core audits accessibility, **ax-audit** tells you how ready your site is for the AI agent ecosystem — discovery files, crawler policy, licensing, content negotiation, and the failure modes invisible to operators (like a WAF blocking crawlers your robots.txt allows).
 ## What it checks
-| Check | What it audits | Weight |
-|---|---|---|
-| **LLMs.txt** | `/llms.txt` presence, [llmstxt.org](https://llmstxt.org) spec, Content-Type | 11% |
-| **Robots.txt** | AI crawler configuration (40+ known crawlers), wildcard detection, partial path restrictions | 11% |
-| **HTML Rendering** | Server-rendered content, semantic landmarks, SPA-shell detection, alt coverage | 9% |
-| **Structured Data** | JSON-LD on homepage (schema.org, `@graph`, entity types) | 9% |
-| **HTTP Headers** | Security headers + AI discovery `Link` headers + CORS on `.well-known` | 9% |
-| **Agent Card** | `/.well-known/agent.json` [A2A protocol](https://a2a-protocol.org) + same-origin url + skill quality | 7% |
-| **MCP** | `/.well-known/mcp.json` [Model Context Protocol](https://modelcontextprotocol.io) server config | 7% |
-| **SEO Basics** | `<title>`, meta description, canonical, `<html lang>`, charset, viewport, hreflang | 7% |
-| **Security.txt** | `/.well-known/security.txt` [RFC 9116](https://www.rfc-editor.org/rfc/rfc9116) compliance | 6% |
-| **Meta Tags** | AI meta tags (`ai:*`), `rel="alternate"`, `rel="me"`, OpenGraph + Twitter Card completeness | 6% |
-| **OpenAPI** | `/.well-known/openapi.json` presence, schema validity, Content-Type | 6% |
-| **TLS / HTTPS** | HTTPS, HTTP→HTTPS redirect, HSTS with `preload` + `includeSubDomains` | 5% |
-| **Sitemap** | `sitemap.xml` (or `Sitemap:` from robots.txt) — XML validity, `<lastmod>` coverage, freshness, sitemap-index handling | 4% |
-| **AI Well-Known** | Emerging files: `/.well-known/ai.txt`, `genai.txt`, `ai-plugin.json`, `agents.json`, `nlweb.json` | 3% |
-| **Content Negotiation** | Markdown for agents — `Accept: text/markdown` negotiation, `Vary: Accept`, `rel="alternate"` fallback | 0%* |
-\* **Content Negotiation** is informational in 3.x: it runs and reports findings but does not affect the overall score. It will gain weight in v4.0.
-## Install
+18 checks — 14 weighted, 4 informational. Full reference: **[docs/checks.md](docs/checks.md)**.
-```bash
-npm install -g ax-audit
-```
+| Check | Weight | Check | Weight |
+|---|---|---|---|
+| LLMs.txt | 11% | Security.txt | 6% |
+| Robots.txt + [Content Signals](https://contentsignals.org) | 11% | Meta Tags (OG / Twitter / AI) | 6% |
+| HTML Rendering | 9% | OpenAPI | 6% |
+| Structured Data (JSON-LD) | 9% | TLS / HTTPS | 5% |
+| HTTP Headers | 9% | Sitemap | 4% |
+| Agent Card ([A2A](https://a2a-protocol.org)) | 7% | AI Well-Known | 3% |
+| MCP | 7% | Content Negotiation (Markdown for Agents) | 0%* |
+| SEO Basics | 7% | [RSL License](https://rslstandard.org) · Agent Access (cloaking) · Crawl Efficiency | 0%* |
-Or run directly without installing:
+\* Informational in 3.x: reported in full, no effect on the score. Weighted in v4.0.
-```bash
-npx ax-audit https://your-site.com
-```
+Every finding links to a step-by-step **[remediation guide](https://lucioduran.com/projects/ax-audit/guides)**.
 ## Usage
 ```bash
-# Full audit with colored terminal output
-ax-audit https://example.com
-# Batch audit — audit multiple URLs in a single run
-ax-audit https://example.com https://other-site.com https://third.dev
-# HTML report — self-contained, dark mode, shareable
-ax-audit https://example.com --output html > report.html
-# JSON output for CI/CD pipelines
-ax-audit https://example.com --json
-# Run only specific checks (validates IDs, errors on unknown)
-ax-audit https://example.com --checks llms-txt,robots-txt,agent-json
-# Custom timeout per request (default: 10s)
-ax-audit https://example.com --timeout 15000
-# Verbose mode — see every HTTP request, cache hit, and check score
-ax-audit https://example.com --verbose
-# Only show failures and warnings (hide passing findings)
-ax-audit https://example.com --only-failures
-# Save a baseline for future comparison
-ax-audit https://example.com --save-baseline baseline.json
-# Compare against a baseline — shows per-check score deltas
-ax-audit https://example.com --baseline baseline.json
-# Fail CI if any check regresses by more than 5 points
-ax-audit https://example.com --baseline baseline.json --fail-on-regression 5
-```
-### Baseline Comparison
-Track score changes over time by saving a baseline and comparing against it in subsequent runs:
-```bash
-# First run — save the baseline
-ax-audit https://example.com --save-baseline .ax-baseline.json
-# Later — compare against the baseline
-ax-audit https://example.com --baseline .ax-baseline.json
-```
-```
-  AX Audit Report
-  https://example.com
-  Baseline: 2026-04-15T12:00:00.000Z
-  ████████████████████████████████░░░░░░░░  82/100  Good  ▲7
-  LLMs.txt (100/100)  ▲20
-  Robots.txt (70/100)  ▼10
-  ...
-  Regressions
-    Robots.txt: 80 → 70 (▼10)
-  Improvements
-    LLMs.txt: 80 → 100 (▲20)
-```
-Works with all output formats (terminal, JSON, HTML). In JSON mode, a `baselineDiff` object is included with per-check deltas.
-Use `--fail-on-regression <points>` in CI to fail the build if any individual check drops by more than the specified threshold.
-### Batch Mode
-Pass multiple URLs to audit them sequentially. Each gets its own full report, followed by a summary table:
-```
-  ═══ Batch Summary ═══
-  URL                                     Score       Grade
-  ────────────────────────────────────────────────────────────
-  https://example.com                     92/100    Excellent
-  https://other-site.com                  45/100         Poor
-  2 URLs audited: 1 passed, 1 failed
-  ████████████████████████████░░░░░░░░░░░░  69/100 avg  Fair
-```
-Exit code: `0` if all URLs score >= 70, `1` if any fails.
-### HTML Report
-Generate a self-contained HTML report you can open in any browser or share with your team:
-```bash
-ax-audit https://example.com --output html > report.html
+ax-audit https://example.com                          # full audit, terminal output
+ax-audit https://a.com https://b.com --concurrency 2  # batch, in parallel
+ax-audit https://example.com --output markdown        # also: json, html
+ax-audit https://example.com --checks llms-txt,rsl    # subset of checks
+ax-audit https://example.com --only-failures          # hide passing findings
+ax-audit https://example.com --baseline .ax-baseline.json --fail-on-regression 5
 ```
-Features: circular score gauge, dark/light mode, collapsible check sections, responsive design. Works for both single and batch audits.
+Exit codes gate CI: `0` for score ≥ 70, `1` below. Full flag reference: **[docs/cli.md](docs/cli.md)** · CI recipes (PR comments, regression gates, scheduled audits): **[docs/ci.md](docs/ci.md)**.
 ## Programmatic API
-Full TypeScript support with all types exported.
 ```typescript
 import { audit, batchAudit } from 'ax-audit';
-import type { AuditReport, BatchAuditReport } from 'ax-audit';
-// Single URL
-const report: AuditReport = await audit({ url: 'https://example.com' });
-console.log(report.overallScore); // 0-100
-console.log(report.grade.label);  // 'Excellent' | 'Good' | 'Fair' | 'Poor'
-console.log(report.results);      // Individual check results with findings
-// Batch audit
-const batch: BatchAuditReport = await batchAudit(
-  ['https://example.com', 'https://other.com'],
-  { timeout: 10000 }
-);
-console.log(batch.summary.averageScore); // Average across all URLs
-console.log(batch.summary.passed);       // Number of URLs scoring >= 70
-```
-Also exports `calculateOverallScore`, `getGrade`, `checks`, `saveBaseline`, `loadBaseline`, `diffBaseline`, and `toBaselineData` for advanced usage.
-## Scoring
-Each check returns a score from 0 to 100. The overall score is a weighted average across all checks.
-| Grade | Score | Exit Code |
-|---|---|---|
-| Excellent | 90 - 100 | `0` |
-| Good | 70 - 89 | `0` |
-| Fair | 50 - 69 | `1` |
-| Poor | 0 - 49 | `1` |
-Exit codes make it easy to gate CI/CD deployments on AX readiness.
-## CI Integration
-### GitHub Actions
-```yaml
-- name: AX Audit
-  run: npx ax-audit https://your-site.com
-  # Fails the step if score < 70
+const report = await audit({ url: 'https://example.com' });
+report.overallScore; // 0–100
+report.results;      // per-check findings
 ```
-Save the report as an artifact:
-```yaml
-- name: AX Audit (JSON)
-  run: npx ax-audit https://your-site.com --json > ax-report.json
+Full API and types: **[docs/api.md](docs/api.md)**.
-- uses: actions/upload-artifact@v4
-  with:
-    name: ax-audit-report
-    path: ax-report.json
-```
+## Documentation
-Fail on regressions using a committed baseline:
+Start here:
-```yaml
-- name: AX Audit (regression gate)
-  run: npx ax-audit https://your-site.com --baseline .ax-baseline.json --fail-on-regression 5
-```
+| Document | Contents |
+|---|---|
+| [docs/getting-started.md](docs/getting-started.md) | First audit, reading the report, fixing in impact order |
+| [docs/concepts.md](docs/concepts.md) | The AX standards landscape — llms.txt, A2A, MCP, RSL, Content Signals, Web Bot Auth |
-## Available Checks
+Reference:
-| Check ID | Use with `--checks` |
+| Document | Contents |
 |---|---|
-| `llms-txt` | LLMs.txt spec + Content-Type |
-| `robots-txt` | AI crawler configuration (40+ crawlers) |
-| `html-rendering` | SSR / SPA-shell detection + semantic HTML |
-| `structured-data` | JSON-LD structured data |
-| `http-headers` | Security + AI discovery headers |
-| `agent-json` | A2A Agent Card + same-origin validation |
-| `mcp` | MCP server configuration |
-| `seo-basics` | title / description / canonical / lang / hreflang |
-| `security-txt` | RFC 9116 Security.txt |
-| `meta-tags` | AI meta tags + OpenGraph + Twitter Card |
-| `openapi` | OpenAPI specification |
-| `tls-https` | HTTPS + HTTP→HTTPS redirect + HSTS preload |
-| `sitemap` | sitemap.xml validation + freshness |
-| `well-known-ai` | Emerging AI discovery files |
-| `content-negotiation` | Markdown via `Accept: text/markdown` (informational) |
-## Testing
+| [docs/checks.md](docs/checks.md) | All 18 checks with **exact scoring** per finding, weights, scoring model |
+| [docs/cli.md](docs/cli.md) | Every flag, output formats, exit codes, baseline workflow |
+| [docs/api.md](docs/api.md) | `audit`, `batchAudit`, baselines, reporters, types, API-stability policy |
+| [docs/ci.md](docs/ci.md) | GitHub Actions recipes: gates, PR comments, scheduled drift detection |
+| [docs/architecture.md](docs/architecture.md) | Pipeline design, check anatomy, how to add a check, scoring policy |
+| [docs/faq.md](docs/faq.md) | Troubleshooting, false positives, the `agent-access` verified-bots caveat |
+| [Remediation guides](https://lucioduran.com/projects/ax-audit/guides) | Step-by-step fixes for every finding |
-```bash
-npm test
-```
+The same documentation is browsable at [lucioduran.com/projects/ax-audit/docs](https://lucioduran.com/projects/ax-audit/docs), rendered from these files. Contributors: see [CONTRIBUTING.md](CONTRIBUTING.md) and [SECURITY.md](SECURITY.md).
+## Scoring
-229 tests covering all 15 checks, the scorer, the HTTP fetcher (against a real local server), baseline comparison, HTML parsing utilities, and edge cases. Uses Node.js built-in test runner (`node:test`), no extra test dependencies.
+| Grade | Score | Exit Code |
+|---|---|---|
+| Excellent | 90–100 | `0` |
+| Good | 70–89 | `0` |
+| Fair | 50–69 | `1` |
+| Poor | 0–49 | `1` |
-## Tech Stack
+## Tech
-- **TypeScript** with strict mode
-- **2 runtime dependencies**: `chalk` + `commander`
-- **Node.js 18+** built-in `fetch` (zero HTTP libraries)
-- Parallel check execution via `Promise.allSettled`
-- In-memory request caching per audit run
+TypeScript strict mode · 2 runtime dependencies (`chalk`, `commander`) · Node 18+ built-in `fetch` · parallel checks via `Promise.allSettled` · per-run request cache with `Vary`-aware keys · transient-failure retries with backoff · 301 tests on `node:test` with zero test dependencies.
 ## Contributing
-Contributions are welcome. To add a new check:
+Contributions are welcome — see **[docs/architecture.md](docs/architecture.md)** for the pipeline design, check anatomy, and the steps (code, tests, docs, remediation guide) a new check requires.
+## Related
-1. Create `src/checks/your-check.ts` exporting `default` (async check function) and `meta` (CheckMeta)
-2. Use `buildResult(meta, score, findings, start)` from `./utils.js` to return results
-3. Register it in `src/checks/index.ts`
-4. Add its weight to `CHECK_WEIGHTS` in `src/constants.ts`
+- **[ax-init](https://github.com/lucioduran/ax-init)** — generate the AX files this tool audits
+- **[ax-cite](https://github.com/lucioduran/ax-cite)** — embed AI-extractable structured data in your pages
 ## License

package/dist/checks/agent-access.d.ts ADDED Viewed

@@ -0,0 +1,16 @@
+import type { CheckContext, CheckResult, CheckMeta } from '../types.js';
+import { type BotEntry } from './robots-txt.js';
+export declare const meta: CheckMeta;
+/**
+ * Build a realistic crawler User-Agent for a given bot token. WAF and
+ * bot-management rules match on the token substring, which is what we need
+ * to trigger the same code path the real crawler would hit.
+ */
+export declare function crawlerUserAgent(token: string): string;
+/**
+ * Whether robots.txt expresses the intent to block this crawler: an explicit
+ * full Disallow for it, or a full wildcard Disallow with no explicit entry.
+ */
+export declare function intentBlocked(entries: BotEntry[], crawler: string): boolean;
+export default function check(ctx: CheckContext): Promise<CheckResult>;
+//# sourceMappingURL=agent-access.d.ts.map

package/dist/checks/agent-access.d.ts.map ADDED Viewed

@@ -0,0 +1 @@

+ {"version":3,"file":"agent-access.d.ts","sourceRoot":"","sources":["../../src/checks/agent-access.ts"],"names":[],"mappings":"AAEA,OAAO,KAAK,EAAE,YAAY,EAAE,WAAW,EAAE,SAAS,EAAW,MAAM,aAAa,CAAC;AAGjF,OAAO,EAAmB,KAAK,QAAQ,EAAE,MAAM,iBAAiB,CAAC;AAEjE,eAAO,MAAM,IAAI,EAAE,SAKlB,CAAC;AASF;;;;GAIG;AACH,wBAAgB,gBAAgB,CAAC,KAAK,EAAE,MAAM,GAAG,MAAM,CAEtD;AAED;;;GAGG;AACH,wBAAgB,aAAa,CAAC,OAAO,EAAE,QAAQ,EAAE,EAAE,OAAO,EAAE,MAAM,GAAG,OAAO,CAK3E;AAED,wBAA8B,KAAK,CAAC,GAAG,EAAE,YAAY,GAAG,OAAO,CAAC,WAAW,CAAC,CAiF3E"}

package/dist/checks/agent-access.js ADDED Viewed

@@ -0,0 +1,110 @@
+import { CORE_AI_CRAWLERS } from '../constants.js';
+import { guideUrl } from '../guide-urls.js';
+import { buildResult } from './utils.js';
+import { extractVisibleText } from './html-utils.js';
+import { parseUserAgents } from './robots-txt.js';
+export const meta = {
+    id: 'agent-access',
+    name: 'Agent Access',
+    description: 'Checks that AI crawler user-agents are not blocked or served reduced content (cloaking)',
+    weight: 0, // Informational in 3.x — will gain weight in v4.0 (score-affecting changes are treated as breaking).
+};
+/** Content below this fraction of the baseline visible text counts as "reduced". */
+const REDUCED_CONTENT_RATIO = 0.5;
+/** Baselines with less visible text than this are too small for meaningful content comparison. */
+const MIN_BASELINE_TEXT = 200;
+/**
+ * Build a realistic crawler User-Agent for a given bot token. WAF and
+ * bot-management rules match on the token substring, which is what we need
+ * to trigger the same code path the real crawler would hit.
+ */
+export function crawlerUserAgent(token) {
+    return `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ${token}/1.0)`;
+}
+/**
+ * Whether robots.txt expresses the intent to block this crawler: an explicit
+ * full Disallow for it, or a full wildcard Disallow with no explicit entry.
+ */
+export function intentBlocked(entries, crawler) {
+    const explicit = entries.find((e) => e.name.toLowerCase() === crawler.toLowerCase());
+    if (explicit)
+        return explicit.disallowed;
+    const wildcard = entries.find((e) => e.name === '*');
+    return wildcard?.disallowed ?? false;
+}
+export default async function check(ctx) {
+    const start = performance.now();
+    const findings = [];
+    const baseline = await ctx.fetch(ctx.url);
+    if (!baseline.ok) {
+        findings.push({
+            status: 'fail',
+            message: 'Baseline homepage request failed — cannot compare crawler access',
+            detail: `HTTP ${baseline.status || 'network error'}`,
+            learnMoreUrl: guideUrl(meta.id, 'baseline-unavailable'),
+        });
+        return buildResult(meta, 0, findings, start);
+    }
+    const baselineText = extractVisibleText(baseline.body).length;
+    const robotsRes = await ctx.fetch(`${ctx.url}/robots.txt`);
+    const robotsEntries = robotsRes.ok ? parseUserAgents(robotsRes.body) : [];
+    const outcomes = new Map();
+    for (const crawler of CORE_AI_CRAWLERS) {
+        const res = await ctx.fetch(ctx.url, { headers: { 'User-Agent': crawlerUserAgent(crawler) } });
+        const blockedByRobots = intentBlocked(robotsEntries, crawler);
+        if (!res.ok) {
+            outcomes.set(crawler, blockedByRobots ? 'blocked-consistent' : 'blocked');
+            const detail = `HTTP ${res.status || 'network error'} for User-Agent containing "${crawler}"`;
+            if (blockedByRobots) {
+                findings.push({
+                    status: 'pass',
+                    message: `${crawler} blocked at the server — consistent with its robots.txt Disallow`,
+                    detail,
+                });
+            }
+            else {
+                findings.push({
+                    status: 'warn',
+                    message: `${crawler} is ${robotsRes.ok ? 'allowed in robots.txt' : 'not restricted'} but its User-Agent is blocked`,
+                    detail,
+                    hint: 'Your WAF or bot management rejects this crawler token even though robots.txt permits it — the block is invisible to you but fatal for the agent. ' +
+                        'Check your firewall rules and AI-bot toggles (e.g., Cloudflare "Block AI Crawlers"). ' +
+                        'Note: if your WAF verifies bots cryptographically (Web Bot Auth / verified-bots lists), the real crawler may still pass while this unverified probe is rejected — verify against your WAF logs.',
+                    learnMoreUrl: guideUrl(meta.id, 'blocked-crawler'),
+                });
+            }
+            continue;
+        }
+        const text = extractVisibleText(res.body).length;
+        if (baselineText >= MIN_BASELINE_TEXT && text < baselineText * REDUCED_CONTENT_RATIO) {
+            outcomes.set(crawler, 'reduced');
+            findings.push({
+                status: 'warn',
+                message: `${crawler} receives reduced content (${text} vs ${baselineText} chars of visible text)`,
+                hint: 'The server returns 200 but serves this crawler substantially less content than a regular client — ' +
+                    'often an interstitial, a challenge page, or conditional rendering. Agents index what they receive.',
+                learnMoreUrl: guideUrl(meta.id, 'reduced-content'),
+            });
+        }
+        else {
+            outcomes.set(crawler, 'ok');
+        }
+    }
+    const okCount = [...outcomes.values()].filter((o) => o === 'ok').length;
+    if (okCount === CORE_AI_CRAWLERS.length) {
+        findings.unshift({
+            status: 'pass',
+            message: `All ${CORE_AI_CRAWLERS.length} core AI crawler user-agents receive equivalent responses`,
+        });
+    }
+    const credit = [...outcomes.values()].reduce((acc, o) => {
+        if (o === 'ok' || o === 'blocked-consistent')
+            return acc + 1;
+        if (o === 'reduced')
+            return acc + 0.5;
+        return acc;
+    }, 0);
+    const score = Math.round((credit / CORE_AI_CRAWLERS.length) * 100);
+    return buildResult(meta, score, findings, start);
+}
+//# sourceMappingURL=agent-access.js.map

package/dist/checks/agent-access.js.map ADDED Viewed

@@ -0,0 +1 @@

+ {"version":3,"file":"agent-access.js","sourceRoot":"","sources":["../../src/checks/agent-access.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,gBAAgB,EAAE,MAAM,iBAAiB,CAAC;AACnD,OAAO,EAAE,QAAQ,EAAE,MAAM,kBAAkB,CAAC;AAE5C,OAAO,EAAE,WAAW,EAAE,MAAM,YAAY,CAAC;AACzC,OAAO,EAAE,kBAAkB,EAAE,MAAM,iBAAiB,CAAC;AACrD,OAAO,EAAE,eAAe,EAAiB,MAAM,iBAAiB,CAAC;AAEjE,MAAM,CAAC,MAAM,IAAI,GAAc;IAC7B,EAAE,EAAE,cAAc;IAClB,IAAI,EAAE,cAAc;IACpB,WAAW,EAAE,yFAAyF;IACtG,MAAM,EAAE,CAAC,EAAE,qGAAqG;CACjH,CAAC;AAEF,oFAAoF;AACpF,MAAM,qBAAqB,GAAG,GAAG,CAAC;AAClC,kGAAkG;AAClG,MAAM,iBAAiB,GAAG,GAAG,CAAC;AAI9B;;;;GAIG;AACH,MAAM,UAAU,gBAAgB,CAAC,KAAa;IAC5C,OAAO,kEAAkE,KAAK,OAAO,CAAC;AACxF,CAAC;AAED;;;GAGG;AACH,MAAM,UAAU,aAAa,CAAC,OAAmB,EAAE,OAAe;IAChE,MAAM,QAAQ,GAAG,OAAO,CAAC,IAAI,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,IAAI,CAAC,WAAW,EAAE,KAAK,OAAO,CAAC,WAAW,EAAE,CAAC,CAAC;IACrF,IAAI,QAAQ;QAAE,OAAO,QAAQ,CAAC,UAAU,CAAC;IACzC,MAAM,QAAQ,GAAG,OAAO,CAAC,IAAI,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,IAAI,KAAK,GAAG,CAAC,CAAC;IACrD,OAAO,QAAQ,EAAE,UAAU,IAAI,KAAK,CAAC;AACvC,CAAC;AAED,MAAM,CAAC,OAAO,CAAC,KAAK,UAAU,KAAK,CAAC,GAAiB;IACnD,MAAM,KAAK,GAAG,WAAW,CAAC,GAAG,EAAE,CAAC;IAChC,MAAM,QAAQ,GAAc,EAAE,CAAC;IAE/B,MAAM,QAAQ,GAAG,MAAM,GAAG,CAAC,KAAK,CAAC,GAAG,CAAC,GAAG,CAAC,CAAC;IAC1C,IAAI,CAAC,QAAQ,CAAC,EAAE,EAAE,CAAC;QACjB,QAAQ,CAAC,IAAI,CAAC;YACZ,MAAM,EAAE,MAAM;YACd,OAAO,EAAE,kEAAkE;YAC3E,MAAM,EAAE,QAAQ,QAAQ,CAAC,MAAM,IAAI,eAAe,EAAE;YACpD,YAAY,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,EAAE,sBAAsB,CAAC;SACxD,CAAC,CAAC;QACH,OAAO,WAAW,CAAC,IAAI,EAAE,CAAC,EAAE,QAAQ,EAAE,KAAK,CAAC,CAAC;IAC/C,CAAC;IACD,MAAM,YAAY,GAAG,kBAAkB,CAAC,QAAQ,CAAC,IAAI,CAAC,CAAC,MAAM,CAAC;IAE9D,MAAM,SAAS,GAAG,MAAM,GAAG,CAAC,KAAK,CAAC,GAAG,GAAG,CAAC,GAAG,aAAa,CAAC,CAAC;IAC3D,MAAM,aAAa,GAAG,SAAS,CAAC,EAAE,CAAC,CAAC,CAAC,eAAe,CAAC,SAAS,CAAC,IAAI,CAAC,CAAC,CAAC,CAAC,EAAE,CAAC;IAE1E,MAAM,QAAQ,GAAG,IAAI,GAAG,EAAmB,CAAC;IAE5C,KAAK,MAAM,OAAO,IAAI,gBAAgB,EAAE,CAAC;QACvC,MAAM,GAAG,GAAG,MAAM,GAAG,CAAC,KAAK,CAAC,GAAG,CAAC,GAAG,EAAE,EAAE,OAAO,EAAE,EAAE,YAAY,EAAE,gBAAgB,CAAC,OAAO,CAAC,EAAE,EAAE,CAAC,CAAC;QAC/F,MAAM,eAAe,GAAG,aAAa,CAAC,aAAa,EAAE,OAAO,CAAC,CAAC;QAE9D,IAAI,CAAC,GAAG,CAAC,EAAE,EAAE,CAAC;YACZ,QAAQ,CAAC,GAAG,CAAC,OAAO,EAAE,eAAe,CAAC,CAAC,CAAC,oBAAoB,CAAC,CAAC,CAAC,SAAS,CAAC,CAAC;YAC1E,MAAM,MAAM,GAAG,QAAQ,GAAG,CAAC,MAAM,IAAI,eAAe,+BAA+B,OAAO,GAAG,CAAC;YAC9F,IAAI,eAAe,EAAE,CAAC;gBACpB,QAAQ,CAAC,IAAI,CAAC;oBACZ,MAAM,EAAE,MAAM;oBACd,OAAO,EAAE,GAAG,OAAO,kEAAkE;oBACrF,MAAM;iBACP,CAAC,CAAC;YACL,CAAC;iBAAM,CAAC;gBACN,QAAQ,CAAC,IAAI,CAAC;oBACZ,MAAM,EAAE,MAAM;oBACd,OAAO,EAAE,GAAG,OAAO,OAAO,SAAS,CAAC,EAAE,CAAC,CAAC,CAAC,uBAAuB,CAAC,CAAC,CAAC,gBAAgB,gCAAgC;oBACnH,MAAM;oBACN,IAAI,EACF,mJAAmJ;wBACnJ,uFAAuF;wBACvF,iMAAiM;oBACnM,YAAY,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,EAAE,iBAAiB,CAAC;iBACnD,CAAC,CAAC;YACL,CAAC;YACD,SAAS;QACX,CAAC;QAED,MAAM,IAAI,GAAG,kBAAkB,CAAC,GAAG,CAAC,IAAI,CAAC,CAAC,MAAM,CAAC;QACjD,IAAI,YAAY,IAAI,iBAAiB,IAAI,IAAI,GAAG,YAAY,GAAG,qBAAqB,EAAE,CAAC;YACrF,QAAQ,CAAC,GAAG,CAAC,OAAO,EAAE,SAAS,CAAC,CAAC;YACjC,QAAQ,CAAC,IAAI,CAAC;gBACZ,MAAM,EAAE,MAAM;gBACd,OAAO,EAAE,GAAG,OAAO,8BAA8B,IAAI,OAAO,YAAY,yBAAyB;gBACjG,IAAI,EACF,oGAAoG;oBACpG,oGAAoG;gBACtG,YAAY,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,EAAE,iBAAiB,CAAC;aACnD,CAAC,CAAC;QACL,CAAC;aAAM,CAAC;YACN,QAAQ,CAAC,GAAG,CAAC,OAAO,EAAE,IAAI,CAAC,CAAC;QAC9B,CAAC;IACH,CAAC;IAED,MAAM,OAAO,GAAG,CAAC,GAAG,QAAQ,CAAC,MAAM,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,KAAK,IAAI,CAAC,CAAC,MAAM,CAAC;IACxE,IAAI,OAAO,KAAK,gBAAgB,CAAC,MAAM,EAAE,CAAC;QACxC,QAAQ,CAAC,OAAO,CAAC;YACf,MAAM,EAAE,MAAM;YACd,OAAO,EAAE,OAAO,gBAAgB,CAAC,MAAM,2DAA2D;SACnG,CAAC,CAAC;IACL,CAAC;IAED,MAAM,MAAM,GAAG,CAAC,GAAG,QAAQ,CAAC,MAAM,EAAE,CAAC,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,EAAE;QACtD,IAAI,CAAC,KAAK,IAAI,IAAI,CAAC,KAAK,oBAAoB;YAAE,OAAO,GAAG,GAAG,CAAC,CAAC;QAC7D,IAAI,CAAC,KAAK,SAAS;YAAE,OAAO,GAAG,GAAG,GAAG,CAAC;QACtC,OAAO,GAAG,CAAC;IACb,CAAC,EAAE,CAAC,CAAC,CAAC;IACN,MAAM,KAAK,GAAG,IAAI,CAAC,KAAK,CAAC,CAAC,MAAM,GAAG,gBAAgB,CAAC,MAAM,CAAC,GAAG,GAAG,CAAC,CAAC;IAEnE,OAAO,WAAW,CAAC,IAAI,EAAE,KAAK,EAAE,QAAQ,EAAE,KAAK,CAAC,CAAC;AACnD,CAAC"}

package/dist/checks/crawl-efficiency.d.ts ADDED Viewed

@@ -0,0 +1,4 @@
+import type { CheckContext, CheckResult, CheckMeta } from '../types.js';
+export declare const meta: CheckMeta;
+export default function check(ctx: CheckContext): Promise<CheckResult>;
+//# sourceMappingURL=crawl-efficiency.d.ts.map

package/dist/checks/crawl-efficiency.d.ts.map ADDED Viewed

	@@ -0,0 +1 @@
1	+ {"version":3,"file":"crawl-efficiency.d.ts","sourceRoot":"","sources":["../../src/checks/crawl-efficiency.ts"],"names":[],"mappings":"AACA,OAAO,KAAK,EAAE,YAAY,EAAE,WAAW,EAAE,SAAS,EAAW,MAAM,aAAa,CAAC;AAGjF,eAAO,MAAM,IAAI,EAAE,SAKlB,CAAC;AAMF,wBAA8B,KAAK,CAAC,GAAG,EAAE,YAAY,GAAG,OAAO,CAAC,WAAW,CAAC,CA8G3E"}