@braedenbuilds/crawl-sim 1.3.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -9,7 +9,7 @@
9
9
  "name": "crawl-sim",
10
10
  "source": "./",
11
11
  "description": "Multi-bot web crawler simulator — audit how Googlebot, GPTBot, ClaudeBot, and PerplexityBot see your site",
12
- "version": "1.3.0"
12
+ "version": "1.4.0"
13
13
  }
14
14
  ]
15
15
  }
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "crawl-sim",
3
- "version": "1.3.0",
3
+ "version": "1.4.0",
4
4
  "description": "Multi-bot web crawler simulator — audit how Googlebot, GPTBot, ClaudeBot, and PerplexityBot see your site",
5
5
  "author": {
6
6
  "name": "BraedenBDev",
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@braedenbuilds/crawl-sim",
3
- "version": "1.3.0",
3
+ "version": "1.4.0",
4
4
  "description": "Agent-native multi-bot web crawler simulator. See your site through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot.",
5
5
  "bin": {
6
6
  "crawl-sim": "bin/install.js"
@@ -31,6 +31,9 @@ Keep status lines short, active, and specific to this URL. Never use the same se
31
31
  /crawl-sim <url> --bot gptbot # single bot
32
32
  /crawl-sim <url> --category structured-data # category deep dive
33
33
  /crawl-sim <url> --json # JSON output only (for CI)
34
+ /crawl-sim <url> --pdf # audit + PDF report to Desktop
35
+ /crawl-sim <url> --compare <url2> # side-by-side comparison of two sites
36
+ /crawl-sim <url> --compare <url2> --pdf # comparison + PDF report
34
37
  ```
35
38
 
36
39
  ## Prerequisites — check once at the start
@@ -214,11 +217,17 @@ Then produce **prioritized findings** ranked by total point impact across bots:
214
217
  - **Framework detection.** Scan the HTML body for signals: `<meta name="next-head-count">` or `_next/static` → Next.js (Pages Router or App Router respectively), `<div id="__nuxt">` → Nuxt, `<div id="app">` with thin content → SPA (Vue/React CSR), `<!--$-->` placeholder tags → React 18 Suspense. Use these to tailor fix recommendations.
215
218
  - **No speculation beyond the data.** If server HTML has 0 `<a>` tags inside a component, say "component not present in server HTML" — not "JavaScript hydration failed" unless the diff-render data proves it.
216
219
  - **Known extractor limitations.** The bash meta extractor sometimes reports `h1Text: null` even when `h1.count: 1` — that happens when the H1 contains nested tags (`<br>`, `<span>`, `<svg>`). The count is still correct. Don't flag this as a site bug — it's tracked in GitHub issue #4.
220
+ - **robots.txt enforceability.** Each bot in the score output carries `robotsTxtEnforceability` — one of `enforced`, `advisory_only`, or `stealth_risk`. When robots blocks a bot:
221
+ - `enforced`: The block works. State it directly: *"GPTBot is blocked by robots.txt."*
222
+ - `advisory_only`: The block is unenforceable via robots.txt alone. Flag it: *"robots.txt blocks ChatGPT-User, but OpenAI has stated user-initiated fetches may not respect robots.txt. Network-level enforcement (e.g., Cloudflare WAF rules) is needed to actually block this bot."*
223
+ - `stealth_risk`: The bot claims compliance but has been caught bypassing. Note: *"PerplexityBot is blocked by robots.txt, but Cloudflare has documented instances of Perplexity using undeclared crawlers with generic user-agent strings to access blocked sites."*
224
+ - **Cloudflare context.** Since July 2025, Cloudflare blocks all AI training crawlers (GPTBot, ClaudeBot, CCBot, etc.) **by default** for new domains (~20% of the web). If a site uses Cloudflare, robots.txt may be redundant for training bots — the CDN blocks them at the network level before they reach the origin. The score output's `cloudflareCategory` field (`ai_crawler`, `ai_search`, `ai_assistant`) indicates which tier each bot falls into.
217
225
  - **Per-bot quirks to surface:**
218
226
  - Googlebot: renders JS. If `diff-render.sh` was skipped, note that comparison was unavailable and recommend installing Playwright.
219
227
  - GPTBot / ClaudeBot / PerplexityBot: `rendersJavaScript: false` at observed confidence — flag any server-vs-rendered delta as invisible-to-AI content.
220
- - `chatgpt-user` / `perplexity-user`: officially ignore robots.txt for user-initiated fetches. Blocking these via robots.txt has no effect.
221
- - PerplexityBot: third-party reports of stealth/undeclared crawling. Mention if relevant, don't assert.
228
+ - `chatgpt-user` / `perplexity-user`: `robotsTxtEnforceability: advisory_only`. Blocking these via robots.txt alone has no effect — always flag this in findings.
229
+ - `claude-user`: Anthropic is notably stricter — commits to respecting robots.txt even for user-initiated fetches (`robotsTxtEnforceability: enforced`).
230
+ - PerplexityBot: `robotsTxtEnforceability: stealth_risk` — third-party and Cloudflare reports of stealth/undeclared crawling. Mention if relevant, don't assert.
222
231
 
223
232
  After findings, write a **Summary** paragraph: what's working well, biggest wins, confidence caveats. Keep it short — two to three sentences.
224
233
 
@@ -233,6 +242,37 @@ After findings, write a **Summary** paragraph: what's working well, biggest wins
233
242
  - If `jq` or `curl` is missing, exit with install instructions.
234
243
  - If `diff-render.sh` skips, the narrative must note that per-bot differentiation is reduced.
235
244
 
245
+ ## PDF Report (`--pdf`)
246
+
247
+ When the user passes `--pdf`, after the narrative output, generate a PDF report:
248
+
249
+ ```bash
250
+ "$SKILL_DIR/scripts/generate-report-html.sh" ./crawl-sim-report.json "$RUN_DIR/report.html"
251
+ "$SKILL_DIR/scripts/html-to-pdf.sh" "$RUN_DIR/report.html" "$HOME/Desktop/crawl-sim-audit.pdf"
252
+ ```
253
+
254
+ Tell the user where the PDF was saved. If `html-to-pdf.sh` fails (no Chrome or Playwright), the HTML file is still available — tell the user and suggest installing a renderer.
255
+
256
+ ## Comparative Audit (`--compare <url2>`)
257
+
258
+ When the user passes `--compare <url2>`, run two full audits and produce a side-by-side report:
259
+
260
+ 1. Run the complete 5-stage pipeline for `<url>` — save report as `./crawl-sim-report-a.json`
261
+ 2. Run the complete 5-stage pipeline for `<url2>` — save report as `./crawl-sim-report-b.json`
262
+ 3. Generate the comparison:
263
+
264
+ ```bash
265
+ "$SKILL_DIR/scripts/generate-compare-html.sh" ./crawl-sim-report-a.json ./crawl-sim-report-b.json "$RUN_DIR/compare.html"
266
+ ```
267
+
268
+ 4. If `--pdf` was also passed:
269
+
270
+ ```bash
271
+ "$SKILL_DIR/scripts/html-to-pdf.sh" "$RUN_DIR/compare.html" "$HOME/Desktop/crawl-sim-compare.pdf"
272
+ ```
273
+
274
+ The narrative for a comparison should lead with: which site wins overall, by how many points, and in which categories. Then highlight the biggest deltas — what Site A does better, what Site B does better, and what both share.
275
+
236
276
  ## Cleanup
237
277
 
238
278
  `$RUN_DIR` is small and informative — leave it in place and print the path. The user may want to inspect the raw JSON for any of the 23+ intermediate files.
@@ -4,7 +4,7 @@
4
4
  "vendor": "OpenAI",
5
5
  "userAgent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot",
6
6
  "robotsTxtToken": "ChatGPT-User",
7
- "purpose": "user-initiated",
7
+ "purpose": "user_retrieval",
8
8
  "rendersJavaScript": "unknown",
9
9
  "respectsRobotsTxt": "partial",
10
10
  "crawlDelaySupported": "unknown",
@@ -23,6 +23,11 @@
23
23
  }
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
- "relatedBots": ["gptbot", "oai-searchbot"],
27
- "notes": "Not used for automatic crawling. Not used to determine search appearance. User-initiated fetches in ChatGPT and Custom GPTs."
26
+ "relatedBots": [
27
+ "gptbot",
28
+ "oai-searchbot"
29
+ ],
30
+ "notes": "Not used for automatic crawling. Not used to determine search appearance. User-initiated fetches in ChatGPT and Custom GPTs.",
31
+ "cloudflareCategory": "ai_assistant",
32
+ "robotsTxtEnforceability": "advisory_only"
28
33
  }
@@ -23,6 +23,11 @@
23
23
  }
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
- "relatedBots": ["claudebot", "claude-user"],
27
- "notes": "Navigates the web to improve search result quality. Focused on search indexing, not training."
26
+ "relatedBots": [
27
+ "claudebot",
28
+ "claude-user"
29
+ ],
30
+ "notes": "Navigates the web to improve search result quality. Focused on search indexing, not training.",
31
+ "cloudflareCategory": "ai_search",
32
+ "robotsTxtEnforceability": "enforced"
28
33
  }
@@ -4,7 +4,7 @@
4
4
  "vendor": "Anthropic",
5
5
  "userAgent": "Claude-User",
6
6
  "robotsTxtToken": "Claude-User",
7
- "purpose": "user-initiated",
7
+ "purpose": "user_retrieval",
8
8
  "rendersJavaScript": "unknown",
9
9
  "respectsRobotsTxt": true,
10
10
  "crawlDelaySupported": "unknown",
@@ -23,6 +23,11 @@
23
23
  }
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
- "relatedBots": ["claudebot", "claude-searchbot"],
27
- "notes": "When individuals ask questions to Claude, it may access websites. Blocking prevents Claude from retrieving content in response to user queries."
26
+ "relatedBots": [
27
+ "claudebot",
28
+ "claude-searchbot"
29
+ ],
30
+ "notes": "When individuals ask questions to Claude, it may access websites. Blocking prevents Claude from retrieving content in response to user queries.",
31
+ "cloudflareCategory": "ai_assistant",
32
+ "robotsTxtEnforceability": "enforced"
28
33
  }
@@ -23,6 +23,11 @@
23
23
  }
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
- "relatedBots": ["claude-user", "claude-searchbot"],
27
- "notes": "Collects web content that could potentially contribute to AI model training. Crawl-delay explicitly supported (non-standard). Blocking IP addresses will not reliably work."
26
+ "relatedBots": [
27
+ "claude-user",
28
+ "claude-searchbot"
29
+ ],
30
+ "notes": "Collects web content that could potentially contribute to AI model training. Crawl-delay explicitly supported (non-standard). Blocking IP addresses will not reliably work.",
31
+ "cloudflareCategory": "ai_crawler",
32
+ "robotsTxtEnforceability": "enforced"
28
33
  }
@@ -4,7 +4,7 @@
4
4
  "vendor": "Google",
5
5
  "userAgent": "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
6
6
  "robotsTxtToken": "Googlebot",
7
- "purpose": "search-indexing",
7
+ "purpose": "search",
8
8
  "rendersJavaScript": true,
9
9
  "respectsRobotsTxt": true,
10
10
  "crawlDelaySupported": false,
@@ -24,5 +24,7 @@
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
26
  "relatedBots": [],
27
- "notes": "Two-phase: initial fetch (HTML) then queued render (headless Chrome via WRS). Evergreen Chromium. Stateless sessions. ~5s default timeout. Mobile-first indexing."
27
+ "notes": "Two-phase: initial fetch (HTML) then queued render (headless Chrome via WRS). Evergreen Chromium. Stateless sessions. ~5s default timeout. Mobile-first indexing.",
28
+ "cloudflareCategory": "search_engine",
29
+ "robotsTxtEnforceability": "enforced"
28
30
  }
@@ -23,6 +23,11 @@
23
23
  }
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
- "relatedBots": ["oai-searchbot", "chatgpt-user"],
27
- "notes": "Disallowing GPTBot indicates a site's content should not be used in training generative AI foundation models."
26
+ "relatedBots": [
27
+ "oai-searchbot",
28
+ "chatgpt-user"
29
+ ],
30
+ "notes": "Disallowing GPTBot indicates a site's content should not be used in training generative AI foundation models.",
31
+ "cloudflareCategory": "ai_crawler",
32
+ "robotsTxtEnforceability": "enforced"
28
33
  }
@@ -23,6 +23,11 @@
23
23
  }
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
- "relatedBots": ["gptbot", "chatgpt-user"],
27
- "notes": "Sites opted out of OAI-SearchBot will not be shown in ChatGPT search answers, though can still appear as navigational links."
26
+ "relatedBots": [
27
+ "gptbot",
28
+ "chatgpt-user"
29
+ ],
30
+ "notes": "Sites opted out of OAI-SearchBot will not be shown in ChatGPT search answers, though can still appear as navigational links.",
31
+ "cloudflareCategory": "ai_search",
32
+ "robotsTxtEnforceability": "enforced"
28
33
  }
@@ -4,7 +4,7 @@
4
4
  "vendor": "Perplexity",
5
5
  "userAgent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)",
6
6
  "robotsTxtToken": "Perplexity-User",
7
- "purpose": "user-initiated",
7
+ "purpose": "user_retrieval",
8
8
  "rendersJavaScript": "unknown",
9
9
  "respectsRobotsTxt": false,
10
10
  "crawlDelaySupported": "unknown",
@@ -23,6 +23,10 @@
23
23
  }
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
- "relatedBots": ["perplexitybot"],
27
- "notes": "Supports user actions within Perplexity. Not used for web crawling or AI training. Generally ignores robots.txt since fetches are user-initiated."
26
+ "relatedBots": [
27
+ "perplexitybot"
28
+ ],
29
+ "notes": "Supports user actions within Perplexity. Not used for web crawling or AI training. Generally ignores robots.txt since fetches are user-initiated.",
30
+ "cloudflareCategory": "ai_assistant",
31
+ "robotsTxtEnforceability": "advisory_only"
28
32
  }
@@ -4,7 +4,7 @@
4
4
  "vendor": "Perplexity",
5
5
  "userAgent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)",
6
6
  "robotsTxtToken": "PerplexityBot",
7
- "purpose": "search-indexing",
7
+ "purpose": "search",
8
8
  "rendersJavaScript": false,
9
9
  "respectsRobotsTxt": true,
10
10
  "crawlDelaySupported": "unknown",
@@ -23,6 +23,10 @@
23
23
  }
24
24
  },
25
25
  "lastVerified": "2026-04-11",
26
- "relatedBots": ["perplexity-user"],
27
- "notes": "Designed to surface and link websites in search results on Perplexity. NOT used to crawl content for AI foundation models. Changes may take up to 24 hours to reflect."
26
+ "relatedBots": [
27
+ "perplexity-user"
28
+ ],
29
+ "notes": "Designed to surface and link websites in search results on Perplexity. NOT used to crawl content for AI foundation models. Changes may take up to 24 hours to reflect.",
30
+ "cloudflareCategory": "ai_search",
31
+ "robotsTxtEnforceability": "stealth_risk"
28
32
  }
@@ -73,9 +73,14 @@ page_type_for_url() {
73
73
 
74
74
  # Fetch a URL to a local file and return the HTTP status code on stdout.
75
75
  # Usage: status=$(fetch_to_file <url> <output-file> [timeout-seconds])
76
+ # Retries once on transient failure (same SSL/DNS flake that caused #11).
76
77
  fetch_to_file() {
77
78
  local url="$1"
78
79
  local out="$2"
79
80
  local timeout="${3:-15}"
80
- curl -sS -L -o "$out" -w '%{http_code}' --max-time "$timeout" "$url" 2>/dev/null || echo "000"
81
+ local status
82
+ status=$(curl -sS -L -o "$out" -w '%{http_code}' --max-time "$timeout" "$url" 2>/dev/null) && echo "$status" && return
83
+ # Retry once on transient failure
84
+ status=$(curl -sS -L -o "$out" -w '%{http_code}' --max-time "$timeout" "$url" 2>/dev/null) && echo "$status" && return
85
+ echo "000"
81
86
  }
@@ -312,14 +312,16 @@ for bot_id in $BOTS; do
312
312
  continue
313
313
  fi
314
314
 
315
- # Batch-read fields from fetch file (1 jq call instead of 4)
316
- read -r STATUS TOTAL_TIME SERVER_WORD_COUNT RENDERS_JS <<< \
315
+ # Batch-read fields from fetch file
316
+ read -r STATUS TOTAL_TIME SERVER_WORD_COUNT RENDERS_JS PURPOSE_TIER ROBOTS_ENFORCE <<< \
317
317
  "$(jq -r '[
318
318
  (.status // 0),
319
319
  (.timing.total // 0),
320
320
  (.wordCount // 0),
321
- (.bot.rendersJavaScript | if . == null then "unknown" else tostring end)
322
- ] | @tsv' "$FETCH" 2>/dev/null || echo "0 0 0 unknown")"
321
+ (.bot.rendersJavaScript | if . == null then "unknown" else tostring end),
322
+ (.bot.purpose // "unknown"),
323
+ (.bot.robotsTxtEnforceability // "unknown")
324
+ ] | @tsv' "$FETCH" 2>/dev/null || echo "0 0 0 unknown unknown unknown")"
323
325
 
324
326
  ROBOTS_ALLOWED=$(jq -r '.allowed // false | tostring' "$ROBOTS" 2>/dev/null || echo "false")
325
327
 
@@ -547,13 +549,13 @@ for bot_id in $BOTS; do
547
549
 
548
550
  # --- Category 5: AI Readiness (0-100) ---
549
551
  AI=0
550
- # Batch-read llmstxt fields (1 jq call instead of 4)
552
+ # Batch-read llmstxt fields — use top-level exists (M1) which covers both variants
551
553
  read -r LLMS_EXISTS LLMS_HAS_TITLE LLMS_HAS_DESC LLMS_URLS <<< \
552
554
  "$(jq -r '[
553
- (.llmsTxt.exists // false | tostring),
554
- (.llmsTxt.hasTitle // false | tostring),
555
- (.llmsTxt.hasDescription // false | tostring),
556
- (.llmsTxt.urlCount // 0)
555
+ (.exists // (.llmsTxt.exists or .llmsFullTxt.exists) | tostring),
556
+ ((.llmsTxt.hasTitle // .llmsFullTxt.hasTitle // false) | tostring),
557
+ ((.llmsTxt.hasDescription // .llmsFullTxt.hasDescription // false) | tostring),
558
+ ((.llmsTxt.urlCount // 0) + (.llmsFullTxt.urlCount // 0))
557
559
  ] | @tsv' "$LLMSTXT_FILE" 2>/dev/null || echo "false false false 0")"
558
560
 
559
561
  if [ "$LLMS_EXISTS" = "true" ]; then
@@ -586,6 +588,8 @@ for bot_id in $BOTS; do
586
588
  --arg id "$bot_id" \
587
589
  --arg name "$BOT_NAME" \
588
590
  --arg rendersJs "$RENDERS_JS" \
591
+ --arg purpose "$PURPOSE_TIER" \
592
+ --arg robotsEnforce "$ROBOTS_ENFORCE" \
589
593
  --argjson score "$BOT_SCORE" \
590
594
  --arg grade "$BOT_GRADE" \
591
595
  --argjson acc "$ACC" \
@@ -605,6 +609,8 @@ for bot_id in $BOTS; do
605
609
  id: $id,
606
610
  name: $name,
607
611
  rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end),
612
+ purpose: $purpose,
613
+ robotsTxtEnforceability: $robotsEnforce,
608
614
  score: $score,
609
615
  grade: $grade,
610
616
  visibility: {
@@ -15,6 +15,8 @@ BOT_ID=$(jq -r '.id' "$PROFILE")
15
15
  BOT_NAME=$(jq -r '.name' "$PROFILE")
16
16
  UA=$(jq -r '.userAgent' "$PROFILE")
17
17
  RENDERS_JS=$(jq -r '.rendersJavaScript' "$PROFILE")
18
+ PURPOSE=$(jq -r '.purpose // "unknown"' "$PROFILE")
19
+ ROBOTS_ENFORCE=$(jq -r '.robotsTxtEnforceability // "unknown"' "$PROFILE")
18
20
 
19
21
  TMPDIR="${TMPDIR:-/tmp}"
20
22
  HEADERS_FILE=$(mktemp "$TMPDIR/crawlsim-headers.XXXXXX")
@@ -48,6 +50,8 @@ if [ "$CURL_EXIT" -ne 0 ]; then
48
50
  --arg botName "$BOT_NAME" \
49
51
  --arg ua "$UA" \
50
52
  --arg rendersJs "$RENDERS_JS" \
53
+ --arg purpose "$PURPOSE" \
54
+ --arg robotsEnforce "$ROBOTS_ENFORCE" \
51
55
  --arg error "$CURL_ERR" \
52
56
  --argjson exitCode "$CURL_EXIT" \
53
57
  '{
@@ -56,7 +60,9 @@ if [ "$CURL_EXIT" -ne 0 ]; then
56
60
  id: $botId,
57
61
  name: $botName,
58
62
  userAgent: $ua,
59
- rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end)
63
+ rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end),
64
+ purpose: $purpose,
65
+ robotsTxtEnforceability: $robotsEnforce
60
66
  },
61
67
  fetchFailed: true,
62
68
  error: $error,
@@ -121,6 +127,8 @@ jq -n \
121
127
  --arg botName "$BOT_NAME" \
122
128
  --arg ua "$UA" \
123
129
  --arg rendersJs "$RENDERS_JS" \
130
+ --arg purpose "$PURPOSE" \
131
+ --arg robotsEnforce "$ROBOTS_ENFORCE" \
124
132
  --argjson status "$STATUS" \
125
133
  --argjson totalTime "$TOTAL_TIME" \
126
134
  --argjson ttfb "$TTFB" \
@@ -137,7 +145,9 @@ jq -n \
137
145
  id: $botId,
138
146
  name: $botName,
139
147
  userAgent: $ua,
140
- rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end)
148
+ rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end),
149
+ purpose: $purpose,
150
+ robotsTxtEnforceability: $robotsEnforce
141
151
  },
142
152
  status: $status,
143
153
  timing: { total: $totalTime, ttfb: $ttfb },
@@ -0,0 +1,158 @@
1
+ #!/usr/bin/env bash
2
+ set -eu
3
+
4
+ # generate-compare-html.sh — Generate a side-by-side comparison HTML from two crawl-sim reports
5
+ # Usage: generate-compare-html.sh <report-a.json> <report-b.json> [output.html]
6
+
7
+ REPORT_A="${1:?Usage: generate-compare-html.sh <report-a.json> <report-b.json> [output.html]}"
8
+ REPORT_B="${2:?Usage: generate-compare-html.sh <report-a.json> <report-b.json> [output.html]}"
9
+ OUTPUT="${3:-}"
10
+
11
+ for f in "$REPORT_A" "$REPORT_B"; do
12
+ [ -f "$f" ] || { echo "Error: report not found: $f" >&2; exit 1; }
13
+ done
14
+
15
+ # Extract key data from both reports
16
+ URL_A=$(jq -r '.url' "$REPORT_A")
17
+ URL_B=$(jq -r '.url' "$REPORT_B")
18
+ SCORE_A=$(jq -r '.overall.score' "$REPORT_A")
19
+ SCORE_B=$(jq -r '.overall.score' "$REPORT_B")
20
+ GRADE_A=$(jq -r '.overall.grade' "$REPORT_A")
21
+ GRADE_B=$(jq -r '.overall.grade' "$REPORT_B")
22
+ PARITY_A=$(jq -r '.parity.score' "$REPORT_A")
23
+ PARITY_B=$(jq -r '.parity.score' "$REPORT_B")
24
+
25
+ # Build category comparison rows
26
+ CAT_COMPARE=$(jq -r --slurpfile b "$REPORT_B" '
27
+ .categories | to_entries[] |
28
+ . as $cat |
29
+ ($b[0].categories[$cat.key]) as $bcat |
30
+ (if $cat.value.score > $bcat.score then "winner-a"
31
+ elif $cat.value.score < $bcat.score then "winner-b"
32
+ else "tie" end) as $cls |
33
+ ($cat.value.score - $bcat.score) as $delta |
34
+ "<tr class=\"\($cls)\"><td>\($cat.key)</td>" +
35
+ "<td>\($cat.value.score) (\($cat.value.grade))</td>" +
36
+ "<td>\($bcat.score) (\($bcat.grade))</td>" +
37
+ "<td>\(if $delta > 0 then "+\($delta)" elif $delta < 0 then "\($delta)" else "=" end)</td></tr>"
38
+ ' "$REPORT_A")
39
+
40
+ # Build per-bot comparison (using the 4 main bots)
41
+ BOT_COMPARE=$(jq -r --slurpfile b "$REPORT_B" '
42
+ ["googlebot", "gptbot", "claudebot", "perplexitybot"] | .[] |
43
+ . as $id |
44
+ (input.bots[$id] // {score: 0, grade: "N/A"}) as $ba |
45
+ ($b[0].bots[$id] // {score: 0, grade: "N/A"}) as $bb |
46
+ ($ba.score - $bb.score) as $delta |
47
+ "<tr><td>\($id)</td>" +
48
+ "<td>\($ba.score) (\($ba.grade))</td>" +
49
+ "<td>\($bb.score) (\($bb.grade))</td>" +
50
+ "<td>\(if $delta > 0 then "+\($delta)" elif $delta < 0 then "\($delta)" else "=" end)</td></tr>"
51
+ ' "$REPORT_A")
52
+
53
+ # Determine overall winner
54
+ if [ "$SCORE_A" -gt "$SCORE_B" ]; then
55
+ WINNER="Site A leads by $((SCORE_A - SCORE_B)) points"
56
+ elif [ "$SCORE_B" -gt "$SCORE_A" ]; then
57
+ WINNER="Site B leads by $((SCORE_B - SCORE_A)) points"
58
+ else
59
+ WINNER="Both sites tied at ${SCORE_A}/100"
60
+ fi
61
+
62
+ # Count category wins
63
+ WINS_A=$(jq --slurpfile b "$REPORT_B" '
64
+ [.categories | to_entries[] | select(.value.score > ($b[0].categories[.key].score))] | length
65
+ ' "$REPORT_A")
66
+ WINS_B=$(jq --slurpfile b "$REPORT_B" '
67
+ [.categories | to_entries[] | select(.value.score < ($b[0].categories[.key].score))] | length
68
+ ' "$REPORT_A")
69
+
70
+ HTML=$(cat <<HTMLEOF
71
+ <!DOCTYPE html>
72
+ <html lang="en">
73
+ <head>
74
+ <meta charset="utf-8">
75
+ <title>crawl-sim Comparison</title>
76
+ <style>
77
+ @page { size: A4 landscape; margin: 15mm; }
78
+ * { box-sizing: border-box; margin: 0; padding: 0; }
79
+ body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; color: #1a1a1a; line-height: 1.5; padding: 40px; max-width: 1100px; margin: 0 auto; }
80
+ h1 { font-size: 24px; margin-bottom: 4px; }
81
+ .subtitle { color: #666; font-size: 13px; margin-bottom: 24px; }
82
+ .vs-hero { display: grid; grid-template-columns: 1fr auto 1fr; gap: 20px; align-items: center; margin-bottom: 32px; }
83
+ .site-card { background: #f8f9fa; border-radius: 12px; padding: 24px; text-align: center; }
84
+ .site-card.winner { background: #e8f5e9; border: 2px solid #27ae60; }
85
+ .site-score { font-size: 56px; font-weight: 800; line-height: 1; }
86
+ .site-grade { font-size: 36px; font-weight: 700; color: #2d7d46; }
87
+ .site-url { font-size: 12px; color: #666; word-break: break-all; margin-top: 8px; }
88
+ .vs { font-size: 32px; font-weight: 800; color: #999; }
89
+ .verdict { text-align: center; font-size: 16px; font-weight: 600; margin-bottom: 24px; padding: 12px; background: #f0f0f0; border-radius: 8px; }
90
+ table { width: 100%; border-collapse: collapse; margin-bottom: 24px; font-size: 13px; }
91
+ th { background: #1a1a1a; color: white; padding: 8px 12px; text-align: left; }
92
+ td { padding: 8px 12px; border-bottom: 1px solid #e0e0e0; }
93
+ tr:nth-child(even) { background: #f8f9fa; }
94
+ .winner-a td:nth-child(2) { color: #27ae60; font-weight: 600; }
95
+ .winner-b td:nth-child(3) { color: #27ae60; font-weight: 600; }
96
+ .winner-a td:last-child { color: #27ae60; }
97
+ .winner-b td:last-child { color: #c0392b; }
98
+ h2 { font-size: 18px; margin: 24px 0 12px; border-bottom: 2px solid #1a1a1a; padding-bottom: 4px; }
99
+ .footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #e0e0e0; font-size: 11px; color: #999; }
100
+ @media print { body { padding: 0; } }
101
+ </style>
102
+ </head>
103
+ <body>
104
+
105
+ <h1>crawl-sim — Comparative Audit</h1>
106
+ <div class="subtitle">Generated $(date -u +"%Y-%m-%d %H:%M UTC")</div>
107
+
108
+ <div class="vs-hero">
109
+ <div class="site-card$([ "$SCORE_A" -ge "$SCORE_B" ] && echo ' winner' || echo '')">
110
+ <div style="font-size:12px;font-weight:600;color:#666;margin-bottom:8px">SITE A</div>
111
+ <div class="site-score">${SCORE_A}</div>
112
+ <div class="site-grade">${GRADE_A}</div>
113
+ <div class="site-url">${URL_A}</div>
114
+ </div>
115
+ <div class="vs">VS</div>
116
+ <div class="site-card$([ "$SCORE_B" -gt "$SCORE_A" ] && echo ' winner' || echo '')">
117
+ <div style="font-size:12px;font-weight:600;color:#666;margin-bottom:8px">SITE B</div>
118
+ <div class="site-score">${SCORE_B}</div>
119
+ <div class="site-grade">${GRADE_B}</div>
120
+ <div class="site-url">${URL_B}</div>
121
+ </div>
122
+ </div>
123
+
124
+ <div class="verdict">${WINNER} &middot; Site A wins ${WINS_A} categories, Site B wins ${WINS_B}</div>
125
+
126
+ <h2>Category Breakdown</h2>
127
+ <table>
128
+ <tr><th>Category</th><th>Site A</th><th>Site B</th><th>Delta</th></tr>
129
+ ${CAT_COMPARE}
130
+ <tr style="font-weight:600;border-top:2px solid #1a1a1a">
131
+ <td>Content Parity</td>
132
+ <td>${PARITY_A}</td>
133
+ <td>${PARITY_B}</td>
134
+ <td>$([ "$PARITY_A" -gt "$PARITY_B" ] 2>/dev/null && echo "+$((PARITY_A - PARITY_B))" || ([ "$PARITY_B" -gt "$PARITY_A" ] 2>/dev/null && echo "-$((PARITY_B - PARITY_A))" || echo "="))</td>
135
+ </tr>
136
+ </table>
137
+
138
+ <h2>Per-Bot Scores</h2>
139
+ <table>
140
+ <tr><th>Bot</th><th>Site A</th><th>Site B</th><th>Delta</th></tr>
141
+ ${BOT_COMPARE}
142
+ </table>
143
+
144
+ <div class="footer">
145
+ Generated by crawl-sim v1.4.0 &middot; <a href="https://github.com/BraedenBDev/crawl-sim">github.com/BraedenBDev/crawl-sim</a>
146
+ </div>
147
+
148
+ </body>
149
+ </html>
150
+ HTMLEOF
151
+ )
152
+
153
+ if [ -n "$OUTPUT" ]; then
154
+ printf '%s' "$HTML" > "$OUTPUT"
155
+ printf '[generate-compare-html] wrote %s\n' "$OUTPUT" >&2
156
+ else
157
+ printf '%s' "$HTML"
158
+ fi
@@ -0,0 +1,148 @@
1
+ #!/usr/bin/env bash
2
+ set -eu
3
+
4
+ # generate-report-html.sh — Generate a styled HTML audit report from crawl-sim-report.json
5
+ # Usage: generate-report-html.sh <report.json> [output.html]
6
+ # Output: HTML to stdout (or file if second arg given)
7
+
8
+ REPORT="${1:?Usage: generate-report-html.sh <report.json> [output.html]}"
9
+ OUTPUT="${2:-}"
10
+
11
+ if [ ! -f "$REPORT" ]; then
12
+ echo "Error: report not found: $REPORT" >&2
13
+ exit 1
14
+ fi
15
+
16
+ # Extract key data
17
+ URL=$(jq -r '.url' "$REPORT")
18
+ TIMESTAMP=$(jq -r '.timestamp' "$REPORT")
19
+ PAGE_TYPE=$(jq -r '.pageType' "$REPORT")
20
+ OVERALL_SCORE=$(jq -r '.overall.score' "$REPORT")
21
+ OVERALL_GRADE=$(jq -r '.overall.grade' "$REPORT")
22
+ PARITY_SCORE=$(jq -r '.parity.score' "$REPORT")
23
+ PARITY_GRADE=$(jq -r '.parity.grade' "$REPORT")
24
+ PARITY_INTERP=$(jq -r '.parity.interpretation' "$REPORT")
25
+
26
+ # Build per-bot table rows
27
+ BOT_ROWS=$(jq -r '
28
+ .bots | to_entries[] |
29
+ "<tr><td>\(.value.name)</td><td>\(.value.score)</td><td>\(.value.grade)</td>" +
30
+ "<td>\(.value.categories.accessibility.score)</td>" +
31
+ "<td>\(.value.categories.contentVisibility.score)</td>" +
32
+ "<td>\(.value.categories.structuredData.score)</td>" +
33
+ "<td>\(.value.categories.technicalSignals.score)</td>" +
34
+ "<td>\(.value.categories.aiReadiness.score)</td>" +
35
+ "<td>\(.value.purpose // "-")</td>" +
36
+ "<td class=\"enforce-\(.value.robotsTxtEnforceability // "unknown")\">\(.value.robotsTxtEnforceability // "-")</td></tr>"
37
+ ' "$REPORT")
38
+
39
+ # Build category averages
40
+ CAT_ROWS=$(jq -r '
41
+ .categories | to_entries[] |
42
+ "<tr><td>\(.key)</td><td>\(.value.score)</td><td>\(.value.grade)</td></tr>"
43
+ ' "$REPORT")
44
+
45
+ # Build warnings
46
+ WARNINGS_HTML=$(jq -r '
47
+ if (.warnings | length) > 0 then
48
+ (.warnings[] | "<div class=\"warning\"><strong>⚠ \(.code)</strong>: \(.message)</div>")
49
+ else
50
+ "<div class=\"ok\">No warnings.</div>"
51
+ end
52
+ ' "$REPORT")
53
+
54
+ # Build structured data details for first bot
55
+ SD_DETAILS=$(jq -r '
56
+ .bots | to_entries[0].value.categories.structuredData |
57
+ "<p><strong>Page type:</strong> \(.pageType)</p>" +
58
+ "<p><strong>Present:</strong> \(.present | join(", "))</p>" +
59
+ "<p><strong>Missing:</strong> \(if (.missing | length) > 0 then (.missing | join(", ")) else "none" end)</p>" +
60
+ "<p><strong>Violations:</strong> \(if (.violations | length) > 0 then (.violations | map("\(.kind): \(.schema // .field // "")") | join(", ")) else "none" end)</p>" +
61
+ "<p><strong>Notes:</strong> \(.notes)</p>"
62
+ ' "$REPORT")
63
+
64
+ # Generate HTML
65
+ HTML=$(cat <<HTMLEOF
66
+ <!DOCTYPE html>
67
+ <html lang="en">
68
+ <head>
69
+ <meta charset="utf-8">
70
+ <title>crawl-sim Audit — ${URL}</title>
71
+ <style>
72
+ @page { size: A4; margin: 20mm; }
73
+ * { box-sizing: border-box; margin: 0; padding: 0; }
74
+ body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; color: #1a1a1a; line-height: 1.5; padding: 40px; max-width: 900px; margin: 0 auto; }
75
+ h1 { font-size: 28px; margin-bottom: 4px; }
76
+ .subtitle { color: #666; font-size: 14px; margin-bottom: 24px; }
77
+ .score-hero { display: flex; align-items: center; gap: 24px; background: #f8f9fa; border-radius: 12px; padding: 24px; margin-bottom: 24px; }
78
+ .score-big { font-size: 64px; font-weight: 800; line-height: 1; }
79
+ .grade-big { font-size: 48px; font-weight: 700; color: #2d7d46; }
80
+ .score-meta { font-size: 14px; color: #666; }
81
+ table { width: 100%; border-collapse: collapse; margin-bottom: 24px; font-size: 13px; }
82
+ th { background: #1a1a1a; color: white; padding: 8px 12px; text-align: left; font-weight: 600; }
83
+ td { padding: 8px 12px; border-bottom: 1px solid #e0e0e0; }
84
+ tr:nth-child(even) { background: #f8f9fa; }
85
+ .enforce-advisory_only { color: #c0392b; font-weight: 600; }
86
+ .enforce-stealth_risk { color: #e67e22; font-weight: 600; }
87
+ .enforce-enforced { color: #27ae60; }
88
+ h2 { font-size: 18px; margin: 32px 0 12px; border-bottom: 2px solid #1a1a1a; padding-bottom: 4px; }
89
+ .warning { background: #fff3cd; border-left: 4px solid #ffc107; padding: 12px 16px; margin-bottom: 8px; border-radius: 4px; font-size: 13px; }
90
+ .ok { color: #27ae60; font-size: 13px; }
91
+ .parity { display: flex; gap: 16px; align-items: center; background: #e8f5e9; border-radius: 8px; padding: 16px; margin-bottom: 24px; }
92
+ .parity.low { background: #ffebee; }
93
+ .footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #e0e0e0; font-size: 11px; color: #999; }
94
+ @media print { body { padding: 0; } .score-hero { break-inside: avoid; } table { break-inside: avoid; } }
95
+ </style>
96
+ </head>
97
+ <body>
98
+
99
+ <h1>crawl-sim — Bot Visibility Audit</h1>
100
+ <div class="subtitle">${URL} &middot; ${TIMESTAMP} &middot; Page type: ${PAGE_TYPE}</div>
101
+
102
+ <div class="score-hero">
103
+ <div>
104
+ <span class="score-big">${OVERALL_SCORE}</span><span style="font-size:24px;color:#666">/100</span>
105
+ </div>
106
+ <div>
107
+ <div class="grade-big">${OVERALL_GRADE}</div>
108
+ <div class="score-meta">Overall Score</div>
109
+ </div>
110
+ </div>
111
+
112
+ <div class="parity${PARITY_SCORE:+ }$([ "$PARITY_SCORE" -lt 50 ] 2>/dev/null && echo 'low' || echo '')">
113
+ <div><strong>Content Parity:</strong> ${PARITY_SCORE}/100 (${PARITY_GRADE})</div>
114
+ <div>${PARITY_INTERP}</div>
115
+ </div>
116
+
117
+ ${WARNINGS_HTML}
118
+
119
+ <h2>Per-Bot Scores</h2>
120
+ <table>
121
+ <tr><th>Bot</th><th>Score</th><th>Grade</th><th>Access</th><th>Content</th><th>Schema</th><th>Technical</th><th>AI</th><th>Purpose</th><th>robots.txt</th></tr>
122
+ ${BOT_ROWS}
123
+ </table>
124
+
125
+ <h2>Category Averages</h2>
126
+ <table>
127
+ <tr><th>Category</th><th>Score</th><th>Grade</th></tr>
128
+ ${CAT_ROWS}
129
+ </table>
130
+
131
+ <h2>Structured Data Details</h2>
132
+ ${SD_DETAILS}
133
+
134
+ <div class="footer">
135
+ Generated by crawl-sim v1.4.0 &middot; <a href="https://github.com/BraedenBDev/crawl-sim">github.com/BraedenBDev/crawl-sim</a>
136
+ </div>
137
+
138
+ </body>
139
+ </html>
140
+ HTMLEOF
141
+ )
142
+
143
+ if [ -n "$OUTPUT" ]; then
144
+ printf '%s' "$HTML" > "$OUTPUT"
145
+ printf '[generate-report-html] wrote %s\n' "$OUTPUT" >&2
146
+ else
147
+ printf '%s' "$HTML"
148
+ fi
@@ -0,0 +1,85 @@
1
+ #!/usr/bin/env bash
2
+ set -eu
3
+
4
+ # html-to-pdf.sh — Convert an HTML file to PDF using the best available renderer.
5
+ # Usage: html-to-pdf.sh <input.html> <output.pdf>
6
+ #
7
+ # Detection order:
8
+ # 1. Chrome/Chromium at known system paths
9
+ # 2. Playwright's bundled Chromium (npx playwright pdf)
10
+ # 3. Neither → exit 1 with instructions
11
+ #
12
+ # This script is intentionally renderer-agnostic. Callers don't need to know
13
+ # which engine is available — they just pass HTML in and get PDF out.
14
+
15
+ INPUT="${1:?Usage: html-to-pdf.sh <input.html> <output.pdf>}"
16
+ OUTPUT="${2:?Usage: html-to-pdf.sh <input.html> <output.pdf>}"
17
+
18
+ if [ ! -f "$INPUT" ]; then
19
+ echo "Error: input file not found: $INPUT" >&2
20
+ exit 1
21
+ fi
22
+
23
+ # Convert to file:// URL for Chrome (needs absolute path)
24
+ case "$INPUT" in
25
+ /*) INPUT_URL="file://$INPUT" ;;
26
+ *) INPUT_URL="file://$(pwd)/$INPUT" ;;
27
+ esac
28
+
29
+ # --- Strategy 1: System Chrome/Chromium ---
30
+
31
+ find_chrome() {
32
+ # macOS
33
+ for path in \
34
+ "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
35
+ "/Applications/Chromium.app/Contents/MacOS/Chromium" \
36
+ "/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary" \
37
+ "/Applications/Brave Browser.app/Contents/MacOS/Brave Browser"; do
38
+ [ -x "$path" ] && echo "$path" && return 0
39
+ done
40
+ # Linux / WSL
41
+ for cmd in google-chrome chromium-browser chromium google-chrome-stable; do
42
+ command -v "$cmd" >/dev/null 2>&1 && command -v "$cmd" && return 0
43
+ done
44
+ return 1
45
+ }
46
+
47
+ if CHROME=$(find_chrome); then
48
+ printf '[html-to-pdf] using Chrome: %s\n' "$CHROME" >&2
49
+ "$CHROME" \
50
+ --headless \
51
+ --disable-gpu \
52
+ --no-sandbox \
53
+ --print-to-pdf="$OUTPUT" \
54
+ --no-margins \
55
+ "$INPUT_URL" 2>/dev/null
56
+ if [ -s "$OUTPUT" ]; then
57
+ printf '[html-to-pdf] wrote %s (%s bytes)\n' "$OUTPUT" "$(wc -c < "$OUTPUT" | tr -d ' ')" >&2
58
+ exit 0
59
+ fi
60
+ printf '[html-to-pdf] Chrome produced empty output, trying Playwright fallback\n' >&2
61
+ fi
62
+
63
+ # --- Strategy 2: Playwright's bundled Chromium ---
64
+
65
+ if command -v npx >/dev/null 2>&1; then
66
+ # Check if playwright is installed (don't auto-install)
67
+ if npx playwright --version >/dev/null 2>&1; then
68
+ printf '[html-to-pdf] using Playwright bundled Chromium\n' >&2
69
+ npx playwright pdf "$INPUT_URL" "$OUTPUT" 2>/dev/null
70
+ if [ -s "$OUTPUT" ]; then
71
+ printf '[html-to-pdf] wrote %s (%s bytes)\n' "$OUTPUT" "$(wc -c < "$OUTPUT" | tr -d ' ')" >&2
72
+ exit 0
73
+ fi
74
+ printf '[html-to-pdf] Playwright produced empty output\n' >&2
75
+ fi
76
+ fi
77
+
78
+ # --- No renderer available ---
79
+
80
+ echo "Error: no PDF renderer found." >&2
81
+ echo " Install one of:" >&2
82
+ echo " - Google Chrome (recommended — already handles print CSS)" >&2
83
+ echo " - Playwright: npx playwright install chromium" >&2
84
+ echo " The HTML report is still available at: $INPUT" >&2
85
+ exit 1