@braedenbuilds/crawl-sim 1.3.1 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +1 -1
- package/.claude-plugin/plugin.json +1 -1
- package/package.json +1 -1
- package/skills/crawl-sim/SKILL.md +42 -2
- package/skills/crawl-sim/profiles/chatgpt-user.json +8 -3
- package/skills/crawl-sim/profiles/claude-searchbot.json +7 -2
- package/skills/crawl-sim/profiles/claude-user.json +8 -3
- package/skills/crawl-sim/profiles/claudebot.json +7 -2
- package/skills/crawl-sim/profiles/googlebot.json +4 -2
- package/skills/crawl-sim/profiles/gptbot.json +7 -2
- package/skills/crawl-sim/profiles/oai-searchbot.json +7 -2
- package/skills/crawl-sim/profiles/perplexity-user.json +7 -3
- package/skills/crawl-sim/profiles/perplexitybot.json +7 -3
- package/skills/crawl-sim/scripts/compute-score.sh +10 -4
- package/skills/crawl-sim/scripts/fetch-as-bot.sh +12 -2
- package/skills/crawl-sim/scripts/generate-compare-html.sh +158 -0
- package/skills/crawl-sim/scripts/generate-report-html.sh +148 -0
- package/skills/crawl-sim/scripts/html-to-pdf.sh +85 -0
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@braedenbuilds/crawl-sim",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.4.0",
|
|
4
4
|
"description": "Agent-native multi-bot web crawler simulator. See your site through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot.",
|
|
5
5
|
"bin": {
|
|
6
6
|
"crawl-sim": "bin/install.js"
|
|
@@ -31,6 +31,9 @@ Keep status lines short, active, and specific to this URL. Never use the same se
|
|
|
31
31
|
/crawl-sim <url> --bot gptbot # single bot
|
|
32
32
|
/crawl-sim <url> --category structured-data # category deep dive
|
|
33
33
|
/crawl-sim <url> --json # JSON output only (for CI)
|
|
34
|
+
/crawl-sim <url> --pdf # audit + PDF report to Desktop
|
|
35
|
+
/crawl-sim <url> --compare <url2> # side-by-side comparison of two sites
|
|
36
|
+
/crawl-sim <url> --compare <url2> --pdf # comparison + PDF report
|
|
34
37
|
```
|
|
35
38
|
|
|
36
39
|
## Prerequisites — check once at the start
|
|
@@ -214,11 +217,17 @@ Then produce **prioritized findings** ranked by total point impact across bots:
|
|
|
214
217
|
- **Framework detection.** Scan the HTML body for signals: `<meta name="next-head-count">` or `_next/static` → Next.js (Pages Router or App Router respectively), `<div id="__nuxt">` → Nuxt, `<div id="app">` with thin content → SPA (Vue/React CSR), `<!--$-->` placeholder tags → React 18 Suspense. Use these to tailor fix recommendations.
|
|
215
218
|
- **No speculation beyond the data.** If server HTML has 0 `<a>` tags inside a component, say "component not present in server HTML" — not "JavaScript hydration failed" unless the diff-render data proves it.
|
|
216
219
|
- **Known extractor limitations.** The bash meta extractor sometimes reports `h1Text: null` even when `h1.count: 1` — that happens when the H1 contains nested tags (`<br>`, `<span>`, `<svg>`). The count is still correct. Don't flag this as a site bug — it's tracked in GitHub issue #4.
|
|
220
|
+
- **robots.txt enforceability.** Each bot in the score output carries `robotsTxtEnforceability` — one of `enforced`, `advisory_only`, or `stealth_risk`. When robots blocks a bot:
|
|
221
|
+
- `enforced`: The block works. State it directly: *"GPTBot is blocked by robots.txt."*
|
|
222
|
+
- `advisory_only`: The block is unenforceable via robots.txt alone. Flag it: *"robots.txt blocks ChatGPT-User, but OpenAI has stated user-initiated fetches may not respect robots.txt. Network-level enforcement (e.g., Cloudflare WAF rules) is needed to actually block this bot."*
|
|
223
|
+
- `stealth_risk`: The bot claims compliance but has been caught bypassing. Note: *"PerplexityBot is blocked by robots.txt, but Cloudflare has documented instances of Perplexity using undeclared crawlers with generic user-agent strings to access blocked sites."*
|
|
224
|
+
- **Cloudflare context.** Since July 2025, Cloudflare blocks all AI training crawlers (GPTBot, ClaudeBot, CCBot, etc.) **by default** for new domains (~20% of the web). If a site uses Cloudflare, robots.txt may be redundant for training bots — the CDN blocks them at the network level before they reach the origin. The score output's `cloudflareCategory` field (`ai_crawler`, `ai_search`, `ai_assistant`) indicates which tier each bot falls into.
|
|
217
225
|
- **Per-bot quirks to surface:**
|
|
218
226
|
- Googlebot: renders JS. If `diff-render.sh` was skipped, note that comparison was unavailable and recommend installing Playwright.
|
|
219
227
|
- GPTBot / ClaudeBot / PerplexityBot: `rendersJavaScript: false` at observed confidence — flag any server-vs-rendered delta as invisible-to-AI content.
|
|
220
|
-
- `chatgpt-user` / `perplexity-user`:
|
|
221
|
-
-
|
|
228
|
+
- `chatgpt-user` / `perplexity-user`: `robotsTxtEnforceability: advisory_only`. Blocking these via robots.txt alone has no effect — always flag this in findings.
|
|
229
|
+
- `claude-user`: Anthropic is notably stricter — commits to respecting robots.txt even for user-initiated fetches (`robotsTxtEnforceability: enforced`).
|
|
230
|
+
- PerplexityBot: `robotsTxtEnforceability: stealth_risk` — third-party and Cloudflare reports of stealth/undeclared crawling. Mention if relevant, don't assert.
|
|
222
231
|
|
|
223
232
|
After findings, write a **Summary** paragraph: what's working well, biggest wins, confidence caveats. Keep it short — two to three sentences.
|
|
224
233
|
|
|
@@ -233,6 +242,37 @@ After findings, write a **Summary** paragraph: what's working well, biggest wins
|
|
|
233
242
|
- If `jq` or `curl` is missing, exit with install instructions.
|
|
234
243
|
- If `diff-render.sh` skips, the narrative must note that per-bot differentiation is reduced.
|
|
235
244
|
|
|
245
|
+
## PDF Report (`--pdf`)
|
|
246
|
+
|
|
247
|
+
When the user passes `--pdf`, after the narrative output, generate a PDF report:
|
|
248
|
+
|
|
249
|
+
```bash
|
|
250
|
+
"$SKILL_DIR/scripts/generate-report-html.sh" ./crawl-sim-report.json "$RUN_DIR/report.html"
|
|
251
|
+
"$SKILL_DIR/scripts/html-to-pdf.sh" "$RUN_DIR/report.html" "$HOME/Desktop/crawl-sim-audit.pdf"
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
Tell the user where the PDF was saved. If `html-to-pdf.sh` fails (no Chrome or Playwright), the HTML file is still available — tell the user and suggest installing a renderer.
|
|
255
|
+
|
|
256
|
+
## Comparative Audit (`--compare <url2>`)
|
|
257
|
+
|
|
258
|
+
When the user passes `--compare <url2>`, run two full audits and produce a side-by-side report:
|
|
259
|
+
|
|
260
|
+
1. Run the complete 5-stage pipeline for `<url>` — save report as `./crawl-sim-report-a.json`
|
|
261
|
+
2. Run the complete 5-stage pipeline for `<url2>` — save report as `./crawl-sim-report-b.json`
|
|
262
|
+
3. Generate the comparison:
|
|
263
|
+
|
|
264
|
+
```bash
|
|
265
|
+
"$SKILL_DIR/scripts/generate-compare-html.sh" ./crawl-sim-report-a.json ./crawl-sim-report-b.json "$RUN_DIR/compare.html"
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
4. If `--pdf` was also passed:
|
|
269
|
+
|
|
270
|
+
```bash
|
|
271
|
+
"$SKILL_DIR/scripts/html-to-pdf.sh" "$RUN_DIR/compare.html" "$HOME/Desktop/crawl-sim-compare.pdf"
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
The narrative for a comparison should lead with: which site wins overall, by how many points, and in which categories. Then highlight the biggest deltas — what Site A does better, what Site B does better, and what both share.
|
|
275
|
+
|
|
236
276
|
## Cleanup
|
|
237
277
|
|
|
238
278
|
`$RUN_DIR` is small and informative — leave it in place and print the path. The user may want to inspect the raw JSON for any of the 23+ intermediate files.
|
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
"vendor": "OpenAI",
|
|
5
5
|
"userAgent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot",
|
|
6
6
|
"robotsTxtToken": "ChatGPT-User",
|
|
7
|
-
"purpose": "
|
|
7
|
+
"purpose": "user_retrieval",
|
|
8
8
|
"rendersJavaScript": "unknown",
|
|
9
9
|
"respectsRobotsTxt": "partial",
|
|
10
10
|
"crawlDelaySupported": "unknown",
|
|
@@ -23,6 +23,11 @@
|
|
|
23
23
|
}
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
|
-
"relatedBots": [
|
|
27
|
-
|
|
26
|
+
"relatedBots": [
|
|
27
|
+
"gptbot",
|
|
28
|
+
"oai-searchbot"
|
|
29
|
+
],
|
|
30
|
+
"notes": "Not used for automatic crawling. Not used to determine search appearance. User-initiated fetches in ChatGPT and Custom GPTs.",
|
|
31
|
+
"cloudflareCategory": "ai_assistant",
|
|
32
|
+
"robotsTxtEnforceability": "advisory_only"
|
|
28
33
|
}
|
|
@@ -23,6 +23,11 @@
|
|
|
23
23
|
}
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
|
-
"relatedBots": [
|
|
27
|
-
|
|
26
|
+
"relatedBots": [
|
|
27
|
+
"claudebot",
|
|
28
|
+
"claude-user"
|
|
29
|
+
],
|
|
30
|
+
"notes": "Navigates the web to improve search result quality. Focused on search indexing, not training.",
|
|
31
|
+
"cloudflareCategory": "ai_search",
|
|
32
|
+
"robotsTxtEnforceability": "enforced"
|
|
28
33
|
}
|
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
"vendor": "Anthropic",
|
|
5
5
|
"userAgent": "Claude-User",
|
|
6
6
|
"robotsTxtToken": "Claude-User",
|
|
7
|
-
"purpose": "
|
|
7
|
+
"purpose": "user_retrieval",
|
|
8
8
|
"rendersJavaScript": "unknown",
|
|
9
9
|
"respectsRobotsTxt": true,
|
|
10
10
|
"crawlDelaySupported": "unknown",
|
|
@@ -23,6 +23,11 @@
|
|
|
23
23
|
}
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
|
-
"relatedBots": [
|
|
27
|
-
|
|
26
|
+
"relatedBots": [
|
|
27
|
+
"claudebot",
|
|
28
|
+
"claude-searchbot"
|
|
29
|
+
],
|
|
30
|
+
"notes": "When individuals ask questions to Claude, it may access websites. Blocking prevents Claude from retrieving content in response to user queries.",
|
|
31
|
+
"cloudflareCategory": "ai_assistant",
|
|
32
|
+
"robotsTxtEnforceability": "enforced"
|
|
28
33
|
}
|
|
@@ -23,6 +23,11 @@
|
|
|
23
23
|
}
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
|
-
"relatedBots": [
|
|
27
|
-
|
|
26
|
+
"relatedBots": [
|
|
27
|
+
"claude-user",
|
|
28
|
+
"claude-searchbot"
|
|
29
|
+
],
|
|
30
|
+
"notes": "Collects web content that could potentially contribute to AI model training. Crawl-delay explicitly supported (non-standard). Blocking IP addresses will not reliably work.",
|
|
31
|
+
"cloudflareCategory": "ai_crawler",
|
|
32
|
+
"robotsTxtEnforceability": "enforced"
|
|
28
33
|
}
|
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
"vendor": "Google",
|
|
5
5
|
"userAgent": "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
|
|
6
6
|
"robotsTxtToken": "Googlebot",
|
|
7
|
-
"purpose": "search
|
|
7
|
+
"purpose": "search",
|
|
8
8
|
"rendersJavaScript": true,
|
|
9
9
|
"respectsRobotsTxt": true,
|
|
10
10
|
"crawlDelaySupported": false,
|
|
@@ -24,5 +24,7 @@
|
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
26
|
"relatedBots": [],
|
|
27
|
-
"notes": "Two-phase: initial fetch (HTML) then queued render (headless Chrome via WRS). Evergreen Chromium. Stateless sessions. ~5s default timeout. Mobile-first indexing."
|
|
27
|
+
"notes": "Two-phase: initial fetch (HTML) then queued render (headless Chrome via WRS). Evergreen Chromium. Stateless sessions. ~5s default timeout. Mobile-first indexing.",
|
|
28
|
+
"cloudflareCategory": "search_engine",
|
|
29
|
+
"robotsTxtEnforceability": "enforced"
|
|
28
30
|
}
|
|
@@ -23,6 +23,11 @@
|
|
|
23
23
|
}
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
|
-
"relatedBots": [
|
|
27
|
-
|
|
26
|
+
"relatedBots": [
|
|
27
|
+
"oai-searchbot",
|
|
28
|
+
"chatgpt-user"
|
|
29
|
+
],
|
|
30
|
+
"notes": "Disallowing GPTBot indicates a site's content should not be used in training generative AI foundation models.",
|
|
31
|
+
"cloudflareCategory": "ai_crawler",
|
|
32
|
+
"robotsTxtEnforceability": "enforced"
|
|
28
33
|
}
|
|
@@ -23,6 +23,11 @@
|
|
|
23
23
|
}
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
|
-
"relatedBots": [
|
|
27
|
-
|
|
26
|
+
"relatedBots": [
|
|
27
|
+
"gptbot",
|
|
28
|
+
"chatgpt-user"
|
|
29
|
+
],
|
|
30
|
+
"notes": "Sites opted out of OAI-SearchBot will not be shown in ChatGPT search answers, though can still appear as navigational links.",
|
|
31
|
+
"cloudflareCategory": "ai_search",
|
|
32
|
+
"robotsTxtEnforceability": "enforced"
|
|
28
33
|
}
|
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
"vendor": "Perplexity",
|
|
5
5
|
"userAgent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)",
|
|
6
6
|
"robotsTxtToken": "Perplexity-User",
|
|
7
|
-
"purpose": "
|
|
7
|
+
"purpose": "user_retrieval",
|
|
8
8
|
"rendersJavaScript": "unknown",
|
|
9
9
|
"respectsRobotsTxt": false,
|
|
10
10
|
"crawlDelaySupported": "unknown",
|
|
@@ -23,6 +23,10 @@
|
|
|
23
23
|
}
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
|
-
"relatedBots": [
|
|
27
|
-
|
|
26
|
+
"relatedBots": [
|
|
27
|
+
"perplexitybot"
|
|
28
|
+
],
|
|
29
|
+
"notes": "Supports user actions within Perplexity. Not used for web crawling or AI training. Generally ignores robots.txt since fetches are user-initiated.",
|
|
30
|
+
"cloudflareCategory": "ai_assistant",
|
|
31
|
+
"robotsTxtEnforceability": "advisory_only"
|
|
28
32
|
}
|
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
"vendor": "Perplexity",
|
|
5
5
|
"userAgent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)",
|
|
6
6
|
"robotsTxtToken": "PerplexityBot",
|
|
7
|
-
"purpose": "search
|
|
7
|
+
"purpose": "search",
|
|
8
8
|
"rendersJavaScript": false,
|
|
9
9
|
"respectsRobotsTxt": true,
|
|
10
10
|
"crawlDelaySupported": "unknown",
|
|
@@ -23,6 +23,10 @@
|
|
|
23
23
|
}
|
|
24
24
|
},
|
|
25
25
|
"lastVerified": "2026-04-11",
|
|
26
|
-
"relatedBots": [
|
|
27
|
-
|
|
26
|
+
"relatedBots": [
|
|
27
|
+
"perplexity-user"
|
|
28
|
+
],
|
|
29
|
+
"notes": "Designed to surface and link websites in search results on Perplexity. NOT used to crawl content for AI foundation models. Changes may take up to 24 hours to reflect.",
|
|
30
|
+
"cloudflareCategory": "ai_search",
|
|
31
|
+
"robotsTxtEnforceability": "stealth_risk"
|
|
28
32
|
}
|
|
@@ -312,14 +312,16 @@ for bot_id in $BOTS; do
|
|
|
312
312
|
continue
|
|
313
313
|
fi
|
|
314
314
|
|
|
315
|
-
# Batch-read fields from fetch file
|
|
316
|
-
read -r STATUS TOTAL_TIME SERVER_WORD_COUNT RENDERS_JS <<< \
|
|
315
|
+
# Batch-read fields from fetch file
|
|
316
|
+
read -r STATUS TOTAL_TIME SERVER_WORD_COUNT RENDERS_JS PURPOSE_TIER ROBOTS_ENFORCE <<< \
|
|
317
317
|
"$(jq -r '[
|
|
318
318
|
(.status // 0),
|
|
319
319
|
(.timing.total // 0),
|
|
320
320
|
(.wordCount // 0),
|
|
321
|
-
(.bot.rendersJavaScript | if . == null then "unknown" else tostring end)
|
|
322
|
-
|
|
321
|
+
(.bot.rendersJavaScript | if . == null then "unknown" else tostring end),
|
|
322
|
+
(.bot.purpose // "unknown"),
|
|
323
|
+
(.bot.robotsTxtEnforceability // "unknown")
|
|
324
|
+
] | @tsv' "$FETCH" 2>/dev/null || echo "0 0 0 unknown unknown unknown")"
|
|
323
325
|
|
|
324
326
|
ROBOTS_ALLOWED=$(jq -r '.allowed // false | tostring' "$ROBOTS" 2>/dev/null || echo "false")
|
|
325
327
|
|
|
@@ -586,6 +588,8 @@ for bot_id in $BOTS; do
|
|
|
586
588
|
--arg id "$bot_id" \
|
|
587
589
|
--arg name "$BOT_NAME" \
|
|
588
590
|
--arg rendersJs "$RENDERS_JS" \
|
|
591
|
+
--arg purpose "$PURPOSE_TIER" \
|
|
592
|
+
--arg robotsEnforce "$ROBOTS_ENFORCE" \
|
|
589
593
|
--argjson score "$BOT_SCORE" \
|
|
590
594
|
--arg grade "$BOT_GRADE" \
|
|
591
595
|
--argjson acc "$ACC" \
|
|
@@ -605,6 +609,8 @@ for bot_id in $BOTS; do
|
|
|
605
609
|
id: $id,
|
|
606
610
|
name: $name,
|
|
607
611
|
rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end),
|
|
612
|
+
purpose: $purpose,
|
|
613
|
+
robotsTxtEnforceability: $robotsEnforce,
|
|
608
614
|
score: $score,
|
|
609
615
|
grade: $grade,
|
|
610
616
|
visibility: {
|
|
@@ -15,6 +15,8 @@ BOT_ID=$(jq -r '.id' "$PROFILE")
|
|
|
15
15
|
BOT_NAME=$(jq -r '.name' "$PROFILE")
|
|
16
16
|
UA=$(jq -r '.userAgent' "$PROFILE")
|
|
17
17
|
RENDERS_JS=$(jq -r '.rendersJavaScript' "$PROFILE")
|
|
18
|
+
PURPOSE=$(jq -r '.purpose // "unknown"' "$PROFILE")
|
|
19
|
+
ROBOTS_ENFORCE=$(jq -r '.robotsTxtEnforceability // "unknown"' "$PROFILE")
|
|
18
20
|
|
|
19
21
|
TMPDIR="${TMPDIR:-/tmp}"
|
|
20
22
|
HEADERS_FILE=$(mktemp "$TMPDIR/crawlsim-headers.XXXXXX")
|
|
@@ -48,6 +50,8 @@ if [ "$CURL_EXIT" -ne 0 ]; then
|
|
|
48
50
|
--arg botName "$BOT_NAME" \
|
|
49
51
|
--arg ua "$UA" \
|
|
50
52
|
--arg rendersJs "$RENDERS_JS" \
|
|
53
|
+
--arg purpose "$PURPOSE" \
|
|
54
|
+
--arg robotsEnforce "$ROBOTS_ENFORCE" \
|
|
51
55
|
--arg error "$CURL_ERR" \
|
|
52
56
|
--argjson exitCode "$CURL_EXIT" \
|
|
53
57
|
'{
|
|
@@ -56,7 +60,9 @@ if [ "$CURL_EXIT" -ne 0 ]; then
|
|
|
56
60
|
id: $botId,
|
|
57
61
|
name: $botName,
|
|
58
62
|
userAgent: $ua,
|
|
59
|
-
rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end)
|
|
63
|
+
rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end),
|
|
64
|
+
purpose: $purpose,
|
|
65
|
+
robotsTxtEnforceability: $robotsEnforce
|
|
60
66
|
},
|
|
61
67
|
fetchFailed: true,
|
|
62
68
|
error: $error,
|
|
@@ -121,6 +127,8 @@ jq -n \
|
|
|
121
127
|
--arg botName "$BOT_NAME" \
|
|
122
128
|
--arg ua "$UA" \
|
|
123
129
|
--arg rendersJs "$RENDERS_JS" \
|
|
130
|
+
--arg purpose "$PURPOSE" \
|
|
131
|
+
--arg robotsEnforce "$ROBOTS_ENFORCE" \
|
|
124
132
|
--argjson status "$STATUS" \
|
|
125
133
|
--argjson totalTime "$TOTAL_TIME" \
|
|
126
134
|
--argjson ttfb "$TTFB" \
|
|
@@ -137,7 +145,9 @@ jq -n \
|
|
|
137
145
|
id: $botId,
|
|
138
146
|
name: $botName,
|
|
139
147
|
userAgent: $ua,
|
|
140
|
-
rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end)
|
|
148
|
+
rendersJavaScript: (if $rendersJs == "true" then true elif $rendersJs == "false" then false else $rendersJs end),
|
|
149
|
+
purpose: $purpose,
|
|
150
|
+
robotsTxtEnforceability: $robotsEnforce
|
|
141
151
|
},
|
|
142
152
|
status: $status,
|
|
143
153
|
timing: { total: $totalTime, ttfb: $ttfb },
|
|
@@ -0,0 +1,158 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
set -eu
|
|
3
|
+
|
|
4
|
+
# generate-compare-html.sh — Generate a side-by-side comparison HTML from two crawl-sim reports
|
|
5
|
+
# Usage: generate-compare-html.sh <report-a.json> <report-b.json> [output.html]
|
|
6
|
+
|
|
7
|
+
REPORT_A="${1:?Usage: generate-compare-html.sh <report-a.json> <report-b.json> [output.html]}"
|
|
8
|
+
REPORT_B="${2:?Usage: generate-compare-html.sh <report-a.json> <report-b.json> [output.html]}"
|
|
9
|
+
OUTPUT="${3:-}"
|
|
10
|
+
|
|
11
|
+
for f in "$REPORT_A" "$REPORT_B"; do
|
|
12
|
+
[ -f "$f" ] || { echo "Error: report not found: $f" >&2; exit 1; }
|
|
13
|
+
done
|
|
14
|
+
|
|
15
|
+
# Extract key data from both reports
|
|
16
|
+
URL_A=$(jq -r '.url' "$REPORT_A")
|
|
17
|
+
URL_B=$(jq -r '.url' "$REPORT_B")
|
|
18
|
+
SCORE_A=$(jq -r '.overall.score' "$REPORT_A")
|
|
19
|
+
SCORE_B=$(jq -r '.overall.score' "$REPORT_B")
|
|
20
|
+
GRADE_A=$(jq -r '.overall.grade' "$REPORT_A")
|
|
21
|
+
GRADE_B=$(jq -r '.overall.grade' "$REPORT_B")
|
|
22
|
+
PARITY_A=$(jq -r '.parity.score' "$REPORT_A")
|
|
23
|
+
PARITY_B=$(jq -r '.parity.score' "$REPORT_B")
|
|
24
|
+
|
|
25
|
+
# Build category comparison rows
|
|
26
|
+
CAT_COMPARE=$(jq -r --slurpfile b "$REPORT_B" '
|
|
27
|
+
.categories | to_entries[] |
|
|
28
|
+
. as $cat |
|
|
29
|
+
($b[0].categories[$cat.key]) as $bcat |
|
|
30
|
+
(if $cat.value.score > $bcat.score then "winner-a"
|
|
31
|
+
elif $cat.value.score < $bcat.score then "winner-b"
|
|
32
|
+
else "tie" end) as $cls |
|
|
33
|
+
($cat.value.score - $bcat.score) as $delta |
|
|
34
|
+
"<tr class=\"\($cls)\"><td>\($cat.key)</td>" +
|
|
35
|
+
"<td>\($cat.value.score) (\($cat.value.grade))</td>" +
|
|
36
|
+
"<td>\($bcat.score) (\($bcat.grade))</td>" +
|
|
37
|
+
"<td>\(if $delta > 0 then "+\($delta)" elif $delta < 0 then "\($delta)" else "=" end)</td></tr>"
|
|
38
|
+
' "$REPORT_A")
|
|
39
|
+
|
|
40
|
+
# Build per-bot comparison (using the 4 main bots)
|
|
41
|
+
BOT_COMPARE=$(jq -r --slurpfile b "$REPORT_B" '
|
|
42
|
+
["googlebot", "gptbot", "claudebot", "perplexitybot"] | .[] |
|
|
43
|
+
. as $id |
|
|
44
|
+
(input.bots[$id] // {score: 0, grade: "N/A"}) as $ba |
|
|
45
|
+
($b[0].bots[$id] // {score: 0, grade: "N/A"}) as $bb |
|
|
46
|
+
($ba.score - $bb.score) as $delta |
|
|
47
|
+
"<tr><td>\($id)</td>" +
|
|
48
|
+
"<td>\($ba.score) (\($ba.grade))</td>" +
|
|
49
|
+
"<td>\($bb.score) (\($bb.grade))</td>" +
|
|
50
|
+
"<td>\(if $delta > 0 then "+\($delta)" elif $delta < 0 then "\($delta)" else "=" end)</td></tr>"
|
|
51
|
+
' "$REPORT_A")
|
|
52
|
+
|
|
53
|
+
# Determine overall winner
|
|
54
|
+
if [ "$SCORE_A" -gt "$SCORE_B" ]; then
|
|
55
|
+
WINNER="Site A leads by $((SCORE_A - SCORE_B)) points"
|
|
56
|
+
elif [ "$SCORE_B" -gt "$SCORE_A" ]; then
|
|
57
|
+
WINNER="Site B leads by $((SCORE_B - SCORE_A)) points"
|
|
58
|
+
else
|
|
59
|
+
WINNER="Both sites tied at ${SCORE_A}/100"
|
|
60
|
+
fi
|
|
61
|
+
|
|
62
|
+
# Count category wins
|
|
63
|
+
WINS_A=$(jq --slurpfile b "$REPORT_B" '
|
|
64
|
+
[.categories | to_entries[] | select(.value.score > ($b[0].categories[.key].score))] | length
|
|
65
|
+
' "$REPORT_A")
|
|
66
|
+
WINS_B=$(jq --slurpfile b "$REPORT_B" '
|
|
67
|
+
[.categories | to_entries[] | select(.value.score < ($b[0].categories[.key].score))] | length
|
|
68
|
+
' "$REPORT_A")
|
|
69
|
+
|
|
70
|
+
HTML=$(cat <<HTMLEOF
|
|
71
|
+
<!DOCTYPE html>
|
|
72
|
+
<html lang="en">
|
|
73
|
+
<head>
|
|
74
|
+
<meta charset="utf-8">
|
|
75
|
+
<title>crawl-sim Comparison</title>
|
|
76
|
+
<style>
|
|
77
|
+
@page { size: A4 landscape; margin: 15mm; }
|
|
78
|
+
* { box-sizing: border-box; margin: 0; padding: 0; }
|
|
79
|
+
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; color: #1a1a1a; line-height: 1.5; padding: 40px; max-width: 1100px; margin: 0 auto; }
|
|
80
|
+
h1 { font-size: 24px; margin-bottom: 4px; }
|
|
81
|
+
.subtitle { color: #666; font-size: 13px; margin-bottom: 24px; }
|
|
82
|
+
.vs-hero { display: grid; grid-template-columns: 1fr auto 1fr; gap: 20px; align-items: center; margin-bottom: 32px; }
|
|
83
|
+
.site-card { background: #f8f9fa; border-radius: 12px; padding: 24px; text-align: center; }
|
|
84
|
+
.site-card.winner { background: #e8f5e9; border: 2px solid #27ae60; }
|
|
85
|
+
.site-score { font-size: 56px; font-weight: 800; line-height: 1; }
|
|
86
|
+
.site-grade { font-size: 36px; font-weight: 700; color: #2d7d46; }
|
|
87
|
+
.site-url { font-size: 12px; color: #666; word-break: break-all; margin-top: 8px; }
|
|
88
|
+
.vs { font-size: 32px; font-weight: 800; color: #999; }
|
|
89
|
+
.verdict { text-align: center; font-size: 16px; font-weight: 600; margin-bottom: 24px; padding: 12px; background: #f0f0f0; border-radius: 8px; }
|
|
90
|
+
table { width: 100%; border-collapse: collapse; margin-bottom: 24px; font-size: 13px; }
|
|
91
|
+
th { background: #1a1a1a; color: white; padding: 8px 12px; text-align: left; }
|
|
92
|
+
td { padding: 8px 12px; border-bottom: 1px solid #e0e0e0; }
|
|
93
|
+
tr:nth-child(even) { background: #f8f9fa; }
|
|
94
|
+
.winner-a td:nth-child(2) { color: #27ae60; font-weight: 600; }
|
|
95
|
+
.winner-b td:nth-child(3) { color: #27ae60; font-weight: 600; }
|
|
96
|
+
.winner-a td:last-child { color: #27ae60; }
|
|
97
|
+
.winner-b td:last-child { color: #c0392b; }
|
|
98
|
+
h2 { font-size: 18px; margin: 24px 0 12px; border-bottom: 2px solid #1a1a1a; padding-bottom: 4px; }
|
|
99
|
+
.footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #e0e0e0; font-size: 11px; color: #999; }
|
|
100
|
+
@media print { body { padding: 0; } }
|
|
101
|
+
</style>
|
|
102
|
+
</head>
|
|
103
|
+
<body>
|
|
104
|
+
|
|
105
|
+
<h1>crawl-sim — Comparative Audit</h1>
|
|
106
|
+
<div class="subtitle">Generated $(date -u +"%Y-%m-%d %H:%M UTC")</div>
|
|
107
|
+
|
|
108
|
+
<div class="vs-hero">
|
|
109
|
+
<div class="site-card$([ "$SCORE_A" -ge "$SCORE_B" ] && echo ' winner' || echo '')">
|
|
110
|
+
<div style="font-size:12px;font-weight:600;color:#666;margin-bottom:8px">SITE A</div>
|
|
111
|
+
<div class="site-score">${SCORE_A}</div>
|
|
112
|
+
<div class="site-grade">${GRADE_A}</div>
|
|
113
|
+
<div class="site-url">${URL_A}</div>
|
|
114
|
+
</div>
|
|
115
|
+
<div class="vs">VS</div>
|
|
116
|
+
<div class="site-card$([ "$SCORE_B" -gt "$SCORE_A" ] && echo ' winner' || echo '')">
|
|
117
|
+
<div style="font-size:12px;font-weight:600;color:#666;margin-bottom:8px">SITE B</div>
|
|
118
|
+
<div class="site-score">${SCORE_B}</div>
|
|
119
|
+
<div class="site-grade">${GRADE_B}</div>
|
|
120
|
+
<div class="site-url">${URL_B}</div>
|
|
121
|
+
</div>
|
|
122
|
+
</div>
|
|
123
|
+
|
|
124
|
+
<div class="verdict">${WINNER} · Site A wins ${WINS_A} categories, Site B wins ${WINS_B}</div>
|
|
125
|
+
|
|
126
|
+
<h2>Category Breakdown</h2>
|
|
127
|
+
<table>
|
|
128
|
+
<tr><th>Category</th><th>Site A</th><th>Site B</th><th>Delta</th></tr>
|
|
129
|
+
${CAT_COMPARE}
|
|
130
|
+
<tr style="font-weight:600;border-top:2px solid #1a1a1a">
|
|
131
|
+
<td>Content Parity</td>
|
|
132
|
+
<td>${PARITY_A}</td>
|
|
133
|
+
<td>${PARITY_B}</td>
|
|
134
|
+
<td>$([ "$PARITY_A" -gt "$PARITY_B" ] 2>/dev/null && echo "+$((PARITY_A - PARITY_B))" || ([ "$PARITY_B" -gt "$PARITY_A" ] 2>/dev/null && echo "-$((PARITY_B - PARITY_A))" || echo "="))</td>
|
|
135
|
+
</tr>
|
|
136
|
+
</table>
|
|
137
|
+
|
|
138
|
+
<h2>Per-Bot Scores</h2>
|
|
139
|
+
<table>
|
|
140
|
+
<tr><th>Bot</th><th>Site A</th><th>Site B</th><th>Delta</th></tr>
|
|
141
|
+
${BOT_COMPARE}
|
|
142
|
+
</table>
|
|
143
|
+
|
|
144
|
+
<div class="footer">
|
|
145
|
+
Generated by crawl-sim v1.4.0 · <a href="https://github.com/BraedenBDev/crawl-sim">github.com/BraedenBDev/crawl-sim</a>
|
|
146
|
+
</div>
|
|
147
|
+
|
|
148
|
+
</body>
|
|
149
|
+
</html>
|
|
150
|
+
HTMLEOF
|
|
151
|
+
)
|
|
152
|
+
|
|
153
|
+
if [ -n "$OUTPUT" ]; then
|
|
154
|
+
printf '%s' "$HTML" > "$OUTPUT"
|
|
155
|
+
printf '[generate-compare-html] wrote %s\n' "$OUTPUT" >&2
|
|
156
|
+
else
|
|
157
|
+
printf '%s' "$HTML"
|
|
158
|
+
fi
|
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
set -eu
|
|
3
|
+
|
|
4
|
+
# generate-report-html.sh — Generate a styled HTML audit report from crawl-sim-report.json
|
|
5
|
+
# Usage: generate-report-html.sh <report.json> [output.html]
|
|
6
|
+
# Output: HTML to stdout (or file if second arg given)
|
|
7
|
+
|
|
8
|
+
REPORT="${1:?Usage: generate-report-html.sh <report.json> [output.html]}"
|
|
9
|
+
OUTPUT="${2:-}"
|
|
10
|
+
|
|
11
|
+
if [ ! -f "$REPORT" ]; then
|
|
12
|
+
echo "Error: report not found: $REPORT" >&2
|
|
13
|
+
exit 1
|
|
14
|
+
fi
|
|
15
|
+
|
|
16
|
+
# Extract key data
|
|
17
|
+
URL=$(jq -r '.url' "$REPORT")
|
|
18
|
+
TIMESTAMP=$(jq -r '.timestamp' "$REPORT")
|
|
19
|
+
PAGE_TYPE=$(jq -r '.pageType' "$REPORT")
|
|
20
|
+
OVERALL_SCORE=$(jq -r '.overall.score' "$REPORT")
|
|
21
|
+
OVERALL_GRADE=$(jq -r '.overall.grade' "$REPORT")
|
|
22
|
+
PARITY_SCORE=$(jq -r '.parity.score' "$REPORT")
|
|
23
|
+
PARITY_GRADE=$(jq -r '.parity.grade' "$REPORT")
|
|
24
|
+
PARITY_INTERP=$(jq -r '.parity.interpretation' "$REPORT")
|
|
25
|
+
|
|
26
|
+
# Build per-bot table rows
|
|
27
|
+
BOT_ROWS=$(jq -r '
|
|
28
|
+
.bots | to_entries[] |
|
|
29
|
+
"<tr><td>\(.value.name)</td><td>\(.value.score)</td><td>\(.value.grade)</td>" +
|
|
30
|
+
"<td>\(.value.categories.accessibility.score)</td>" +
|
|
31
|
+
"<td>\(.value.categories.contentVisibility.score)</td>" +
|
|
32
|
+
"<td>\(.value.categories.structuredData.score)</td>" +
|
|
33
|
+
"<td>\(.value.categories.technicalSignals.score)</td>" +
|
|
34
|
+
"<td>\(.value.categories.aiReadiness.score)</td>" +
|
|
35
|
+
"<td>\(.value.purpose // "-")</td>" +
|
|
36
|
+
"<td class=\"enforce-\(.value.robotsTxtEnforceability // "unknown")\">\(.value.robotsTxtEnforceability // "-")</td></tr>"
|
|
37
|
+
' "$REPORT")
|
|
38
|
+
|
|
39
|
+
# Build category averages
|
|
40
|
+
CAT_ROWS=$(jq -r '
|
|
41
|
+
.categories | to_entries[] |
|
|
42
|
+
"<tr><td>\(.key)</td><td>\(.value.score)</td><td>\(.value.grade)</td></tr>"
|
|
43
|
+
' "$REPORT")
|
|
44
|
+
|
|
45
|
+
# Build warnings
|
|
46
|
+
WARNINGS_HTML=$(jq -r '
|
|
47
|
+
if (.warnings | length) > 0 then
|
|
48
|
+
(.warnings[] | "<div class=\"warning\"><strong>⚠ \(.code)</strong>: \(.message)</div>")
|
|
49
|
+
else
|
|
50
|
+
"<div class=\"ok\">No warnings.</div>"
|
|
51
|
+
end
|
|
52
|
+
' "$REPORT")
|
|
53
|
+
|
|
54
|
+
# Build structured data details for first bot
|
|
55
|
+
SD_DETAILS=$(jq -r '
|
|
56
|
+
.bots | to_entries[0].value.categories.structuredData |
|
|
57
|
+
"<p><strong>Page type:</strong> \(.pageType)</p>" +
|
|
58
|
+
"<p><strong>Present:</strong> \(.present | join(", "))</p>" +
|
|
59
|
+
"<p><strong>Missing:</strong> \(if (.missing | length) > 0 then (.missing | join(", ")) else "none" end)</p>" +
|
|
60
|
+
"<p><strong>Violations:</strong> \(if (.violations | length) > 0 then (.violations | map("\(.kind): \(.schema // .field // "")") | join(", ")) else "none" end)</p>" +
|
|
61
|
+
"<p><strong>Notes:</strong> \(.notes)</p>"
|
|
62
|
+
' "$REPORT")
|
|
63
|
+
|
|
64
|
+
# Generate HTML
|
|
65
|
+
HTML=$(cat <<HTMLEOF
|
|
66
|
+
<!DOCTYPE html>
|
|
67
|
+
<html lang="en">
|
|
68
|
+
<head>
|
|
69
|
+
<meta charset="utf-8">
|
|
70
|
+
<title>crawl-sim Audit — ${URL}</title>
|
|
71
|
+
<style>
|
|
72
|
+
@page { size: A4; margin: 20mm; }
|
|
73
|
+
* { box-sizing: border-box; margin: 0; padding: 0; }
|
|
74
|
+
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; color: #1a1a1a; line-height: 1.5; padding: 40px; max-width: 900px; margin: 0 auto; }
|
|
75
|
+
h1 { font-size: 28px; margin-bottom: 4px; }
|
|
76
|
+
.subtitle { color: #666; font-size: 14px; margin-bottom: 24px; }
|
|
77
|
+
.score-hero { display: flex; align-items: center; gap: 24px; background: #f8f9fa; border-radius: 12px; padding: 24px; margin-bottom: 24px; }
|
|
78
|
+
.score-big { font-size: 64px; font-weight: 800; line-height: 1; }
|
|
79
|
+
.grade-big { font-size: 48px; font-weight: 700; color: #2d7d46; }
|
|
80
|
+
.score-meta { font-size: 14px; color: #666; }
|
|
81
|
+
table { width: 100%; border-collapse: collapse; margin-bottom: 24px; font-size: 13px; }
|
|
82
|
+
th { background: #1a1a1a; color: white; padding: 8px 12px; text-align: left; font-weight: 600; }
|
|
83
|
+
td { padding: 8px 12px; border-bottom: 1px solid #e0e0e0; }
|
|
84
|
+
tr:nth-child(even) { background: #f8f9fa; }
|
|
85
|
+
.enforce-advisory_only { color: #c0392b; font-weight: 600; }
|
|
86
|
+
.enforce-stealth_risk { color: #e67e22; font-weight: 600; }
|
|
87
|
+
.enforce-enforced { color: #27ae60; }
|
|
88
|
+
h2 { font-size: 18px; margin: 32px 0 12px; border-bottom: 2px solid #1a1a1a; padding-bottom: 4px; }
|
|
89
|
+
.warning { background: #fff3cd; border-left: 4px solid #ffc107; padding: 12px 16px; margin-bottom: 8px; border-radius: 4px; font-size: 13px; }
|
|
90
|
+
.ok { color: #27ae60; font-size: 13px; }
|
|
91
|
+
.parity { display: flex; gap: 16px; align-items: center; background: #e8f5e9; border-radius: 8px; padding: 16px; margin-bottom: 24px; }
|
|
92
|
+
.parity.low { background: #ffebee; }
|
|
93
|
+
.footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #e0e0e0; font-size: 11px; color: #999; }
|
|
94
|
+
@media print { body { padding: 0; } .score-hero { break-inside: avoid; } table { break-inside: avoid; } }
|
|
95
|
+
</style>
|
|
96
|
+
</head>
|
|
97
|
+
<body>
|
|
98
|
+
|
|
99
|
+
<h1>crawl-sim — Bot Visibility Audit</h1>
|
|
100
|
+
<div class="subtitle">${URL} · ${TIMESTAMP} · Page type: ${PAGE_TYPE}</div>
|
|
101
|
+
|
|
102
|
+
<div class="score-hero">
|
|
103
|
+
<div>
|
|
104
|
+
<span class="score-big">${OVERALL_SCORE}</span><span style="font-size:24px;color:#666">/100</span>
|
|
105
|
+
</div>
|
|
106
|
+
<div>
|
|
107
|
+
<div class="grade-big">${OVERALL_GRADE}</div>
|
|
108
|
+
<div class="score-meta">Overall Score</div>
|
|
109
|
+
</div>
|
|
110
|
+
</div>
|
|
111
|
+
|
|
112
|
+
<div class="parity${PARITY_SCORE:+ }$([ "$PARITY_SCORE" -lt 50 ] 2>/dev/null && echo 'low' || echo '')">
|
|
113
|
+
<div><strong>Content Parity:</strong> ${PARITY_SCORE}/100 (${PARITY_GRADE})</div>
|
|
114
|
+
<div>${PARITY_INTERP}</div>
|
|
115
|
+
</div>
|
|
116
|
+
|
|
117
|
+
${WARNINGS_HTML}
|
|
118
|
+
|
|
119
|
+
<h2>Per-Bot Scores</h2>
|
|
120
|
+
<table>
|
|
121
|
+
<tr><th>Bot</th><th>Score</th><th>Grade</th><th>Access</th><th>Content</th><th>Schema</th><th>Technical</th><th>AI</th><th>Purpose</th><th>robots.txt</th></tr>
|
|
122
|
+
${BOT_ROWS}
|
|
123
|
+
</table>
|
|
124
|
+
|
|
125
|
+
<h2>Category Averages</h2>
|
|
126
|
+
<table>
|
|
127
|
+
<tr><th>Category</th><th>Score</th><th>Grade</th></tr>
|
|
128
|
+
${CAT_ROWS}
|
|
129
|
+
</table>
|
|
130
|
+
|
|
131
|
+
<h2>Structured Data Details</h2>
|
|
132
|
+
${SD_DETAILS}
|
|
133
|
+
|
|
134
|
+
<div class="footer">
|
|
135
|
+
Generated by crawl-sim v1.4.0 · <a href="https://github.com/BraedenBDev/crawl-sim">github.com/BraedenBDev/crawl-sim</a>
|
|
136
|
+
</div>
|
|
137
|
+
|
|
138
|
+
</body>
|
|
139
|
+
</html>
|
|
140
|
+
HTMLEOF
|
|
141
|
+
)
|
|
142
|
+
|
|
143
|
+
if [ -n "$OUTPUT" ]; then
|
|
144
|
+
printf '%s' "$HTML" > "$OUTPUT"
|
|
145
|
+
printf '[generate-report-html] wrote %s\n' "$OUTPUT" >&2
|
|
146
|
+
else
|
|
147
|
+
printf '%s' "$HTML"
|
|
148
|
+
fi
|
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
set -eu
|
|
3
|
+
|
|
4
|
+
# html-to-pdf.sh — Convert an HTML file to PDF using the best available renderer.
|
|
5
|
+
# Usage: html-to-pdf.sh <input.html> <output.pdf>
|
|
6
|
+
#
|
|
7
|
+
# Detection order:
|
|
8
|
+
# 1. Chrome/Chromium at known system paths
|
|
9
|
+
# 2. Playwright's bundled Chromium (npx playwright pdf)
|
|
10
|
+
# 3. Neither → exit 1 with instructions
|
|
11
|
+
#
|
|
12
|
+
# This script is intentionally renderer-agnostic. Callers don't need to know
|
|
13
|
+
# which engine is available — they just pass HTML in and get PDF out.
|
|
14
|
+
|
|
15
|
+
INPUT="${1:?Usage: html-to-pdf.sh <input.html> <output.pdf>}"
|
|
16
|
+
OUTPUT="${2:?Usage: html-to-pdf.sh <input.html> <output.pdf>}"
|
|
17
|
+
|
|
18
|
+
if [ ! -f "$INPUT" ]; then
|
|
19
|
+
echo "Error: input file not found: $INPUT" >&2
|
|
20
|
+
exit 1
|
|
21
|
+
fi
|
|
22
|
+
|
|
23
|
+
# Convert to file:// URL for Chrome (needs absolute path)
|
|
24
|
+
case "$INPUT" in
|
|
25
|
+
/*) INPUT_URL="file://$INPUT" ;;
|
|
26
|
+
*) INPUT_URL="file://$(pwd)/$INPUT" ;;
|
|
27
|
+
esac
|
|
28
|
+
|
|
29
|
+
# --- Strategy 1: System Chrome/Chromium ---
|
|
30
|
+
|
|
31
|
+
find_chrome() {
|
|
32
|
+
# macOS
|
|
33
|
+
for path in \
|
|
34
|
+
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
|
|
35
|
+
"/Applications/Chromium.app/Contents/MacOS/Chromium" \
|
|
36
|
+
"/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary" \
|
|
37
|
+
"/Applications/Brave Browser.app/Contents/MacOS/Brave Browser"; do
|
|
38
|
+
[ -x "$path" ] && echo "$path" && return 0
|
|
39
|
+
done
|
|
40
|
+
# Linux / WSL
|
|
41
|
+
for cmd in google-chrome chromium-browser chromium google-chrome-stable; do
|
|
42
|
+
command -v "$cmd" >/dev/null 2>&1 && command -v "$cmd" && return 0
|
|
43
|
+
done
|
|
44
|
+
return 1
|
|
45
|
+
}
|
|
46
|
+
|
|
47
|
+
if CHROME=$(find_chrome); then
|
|
48
|
+
printf '[html-to-pdf] using Chrome: %s\n' "$CHROME" >&2
|
|
49
|
+
"$CHROME" \
|
|
50
|
+
--headless \
|
|
51
|
+
--disable-gpu \
|
|
52
|
+
--no-sandbox \
|
|
53
|
+
--print-to-pdf="$OUTPUT" \
|
|
54
|
+
--no-margins \
|
|
55
|
+
"$INPUT_URL" 2>/dev/null
|
|
56
|
+
if [ -s "$OUTPUT" ]; then
|
|
57
|
+
printf '[html-to-pdf] wrote %s (%s bytes)\n' "$OUTPUT" "$(wc -c < "$OUTPUT" | tr -d ' ')" >&2
|
|
58
|
+
exit 0
|
|
59
|
+
fi
|
|
60
|
+
printf '[html-to-pdf] Chrome produced empty output, trying Playwright fallback\n' >&2
|
|
61
|
+
fi
|
|
62
|
+
|
|
63
|
+
# --- Strategy 2: Playwright's bundled Chromium ---
|
|
64
|
+
|
|
65
|
+
if command -v npx >/dev/null 2>&1; then
|
|
66
|
+
# Check if playwright is installed (don't auto-install)
|
|
67
|
+
if npx playwright --version >/dev/null 2>&1; then
|
|
68
|
+
printf '[html-to-pdf] using Playwright bundled Chromium\n' >&2
|
|
69
|
+
npx playwright pdf "$INPUT_URL" "$OUTPUT" 2>/dev/null
|
|
70
|
+
if [ -s "$OUTPUT" ]; then
|
|
71
|
+
printf '[html-to-pdf] wrote %s (%s bytes)\n' "$OUTPUT" "$(wc -c < "$OUTPUT" | tr -d ' ')" >&2
|
|
72
|
+
exit 0
|
|
73
|
+
fi
|
|
74
|
+
printf '[html-to-pdf] Playwright produced empty output\n' >&2
|
|
75
|
+
fi
|
|
76
|
+
fi
|
|
77
|
+
|
|
78
|
+
# --- No renderer available ---
|
|
79
|
+
|
|
80
|
+
echo "Error: no PDF renderer found." >&2
|
|
81
|
+
echo " Install one of:" >&2
|
|
82
|
+
echo " - Google Chrome (recommended — already handles print CSS)" >&2
|
|
83
|
+
echo " - Playwright: npx playwright install chromium" >&2
|
|
84
|
+
echo " The HTML report is still available at: $INPUT" >&2
|
|
85
|
+
exit 1
|