@braedenbuilds/crawl-sim 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 BraedenBDev
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,261 @@
1
+ # crawl-sim
2
+
3
+ **See your site through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot.**
4
+
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](./LICENSE)
6
+ [![npm version](https://img.shields.io/npm/v/@braedenbuilds/crawl-sim.svg)](https://www.npmjs.com/package/@braedenbuilds/crawl-sim)
7
+ [![Built for Claude Code](https://img.shields.io/badge/built%20for-Claude%20Code-D97757.svg)](https://claude.com/claude-code)
8
+ [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](./CONTRIBUTING.md)
9
+
10
+ `crawl-sim` is the first open-source, **agent-native multi-bot web crawler simulator**. It audits a URL from the perspective of each major crawler — Google's search bot, OpenAI's GPTBot, Anthropic's ClaudeBot, Perplexity's crawler, and more — then produces a quantified score card, prioritized findings, and structured JSON output.
11
+
12
+ It ships as a [Claude Code skill](https://docs.claude.com/en/docs/claude-code/skills) backed by standalone shell scripts, so the intelligence lives in the agent and the plumbing stays debuggable.
13
+
14
+ ---
15
+
16
+ ## Why this exists
17
+
18
+ The crawler-simulation market has a gap. Most tools pick one lane:
19
+
20
+ | Category | Examples | What they miss |
21
+ |----------|----------|----------------|
22
+ | **Rendering tools** | Screaming Frog, TametheBot | Googlebot only — no AI crawlers |
23
+ | **Monitoring SaaS** | Otterly, ZipTie, Peec | Track citations but don't simulate crawls |
24
+ | **Frameworks** | Crawlee, Playwright | Raw building blocks with no bot intelligence |
25
+
26
+ No existing tool combines **multi-bot simulation + LLM-powered interpretation + quantified scoring** in an agent-native format. `crawl-sim` does.
27
+
28
+ The concept was validated manually: a curl-as-GPTBot + Claude analysis caught a real SSR bug (`ssr: false` on a dynamic import) that was silently hiding article cards from AI crawlers on a production site.
29
+
30
+ ---
31
+
32
+ ## Table of contents
33
+
34
+ - [Quick start](#quick-start)
35
+ - [Features](#features)
36
+ - [Usage](#usage)
37
+ - [Scoring system](#scoring-system)
38
+ - [Supported bots](#supported-bots)
39
+ - [Architecture](#architecture)
40
+ - [Contributing](#contributing)
41
+ - [License](#license)
42
+
43
+ ---
44
+
45
+ ## Quick start
46
+
47
+ ### In Claude Code (recommended)
48
+
49
+ ```bash
50
+ npx @braedenbuilds/crawl-sim install # installs to ~/.claude/skills/crawl-sim/
51
+ npx @braedenbuilds/crawl-sim install --project # installs to ./.claude/skills/crawl-sim/
52
+ ```
53
+
54
+ Then in Claude Code:
55
+
56
+ ```
57
+ /crawl-sim https://yoursite.com
58
+ ```
59
+
60
+ Claude runs the full pipeline, interprets the results, and returns a score card plus prioritized findings.
61
+
62
+ > The installed `crawl-sim` bin command is available after `npm install -g @braedenbuilds/crawl-sim` if you prefer a persistent install over `npx`.
63
+
64
+ ### As a standalone CLI
65
+
66
+ ```bash
67
+ git clone https://github.com/BraedenBDev/crawl-sim.git
68
+ cd crawl-sim
69
+ ./scripts/fetch-as-bot.sh https://yoursite.com profiles/gptbot.json | jq .
70
+ ```
71
+
72
+ ### Prerequisites
73
+
74
+ - **`curl`** — pre-installed on macOS/Linux
75
+ - **`jq`** — `brew install jq` (macOS) or `apt install jq` (Linux)
76
+ - **`playwright`** (optional) — for Googlebot JS-render comparison: `npx playwright install chromium`
77
+
78
+ ---
79
+
80
+ ## Features
81
+
82
+ - **Multi-bot simulation.** Nine verified bot profiles covering Google, OpenAI, Anthropic, and Perplexity — including the bot-vs-user-agent distinction (e.g., `ChatGPT-User` officially ignores robots.txt; `claude-user` respects it).
83
+ - **Quantified scoring.** Each bot is graded 0–100 across five categories with letter grades A through F, plus a weighted composite score.
84
+ - **Agent-native interpretation.** The Claude Code skill reads raw data, identifies root causes (framework signals, hydration boundaries, soft-404s), and recommends specific fixes.
85
+ - **Three-layer output.** Terminal score card, prose narrative, and structured JSON — so humans and CI both get what they need.
86
+ - **Confidence transparency.** Every claim is tagged `official`, `observed`, or `inferred`. The skill notes when recommendations depend on observed-but-undocumented behavior.
87
+ - **Shell-native core.** All checks use only `curl` + `jq`. No Node, no Python, no Docker. Each script is independently invokable.
88
+ - **Extensible.** Drop a new profile JSON into `profiles/` and it's auto-discovered.
89
+
90
+ ---
91
+
92
+ ## Usage
93
+
94
+ ### Claude Code skill
95
+
96
+ ```
97
+ /crawl-sim https://yoursite.com # full audit
98
+ /crawl-sim https://yoursite.com --bot gptbot # single bot
99
+ /crawl-sim https://yoursite.com --category structured-data # category deep-dive
100
+ /crawl-sim https://yoursite.com --json # JSON only (for CI)
101
+ ```
102
+
103
+ Output is a three-layer report:
104
+
105
+ 1. **Score card** — ASCII overview with per-bot and per-category scores.
106
+ 2. **Narrative audit** — prose findings ranked by point impact, with fix recommendations.
107
+ 3. **JSON report** — saved to `crawl-sim-report.json` for diffing and automation.
108
+
109
+ ### Direct script invocation
110
+
111
+ Every script is standalone and outputs JSON to stdout:
112
+
113
+ ```bash
114
+ ./scripts/fetch-as-bot.sh https://yoursite.com profiles/gptbot.json
115
+ ./scripts/extract-meta.sh < response.html
116
+ ./scripts/extract-jsonld.sh < response.html
117
+ ./scripts/extract-links.sh https://yoursite.com < response.html
118
+ ./scripts/check-robots.sh https://yoursite.com GPTBot
119
+ ./scripts/check-llmstxt.sh https://yoursite.com
120
+ ./scripts/check-sitemap.sh https://yoursite.com
121
+ ./scripts/compute-score.sh /tmp/audit-data/
122
+ ```
123
+
124
+ ### CI/CD
125
+
126
+ ```yaml
127
+ - name: crawl-sim audit
128
+ run: |
129
+ ./scripts/fetch-as-bot.sh "$DEPLOY_URL" profiles/gptbot.json > /tmp/gptbot.json
130
+ ./scripts/compute-score.sh /tmp/ > /tmp/score.json
131
+ jq -e '.overall.score >= 70' /tmp/score.json
132
+ ```
133
+
134
+ ---
135
+
136
+ ## Scoring system
137
+
138
+ Each bot is scored 0–100 across five weighted categories:
139
+
140
+ | Category | Weight | Measures |
141
+ |----------|:------:|----------|
142
+ | **Accessibility** | 25 | robots.txt allows, HTTP 200, response time |
143
+ | **Content Visibility** | 30 | server HTML word count, heading structure, internal links, image alt text |
144
+ | **Structured Data** | 20 | JSON-LD presence, validity, page-appropriate `@type` |
145
+ | **Technical Signals** | 15 | title / description / canonical / OG meta, sitemap inclusion |
146
+ | **AI Readiness** | 10 | `llms.txt` structure, content citability |
147
+
148
+ **Overall composite** weighs bots by reach:
149
+
150
+ - Googlebot **40%** — still the primary search driver
151
+ - GPTBot, ClaudeBot, PerplexityBot — **20% each** — the AI visibility tier
152
+
153
+ **Grade thresholds**
154
+
155
+ | Score | Grade | Meaning |
156
+ |-------|:-----:|---------|
157
+ | 93–100 | A | Fully visible, well-structured, citable |
158
+ | 90–92 | A- | Near-perfect with minor gaps |
159
+ | 80–89 | B / B+ / B- | Visible but missing optimization opportunities |
160
+ | 70–79 | C+ / C / C- | Partially visible, significant gaps |
161
+ | 60–69 | D+ / D / D- | Major issues — limited discoverability |
162
+ | 0–59 | F | Invisible or broken for this bot |
163
+
164
+ **The key differentiator:** bots with `rendersJavaScript: false` (GPTBot, ClaudeBot, PerplexityBot) are scored against **server HTML only**. Googlebot can be scored against the rendered DOM via the optional `diff-render.sh`. This surfaces CSR hydration issues that hide content from AI crawlers — exactly the kind of bug SEO tools don't catch because they're built around Googlebot's headless-Chrome behavior.
165
+
166
+ ---
167
+
168
+ ## Supported bots
169
+
170
+ | Profile | Vendor | Purpose | JS Render | Respects robots.txt |
171
+ |---------|--------|---------|:---------:|:-------------------:|
172
+ | `googlebot` | Google | Search indexing | **yes** (official) | yes |
173
+ | `gptbot` | OpenAI | Model training | no (observed) | yes |
174
+ | `oai-searchbot` | OpenAI | ChatGPT search | unknown (inferred) | yes |
175
+ | `chatgpt-user` | OpenAI | User fetches | unknown | partial (*) |
176
+ | `claudebot` | Anthropic | Model training | no (observed) | yes |
177
+ | `claude-user` | Anthropic | User fetches | unknown | yes |
178
+ | `claude-searchbot` | Anthropic | Search quality | unknown | yes |
179
+ | `perplexitybot` | Perplexity | Search indexing | no (observed) | yes |
180
+ | `perplexity-user` | Perplexity | User fetches | unknown | no (*) |
181
+
182
+ \* Officially documented as ignoring robots.txt for user-initiated fetches.
183
+
184
+ Every profile is backed by official vendor documentation where possible. See [`research/bot-profiles-verified.md`](./research/bot-profiles-verified.md) for sources and confidence levels. When a claim is `observed` or `inferred` rather than `official`, the skill output notes this transparently.
185
+
186
+ ### Adding a custom bot
187
+
188
+ Drop a JSON file in `profiles/`. The skill auto-discovers all `*.json` files.
189
+
190
+ ```json
191
+ {
192
+ "id": "mybot",
193
+ "name": "MyBot",
194
+ "vendor": "Example Corp",
195
+ "userAgent": "Mozilla/5.0 ... MyBot/1.0",
196
+ "robotsTxtToken": "MyBot",
197
+ "purpose": "search",
198
+ "rendersJavaScript": false,
199
+ "respectsRobotsTxt": true,
200
+ "lastVerified": "2026-04-11"
201
+ }
202
+ ```
203
+
204
+ ---
205
+
206
+ ## Architecture
207
+
208
+ ```
209
+ crawl-sim/
210
+ ├── SKILL.md # Claude Code orchestrator skill
211
+ ├── bin/install.js # npm installer
212
+ ├── profiles/ # 9 verified bot profiles (JSON)
213
+ ├── scripts/
214
+ │ ├── fetch-as-bot.sh # curl with bot UA → JSON (status/headers/body/timing)
215
+ │ ├── extract-meta.sh # title, description, OG, headings, images
216
+ │ ├── extract-jsonld.sh # JSON-LD @type detection
217
+ │ ├── extract-links.sh # internal/external link classification
218
+ │ ├── check-robots.sh # robots.txt parsing per UA token
219
+ │ ├── check-llmstxt.sh # llms.txt presence and structure
220
+ │ ├── check-sitemap.sh # sitemap.xml URL inclusion
221
+ │ ├── diff-render.sh # optional Playwright server-vs-rendered comparison
222
+ │ └── compute-score.sh # aggregates all checks → per-bot + per-category scores
223
+ ├── research/ # Verified bot data sources
224
+ └── docs/specs/ # Design docs
225
+ ```
226
+
227
+ The shell scripts are the plumbing. The Claude Code skill is the intelligence — it reads the raw JSON, understands framework context (Next.js, Nuxt, SPAs), identifies root causes, and writes actionable recommendations.
228
+
229
+ ---
230
+
231
+ ## Contributing
232
+
233
+ Contributions are welcome! See [CONTRIBUTING.md](./CONTRIBUTING.md) for details on:
234
+
235
+ - Reporting bugs and requesting features
236
+ - Adding or updating bot profiles when vendor docs change
237
+ - Writing new check scripts (must be `curl` + `jq` only, must output JSON)
238
+ - Running the integration test suite
239
+ - Coding standards and commit conventions
240
+
241
+ Quick principles:
242
+
243
+ - **Keep the core dependency-free** — `curl` + `jq` only. `diff-render.sh` is the single Playwright exception.
244
+ - **Every script outputs valid JSON to stdout** and is testable against a live URL.
245
+ - **Cite sources** when adding or updating bot profiles — every behavioral claim needs a vendor doc link or a reproducible observation.
246
+
247
+ ---
248
+
249
+ ## Acknowledgments
250
+
251
+ - **Bot documentation** from [OpenAI](https://developers.openai.com/api/docs/bots), [Anthropic](https://privacy.claude.com), [Perplexity](https://docs.perplexity.ai/docs/resources/perplexity-crawlers), and [Google Search Central](https://developers.google.com/search/docs).
252
+ - **Prior art** in the space: [Dark Visitors](https://darkvisitors.com), [CrawlerCheck](https://crawlercheck.com), [Cloudflare Radar](https://radar.cloudflare.com).
253
+ - Built with [Claude Code](https://claude.com/claude-code).
254
+
255
+ ---
256
+
257
+ ## License
258
+
259
+ [MIT](./LICENSE) © 2026 BraedenBDev
260
+
261
+ Free for personal and commercial use. If `crawl-sim` helps your project, a GitHub star or a mention is always appreciated.
package/SKILL.md ADDED
@@ -0,0 +1,196 @@
1
+ ---
2
+ name: crawl-sim
3
+ description: Audit a URL through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot. Fetches the page as each bot, runs structural checks, compares server HTML vs JS-rendered DOM to differentiate rendering-capable bots from non-rendering ones, then scores and returns a score card + narrative audit + JSON report. Trigger when the user asks to audit a site for AI/search visibility, test how bots see a page, check if content is visible to GPTBot/ClaudeBot/Perplexity, analyze llms.txt / robots.txt / structured data, or says "/crawl-sim".
4
+ allowed-tools: Bash, Read, Write
5
+ ---
6
+
7
+ # crawl-sim — Multi-Bot Visibility Audit
8
+
9
+ You are running a per-URL audit that simulates how different web crawlers see a site. You orchestrate shell scripts, interpret the raw data, and produce a three-layer output: (1) a terminal score card, (2) a prose narrative with prioritized findings, (3) a structured JSON report.
10
+
11
+ ## Experience principle
12
+
13
+ **This tool should feel alive.** Before each stage of the pipeline, emit a one-sentence status line to the user in plain text (not inside a code block). The user should know what's happening without expanding tool-call details. Example cadence:
14
+
15
+ > Fetching the page as 4 bots in parallel...
16
+ >
17
+ > Extracting meta, JSON-LD, and links from each response...
18
+ >
19
+ > Checking robots.txt per bot, plus llms.txt and sitemap...
20
+ >
21
+ > Comparing server HTML vs Playwright-rendered DOM (this is what differentiates bots)...
22
+ >
23
+ > Computing scores and finalizing...
24
+
25
+ Keep status lines short, active, and specific to this URL. Never use the same sentence twice in one run.
26
+
27
+ ## Usage
28
+
29
+ ```
30
+ /crawl-sim <url> # full audit (default)
31
+ /crawl-sim <url> --bot gptbot # single bot
32
+ /crawl-sim <url> --category structured-data # category deep dive
33
+ /crawl-sim <url> --json # JSON output only (for CI)
34
+ ```
35
+
36
+ ## Prerequisites — check once at the start
37
+
38
+ ```bash
39
+ command -v curl >/dev/null 2>&1 || { echo "ERROR: curl is required"; exit 1; }
40
+ command -v jq >/dev/null 2>&1 || { echo "ERROR: jq is required (brew install jq)"; exit 1; }
41
+ ```
42
+
43
+ Locate the skill directory: typically `~/.claude/skills/crawl-sim/` or `.claude/skills/crawl-sim/`. Use `$CLAUDE_PLUGIN_ROOT` if set, otherwise find the directory containing this `SKILL.md`.
44
+
45
+ ## Orchestration — five narrated stages
46
+
47
+ Split the work into **five Bash invocations**, each with a clear `description` field, and emit a plain-text status line *before* each one. Do not run the whole pipeline in one giant bash block — that makes the tool feel silent.
48
+
49
+ ### Stage 1 — Fetch
50
+
51
+ Tell the user: "Fetching as Googlebot, GPTBot, ClaudeBot, and PerplexityBot..."
52
+
53
+ ```bash
54
+ SKILL_DIR="$HOME/.claude/skills/crawl-sim" # or wherever this SKILL.md lives
55
+ RUN_DIR=$(mktemp -d -t crawl-sim.XXXXXX)
56
+ URL="<user-provided-url>"
57
+ for bot in googlebot gptbot claudebot perplexitybot; do
58
+ "$SKILL_DIR/scripts/fetch-as-bot.sh" "$URL" "$SKILL_DIR/profiles/${bot}.json" > "$RUN_DIR/fetch-${bot}.json"
59
+ done
60
+ ```
61
+
62
+ Each fetch emits a `[fetch-as-bot] BotName <- URL` line to stderr that surfaces in the Bash call's output.
63
+
64
+ If `--bot <id>` was passed, use only that bot. Also optionally include the secondary profiles (`oai-searchbot`, `chatgpt-user`, `claude-user`, `claude-searchbot`, `perplexity-user`) when the user passes `--all`.
65
+
66
+ ### Stage 2 — Extract HTML structure
67
+
68
+ Tell the user: "Extracting meta, JSON-LD, and links from each bot's view..."
69
+
70
+ ```bash
71
+ for bot in googlebot gptbot claudebot perplexitybot; do
72
+ jq -r '.bodyBase64' "$RUN_DIR/fetch-${bot}.json" | base64 -d > "$RUN_DIR/body-${bot}.html"
73
+ "$SKILL_DIR/scripts/extract-meta.sh" "$RUN_DIR/body-${bot}.html" > "$RUN_DIR/meta-${bot}.json"
74
+ "$SKILL_DIR/scripts/extract-jsonld.sh" "$RUN_DIR/body-${bot}.html" > "$RUN_DIR/jsonld-${bot}.json"
75
+ "$SKILL_DIR/scripts/extract-links.sh" "$URL" "$RUN_DIR/body-${bot}.html" > "$RUN_DIR/links-${bot}.json"
76
+ done
77
+ ```
78
+
79
+ ### Stage 3 — Crawler policy checks
80
+
81
+ Tell the user: "Checking robots.txt for each bot's UA token, plus llms.txt and sitemap.xml..."
82
+
83
+ ```bash
84
+ for bot in googlebot gptbot claudebot perplexitybot; do
85
+ token=$(jq -r '.robotsTxtToken' "$SKILL_DIR/profiles/${bot}.json")
86
+ "$SKILL_DIR/scripts/check-robots.sh" "$URL" "$token" > "$RUN_DIR/robots-${bot}.json"
87
+ done
88
+ "$SKILL_DIR/scripts/check-llmstxt.sh" "$URL" > "$RUN_DIR/llmstxt.json"
89
+ "$SKILL_DIR/scripts/check-sitemap.sh" "$URL" > "$RUN_DIR/sitemap.json"
90
+ ```
91
+
92
+ ### Stage 4 — Render comparison (this is the differentiator)
93
+
94
+ Tell the user something like: "Comparing server HTML vs Playwright-rendered DOM — this is how Googlebot and the AI bots get scored differently..."
95
+
96
+ ```bash
97
+ if [ -x "$SKILL_DIR/scripts/diff-render.sh" ]; then
98
+ "$SKILL_DIR/scripts/diff-render.sh" "$URL" > "$RUN_DIR/diff-render.json" \
99
+ || echo '{"skipped":true,"reason":"diff-render failed"}' > "$RUN_DIR/diff-render.json"
100
+ fi
101
+ ```
102
+
103
+ Never redirect `diff-render.sh` stderr into the output file — the narration line would corrupt the JSON.
104
+
105
+ **Why this stage matters:** the score depends on it. `compute-score.sh` uses the rendered word count for bots with `rendersJavaScript: true` (Googlebot) and applies a hydration penalty to bots with `rendersJavaScript: false` (GPTBot, ClaudeBot, PerplexityBot) proportional to how much content is invisible to them. On a site with significant client-side hydration, this is where the bot scores actually diverge. Without this stage, all non-blocked bots would score identically.
106
+
107
+ If Playwright isn't installed, `diff-render.sh` writes `{"skipped": true, "reason": "..."}` and the scoring falls back to server-HTML-only for all bots — the narrative must acknowledge this: "Per-bot differentiation was limited because JS render comparison was unavailable."
108
+
109
+ ### Stage 5 — Score and aggregate
110
+
111
+ Tell the user: "Computing per-bot scores and finalizing the report..."
112
+
113
+ ```bash
114
+ "$SKILL_DIR/scripts/compute-score.sh" "$RUN_DIR" > "$RUN_DIR/score.json"
115
+ cp "$RUN_DIR/score.json" ./crawl-sim-report.json
116
+ ```
117
+
118
+ ## Output Layer 1 — Score Card (ASCII)
119
+
120
+ Print a boxed score card to the terminal:
121
+
122
+ ```
123
+ ╔══════════════════════════════════════════════╗
124
+ ║ crawl-sim — Bot Visibility Audit ║
125
+ ║ <URL> ║
126
+ ╠══════════════════════════════════════════════╣
127
+ ║ Overall: <score>/100 (<grade>) ║
128
+ ╠══════════════════════════════════════════════╣
129
+ ║ Googlebot <s> <g> <bar> ║
130
+ ║ GPTBot <s> <g> <bar> ║
131
+ ║ ClaudeBot <s> <g> <bar> ║
132
+ ║ PerplexityBot <s> <g> <bar> ║
133
+ ╠══════════════════════════════════════════════╣
134
+ ║ By Category: ║
135
+ ║ Accessibility <s> <g> ║
136
+ ║ Content Visibility <s> <g> ║
137
+ ║ Structured Data <s> <g> ║
138
+ ║ Technical Signals <s> <g> ║
139
+ ║ AI Readiness <s> <g> ║
140
+ ╚══════════════════════════════════════════════╝
141
+ ```
142
+
143
+ Progress bars are 20 chars wide using `█` and `░` (each char = 5%).
144
+
145
+ ## Output Layer 2 — Narrative Audit
146
+
147
+ Lead with a **Bot differentiation summary** — state up front whether the bots scored the same or differently, and why. If they scored the same, explicitly say so:
148
+
149
+ > *"All four bots scored 94/A because the server HTML is complete (2,166 words), robots.txt allows every UA token, and there was no meaningful JS hydration gap (delta 11%, below the 20% penalty threshold). On a clean site like this, crawl-sim's multi-bot angle isn't the headline finding — the category gaps are."*
150
+
151
+ If they scored differently (the interesting case):
152
+
153
+ > *"Googlebot scored 92/A because it renders JS and sees the full 3,400-word page. GPTBot, ClaudeBot, and PerplexityBot scored 78/C+ because they only see the 2,100-word server HTML — 1,300 words (38%) of content is invisible to AI crawlers, including the testimonials section and half the product cards."*
154
+
155
+ Then produce **prioritized findings** ranked by total point impact across bots:
156
+
157
+ ```markdown
158
+ ### 1. <Title> (−<total> pts across <bot count> bots)
159
+
160
+ **Affected:** <bot list>
161
+ **Category:** <category>
162
+ **Observed:** <what the data shows — cite counts, tag names, paths>
163
+ **Likely cause:** <if inferable from HTML/framework signals>
164
+ **Fix:** <actionable, file-path-specific if possible>
165
+ **Impact if fixed:** +<N> points on affected bot scores
166
+ ```
167
+
168
+ ### Interpretation rules
169
+
170
+ - **Cross-bot deltas are the headline.** Compare `visibility.effectiveWords` across bots — if Googlebot has significantly more than the AI bots, that's finding #1. The raw delta is in `visibility.missedWordsVsRendered`.
171
+ - **Confidence transparency.** If a claim depends on a bot profile's `rendersJavaScript: false` at `observed` confidence (not `official`), note it: *"Based on observed behavior, not official documentation."*
172
+ - **Framework detection.** Scan the HTML body for signals: `<meta name="next-head-count">` or `_next/static` → Next.js (Pages Router or App Router respectively), `<div id="__nuxt">` → Nuxt, `<div id="app">` with thin content → SPA (Vue/React CSR), `<!--$-->` placeholder tags → React 18 Suspense. Use these to tailor fix recommendations.
173
+ - **No speculation beyond the data.** If server HTML has 0 `<a>` tags inside a component, say "component not present in server HTML" — not "JavaScript hydration failed" unless the diff-render data proves it.
174
+ - **Known extractor limitations.** The bash meta extractor sometimes reports `h1Text: null` even when `h1.count: 1` — that happens when the H1 contains nested tags (`<br>`, `<span>`, `<svg>`). The count is still correct. Don't flag this as a site bug — it's tracked in GitHub issue #4.
175
+ - **Per-bot quirks to surface:**
176
+ - Googlebot: renders JS. If `diff-render.sh` was skipped, note that comparison was unavailable and recommend installing Playwright.
177
+ - GPTBot / ClaudeBot / PerplexityBot: `rendersJavaScript: false` at observed confidence — flag any server-vs-rendered delta as invisible-to-AI content.
178
+ - `chatgpt-user` / `perplexity-user`: officially ignore robots.txt for user-initiated fetches. Blocking these via robots.txt has no effect.
179
+ - PerplexityBot: third-party reports of stealth/undeclared crawling. Mention if relevant, don't assert.
180
+
181
+ After findings, write a **Summary** paragraph: what's working well, biggest wins, confidence caveats. Keep it short — two to three sentences.
182
+
183
+ ## Output Layer 3 — JSON Report
184
+
185
+ `./crawl-sim-report.json` is written in Stage 5. The schema is stable for diffing across runs. Tell the user the report path at the end and also print the `RUN_DIR` so they can inspect intermediate JSON.
186
+
187
+ ## Error Handling
188
+
189
+ - If any script fails, include the failure in the narrative — don't silently skip.
190
+ - If the target URL returns non-200, report immediately and still run robots.txt / sitemap / llms.txt checks (they don't require the page to load).
191
+ - If `jq` or `curl` is missing, exit with install instructions.
192
+ - If `diff-render.sh` skips, the narrative must note that per-bot differentiation is reduced.
193
+
194
+ ## Cleanup
195
+
196
+ `$RUN_DIR` is small and informative — leave it in place and print the path. The user may want to inspect the raw JSON for any of the 23+ intermediate files.
package/bin/install.js ADDED
@@ -0,0 +1,159 @@
1
+ #!/usr/bin/env node
2
+
3
+ // crawl-sim installer
4
+ // Usage:
5
+ // npx crawl-sim install → ~/.claude/skills/crawl-sim/
6
+ // npx crawl-sim install --project → ./.claude/skills/crawl-sim/
7
+ // npx crawl-sim install --dir <path> → <path>/crawl-sim/
8
+
9
+ 'use strict';
10
+
11
+ const fs = require('fs');
12
+ const path = require('path');
13
+ const os = require('os');
14
+ const { execFileSync } = require('child_process');
15
+
16
+ const SOURCE_DIR = path.resolve(__dirname, '..');
17
+ const SKILL_FILES = ['SKILL.md'];
18
+ const SKILL_DIRS = ['profiles', 'scripts'];
19
+
20
+ function parseArgs(argv) {
21
+ const args = { command: null, project: false, dir: null, help: false };
22
+ for (let i = 0; i < argv.length; i++) {
23
+ const a = argv[i];
24
+ if (a === 'install' || a === 'uninstall') args.command = a;
25
+ else if (a === '--project') args.project = true;
26
+ else if (a === '--dir') args.dir = argv[++i];
27
+ else if (a === '-h' || a === '--help') args.help = true;
28
+ }
29
+ return args;
30
+ }
31
+
32
+ function printHelp() {
33
+ console.log(`
34
+ crawl-sim — Multi-bot visibility audit for Claude Code
35
+
36
+ Usage:
37
+ npx @braedenbuilds/crawl-sim install Install to ~/.claude/skills/crawl-sim/
38
+ npx @braedenbuilds/crawl-sim install --project Install to ./.claude/skills/crawl-sim/
39
+ npx @braedenbuilds/crawl-sim install --dir <path> Install to <path>/crawl-sim/
40
+
41
+ After installing, invoke in Claude Code with: /crawl-sim <url>
42
+ `);
43
+ }
44
+
45
+ function resolveTarget(args) {
46
+ if (args.dir) return path.resolve(args.dir, 'crawl-sim');
47
+ if (args.project) return path.resolve(process.cwd(), '.claude', 'skills', 'crawl-sim');
48
+ return path.resolve(os.homedir(), '.claude', 'skills', 'crawl-sim');
49
+ }
50
+
51
+ function checkPrereq(cmd, installHint) {
52
+ try {
53
+ execFileSync(cmd, ['--version'], { stdio: 'ignore' });
54
+ return true;
55
+ } catch {
56
+ console.warn(` ! ${cmd} not found — ${installHint}`);
57
+ return false;
58
+ }
59
+ }
60
+
61
+ function copyRecursive(src, dest) {
62
+ const stat = fs.statSync(src);
63
+ if (stat.isDirectory()) {
64
+ fs.mkdirSync(dest, { recursive: true });
65
+ for (const entry of fs.readdirSync(src)) {
66
+ copyRecursive(path.join(src, entry), path.join(dest, entry));
67
+ }
68
+ } else {
69
+ fs.copyFileSync(src, dest);
70
+ }
71
+ }
72
+
73
+ function install(target) {
74
+ console.log(`Installing crawl-sim to: ${target}`);
75
+
76
+ fs.mkdirSync(target, { recursive: true });
77
+
78
+ for (const file of SKILL_FILES) {
79
+ const src = path.join(SOURCE_DIR, file);
80
+ const dest = path.join(target, file);
81
+ if (fs.existsSync(src)) {
82
+ fs.copyFileSync(src, dest);
83
+ console.log(` ✓ ${file}`);
84
+ } else {
85
+ console.error(` ✗ missing source: ${file}`);
86
+ process.exit(1);
87
+ }
88
+ }
89
+
90
+ for (const dir of SKILL_DIRS) {
91
+ const src = path.join(SOURCE_DIR, dir);
92
+ const dest = path.join(target, dir);
93
+ if (fs.existsSync(src)) {
94
+ if (fs.existsSync(dest)) {
95
+ fs.rmSync(dest, { recursive: true, force: true });
96
+ }
97
+ copyRecursive(src, dest);
98
+ console.log(` ✓ ${dir}/`);
99
+ } else {
100
+ console.error(` ✗ missing source: ${dir}/`);
101
+ process.exit(1);
102
+ }
103
+ }
104
+
105
+ // Make scripts executable
106
+ const scriptsDir = path.join(target, 'scripts');
107
+ for (const script of fs.readdirSync(scriptsDir)) {
108
+ if (script.endsWith('.sh')) {
109
+ fs.chmodSync(path.join(scriptsDir, script), 0o755);
110
+ }
111
+ }
112
+ console.log(` ✓ scripts chmod +x`);
113
+
114
+ // Prerequisite check
115
+ console.log('\nPrerequisites:');
116
+ const hasCurl = checkPrereq('curl', 'pre-installed on macOS/Linux');
117
+ const hasJq = checkPrereq('jq', 'install with: brew install jq (or: apt install jq)');
118
+ let hasPlaywright = false;
119
+ try {
120
+ execFileSync('npx', ['playwright', '--version'], { stdio: 'ignore' });
121
+ hasPlaywright = true;
122
+ } catch {
123
+ console.warn(` ! playwright not found — optional, needed only for diff-render.sh`);
124
+ console.warn(` install with: npx playwright install chromium`);
125
+ }
126
+
127
+ if (!hasCurl || !hasJq) {
128
+ console.error('\nMissing required prerequisites. Install them and re-run.');
129
+ process.exit(1);
130
+ }
131
+
132
+ console.log(`\n✓ crawl-sim installed to: ${target}`);
133
+ console.log('\nUsage:');
134
+ console.log(' In Claude Code: /crawl-sim https://yoursite.com');
135
+ console.log(' Direct shell: ' + path.join(target, 'scripts', 'fetch-as-bot.sh') + ' <url> ' + path.join(target, 'profiles', 'gptbot.json'));
136
+ }
137
+
138
+ function main() {
139
+ const args = parseArgs(process.argv.slice(2));
140
+
141
+ if (args.help || !args.command) {
142
+ printHelp();
143
+ process.exit(args.help ? 0 : 1);
144
+ }
145
+
146
+ if (args.command === 'install') {
147
+ install(resolveTarget(args));
148
+ } else if (args.command === 'uninstall') {
149
+ const target = resolveTarget(args);
150
+ if (fs.existsSync(target)) {
151
+ fs.rmSync(target, { recursive: true, force: true });
152
+ console.log(`✓ removed ${target}`);
153
+ } else {
154
+ console.log(`Not installed at ${target}`);
155
+ }
156
+ }
157
+ }
158
+
159
+ main();