npm - @braedenbuilds/crawl-sim - Versions diffs - 1.0.1 - Mend

@braedenbuilds/crawl-sim 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

package/LICENSE +21 -0
package/README.md +261 -0
package/SKILL.md +196 -0
package/bin/install.js +159 -0
package/package.json +46 -0
package/profiles/chatgpt-user.json +28 -0
package/profiles/claude-searchbot.json +28 -0
package/profiles/claude-user.json +28 -0
package/profiles/claudebot.json +28 -0
package/profiles/googlebot.json +28 -0
package/profiles/gptbot.json +28 -0
package/profiles/oai-searchbot.json +28 -0
package/profiles/perplexity-user.json +28 -0
package/profiles/perplexitybot.json +28 -0
package/scripts/_lib.sh +51 -0
package/scripts/check-llmstxt.sh +116 -0
package/scripts/check-robots.sh +196 -0
package/scripts/check-sitemap.sh +79 -0
package/scripts/compute-score.sh +424 -0
package/scripts/diff-render.sh +136 -0
package/scripts/extract-jsonld.sh +133 -0
package/scripts/extract-links.sh +103 -0
package/scripts/extract-meta.sh +117 -0
package/scripts/fetch-as-bot.sh +87 -0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 BraedenBDev
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,261 @@
+# crawl-sim
+**See your site through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot.**
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](./LICENSE)
+[![npm version](https://img.shields.io/npm/v/@braedenbuilds/crawl-sim.svg)](https://www.npmjs.com/package/@braedenbuilds/crawl-sim)
+[![Built for Claude Code](https://img.shields.io/badge/built%20for-Claude%20Code-D97757.svg)](https://claude.com/claude-code)
+[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](./CONTRIBUTING.md)
+`crawl-sim` is the first open-source, **agent-native multi-bot web crawler simulator**. It audits a URL from the perspective of each major crawler — Google's search bot, OpenAI's GPTBot, Anthropic's ClaudeBot, Perplexity's crawler, and more — then produces a quantified score card, prioritized findings, and structured JSON output.
+It ships as a [Claude Code skill](https://docs.claude.com/en/docs/claude-code/skills) backed by standalone shell scripts, so the intelligence lives in the agent and the plumbing stays debuggable.
+---
+## Why this exists
+The crawler-simulation market has a gap. Most tools pick one lane:
+| Category | Examples | What they miss |
+|----------|----------|----------------|
+| **Rendering tools** | Screaming Frog, TametheBot | Googlebot only — no AI crawlers |
+| **Monitoring SaaS** | Otterly, ZipTie, Peec | Track citations but don't simulate crawls |
+| **Frameworks** | Crawlee, Playwright | Raw building blocks with no bot intelligence |
+No existing tool combines **multi-bot simulation + LLM-powered interpretation + quantified scoring** in an agent-native format. `crawl-sim` does.
+The concept was validated manually: a curl-as-GPTBot + Claude analysis caught a real SSR bug (`ssr: false` on a dynamic import) that was silently hiding article cards from AI crawlers on a production site.
+---
+## Table of contents
+- [Quick start](#quick-start)
+- [Features](#features)
+- [Usage](#usage)
+- [Scoring system](#scoring-system)
+- [Supported bots](#supported-bots)
+- [Architecture](#architecture)
+- [Contributing](#contributing)
+- [License](#license)
+---
+## Quick start
+### In Claude Code (recommended)
+```bash
+npx @braedenbuilds/crawl-sim install              # installs to ~/.claude/skills/crawl-sim/
+npx @braedenbuilds/crawl-sim install --project    # installs to ./.claude/skills/crawl-sim/
+```
+Then in Claude Code:
+```
+/crawl-sim https://yoursite.com
+```
+Claude runs the full pipeline, interprets the results, and returns a score card plus prioritized findings.
+> The installed `crawl-sim` bin command is available after `npm install -g @braedenbuilds/crawl-sim` if you prefer a persistent install over `npx`.
+### As a standalone CLI
+```bash
+git clone https://github.com/BraedenBDev/crawl-sim.git
+cd crawl-sim
+./scripts/fetch-as-bot.sh https://yoursite.com profiles/gptbot.json | jq .
+```
+### Prerequisites
+- **`curl`** — pre-installed on macOS/Linux
+- **`jq`** — `brew install jq` (macOS) or `apt install jq` (Linux)
+- **`playwright`** (optional) — for Googlebot JS-render comparison: `npx playwright install chromium`
+---
+## Features
+- **Multi-bot simulation.** Nine verified bot profiles covering Google, OpenAI, Anthropic, and Perplexity — including the bot-vs-user-agent distinction (e.g., `ChatGPT-User` officially ignores robots.txt; `claude-user` respects it).
+- **Quantified scoring.** Each bot is graded 0–100 across five categories with letter grades A through F, plus a weighted composite score.
+- **Agent-native interpretation.** The Claude Code skill reads raw data, identifies root causes (framework signals, hydration boundaries, soft-404s), and recommends specific fixes.
+- **Three-layer output.** Terminal score card, prose narrative, and structured JSON — so humans and CI both get what they need.
+- **Confidence transparency.** Every claim is tagged `official`, `observed`, or `inferred`. The skill notes when recommendations depend on observed-but-undocumented behavior.
+- **Shell-native core.** All checks use only `curl` + `jq`. No Node, no Python, no Docker. Each script is independently invokable.
+- **Extensible.** Drop a new profile JSON into `profiles/` and it's auto-discovered.
+---
+## Usage
+### Claude Code skill
+```
+/crawl-sim https://yoursite.com                            # full audit
+/crawl-sim https://yoursite.com --bot gptbot               # single bot
+/crawl-sim https://yoursite.com --category structured-data # category deep-dive
+/crawl-sim https://yoursite.com --json                     # JSON only (for CI)
+```
+Output is a three-layer report:
+1. **Score card** — ASCII overview with per-bot and per-category scores.
+2. **Narrative audit** — prose findings ranked by point impact, with fix recommendations.
+3. **JSON report** — saved to `crawl-sim-report.json` for diffing and automation.
+### Direct script invocation
+Every script is standalone and outputs JSON to stdout:
+```bash
+./scripts/fetch-as-bot.sh    https://yoursite.com profiles/gptbot.json
+./scripts/extract-meta.sh    < response.html
+./scripts/extract-jsonld.sh  < response.html
+./scripts/extract-links.sh   https://yoursite.com < response.html
+./scripts/check-robots.sh    https://yoursite.com GPTBot
+./scripts/check-llmstxt.sh   https://yoursite.com
+./scripts/check-sitemap.sh   https://yoursite.com
+./scripts/compute-score.sh   /tmp/audit-data/
+```
+### CI/CD
+```yaml
+- name: crawl-sim audit
+  run: |
+    ./scripts/fetch-as-bot.sh "$DEPLOY_URL" profiles/gptbot.json > /tmp/gptbot.json
+    ./scripts/compute-score.sh /tmp/ > /tmp/score.json
+    jq -e '.overall.score >= 70' /tmp/score.json
+```
+---
+## Scoring system
+Each bot is scored 0–100 across five weighted categories:
+| Category | Weight | Measures |
+|----------|:------:|----------|
+| **Accessibility** | 25 | robots.txt allows, HTTP 200, response time |
+| **Content Visibility** | 30 | server HTML word count, heading structure, internal links, image alt text |
+| **Structured Data** | 20 | JSON-LD presence, validity, page-appropriate `@type` |
+| **Technical Signals** | 15 | title / description / canonical / OG meta, sitemap inclusion |
+| **AI Readiness** | 10 | `llms.txt` structure, content citability |
+**Overall composite** weighs bots by reach:
+- Googlebot **40%** — still the primary search driver
+- GPTBot, ClaudeBot, PerplexityBot — **20% each** — the AI visibility tier
+**Grade thresholds**
+| Score | Grade | Meaning |
+|-------|:-----:|---------|
+| 93–100 | A | Fully visible, well-structured, citable |
+| 90–92 | A- | Near-perfect with minor gaps |
+| 80–89 | B / B+ / B- | Visible but missing optimization opportunities |
+| 70–79 | C+ / C / C- | Partially visible, significant gaps |
+| 60–69 | D+ / D / D- | Major issues — limited discoverability |
+| 0–59 | F | Invisible or broken for this bot |
+**The key differentiator:** bots with `rendersJavaScript: false` (GPTBot, ClaudeBot, PerplexityBot) are scored against **server HTML only**. Googlebot can be scored against the rendered DOM via the optional `diff-render.sh`. This surfaces CSR hydration issues that hide content from AI crawlers — exactly the kind of bug SEO tools don't catch because they're built around Googlebot's headless-Chrome behavior.
+---
+## Supported bots
+| Profile | Vendor | Purpose | JS Render | Respects robots.txt |
+|---------|--------|---------|:---------:|:-------------------:|
+| `googlebot` | Google | Search indexing | **yes** (official) | yes |
+| `gptbot` | OpenAI | Model training | no (observed) | yes |
+| `oai-searchbot` | OpenAI | ChatGPT search | unknown (inferred) | yes |
+| `chatgpt-user` | OpenAI | User fetches | unknown | partial (*) |
+| `claudebot` | Anthropic | Model training | no (observed) | yes |
+| `claude-user` | Anthropic | User fetches | unknown | yes |
+| `claude-searchbot` | Anthropic | Search quality | unknown | yes |
+| `perplexitybot` | Perplexity | Search indexing | no (observed) | yes |
+| `perplexity-user` | Perplexity | User fetches | unknown | no (*) |
+\* Officially documented as ignoring robots.txt for user-initiated fetches.
+Every profile is backed by official vendor documentation where possible. See [`research/bot-profiles-verified.md`](./research/bot-profiles-verified.md) for sources and confidence levels. When a claim is `observed` or `inferred` rather than `official`, the skill output notes this transparently.
+### Adding a custom bot
+Drop a JSON file in `profiles/`. The skill auto-discovers all `*.json` files.
+```json
+{
+  "id": "mybot",
+  "name": "MyBot",
+  "vendor": "Example Corp",
+  "userAgent": "Mozilla/5.0 ... MyBot/1.0",
+  "robotsTxtToken": "MyBot",
+  "purpose": "search",
+  "rendersJavaScript": false,
+  "respectsRobotsTxt": true,
+  "lastVerified": "2026-04-11"
+}
+```
+---
+## Architecture
+```
+crawl-sim/
+├── SKILL.md               # Claude Code orchestrator skill
+├── bin/install.js         # npm installer
+├── profiles/              # 9 verified bot profiles (JSON)
+├── scripts/
+│   ├── fetch-as-bot.sh    # curl with bot UA → JSON (status/headers/body/timing)
+│   ├── extract-meta.sh    # title, description, OG, headings, images
+│   ├── extract-jsonld.sh  # JSON-LD @type detection
+│   ├── extract-links.sh   # internal/external link classification
+│   ├── check-robots.sh    # robots.txt parsing per UA token
+│   ├── check-llmstxt.sh   # llms.txt presence and structure
+│   ├── check-sitemap.sh   # sitemap.xml URL inclusion
+│   ├── diff-render.sh     # optional Playwright server-vs-rendered comparison
+│   └── compute-score.sh   # aggregates all checks → per-bot + per-category scores
+├── research/              # Verified bot data sources
+└── docs/specs/            # Design docs
+```
+The shell scripts are the plumbing. The Claude Code skill is the intelligence — it reads the raw JSON, understands framework context (Next.js, Nuxt, SPAs), identifies root causes, and writes actionable recommendations.
+---
+## Contributing
+Contributions are welcome! See [CONTRIBUTING.md](./CONTRIBUTING.md) for details on:
+- Reporting bugs and requesting features
+- Adding or updating bot profiles when vendor docs change
+- Writing new check scripts (must be `curl` + `jq` only, must output JSON)
+- Running the integration test suite
+- Coding standards and commit conventions
+Quick principles:
+- **Keep the core dependency-free** — `curl` + `jq` only. `diff-render.sh` is the single Playwright exception.
+- **Every script outputs valid JSON to stdout** and is testable against a live URL.
+- **Cite sources** when adding or updating bot profiles — every behavioral claim needs a vendor doc link or a reproducible observation.
+---
+## Acknowledgments
+- **Bot documentation** from [OpenAI](https://developers.openai.com/api/docs/bots), [Anthropic](https://privacy.claude.com), [Perplexity](https://docs.perplexity.ai/docs/resources/perplexity-crawlers), and [Google Search Central](https://developers.google.com/search/docs).
+- **Prior art** in the space: [Dark Visitors](https://darkvisitors.com), [CrawlerCheck](https://crawlercheck.com), [Cloudflare Radar](https://radar.cloudflare.com).
+- Built with [Claude Code](https://claude.com/claude-code).
+---
+## License
+[MIT](./LICENSE) © 2026 BraedenBDev
+Free for personal and commercial use. If `crawl-sim` helps your project, a GitHub star or a mention is always appreciated.

package/SKILL.md ADDED Viewed

@@ -0,0 +1,196 @@
+---
+name: crawl-sim
+description: Audit a URL through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot. Fetches the page as each bot, runs structural checks, compares server HTML vs JS-rendered DOM to differentiate rendering-capable bots from non-rendering ones, then scores and returns a score card + narrative audit + JSON report. Trigger when the user asks to audit a site for AI/search visibility, test how bots see a page, check if content is visible to GPTBot/ClaudeBot/Perplexity, analyze llms.txt / robots.txt / structured data, or says "/crawl-sim".
+allowed-tools: Bash, Read, Write
+---
+# crawl-sim — Multi-Bot Visibility Audit
+You are running a per-URL audit that simulates how different web crawlers see a site. You orchestrate shell scripts, interpret the raw data, and produce a three-layer output: (1) a terminal score card, (2) a prose narrative with prioritized findings, (3) a structured JSON report.
+## Experience principle
+**This tool should feel alive.** Before each stage of the pipeline, emit a one-sentence status line to the user in plain text (not inside a code block). The user should know what's happening without expanding tool-call details. Example cadence:
+> Fetching the page as 4 bots in parallel...
+>
+> Extracting meta, JSON-LD, and links from each response...
+>
+> Checking robots.txt per bot, plus llms.txt and sitemap...
+>
+> Comparing server HTML vs Playwright-rendered DOM (this is what differentiates bots)...
+>
+> Computing scores and finalizing...
+Keep status lines short, active, and specific to this URL. Never use the same sentence twice in one run.
+## Usage
+```
+/crawl-sim <url>                            # full audit (default)
+/crawl-sim <url> --bot gptbot               # single bot
+/crawl-sim <url> --category structured-data # category deep dive
+/crawl-sim <url> --json                     # JSON output only (for CI)
+```
+## Prerequisites — check once at the start
+```bash
+command -v curl >/dev/null 2>&1 || { echo "ERROR: curl is required"; exit 1; }
+command -v jq   >/dev/null 2>&1 || { echo "ERROR: jq is required (brew install jq)"; exit 1; }
+```
+Locate the skill directory: typically `~/.claude/skills/crawl-sim/` or `.claude/skills/crawl-sim/`. Use `$CLAUDE_PLUGIN_ROOT` if set, otherwise find the directory containing this `SKILL.md`.
+## Orchestration — five narrated stages
+Split the work into **five Bash invocations**, each with a clear `description` field, and emit a plain-text status line *before* each one. Do not run the whole pipeline in one giant bash block — that makes the tool feel silent.
+### Stage 1 — Fetch
+Tell the user: "Fetching as Googlebot, GPTBot, ClaudeBot, and PerplexityBot..."
+```bash
+SKILL_DIR="$HOME/.claude/skills/crawl-sim"  # or wherever this SKILL.md lives
+RUN_DIR=$(mktemp -d -t crawl-sim.XXXXXX)
+URL="<user-provided-url>"
+for bot in googlebot gptbot claudebot perplexitybot; do
+  "$SKILL_DIR/scripts/fetch-as-bot.sh" "$URL" "$SKILL_DIR/profiles/${bot}.json" > "$RUN_DIR/fetch-${bot}.json"
+done
+```
+Each fetch emits a `[fetch-as-bot] BotName <- URL` line to stderr that surfaces in the Bash call's output.
+If `--bot <id>` was passed, use only that bot. Also optionally include the secondary profiles (`oai-searchbot`, `chatgpt-user`, `claude-user`, `claude-searchbot`, `perplexity-user`) when the user passes `--all`.
+### Stage 2 — Extract HTML structure
+Tell the user: "Extracting meta, JSON-LD, and links from each bot's view..."
+```bash
+for bot in googlebot gptbot claudebot perplexitybot; do
+  jq -r '.bodyBase64' "$RUN_DIR/fetch-${bot}.json" | base64 -d > "$RUN_DIR/body-${bot}.html"
+  "$SKILL_DIR/scripts/extract-meta.sh"   "$RUN_DIR/body-${bot}.html" > "$RUN_DIR/meta-${bot}.json"
+  "$SKILL_DIR/scripts/extract-jsonld.sh" "$RUN_DIR/body-${bot}.html" > "$RUN_DIR/jsonld-${bot}.json"
+  "$SKILL_DIR/scripts/extract-links.sh" "$URL" "$RUN_DIR/body-${bot}.html" > "$RUN_DIR/links-${bot}.json"
+done
+```
+### Stage 3 — Crawler policy checks
+Tell the user: "Checking robots.txt for each bot's UA token, plus llms.txt and sitemap.xml..."
+```bash
+for bot in googlebot gptbot claudebot perplexitybot; do
+  token=$(jq -r '.robotsTxtToken' "$SKILL_DIR/profiles/${bot}.json")
+  "$SKILL_DIR/scripts/check-robots.sh" "$URL" "$token" > "$RUN_DIR/robots-${bot}.json"
+done
+"$SKILL_DIR/scripts/check-llmstxt.sh" "$URL" > "$RUN_DIR/llmstxt.json"
+"$SKILL_DIR/scripts/check-sitemap.sh" "$URL" > "$RUN_DIR/sitemap.json"
+```
+### Stage 4 — Render comparison (this is the differentiator)
+Tell the user something like: "Comparing server HTML vs Playwright-rendered DOM — this is how Googlebot and the AI bots get scored differently..."
+```bash
+if [ -x "$SKILL_DIR/scripts/diff-render.sh" ]; then
+  "$SKILL_DIR/scripts/diff-render.sh" "$URL" > "$RUN_DIR/diff-render.json" \
+    || echo '{"skipped":true,"reason":"diff-render failed"}' > "$RUN_DIR/diff-render.json"
+fi
+```
+Never redirect `diff-render.sh` stderr into the output file — the narration line would corrupt the JSON.
+**Why this stage matters:** the score depends on it. `compute-score.sh` uses the rendered word count for bots with `rendersJavaScript: true` (Googlebot) and applies a hydration penalty to bots with `rendersJavaScript: false` (GPTBot, ClaudeBot, PerplexityBot) proportional to how much content is invisible to them. On a site with significant client-side hydration, this is where the bot scores actually diverge. Without this stage, all non-blocked bots would score identically.
+If Playwright isn't installed, `diff-render.sh` writes `{"skipped": true, "reason": "..."}` and the scoring falls back to server-HTML-only for all bots — the narrative must acknowledge this: "Per-bot differentiation was limited because JS render comparison was unavailable."
+### Stage 5 — Score and aggregate
+Tell the user: "Computing per-bot scores and finalizing the report..."
+```bash
+"$SKILL_DIR/scripts/compute-score.sh" "$RUN_DIR" > "$RUN_DIR/score.json"
+cp "$RUN_DIR/score.json" ./crawl-sim-report.json
+```
+## Output Layer 1 — Score Card (ASCII)
+Print a boxed score card to the terminal:
+```
+╔══════════════════════════════════════════════╗
+║         crawl-sim — Bot Visibility Audit     ║
+║         <URL>                                ║
+╠══════════════════════════════════════════════╣
+║  Overall: <score>/100 (<grade>)              ║
+╠══════════════════════════════════════════════╣
+║  Googlebot      <s>  <g>  <bar>              ║
+║  GPTBot         <s>  <g>  <bar>              ║
+║  ClaudeBot      <s>  <g>  <bar>              ║
+║  PerplexityBot  <s>  <g>  <bar>              ║
+╠══════════════════════════════════════════════╣
+║  By Category:                                ║
+║  Accessibility      <s>  <g>                 ║
+║  Content Visibility <s>  <g>                 ║
+║  Structured Data    <s>  <g>                 ║
+║  Technical Signals  <s>  <g>                 ║
+║  AI Readiness       <s>  <g>                 ║
+╚══════════════════════════════════════════════╝
+```
+Progress bars are 20 chars wide using `█` and `░` (each char = 5%).
+## Output Layer 2 — Narrative Audit
+Lead with a **Bot differentiation summary** — state up front whether the bots scored the same or differently, and why. If they scored the same, explicitly say so:
+> *"All four bots scored 94/A because the server HTML is complete (2,166 words), robots.txt allows every UA token, and there was no meaningful JS hydration gap (delta 11%, below the 20% penalty threshold). On a clean site like this, crawl-sim's multi-bot angle isn't the headline finding — the category gaps are."*
+If they scored differently (the interesting case):
+> *"Googlebot scored 92/A because it renders JS and sees the full 3,400-word page. GPTBot, ClaudeBot, and PerplexityBot scored 78/C+ because they only see the 2,100-word server HTML — 1,300 words (38%) of content is invisible to AI crawlers, including the testimonials section and half the product cards."*
+Then produce **prioritized findings** ranked by total point impact across bots:
+```markdown
+### 1. <Title> (−<total> pts across <bot count> bots)
+**Affected:** <bot list>
+**Category:** <category>
+**Observed:** <what the data shows — cite counts, tag names, paths>
+**Likely cause:** <if inferable from HTML/framework signals>
+**Fix:** <actionable, file-path-specific if possible>
+**Impact if fixed:** +<N> points on affected bot scores
+```
+### Interpretation rules
+- **Cross-bot deltas are the headline.** Compare `visibility.effectiveWords` across bots — if Googlebot has significantly more than the AI bots, that's finding #1. The raw delta is in `visibility.missedWordsVsRendered`.
+- **Confidence transparency.** If a claim depends on a bot profile's `rendersJavaScript: false` at `observed` confidence (not `official`), note it: *"Based on observed behavior, not official documentation."*
+- **Framework detection.** Scan the HTML body for signals: `<meta name="next-head-count">` or `_next/static` → Next.js (Pages Router or App Router respectively), `<div id="__nuxt">` → Nuxt, `<div id="app">` with thin content → SPA (Vue/React CSR), `<!--$-->` placeholder tags → React 18 Suspense. Use these to tailor fix recommendations.
+- **No speculation beyond the data.** If server HTML has 0 `<a>` tags inside a component, say "component not present in server HTML" — not "JavaScript hydration failed" unless the diff-render data proves it.
+- **Known extractor limitations.** The bash meta extractor sometimes reports `h1Text: null` even when `h1.count: 1` — that happens when the H1 contains nested tags (`<br>`, `<span>`, `<svg>`). The count is still correct. Don't flag this as a site bug — it's tracked in GitHub issue #4.
+- **Per-bot quirks to surface:**
+  - Googlebot: renders JS. If `diff-render.sh` was skipped, note that comparison was unavailable and recommend installing Playwright.
+  - GPTBot / ClaudeBot / PerplexityBot: `rendersJavaScript: false` at observed confidence — flag any server-vs-rendered delta as invisible-to-AI content.
+  - `chatgpt-user` / `perplexity-user`: officially ignore robots.txt for user-initiated fetches. Blocking these via robots.txt has no effect.
+  - PerplexityBot: third-party reports of stealth/undeclared crawling. Mention if relevant, don't assert.
+After findings, write a **Summary** paragraph: what's working well, biggest wins, confidence caveats. Keep it short — two to three sentences.
+## Output Layer 3 — JSON Report
+`./crawl-sim-report.json` is written in Stage 5. The schema is stable for diffing across runs. Tell the user the report path at the end and also print the `RUN_DIR` so they can inspect intermediate JSON.
+## Error Handling
+- If any script fails, include the failure in the narrative — don't silently skip.
+- If the target URL returns non-200, report immediately and still run robots.txt / sitemap / llms.txt checks (they don't require the page to load).
+- If `jq` or `curl` is missing, exit with install instructions.
+- If `diff-render.sh` skips, the narrative must note that per-bot differentiation is reduced.
+## Cleanup
+`$RUN_DIR` is small and informative — leave it in place and print the path. The user may want to inspect the raw JSON for any of the 23+ intermediate files.

package/bin/install.js ADDED Viewed

@@ -0,0 +1,159 @@
+#!/usr/bin/env node
+// crawl-sim installer
+// Usage:
+//   npx crawl-sim install              → ~/.claude/skills/crawl-sim/
+//   npx crawl-sim install --project    → ./.claude/skills/crawl-sim/
+//   npx crawl-sim install --dir <path> → <path>/crawl-sim/
+'use strict';
+const fs = require('fs');
+const path = require('path');
+const os = require('os');
+const { execFileSync } = require('child_process');
+const SOURCE_DIR = path.resolve(__dirname, '..');
+const SKILL_FILES = ['SKILL.md'];
+const SKILL_DIRS = ['profiles', 'scripts'];
+function parseArgs(argv) {
+  const args = { command: null, project: false, dir: null, help: false };
+  for (let i = 0; i < argv.length; i++) {
+    const a = argv[i];
+    if (a === 'install' || a === 'uninstall') args.command = a;
+    else if (a === '--project') args.project = true;
+    else if (a === '--dir') args.dir = argv[++i];
+    else if (a === '-h' || a === '--help') args.help = true;
+  }
+  return args;
+}
+function printHelp() {
+  console.log(`
+crawl-sim — Multi-bot visibility audit for Claude Code
+Usage:
+  npx @braedenbuilds/crawl-sim install              Install to ~/.claude/skills/crawl-sim/
+  npx @braedenbuilds/crawl-sim install --project    Install to ./.claude/skills/crawl-sim/
+  npx @braedenbuilds/crawl-sim install --dir <path> Install to <path>/crawl-sim/
+After installing, invoke in Claude Code with: /crawl-sim <url>
+`);
+}
+function resolveTarget(args) {
+  if (args.dir) return path.resolve(args.dir, 'crawl-sim');
+  if (args.project) return path.resolve(process.cwd(), '.claude', 'skills', 'crawl-sim');
+  return path.resolve(os.homedir(), '.claude', 'skills', 'crawl-sim');
+}
+function checkPrereq(cmd, installHint) {
+  try {
+    execFileSync(cmd, ['--version'], { stdio: 'ignore' });
+    return true;
+  } catch {
+    console.warn(`  ! ${cmd} not found — ${installHint}`);
+    return false;
+  }
+}
+function copyRecursive(src, dest) {
+  const stat = fs.statSync(src);
+  if (stat.isDirectory()) {
+    fs.mkdirSync(dest, { recursive: true });
+    for (const entry of fs.readdirSync(src)) {
+      copyRecursive(path.join(src, entry), path.join(dest, entry));
+    }
+  } else {
+    fs.copyFileSync(src, dest);
+  }
+}
+function install(target) {
+  console.log(`Installing crawl-sim to: ${target}`);
+  fs.mkdirSync(target, { recursive: true });
+  for (const file of SKILL_FILES) {
+    const src = path.join(SOURCE_DIR, file);
+    const dest = path.join(target, file);
+    if (fs.existsSync(src)) {
+      fs.copyFileSync(src, dest);
+      console.log(`  ✓ ${file}`);
+    } else {
+      console.error(`  ✗ missing source: ${file}`);
+      process.exit(1);
+    }
+  }
+  for (const dir of SKILL_DIRS) {
+    const src = path.join(SOURCE_DIR, dir);
+    const dest = path.join(target, dir);
+    if (fs.existsSync(src)) {
+      if (fs.existsSync(dest)) {
+        fs.rmSync(dest, { recursive: true, force: true });
+      }
+      copyRecursive(src, dest);
+      console.log(`  ✓ ${dir}/`);
+    } else {
+      console.error(`  ✗ missing source: ${dir}/`);
+      process.exit(1);
+    }
+  }
+  // Make scripts executable
+  const scriptsDir = path.join(target, 'scripts');
+  for (const script of fs.readdirSync(scriptsDir)) {
+    if (script.endsWith('.sh')) {
+      fs.chmodSync(path.join(scriptsDir, script), 0o755);
+    }
+  }
+  console.log(`  ✓ scripts chmod +x`);
+  // Prerequisite check
+  console.log('\nPrerequisites:');
+  const hasCurl = checkPrereq('curl', 'pre-installed on macOS/Linux');
+  const hasJq = checkPrereq('jq', 'install with: brew install jq  (or: apt install jq)');
+  let hasPlaywright = false;
+  try {
+    execFileSync('npx', ['playwright', '--version'], { stdio: 'ignore' });
+    hasPlaywright = true;
+  } catch {
+    console.warn(`  ! playwright not found — optional, needed only for diff-render.sh`);
+    console.warn(`    install with: npx playwright install chromium`);
+  }
+  if (!hasCurl || !hasJq) {
+    console.error('\nMissing required prerequisites. Install them and re-run.');
+    process.exit(1);
+  }
+  console.log(`\n✓ crawl-sim installed to: ${target}`);
+  console.log('\nUsage:');
+  console.log('  In Claude Code: /crawl-sim https://yoursite.com');
+  console.log('  Direct shell:   ' + path.join(target, 'scripts', 'fetch-as-bot.sh') + ' <url> ' + path.join(target, 'profiles', 'gptbot.json'));
+}
+function main() {
+  const args = parseArgs(process.argv.slice(2));
+  if (args.help || !args.command) {
+    printHelp();
+    process.exit(args.help ? 0 : 1);
+  }
+  if (args.command === 'install') {
+    install(resolveTarget(args));
+  } else if (args.command === 'uninstall') {
+    const target = resolveTarget(args);
+    if (fs.existsSync(target)) {
+      fs.rmSync(target, { recursive: true, force: true });
+      console.log(`✓ removed ${target}`);
+    } else {
+      console.log(`Not installed at ${target}`);
+    }
+  }
+}
+main();