@braedenbuilds/crawl-sim 1.0.3 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -60,7 +60,7 @@ Then in Claude Code:
60
60
 
61
61
  Claude runs the full pipeline, interprets the results, and returns a score card plus prioritized findings.
62
62
 
63
- > **Why `npm install -g` instead of `npx`?** Recent versions of npx have a [known issue](https://github.com/npm/cli/issues) linking bins for scoped single-bin packages in ephemeral installs. A persistent global install avoids the problem entirely. If you want a one-shot install without touching global state, the tarball URL form also works: `npx https://registry.npmjs.org/@braedenbuilds/crawl-sim/-/crawl-sim-1.0.3.tgz install`.
63
+ > **Why `npm install -g` instead of `npx`?** Recent versions of npx have a known issue linking bins for scoped single-bin packages in ephemeral installs. A persistent global install avoids the problem entirely. The git clone path below is the zero-npm fallback.
64
64
 
65
65
  ### As a standalone CLI
66
66
 
@@ -88,10 +88,13 @@ git clone https://github.com/BraedenBDev/crawl-sim.git ~/.claude/skills/crawl-si
88
88
 
89
89
  - **Multi-bot simulation.** Nine verified bot profiles covering Google, OpenAI, Anthropic, and Perplexity — including the bot-vs-user-agent distinction (e.g., `ChatGPT-User` officially ignores robots.txt; `claude-user` respects it).
90
90
  - **Quantified scoring.** Each bot is graded 0–100 across five categories with letter grades A through F, plus a weighted composite score.
91
+ - **Page-type-aware rubric.** The structured-data category derives the page type from the URL (`root` / `detail` / `archive` / `faq` / `about` / `contact` / `generic`) and applies a per-type schema rubric. A homepage shipping `Organization` + `WebSite` scores 100 without being penalized for not having `BreadcrumbList` or `FAQPage`. Override the detection with `--page-type <type>` when the URL heuristic picks wrong.
92
+ - **Self-explaining scores.** Every `structuredData` block in the JSON report ships `pageType`, `expected`, `optional`, `forbidden`, `present`, `missing`, `extras`, `violations`, `calculation`, and `notes` — so the narrative layer reads the scorer's reasoning directly instead of guessing what was penalized.
91
93
  - **Agent-native interpretation.** The Claude Code skill reads raw data, identifies root causes (framework signals, hydration boundaries, soft-404s), and recommends specific fixes.
92
94
  - **Three-layer output.** Terminal score card, prose narrative, and structured JSON — so humans and CI both get what they need.
93
95
  - **Confidence transparency.** Every claim is tagged `official`, `observed`, or `inferred`. The skill notes when recommendations depend on observed-but-undocumented behavior.
94
96
  - **Shell-native core.** All checks use only `curl` + `jq`. No Node, no Python, no Docker. Each script is independently invokable.
97
+ - **Regression-tested.** `npm test` runs a 37-assertion scoring suite against synthetic fixtures, covering URL→page-type detection, per-type rubrics, missing/forbidden schema flagging, and golden non-structured output.
95
98
  - **Extensible.** Drop a new profile JSON into `profiles/` and it's auto-discovered.
96
99
 
97
100
  ---
@@ -107,6 +110,8 @@ git clone https://github.com/BraedenBDev/crawl-sim.git ~/.claude/skills/crawl-si
107
110
  /crawl-sim https://yoursite.com --json # JSON only (for CI)
108
111
  ```
109
112
 
113
+ The skill auto-detects page type from the URL. Pass `--page-type root|detail|archive|faq|about|contact|generic` to the underlying `compute-score.sh` when the URL heuristic picks the wrong type (e.g., a homepage at `/en/` that URL-parses as `generic`).
114
+
110
115
  Output is a three-layer report:
111
116
 
112
117
  1. **Score card** — ASCII overview with per-bot and per-category scores.
@@ -126,6 +131,7 @@ Every script is standalone and outputs JSON to stdout:
126
131
  ./scripts/check-llmstxt.sh https://yoursite.com
127
132
  ./scripts/check-sitemap.sh https://yoursite.com
128
133
  ./scripts/compute-score.sh /tmp/audit-data/
134
+ ./scripts/compute-score.sh --page-type root /tmp/audit-data/ # override URL heuristic
129
135
  ```
130
136
 
131
137
  ### CI/CD
@@ -148,7 +154,7 @@ Each bot is scored 0–100 across five weighted categories:
148
154
  |----------|:------:|----------|
149
155
  | **Accessibility** | 25 | robots.txt allows, HTTP 200, response time |
150
156
  | **Content Visibility** | 30 | server HTML word count, heading structure, internal links, image alt text |
151
- | **Structured Data** | 20 | JSON-LD presence, validity, page-appropriate `@type` |
157
+ | **Structured Data** | 20 | JSON-LD presence, validity, page-type-aware `@type` rubric (root / detail / archive / faq / about / contact / generic) |
152
158
  | **Technical Signals** | 15 | title / description / canonical / OG meta, sitemap inclusion |
153
159
  | **AI Readiness** | 10 | `llms.txt` structure, content citability |
154
160
 
@@ -218,6 +224,7 @@ crawl-sim/
218
224
  ├── bin/install.js # npm installer
219
225
  ├── profiles/ # 9 verified bot profiles (JSON)
220
226
  ├── scripts/
227
+ │ ├── _lib.sh # shared helpers (URL parsing, page-type detection)
221
228
  │ ├── fetch-as-bot.sh # curl with bot UA → JSON (status/headers/body/timing)
222
229
  │ ├── extract-meta.sh # title, description, OG, headings, images
223
230
  │ ├── extract-jsonld.sh # JSON-LD @type detection
@@ -227,8 +234,11 @@ crawl-sim/
227
234
  │ ├── check-sitemap.sh # sitemap.xml URL inclusion
228
235
  │ ├── diff-render.sh # optional Playwright server-vs-rendered comparison
229
236
  │ └── compute-score.sh # aggregates all checks → per-bot + per-category scores
237
+ ├── test/
238
+ │ ├── run-scoring-tests.sh # 37-assertion bash harness (run with `npm test`)
239
+ │ └── fixtures/ # synthetic RUN_DIR fixtures for regression tests
230
240
  ├── research/ # Verified bot data sources
231
- └── docs/specs/ # Design docs
241
+ └── docs/ # Design docs, issues, accuracy handoffs
232
242
  ```
233
243
 
234
244
  The shell scripts are the plumbing. The Claude Code skill is the intelligence — it reads the raw JSON, understands framework context (Next.js, Nuxt, SPAs), identifies root causes, and writes actionable recommendations.
package/SKILL.md CHANGED
@@ -115,6 +115,14 @@ Tell the user: "Computing per-bot scores and finalizing the report..."
115
115
  cp "$RUN_DIR/score.json" ./crawl-sim-report.json
116
116
  ```
117
117
 
118
+ **Page-type awareness.** `compute-score.sh` derives a page type from the target URL (`root` / `detail` / `archive` / `faq` / `about` / `contact` / `generic`) and picks a schema rubric accordingly. Root pages are expected to ship `Organization` + `WebSite` — penalizing them for missing `BreadcrumbList` or `FAQPage` would be wrong, so the scorer doesn't. If the URL heuristic picks the wrong type (e.g., a homepage at `/en/` that URL-parses as generic), pass `--page-type <type>`:
119
+
120
+ ```bash
121
+ "$SKILL_DIR/scripts/compute-score.sh" --page-type root "$RUN_DIR" > "$RUN_DIR/score.json"
122
+ ```
123
+
124
+ Valid values: `root`, `detail`, `archive`, `faq`, `about`, `contact`, `generic`. The detected (or overridden) page type is exposed on `score.pageType`, and `score.pageTypeOverridden` flips `true` when `--page-type` was used.
125
+
118
126
  ## Output Layer 1 — Score Card (ASCII)
119
127
 
120
128
  Print a boxed score card to the terminal:
@@ -168,6 +176,7 @@ Then produce **prioritized findings** ranked by total point impact across bots:
168
176
  ### Interpretation rules
169
177
 
170
178
  - **Cross-bot deltas are the headline.** Compare `visibility.effectiveWords` across bots — if Googlebot has significantly more than the AI bots, that's finding #1. The raw delta is in `visibility.missedWordsVsRendered`.
179
+ - **Trust the structuredData rubric.** Every `bots.<bot>.categories.structuredData` block now carries `pageType`, `expected`, `optional`, `forbidden`, `present`, `missing`, `extras`, `violations`, `calculation`, and `notes`. Read `missing` and `violations` directly — never guess what the scorer was penalizing for. If `notes` says the page scores 100 with no action needed, that IS the finding; don't invent fixes. If the rubric looks wrong for this specific page (e.g., a homepage detected as `generic` because the URL ends in `/en/`), rerun with `--page-type <correct-type>` instead of arguing with the score. Never recommend adding a schema that already appears in `present` or `extras`.
171
180
  - **Confidence transparency.** If a claim depends on a bot profile's `rendersJavaScript: false` at `observed` confidence (not `official`), note it: *"Based on observed behavior, not official documentation."*
172
181
  - **Framework detection.** Scan the HTML body for signals: `<meta name="next-head-count">` or `_next/static` → Next.js (Pages Router or App Router respectively), `<div id="__nuxt">` → Nuxt, `<div id="app">` with thin content → SPA (Vue/React CSR), `<!--$-->` placeholder tags → React 18 Suspense. Use these to tailor fix recommendations.
173
182
  - **No speculation beyond the data.** If server HTML has 0 `<a>` tags inside a component, say "component not present in server HTML" — not "JavaScript hydration failed" unless the diff-render data proves it.
package/package.json CHANGED
@@ -1,10 +1,13 @@
1
1
  {
2
2
  "name": "@braedenbuilds/crawl-sim",
3
- "version": "1.0.3",
3
+ "version": "1.1.0",
4
4
  "description": "Agent-native multi-bot web crawler simulator. See your site through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot.",
5
5
  "bin": {
6
6
  "crawl-sim": "bin/install.js"
7
7
  },
8
+ "scripts": {
9
+ "test": "./test/run-scoring-tests.sh"
10
+ },
8
11
  "keywords": [
9
12
  "seo",
10
13
  "crawler",
package/scripts/_lib.sh CHANGED
@@ -41,6 +41,36 @@ count_words() {
41
41
  sed 's/<[^>]*>//g' "$1" | tr -s '[:space:]' '\n' | grep -c '[a-zA-Z0-9]' || true
42
42
  }
43
43
 
44
+ # Detect the structural page type of a URL based on its path.
45
+ # Returns one of: root, detail, archive, faq, about, contact, generic.
46
+ #
47
+ # Used by compute-score.sh to pick a schema rubric, but also exposed here
48
+ # so other tooling (narrative layer, planned multi-URL mode) can classify
49
+ # URLs consistently without re-implementing the heuristic.
50
+ page_type_for_url() {
51
+ local url="$1"
52
+ local path
53
+ path=$(path_from_url "$url" | sed 's#[?#].*##')
54
+ if [ "$path" = "/" ]; then
55
+ echo "root"
56
+ return
57
+ fi
58
+ local trimmed lower
59
+ trimmed=$(printf '%s' "$path" | sed 's#^/##' | sed 's#/$##')
60
+ lower=$(printf '%s' "$trimmed" | tr '[:upper:]' '[:lower:]')
61
+ case "$lower" in
62
+ "") echo "root" ;;
63
+ work|journal|blog|articles|news|careers|projects|case-studies|cases)
64
+ echo "archive" ;;
65
+ work/*|articles/*|journal/*|blog/*|news/*|case-studies/*|cases/*|case/*|careers/*|projects/*)
66
+ echo "detail" ;;
67
+ *faq*) echo "faq" ;;
68
+ *about*|*team*|*purpose*|*who-we-are*) echo "about" ;;
69
+ *contact*) echo "contact" ;;
70
+ *) echo "generic" ;;
71
+ esac
72
+ }
73
+
44
74
  # Fetch a URL to a local file and return the HTTP status code on stdout.
45
75
  # Usage: status=$(fetch_to_file <url> <output-file> [timeout-seconds])
46
76
  fetch_to_file() {
@@ -2,7 +2,7 @@
2
2
  set -eu
3
3
 
4
4
  # compute-score.sh — Aggregate check outputs into per-bot + per-category scores
5
- # Usage: compute-score.sh <results-dir>
5
+ # Usage: compute-score.sh [--page-type <type>] <results-dir>
6
6
  # Output: JSON to stdout
7
7
  #
8
8
  # Expected filenames in <results-dir>:
@@ -14,8 +14,56 @@ set -eu
14
14
  # llmstxt.json — check-llmstxt.sh output (bot-independent)
15
15
  # sitemap.json — check-sitemap.sh output (bot-independent)
16
16
  # diff-render.json — diff-render.sh output (optional, Googlebot only)
17
+ #
18
+ # The --page-type flag overrides URL-based page-type detection. Valid values:
19
+ # root, detail, archive, faq, about, contact, generic.
20
+
21
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
22
+ # shellcheck source=_lib.sh
23
+ . "$SCRIPT_DIR/_lib.sh"
24
+
25
+ PAGE_TYPE_OVERRIDE=""
26
+ while [ $# -gt 0 ]; do
27
+ case "$1" in
28
+ --page-type)
29
+ [ $# -ge 2 ] || { echo "--page-type requires a value" >&2; exit 2; }
30
+ PAGE_TYPE_OVERRIDE="$2"
31
+ shift 2
32
+ ;;
33
+ --page-type=*)
34
+ PAGE_TYPE_OVERRIDE="${1#--page-type=}"
35
+ shift
36
+ ;;
37
+ -h|--help)
38
+ echo "Usage: compute-score.sh [--page-type <type>] <results-dir>"
39
+ exit 0
40
+ ;;
41
+ --)
42
+ shift
43
+ break
44
+ ;;
45
+ -*)
46
+ echo "Unknown flag: $1" >&2
47
+ exit 2
48
+ ;;
49
+ *)
50
+ break
51
+ ;;
52
+ esac
53
+ done
54
+
55
+ RESULTS_DIR="${1:?Usage: compute-score.sh [--page-type <type>] <results-dir>}"
56
+
57
+ if [ -n "$PAGE_TYPE_OVERRIDE" ]; then
58
+ case "$PAGE_TYPE_OVERRIDE" in
59
+ root|detail|archive|faq|about|contact|generic) ;;
60
+ *)
61
+ echo "Error: invalid --page-type '$PAGE_TYPE_OVERRIDE' (valid: root, detail, archive, faq, about, contact, generic)" >&2
62
+ exit 2
63
+ ;;
64
+ esac
65
+ fi
17
66
 
18
- RESULTS_DIR="${1:?Usage: compute-score.sh <results-dir>}"
19
67
  printf '[compute-score] aggregating %s\n' "$RESULTS_DIR" >&2
20
68
 
21
69
  if [ ! -d "$RESULTS_DIR" ]; then
@@ -31,7 +79,6 @@ W_TECHNICAL=15
31
79
  W_AI=10
32
80
 
33
81
  # Overall composite weights (per bot)
34
- # Default: Googlebot 40, GPTBot 20, ClaudeBot 20, PerplexityBot 20
35
82
  overall_weight() {
36
83
  case "$1" in
37
84
  googlebot) echo 40 ;;
@@ -42,7 +89,6 @@ overall_weight() {
42
89
  esac
43
90
  }
44
91
 
45
- # Grade from score (0-100)
46
92
  grade_for() {
47
93
  local s=$1
48
94
  if [ "$s" -ge 93 ]; then echo "A"
@@ -60,7 +106,85 @@ grade_for() {
60
106
  fi
61
107
  }
62
108
 
63
- # Read a jq value from a file with a default fallback
109
+ # Rubric: expected schema types per page type.
110
+ rubric_expected() {
111
+ case "$1" in
112
+ root) echo "Organization WebSite" ;;
113
+ detail) echo "Article BreadcrumbList" ;;
114
+ archive) echo "CollectionPage ItemList BreadcrumbList" ;;
115
+ faq) echo "FAQPage BreadcrumbList" ;;
116
+ about) echo "AboutPage BreadcrumbList Organization" ;;
117
+ contact) echo "ContactPage BreadcrumbList" ;;
118
+ *) echo "WebPage BreadcrumbList" ;;
119
+ esac
120
+ }
121
+
122
+ rubric_optional() {
123
+ case "$1" in
124
+ root) echo "ProfessionalService LocalBusiness" ;;
125
+ detail) echo "NewsArticle ImageObject Person" ;;
126
+ archive) echo "" ;;
127
+ faq) echo "WebPage" ;;
128
+ about) echo "Person" ;;
129
+ contact) echo "PostalAddress" ;;
130
+ *) echo "" ;;
131
+ esac
132
+ }
133
+
134
+ rubric_forbidden() {
135
+ case "$1" in
136
+ root) echo "BreadcrumbList Article FAQPage" ;;
137
+ detail) echo "CollectionPage ItemList" ;;
138
+ archive) echo "Article Product" ;;
139
+ faq) echo "Article CollectionPage" ;;
140
+ about) echo "Article Product" ;;
141
+ contact) echo "Article Product" ;;
142
+ *) echo "" ;;
143
+ esac
144
+ }
145
+
146
+ list_contains() {
147
+ local needle="$1"
148
+ shift
149
+ local item
150
+ for item in "$@"; do
151
+ [ "$item" = "$needle" ] && return 0
152
+ done
153
+ return 1
154
+ }
155
+
156
+ list_count() {
157
+ # shellcheck disable=SC2086
158
+ set -- $1
159
+ echo "$#"
160
+ }
161
+
162
+ list_intersect() {
163
+ local a="$1" b="$2"
164
+ local out="" item
165
+ # shellcheck disable=SC2086
166
+ for item in $a; do
167
+ # shellcheck disable=SC2086
168
+ if list_contains "$item" $b; then
169
+ out="$out $item"
170
+ fi
171
+ done
172
+ printf '%s' "${out# }"
173
+ }
174
+
175
+ list_diff() {
176
+ local a="$1" b="$2"
177
+ local out="" item
178
+ # shellcheck disable=SC2086
179
+ for item in $a; do
180
+ # shellcheck disable=SC2086
181
+ if ! list_contains "$item" $b; then
182
+ out="$out $item"
183
+ fi
184
+ done
185
+ printf '%s' "${out# }"
186
+ }
187
+
64
188
  jget() {
65
189
  local file="$1"
66
190
  local query="$2"
@@ -75,7 +199,6 @@ jget() {
75
199
  jget_num() {
76
200
  local v
77
201
  v=$(jget "$1" "$2" "0")
78
- # Replace "null" or non-numeric with 0
79
202
  if ! printf '%s' "$v" | grep -qE '^-?[0-9]+(\.[0-9]+)?$'; then
80
203
  echo "0"
81
204
  else
@@ -90,8 +213,10 @@ jget_bool() {
90
213
  }
91
214
 
92
215
  BOTS=""
216
+ FIRST_FETCH=""
93
217
  for f in "$RESULTS_DIR"/fetch-*.json; do
94
218
  [ -f "$f" ] || continue
219
+ [ -z "$FIRST_FETCH" ] && FIRST_FETCH="$f"
95
220
  bot_id=$(basename "$f" .json | sed 's/^fetch-//')
96
221
  BOTS="$BOTS $bot_id"
97
222
  done
@@ -105,14 +230,10 @@ LLMSTXT_FILE="$RESULTS_DIR/llmstxt.json"
105
230
  SITEMAP_FILE="$RESULTS_DIR/sitemap.json"
106
231
  DIFF_RENDER_FILE="$RESULTS_DIR/diff-render.json"
107
232
 
108
- # Load Playwright render-delta data once (used to differentiate JS-rendering
109
- # bots from non-rendering ones). If the comparison was skipped or missing,
110
- # all bots score against server HTML only.
111
233
  DIFF_AVAILABLE=false
112
234
  DIFF_RENDERED_WORDS=0
113
235
  DIFF_DELTA_PCT=0
114
236
  if [ -f "$DIFF_RENDER_FILE" ]; then
115
- # Explicit null check — `.skipped // true` would treat real false as null.
116
237
  DIFF_SKIPPED=$(jq -r '.skipped | if . == null then "true" else tostring end' "$DIFF_RENDER_FILE" 2>/dev/null || echo "true")
117
238
  if [ "$DIFF_SKIPPED" = "false" ]; then
118
239
  DIFF_AVAILABLE=true
@@ -121,9 +242,22 @@ if [ -f "$DIFF_RENDER_FILE" ]; then
121
242
  fi
122
243
  fi
123
244
 
245
+ # Resolve page type once from the first fetch file's URL, unless overridden.
246
+ TARGET_URL=$(jget "$FIRST_FETCH" '.url' "")
247
+ if [ -n "$PAGE_TYPE_OVERRIDE" ]; then
248
+ PAGE_TYPE="$PAGE_TYPE_OVERRIDE"
249
+ else
250
+ PAGE_TYPE=$(page_type_for_url "$TARGET_URL")
251
+ fi
252
+ printf '[compute-score] page type: %s (url: %s)\n' "$PAGE_TYPE" "$TARGET_URL" >&2
253
+
254
+ RUBRIC_EXPECTED="$(rubric_expected "$PAGE_TYPE")"
255
+ RUBRIC_OPTIONAL="$(rubric_optional "$PAGE_TYPE")"
256
+ RUBRIC_FORBIDDEN="$(rubric_forbidden "$PAGE_TYPE")"
257
+ EXPECTED_COUNT=$(list_count "$RUBRIC_EXPECTED")
258
+
124
259
  BOTS_JSON="{}"
125
260
 
126
- # Accumulators for per-category averages (across bots)
127
261
  CAT_ACCESSIBILITY_SUM=0
128
262
  CAT_CONTENT_SUM=0
129
263
  CAT_STRUCTURED_SUM=0
@@ -131,7 +265,6 @@ CAT_TECHNICAL_SUM=0
131
265
  CAT_AI_SUM=0
132
266
  CAT_N=0
133
267
 
134
- # Accumulators for overall weighted composite
135
268
  OVERALL_WEIGHTED_SUM=0
136
269
  OVERALL_WEIGHT_TOTAL=0
137
270
 
@@ -146,17 +279,10 @@ for bot_id in $BOTS; do
146
279
  STATUS=$(jget_num "$FETCH" '.status')
147
280
  TOTAL_TIME=$(jget_num "$FETCH" '.timing.total')
148
281
  SERVER_WORD_COUNT=$(jget_num "$FETCH" '.wordCount')
149
- # Read with explicit null fallback — jq's `//` is unsafe here because it
150
- # treats boolean false as falsy, which is exactly the value we need to see.
151
282
  RENDERS_JS=$(jq -r '.bot.rendersJavaScript | if . == null then "unknown" else tostring end' "$FETCH" 2>/dev/null || echo "unknown")
152
283
 
153
284
  ROBOTS_ALLOWED=$(jget_bool "$ROBOTS" '.allowed')
154
285
 
155
- # Effective word count depends on JS rendering capability:
156
- # - true (e.g. Googlebot) + diff-render data → rendered DOM word count
157
- # - false (AI training/search bots, observed) → server HTML only, with
158
- # penalty proportional to the rendering delta
159
- # - unknown → conservative: server HTML (same as false but no penalty)
160
286
  EFFECTIVE_WORD_COUNT=$SERVER_WORD_COUNT
161
287
  HYDRATION_PENALTY=0
162
288
  MISSED_WORDS=0
@@ -164,11 +290,8 @@ for bot_id in $BOTS; do
164
290
  if [ "$RENDERS_JS" = "true" ]; then
165
291
  EFFECTIVE_WORD_COUNT=$DIFF_RENDERED_WORDS
166
292
  elif [ "$RENDERS_JS" = "false" ]; then
167
- # Absolute-value delta: if rendered DOM has materially more than server,
168
- # AI bots are missing that content.
169
293
  ABS_DELTA=$(awk -v d="$DIFF_DELTA_PCT" 'BEGIN { printf "%d", (d < 0 ? -d : d) + 0.5 }')
170
294
  if [ "$ABS_DELTA" -gt 5 ]; then
171
- # Scale penalty: 5% delta = 0, 10% = 5, 20%+ = 15 (cap)
172
295
  HYDRATION_PENALTY=$(awk -v d="$ABS_DELTA" 'BEGIN {
173
296
  p = (d - 5)
174
297
  if (p > 15) p = 15
@@ -182,11 +305,8 @@ for bot_id in $BOTS; do
182
305
 
183
306
  # --- Category 1: Accessibility (0-100) ---
184
307
  ACC=0
185
- # robots.txt allows: 40
186
308
  [ "$ROBOTS_ALLOWED" = "true" ] && ACC=$((ACC + 40))
187
- # HTTP 200: 40
188
309
  [ "$STATUS" = "200" ] && ACC=$((ACC + 40))
189
- # Response time: <2s = 20, <5s = 10, else 0
190
310
  TIME_SCORE=$(awk -v t="$TOTAL_TIME" 'BEGIN { if (t < 2) print 20; else if (t < 5) print 10; else print 0 }')
191
311
  ACC=$((ACC + TIME_SCORE))
192
312
 
@@ -216,34 +336,106 @@ for bot_id in $BOTS; do
216
336
  CONTENT=$((CONTENT + ALT_SCORE))
217
337
  fi
218
338
 
219
- # Apply hydration penalty for non-rendering bots that are missing content
220
339
  CONTENT=$((CONTENT - HYDRATION_PENALTY))
221
340
  [ $CONTENT -lt 0 ] && CONTENT=0
222
341
 
223
342
  # --- Category 3: Structured Data (0-100) ---
224
- STRUCTURED=0
225
343
  JSONLD_COUNT=$(jget_num "$JSONLD" '.blockCount')
226
344
  JSONLD_VALID=$(jget_num "$JSONLD" '.validCount')
227
345
  JSONLD_INVALID=$(jget_num "$JSONLD" '.invalidCount')
228
- HAS_ORG=$(jget_bool "$JSONLD" '.flags.hasOrganization')
229
- HAS_WEBSITE=$(jget_bool "$JSONLD" '.flags.hasWebSite')
230
- HAS_BREADCRUMB=$(jget_bool "$JSONLD" '.flags.hasBreadcrumbList')
231
- HAS_ARTICLE=$(jget_bool "$JSONLD" '.flags.hasArticle')
232
- HAS_PRODUCT=$(jget_bool "$JSONLD" '.flags.hasProduct')
233
- HAS_FAQ=$(jget_bool "$JSONLD" '.flags.hasFAQPage')
234
-
235
- [ "$JSONLD_COUNT" -ge 1 ] && STRUCTURED=$((STRUCTURED + 30))
236
- if [ "$JSONLD_COUNT" -ge 1 ] && [ "$JSONLD_INVALID" -eq 0 ]; then
237
- STRUCTURED=$((STRUCTURED + 20))
346
+
347
+ if [ -f "$JSONLD" ]; then
348
+ PRESENT_TYPES=$(jq -r '.types[]? // empty' "$JSONLD" 2>/dev/null | awk 'NF && !seen[$0]++' | tr '\n' ' ')
349
+ PRESENT_TYPES=${PRESENT_TYPES% }
350
+ else
351
+ PRESENT_TYPES=""
238
352
  fi
239
- if [ "$HAS_ORG" = "true" ] || [ "$HAS_WEBSITE" = "true" ]; then
240
- STRUCTURED=$((STRUCTURED + 20))
353
+
354
+ PRESENT_EXPECTED=$(list_intersect "$RUBRIC_EXPECTED" "$PRESENT_TYPES")
355
+ PRESENT_OPTIONAL=$(list_intersect "$RUBRIC_OPTIONAL" "$PRESENT_TYPES")
356
+ PRESENT_FORBIDDEN=$(list_intersect "$RUBRIC_FORBIDDEN" "$PRESENT_TYPES")
357
+ MISSING_EXPECTED=$(list_diff "$RUBRIC_EXPECTED" "$PRESENT_TYPES")
358
+ RUBRIC_KNOWN="$RUBRIC_EXPECTED $RUBRIC_OPTIONAL $RUBRIC_FORBIDDEN"
359
+ EXTRAS=$(list_diff "$PRESENT_TYPES" "$RUBRIC_KNOWN")
360
+
361
+ PRESENT_EXPECTED_COUNT=$(list_count "$PRESENT_EXPECTED")
362
+ PRESENT_OPTIONAL_COUNT=$(list_count "$PRESENT_OPTIONAL")
363
+ PRESENT_FORBIDDEN_COUNT=$(list_count "$PRESENT_FORBIDDEN")
364
+
365
+ BASE=$(awk -v h="$PRESENT_EXPECTED_COUNT" -v t="$EXPECTED_COUNT" \
366
+ 'BEGIN { if (t == 0) print 0; else printf "%d", (h / t) * 100 + 0.5 }')
367
+
368
+ BONUS=$((PRESENT_OPTIONAL_COUNT * 10))
369
+ [ $BONUS -gt 20 ] && BONUS=20
370
+
371
+ FORBID_PENALTY=$((PRESENT_FORBIDDEN_COUNT * 10))
372
+
373
+ VALID_PENALTY=0
374
+ if [ "$JSONLD_COUNT" -gt 0 ] && [ "$JSONLD_INVALID" -gt 0 ]; then
375
+ VALID_PENALTY=$((JSONLD_INVALID * 5))
376
+ [ $VALID_PENALTY -gt 20 ] && VALID_PENALTY=20
241
377
  fi
242
- [ "$HAS_BREADCRUMB" = "true" ] && STRUCTURED=$((STRUCTURED + 15))
243
- if [ "$HAS_ARTICLE" = "true" ] || [ "$HAS_PRODUCT" = "true" ] || [ "$HAS_FAQ" = "true" ]; then
244
- STRUCTURED=$((STRUCTURED + 15))
378
+
379
+ STRUCTURED=$((BASE + BONUS - FORBID_PENALTY - VALID_PENALTY))
380
+ [ $STRUCTURED -gt 100 ] && STRUCTURED=100
381
+ [ $STRUCTURED -lt 0 ] && STRUCTURED=0
382
+
383
+ CALCULATION=$(printf 'base: %d/%d expected present = %d; +%d optional bonus; -%d forbidden penalty; -%d validity penalty; clamp [0,100] = %d' \
384
+ "$PRESENT_EXPECTED_COUNT" "$EXPECTED_COUNT" "$BASE" \
385
+ "$BONUS" "$FORBID_PENALTY" "$VALID_PENALTY" "$STRUCTURED")
386
+
387
+ if [ "$STRUCTURED" -ge 100 ] && [ -z "$PRESENT_FORBIDDEN" ] && [ "$VALID_PENALTY" -eq 0 ]; then
388
+ NOTES="All expected schemas for pageType=$PAGE_TYPE are present. No structured-data action needed."
389
+ elif [ -n "$MISSING_EXPECTED" ] && [ -z "$PRESENT_FORBIDDEN" ]; then
390
+ NOTES="Missing expected schemas for pageType=$PAGE_TYPE: $MISSING_EXPECTED. Add these to raise the score."
391
+ elif [ -n "$PRESENT_FORBIDDEN" ] && [ -z "$MISSING_EXPECTED" ]; then
392
+ NOTES="Forbidden schemas present for pageType=$PAGE_TYPE: $PRESENT_FORBIDDEN. Remove these (or re-classify the page type with --page-type)."
393
+ elif [ -n "$PRESENT_FORBIDDEN" ] && [ -n "$MISSING_EXPECTED" ]; then
394
+ NOTES="Mixed: missing $MISSING_EXPECTED and forbidden present $PRESENT_FORBIDDEN for pageType=$PAGE_TYPE."
395
+ else
396
+ NOTES="Score reduced by $VALID_PENALTY pts due to invalid JSON-LD blocks."
245
397
  fi
246
398
 
399
+ STRUCTURED_GRADE=$(grade_for "$STRUCTURED")
400
+ STRUCTURED_OBJ=$(jq -n \
401
+ --argjson score "$STRUCTURED" \
402
+ --arg grade "$STRUCTURED_GRADE" \
403
+ --arg pageType "$PAGE_TYPE" \
404
+ --arg expectedList "$RUBRIC_EXPECTED" \
405
+ --arg optionalList "$RUBRIC_OPTIONAL" \
406
+ --arg forbiddenList "$RUBRIC_FORBIDDEN" \
407
+ --arg presentList "$PRESENT_TYPES" \
408
+ --arg missingList "$MISSING_EXPECTED" \
409
+ --arg extrasList "$EXTRAS" \
410
+ --arg forbiddenPresent "$PRESENT_FORBIDDEN" \
411
+ --argjson invalidCount "$JSONLD_INVALID" \
412
+ --argjson validPenalty "$VALID_PENALTY" \
413
+ --arg calculation "$CALCULATION" \
414
+ --arg notes "$NOTES" \
415
+ '
416
+ def to_arr: split(" ") | map(select(length > 0));
417
+ {
418
+ score: $score,
419
+ grade: $grade,
420
+ pageType: $pageType,
421
+ expected: ($expectedList | to_arr),
422
+ optional: ($optionalList | to_arr),
423
+ forbidden: ($forbiddenList | to_arr),
424
+ present: ($presentList | to_arr),
425
+ missing: ($missingList | to_arr),
426
+ extras: ($extrasList | to_arr),
427
+ violations: (
428
+ ($forbiddenPresent | to_arr | map({kind: "forbidden_schema", schema: ., impact: -10}))
429
+ + (if $validPenalty > 0
430
+ then [{kind: "invalid_jsonld", count: $invalidCount, impact: (0 - $validPenalty)}]
431
+ else []
432
+ end)
433
+ ),
434
+ calculation: $calculation,
435
+ notes: $notes
436
+ }
437
+ ')
438
+
247
439
  # --- Category 4: Technical Signals (0-100) ---
248
440
  TECHNICAL=0
249
441
  TITLE=$(jget "$META" '.title' "")
@@ -279,21 +471,16 @@ for bot_id in $BOTS; do
279
471
  [ "$LLMS_HAS_DESC" = "true" ] && AI=$((AI + 7))
280
472
  [ "$LLMS_URLS" -ge 1 ] && AI=$((AI + 6))
281
473
  fi
282
- # Content citable (>= 200 words, effective for this bot)
283
474
  [ "$EFFECTIVE_WORD_COUNT" -ge 200 ] && AI=$((AI + 20))
284
- # Semantic clarity: has H1 + description
285
475
  if [ "$H1_COUNT" -ge 1 ] && [ -n "$DESCRIPTION" ] && [ "$DESCRIPTION" != "null" ]; then
286
476
  AI=$((AI + 20))
287
477
  fi
288
478
 
289
- # Cap categories at 100
290
479
  [ $ACC -gt 100 ] && ACC=100
291
480
  [ $CONTENT -gt 100 ] && CONTENT=100
292
- [ $STRUCTURED -gt 100 ] && STRUCTURED=100
293
481
  [ $TECHNICAL -gt 100 ] && TECHNICAL=100
294
482
  [ $AI -gt 100 ] && AI=100
295
483
 
296
- # Per-bot composite score (weighted average of 5 categories)
297
484
  BOT_SCORE=$(awk -v a=$ACC -v c=$CONTENT -v s=$STRUCTURED -v t=$TECHNICAL -v ai=$AI \
298
485
  -v wa=$W_ACCESSIBILITY -v wc=$W_CONTENT -v ws=$W_STRUCTURED -v wt=$W_TECHNICAL -v wai=$W_AI \
299
486
  'BEGIN { printf "%d", (a*wa + c*wc + s*ws + t*wt + ai*wai) / (wa+wc+ws+wt+wai) + 0.5 }')
@@ -301,7 +488,6 @@ for bot_id in $BOTS; do
301
488
  BOT_GRADE=$(grade_for "$BOT_SCORE")
302
489
  ACC_GRADE=$(grade_for "$ACC")
303
490
  CONTENT_GRADE=$(grade_for "$CONTENT")
304
- STRUCTURED_GRADE=$(grade_for "$STRUCTURED")
305
491
  TECHNICAL_GRADE=$(grade_for "$TECHNICAL")
306
492
  AI_GRADE=$(grade_for "$AI")
307
493
 
@@ -315,8 +501,7 @@ for bot_id in $BOTS; do
315
501
  --arg accGrade "$ACC_GRADE" \
316
502
  --argjson content "$CONTENT" \
317
503
  --arg contentGrade "$CONTENT_GRADE" \
318
- --argjson structured "$STRUCTURED" \
319
- --arg structuredGrade "$STRUCTURED_GRADE" \
504
+ --argjson structured "$STRUCTURED_OBJ" \
320
505
  --argjson technical "$TECHNICAL" \
321
506
  --arg technicalGrade "$TECHNICAL_GRADE" \
322
507
  --argjson ai "$AI" \
@@ -338,17 +523,16 @@ for bot_id in $BOTS; do
338
523
  hydrationPenaltyPts: $hydrationPenalty
339
524
  },
340
525
  categories: {
341
- accessibility: { score: $acc, grade: $accGrade },
342
- contentVisibility: { score: $content, grade: $contentGrade },
343
- structuredData: { score: $structured, grade: $structuredGrade },
344
- technicalSignals: { score: $technical, grade: $technicalGrade },
345
- aiReadiness: { score: $ai, grade: $aiGrade }
526
+ accessibility: { score: $acc, grade: $accGrade },
527
+ contentVisibility: { score: $content, grade: $contentGrade },
528
+ structuredData: $structured,
529
+ technicalSignals: { score: $technical, grade: $technicalGrade },
530
+ aiReadiness: { score: $ai, grade: $aiGrade }
346
531
  }
347
532
  }')
348
533
 
349
534
  BOTS_JSON=$(printf '%s' "$BOTS_JSON" | jq --argjson bot "$BOT_OBJ" --arg id "$bot_id" '.[$id] = $bot')
350
535
 
351
- # Accumulate category averages
352
536
  CAT_ACCESSIBILITY_SUM=$((CAT_ACCESSIBILITY_SUM + ACC))
353
537
  CAT_CONTENT_SUM=$((CAT_CONTENT_SUM + CONTENT))
354
538
  CAT_STRUCTURED_SUM=$((CAT_STRUCTURED_SUM + STRUCTURED))
@@ -356,7 +540,6 @@ for bot_id in $BOTS; do
356
540
  CAT_AI_SUM=$((CAT_AI_SUM + AI))
357
541
  CAT_N=$((CAT_N + 1))
358
542
 
359
- # Accumulate weighted overall
360
543
  W=$(overall_weight "$bot_id")
361
544
  if [ "$W" -gt 0 ]; then
362
545
  OVERALL_WEIGHTED_SUM=$((OVERALL_WEIGHTED_SUM + BOT_SCORE * W))
@@ -364,18 +547,15 @@ for bot_id in $BOTS; do
364
547
  fi
365
548
  done
366
549
 
367
- # Per-category averages (across all bots)
368
550
  CAT_ACC_AVG=$((CAT_ACCESSIBILITY_SUM / CAT_N))
369
551
  CAT_CONTENT_AVG=$((CAT_CONTENT_SUM / CAT_N))
370
552
  CAT_STRUCTURED_AVG=$((CAT_STRUCTURED_SUM / CAT_N))
371
553
  CAT_TECHNICAL_AVG=$((CAT_TECHNICAL_SUM / CAT_N))
372
554
  CAT_AI_AVG=$((CAT_AI_SUM / CAT_N))
373
555
 
374
- # Overall composite
375
556
  if [ "$OVERALL_WEIGHT_TOTAL" -gt 0 ]; then
376
557
  OVERALL_SCORE=$((OVERALL_WEIGHTED_SUM / OVERALL_WEIGHT_TOTAL))
377
558
  else
378
- # Fall back to simple average if none of the 4 standard bots are present
379
559
  OVERALL_SCORE=$(((CAT_ACC_AVG + CAT_CONTENT_AVG + CAT_STRUCTURED_AVG + CAT_TECHNICAL_AVG + CAT_AI_AVG) / 5))
380
560
  fi
381
561
 
@@ -386,15 +566,14 @@ CAT_STRUCTURED_GRADE=$(grade_for "$CAT_STRUCTURED_AVG")
386
566
  CAT_TECHNICAL_GRADE=$(grade_for "$CAT_TECHNICAL_AVG")
387
567
  CAT_AI_GRADE=$(grade_for "$CAT_AI_AVG")
388
568
 
389
- # Get the URL from the first fetch file
390
- FIRST_FETCH=$(ls "$RESULTS_DIR"/fetch-*.json | head -1)
391
- TARGET_URL=$(jget "$FIRST_FETCH" '.url' "")
392
569
  TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
393
570
 
394
571
  jq -n \
395
572
  --arg url "$TARGET_URL" \
396
573
  --arg timestamp "$TIMESTAMP" \
397
- --arg version "0.1.0" \
574
+ --arg version "0.2.0" \
575
+ --arg pageType "$PAGE_TYPE" \
576
+ --arg pageTypeOverride "$PAGE_TYPE_OVERRIDE" \
398
577
  --argjson overallScore "$OVERALL_SCORE" \
399
578
  --arg overallGrade "$OVERALL_GRADE" \
400
579
  --argjson bots "$BOTS_JSON" \
@@ -412,6 +591,8 @@ jq -n \
412
591
  url: $url,
413
592
  timestamp: $timestamp,
414
593
  version: $version,
594
+ pageType: $pageType,
595
+ pageTypeOverridden: ($pageTypeOverride | length > 0),
415
596
  overall: { score: $overallScore, grade: $overallGrade },
416
597
  bots: $bots,
417
598
  categories: {