npm - @braedenbuilds/crawl-sim - Versions diffs - 1.0.5 → 1.1.0 - Mend

@braedenbuilds/crawl-sim 1.0.5 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md +13 -3
package/SKILL.md +9 -0
package/package.json +4 -1
package/scripts/_lib.sh +30 -0
package/scripts/compute-score.sh +245 -64

package/README.md CHANGED Viewed

@@ -60,7 +60,7 @@ Then in Claude Code:
 Claude runs the full pipeline, interprets the results, and returns a score card plus prioritized findings.
-> **Why `npm install -g` instead of `npx`?** Recent versions of npx have a [known issue](https://github.com/npm/cli/issues) linking bins for scoped single-bin packages in ephemeral installs. A persistent global install avoids the problem entirely. If you want a one-shot install without touching global state, the tarball URL form also works: `npx https://registry.npmjs.org/@braedenbuilds/crawl-sim/-/crawl-sim-1.0.3.tgz install`.
+> **Why `npm install -g` instead of `npx`?** Recent versions of npx have a known issue linking bins for scoped single-bin packages in ephemeral installs. A persistent global install avoids the problem entirely. The git clone path below is the zero-npm fallback.
 ### As a standalone CLI
@@ -88,10 +88,13 @@ git clone https://github.com/BraedenBDev/crawl-sim.git ~/.claude/skills/crawl-si
 - **Multi-bot simulation.** Nine verified bot profiles covering Google, OpenAI, Anthropic, and Perplexity — including the bot-vs-user-agent distinction (e.g., `ChatGPT-User` officially ignores robots.txt; `claude-user` respects it).
 - **Quantified scoring.** Each bot is graded 0–100 across five categories with letter grades A through F, plus a weighted composite score.
+- **Page-type-aware rubric.** The structured-data category derives the page type from the URL (`root` / `detail` / `archive` / `faq` / `about` / `contact` / `generic`) and applies a per-type schema rubric. A homepage shipping `Organization` + `WebSite` scores 100 without being penalized for not having `BreadcrumbList` or `FAQPage`. Override the detection with `--page-type <type>` when the URL heuristic picks wrong.
+- **Self-explaining scores.** Every `structuredData` block in the JSON report ships `pageType`, `expected`, `optional`, `forbidden`, `present`, `missing`, `extras`, `violations`, `calculation`, and `notes` — so the narrative layer reads the scorer's reasoning directly instead of guessing what was penalized.
 - **Agent-native interpretation.** The Claude Code skill reads raw data, identifies root causes (framework signals, hydration boundaries, soft-404s), and recommends specific fixes.
 - **Three-layer output.** Terminal score card, prose narrative, and structured JSON — so humans and CI both get what they need.
 - **Confidence transparency.** Every claim is tagged `official`, `observed`, or `inferred`. The skill notes when recommendations depend on observed-but-undocumented behavior.
 - **Shell-native core.** All checks use only `curl` + `jq`. No Node, no Python, no Docker. Each script is independently invokable.
+- **Regression-tested.** `npm test` runs a 37-assertion scoring suite against synthetic fixtures, covering URL→page-type detection, per-type rubrics, missing/forbidden schema flagging, and golden non-structured output.
 - **Extensible.** Drop a new profile JSON into `profiles/` and it's auto-discovered.
 ---
@@ -107,6 +110,8 @@ git clone https://github.com/BraedenBDev/crawl-sim.git ~/.claude/skills/crawl-si
 /crawl-sim https://yoursite.com --json                     # JSON only (for CI)
 ```
+The skill auto-detects page type from the URL. Pass `--page-type root|detail|archive|faq|about|contact|generic` to the underlying `compute-score.sh` when the URL heuristic picks the wrong type (e.g., a homepage at `/en/` that URL-parses as `generic`).
 Output is a three-layer report:
 1. **Score card** — ASCII overview with per-bot and per-category scores.
@@ -126,6 +131,7 @@ Every script is standalone and outputs JSON to stdout:
 ./scripts/check-llmstxt.sh   https://yoursite.com
 ./scripts/check-sitemap.sh   https://yoursite.com
 ./scripts/compute-score.sh   /tmp/audit-data/
+./scripts/compute-score.sh   --page-type root /tmp/audit-data/   # override URL heuristic
 ```
 ### CI/CD
@@ -148,7 +154,7 @@ Each bot is scored 0–100 across five weighted categories:
 |----------|:------:|----------|
 | **Accessibility** | 25 | robots.txt allows, HTTP 200, response time |
 | **Content Visibility** | 30 | server HTML word count, heading structure, internal links, image alt text |
-| **Structured Data** | 20 | JSON-LD presence, validity, page-appropriate `@type` |
+| **Structured Data** | 20 | JSON-LD presence, validity, page-type-aware `@type` rubric (root / detail / archive / faq / about / contact / generic) |
 | **Technical Signals** | 15 | title / description / canonical / OG meta, sitemap inclusion |
 | **AI Readiness** | 10 | `llms.txt` structure, content citability |
@@ -218,6 +224,7 @@ crawl-sim/
 ├── bin/install.js         # npm installer
 ├── profiles/              # 9 verified bot profiles (JSON)
 ├── scripts/
+│   ├── _lib.sh            # shared helpers (URL parsing, page-type detection)
 │   ├── fetch-as-bot.sh    # curl with bot UA → JSON (status/headers/body/timing)
 │   ├── extract-meta.sh    # title, description, OG, headings, images
 │   ├── extract-jsonld.sh  # JSON-LD @type detection
@@ -227,8 +234,11 @@ crawl-sim/
 │   ├── check-sitemap.sh   # sitemap.xml URL inclusion
 │   ├── diff-render.sh     # optional Playwright server-vs-rendered comparison
 │   └── compute-score.sh   # aggregates all checks → per-bot + per-category scores
+├── test/
+│   ├── run-scoring-tests.sh  # 37-assertion bash harness (run with `npm test`)
+│   └── fixtures/             # synthetic RUN_DIR fixtures for regression tests
 ├── research/              # Verified bot data sources
-└── docs/specs/            # Design docs
+└── docs/                  # Design docs, issues, accuracy handoffs
 ```
 The shell scripts are the plumbing. The Claude Code skill is the intelligence — it reads the raw JSON, understands framework context (Next.js, Nuxt, SPAs), identifies root causes, and writes actionable recommendations.

package/SKILL.md CHANGED Viewed

@@ -115,6 +115,14 @@ Tell the user: "Computing per-bot scores and finalizing the report..."
 cp "$RUN_DIR/score.json" ./crawl-sim-report.json
 ```
+**Page-type awareness.** `compute-score.sh` derives a page type from the target URL (`root` / `detail` / `archive` / `faq` / `about` / `contact` / `generic`) and picks a schema rubric accordingly. Root pages are expected to ship `Organization` + `WebSite` — penalizing them for missing `BreadcrumbList` or `FAQPage` would be wrong, so the scorer doesn't. If the URL heuristic picks the wrong type (e.g., a homepage at `/en/` that URL-parses as generic), pass `--page-type <type>`:
+```bash
+"$SKILL_DIR/scripts/compute-score.sh" --page-type root "$RUN_DIR" > "$RUN_DIR/score.json"
+```
+Valid values: `root`, `detail`, `archive`, `faq`, `about`, `contact`, `generic`. The detected (or overridden) page type is exposed on `score.pageType`, and `score.pageTypeOverridden` flips `true` when `--page-type` was used.
 ## Output Layer 1 — Score Card (ASCII)
 Print a boxed score card to the terminal:
@@ -168,6 +176,7 @@ Then produce **prioritized findings** ranked by total point impact across bots:
 ### Interpretation rules
 - **Cross-bot deltas are the headline.** Compare `visibility.effectiveWords` across bots — if Googlebot has significantly more than the AI bots, that's finding #1. The raw delta is in `visibility.missedWordsVsRendered`.
+- **Trust the structuredData rubric.** Every `bots.<bot>.categories.structuredData` block now carries `pageType`, `expected`, `optional`, `forbidden`, `present`, `missing`, `extras`, `violations`, `calculation`, and `notes`. Read `missing` and `violations` directly — never guess what the scorer was penalizing for. If `notes` says the page scores 100 with no action needed, that IS the finding; don't invent fixes. If the rubric looks wrong for this specific page (e.g., a homepage detected as `generic` because the URL ends in `/en/`), rerun with `--page-type <correct-type>` instead of arguing with the score. Never recommend adding a schema that already appears in `present` or `extras`.
 - **Confidence transparency.** If a claim depends on a bot profile's `rendersJavaScript: false` at `observed` confidence (not `official`), note it: *"Based on observed behavior, not official documentation."*
 - **Framework detection.** Scan the HTML body for signals: `<meta name="next-head-count">` or `_next/static` → Next.js (Pages Router or App Router respectively), `<div id="__nuxt">` → Nuxt, `<div id="app">` with thin content → SPA (Vue/React CSR), `<!--$-->` placeholder tags → React 18 Suspense. Use these to tailor fix recommendations.
 - **No speculation beyond the data.** If server HTML has 0 `<a>` tags inside a component, say "component not present in server HTML" — not "JavaScript hydration failed" unless the diff-render data proves it.

package/package.json CHANGED Viewed

@@ -1,10 +1,13 @@
 {
   "name": "@braedenbuilds/crawl-sim",
-  "version": "1.0.5",
+  "version": "1.1.0",
   "description": "Agent-native multi-bot web crawler simulator. See your site through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot.",
   "bin": {
     "crawl-sim": "bin/install.js"
   },
+  "scripts": {
+    "test": "./test/run-scoring-tests.sh"
+  },
   "keywords": [
     "seo",
     "crawler",

package/scripts/_lib.sh CHANGED Viewed

@@ -41,6 +41,36 @@ count_words() {
   sed 's/<[^>]*>//g' "$1" | tr -s '[:space:]' '\n' | grep -c '[a-zA-Z0-9]' || true
 }
+# Detect the structural page type of a URL based on its path.
+# Returns one of: root, detail, archive, faq, about, contact, generic.
+#
+# Used by compute-score.sh to pick a schema rubric, but also exposed here
+# so other tooling (narrative layer, planned multi-URL mode) can classify
+# URLs consistently without re-implementing the heuristic.
+page_type_for_url() {
+  local url="$1"
+  local path
+  path=$(path_from_url "$url" | sed 's#[?#].*##')
+  if [ "$path" = "/" ]; then
+    echo "root"
+    return
+  fi
+  local trimmed lower
+  trimmed=$(printf '%s' "$path" | sed 's#^/##' | sed 's#/$##')
+  lower=$(printf '%s' "$trimmed" | tr '[:upper:]' '[:lower:]')
+  case "$lower" in
+    "") echo "root" ;;
+    work|journal|blog|articles|news|careers|projects|case-studies|cases)
+      echo "archive" ;;
+    work/*|articles/*|journal/*|blog/*|news/*|case-studies/*|cases/*|case/*|careers/*|projects/*)
+      echo "detail" ;;
+    *faq*) echo "faq" ;;
+    *about*|*team*|*purpose*|*who-we-are*) echo "about" ;;
+    *contact*) echo "contact" ;;
+    *) echo "generic" ;;
+  esac
+}
 # Fetch a URL to a local file and return the HTTP status code on stdout.
 # Usage: status=$(fetch_to_file <url> <output-file> [timeout-seconds])
 fetch_to_file() {

package/scripts/compute-score.sh CHANGED Viewed

@@ -2,7 +2,7 @@
 set -eu
 # compute-score.sh — Aggregate check outputs into per-bot + per-category scores
-# Usage: compute-score.sh <results-dir>
+# Usage: compute-score.sh [--page-type <type>] <results-dir>
 # Output: JSON to stdout
 #
 # Expected filenames in <results-dir>:
@@ -14,8 +14,56 @@ set -eu
 #   llmstxt.json             — check-llmstxt.sh output (bot-independent)
 #   sitemap.json             — check-sitemap.sh output (bot-independent)
 #   diff-render.json         — diff-render.sh output (optional, Googlebot only)
+#
+# The --page-type flag overrides URL-based page-type detection. Valid values:
+# root, detail, archive, faq, about, contact, generic.
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=_lib.sh
+. "$SCRIPT_DIR/_lib.sh"
+PAGE_TYPE_OVERRIDE=""
+while [ $# -gt 0 ]; do
+  case "$1" in
+    --page-type)
+      [ $# -ge 2 ] || { echo "--page-type requires a value" >&2; exit 2; }
+      PAGE_TYPE_OVERRIDE="$2"
+      shift 2
+      ;;
+    --page-type=*)
+      PAGE_TYPE_OVERRIDE="${1#--page-type=}"
+      shift
+      ;;
+    -h|--help)
+      echo "Usage: compute-score.sh [--page-type <type>] <results-dir>"
+      exit 0
+      ;;
+    --)
+      shift
+      break
+      ;;
+    -*)
+      echo "Unknown flag: $1" >&2
+      exit 2
+      ;;
+    *)
+      break
+      ;;
+  esac
+done
+RESULTS_DIR="${1:?Usage: compute-score.sh [--page-type <type>] <results-dir>}"
+if [ -n "$PAGE_TYPE_OVERRIDE" ]; then
+  case "$PAGE_TYPE_OVERRIDE" in
+    root|detail|archive|faq|about|contact|generic) ;;
+    *)
+      echo "Error: invalid --page-type '$PAGE_TYPE_OVERRIDE' (valid: root, detail, archive, faq, about, contact, generic)" >&2
+      exit 2
+      ;;
+  esac
+fi
-RESULTS_DIR="${1:?Usage: compute-score.sh <results-dir>}"
 printf '[compute-score] aggregating %s\n' "$RESULTS_DIR" >&2
 if [ ! -d "$RESULTS_DIR" ]; then
@@ -31,7 +79,6 @@ W_TECHNICAL=15
 W_AI=10
 # Overall composite weights (per bot)
-# Default: Googlebot 40, GPTBot 20, ClaudeBot 20, PerplexityBot 20
 overall_weight() {
   case "$1" in
     googlebot) echo 40 ;;
@@ -42,7 +89,6 @@ overall_weight() {
   esac
 }
-# Grade from score (0-100)
 grade_for() {
   local s=$1
   if   [ "$s" -ge 93 ]; then echo "A"
@@ -60,7 +106,85 @@ grade_for() {
   fi
 }
-# Read a jq value from a file with a default fallback
+# Rubric: expected schema types per page type.
+rubric_expected() {
+  case "$1" in
+    root)    echo "Organization WebSite" ;;
+    detail)  echo "Article BreadcrumbList" ;;
+    archive) echo "CollectionPage ItemList BreadcrumbList" ;;
+    faq)     echo "FAQPage BreadcrumbList" ;;
+    about)   echo "AboutPage BreadcrumbList Organization" ;;
+    contact) echo "ContactPage BreadcrumbList" ;;
+    *)       echo "WebPage BreadcrumbList" ;;
+  esac
+}
+rubric_optional() {
+  case "$1" in
+    root)    echo "ProfessionalService LocalBusiness" ;;
+    detail)  echo "NewsArticle ImageObject Person" ;;
+    archive) echo "" ;;
+    faq)     echo "WebPage" ;;
+    about)   echo "Person" ;;
+    contact) echo "PostalAddress" ;;
+    *)       echo "" ;;
+  esac
+}
+rubric_forbidden() {
+  case "$1" in
+    root)    echo "BreadcrumbList Article FAQPage" ;;
+    detail)  echo "CollectionPage ItemList" ;;
+    archive) echo "Article Product" ;;
+    faq)     echo "Article CollectionPage" ;;
+    about)   echo "Article Product" ;;
+    contact) echo "Article Product" ;;
+    *)       echo "" ;;
+  esac
+}
+list_contains() {
+  local needle="$1"
+  shift
+  local item
+  for item in "$@"; do
+    [ "$item" = "$needle" ] && return 0
+  done
+  return 1
+}
+list_count() {
+  # shellcheck disable=SC2086
+  set -- $1
+  echo "$#"
+}
+list_intersect() {
+  local a="$1" b="$2"
+  local out="" item
+  # shellcheck disable=SC2086
+  for item in $a; do
+    # shellcheck disable=SC2086
+    if list_contains "$item" $b; then
+      out="$out $item"
+    fi
+  done
+  printf '%s' "${out# }"
+}
+list_diff() {
+  local a="$1" b="$2"
+  local out="" item
+  # shellcheck disable=SC2086
+  for item in $a; do
+    # shellcheck disable=SC2086
+    if ! list_contains "$item" $b; then
+      out="$out $item"
+    fi
+  done
+  printf '%s' "${out# }"
+}
 jget() {
   local file="$1"
   local query="$2"
@@ -75,7 +199,6 @@ jget() {
 jget_num() {
   local v
   v=$(jget "$1" "$2" "0")
-  # Replace "null" or non-numeric with 0
   if ! printf '%s' "$v" | grep -qE '^-?[0-9]+(\.[0-9]+)?$'; then
     echo "0"
   else
@@ -90,8 +213,10 @@ jget_bool() {
 }
 BOTS=""
+FIRST_FETCH=""
 for f in "$RESULTS_DIR"/fetch-*.json; do
   [ -f "$f" ] || continue
+  [ -z "$FIRST_FETCH" ] && FIRST_FETCH="$f"
   bot_id=$(basename "$f" .json | sed 's/^fetch-//')
   BOTS="$BOTS $bot_id"
 done
@@ -105,14 +230,10 @@ LLMSTXT_FILE="$RESULTS_DIR/llmstxt.json"
 SITEMAP_FILE="$RESULTS_DIR/sitemap.json"
 DIFF_RENDER_FILE="$RESULTS_DIR/diff-render.json"
-# Load Playwright render-delta data once (used to differentiate JS-rendering
-# bots from non-rendering ones). If the comparison was skipped or missing,
-# all bots score against server HTML only.
 DIFF_AVAILABLE=false
 DIFF_RENDERED_WORDS=0
 DIFF_DELTA_PCT=0
 if [ -f "$DIFF_RENDER_FILE" ]; then
-  # Explicit null check — `.skipped // true` would treat real false as null.
   DIFF_SKIPPED=$(jq -r '.skipped | if . == null then "true" else tostring end' "$DIFF_RENDER_FILE" 2>/dev/null || echo "true")
   if [ "$DIFF_SKIPPED" = "false" ]; then
     DIFF_AVAILABLE=true
@@ -121,9 +242,22 @@ if [ -f "$DIFF_RENDER_FILE" ]; then
   fi
 fi
+# Resolve page type once from the first fetch file's URL, unless overridden.
+TARGET_URL=$(jget "$FIRST_FETCH" '.url' "")
+if [ -n "$PAGE_TYPE_OVERRIDE" ]; then
+  PAGE_TYPE="$PAGE_TYPE_OVERRIDE"
+else
+  PAGE_TYPE=$(page_type_for_url "$TARGET_URL")
+fi
+printf '[compute-score] page type: %s (url: %s)\n' "$PAGE_TYPE" "$TARGET_URL" >&2
+RUBRIC_EXPECTED="$(rubric_expected "$PAGE_TYPE")"
+RUBRIC_OPTIONAL="$(rubric_optional "$PAGE_TYPE")"
+RUBRIC_FORBIDDEN="$(rubric_forbidden "$PAGE_TYPE")"
+EXPECTED_COUNT=$(list_count "$RUBRIC_EXPECTED")
 BOTS_JSON="{}"
-# Accumulators for per-category averages (across bots)
 CAT_ACCESSIBILITY_SUM=0
 CAT_CONTENT_SUM=0
 CAT_STRUCTURED_SUM=0
@@ -131,7 +265,6 @@ CAT_TECHNICAL_SUM=0
 CAT_AI_SUM=0
 CAT_N=0
-# Accumulators for overall weighted composite
 OVERALL_WEIGHTED_SUM=0
 OVERALL_WEIGHT_TOTAL=0
@@ -146,17 +279,10 @@ for bot_id in $BOTS; do
   STATUS=$(jget_num "$FETCH" '.status')
   TOTAL_TIME=$(jget_num "$FETCH" '.timing.total')
   SERVER_WORD_COUNT=$(jget_num "$FETCH" '.wordCount')
-  # Read with explicit null fallback — jq's `//` is unsafe here because it
-  # treats boolean false as falsy, which is exactly the value we need to see.
   RENDERS_JS=$(jq -r '.bot.rendersJavaScript | if . == null then "unknown" else tostring end' "$FETCH" 2>/dev/null || echo "unknown")
   ROBOTS_ALLOWED=$(jget_bool "$ROBOTS" '.allowed')
-  # Effective word count depends on JS rendering capability:
-  # - true (e.g. Googlebot) + diff-render data → rendered DOM word count
-  # - false (AI training/search bots, observed) → server HTML only, with
-  #   penalty proportional to the rendering delta
-  # - unknown → conservative: server HTML (same as false but no penalty)
   EFFECTIVE_WORD_COUNT=$SERVER_WORD_COUNT
   HYDRATION_PENALTY=0
   MISSED_WORDS=0
@@ -164,11 +290,8 @@ for bot_id in $BOTS; do
     if [ "$RENDERS_JS" = "true" ]; then
       EFFECTIVE_WORD_COUNT=$DIFF_RENDERED_WORDS
     elif [ "$RENDERS_JS" = "false" ]; then
-      # Absolute-value delta: if rendered DOM has materially more than server,
-      # AI bots are missing that content.
       ABS_DELTA=$(awk -v d="$DIFF_DELTA_PCT" 'BEGIN { printf "%d", (d < 0 ? -d : d) + 0.5 }')
       if [ "$ABS_DELTA" -gt 5 ]; then
-        # Scale penalty: 5% delta = 0, 10% = 5, 20%+ = 15 (cap)
         HYDRATION_PENALTY=$(awk -v d="$ABS_DELTA" 'BEGIN {
           p = (d - 5)
           if (p > 15) p = 15
@@ -182,11 +305,8 @@ for bot_id in $BOTS; do
   # --- Category 1: Accessibility (0-100) ---
   ACC=0
-  # robots.txt allows: 40
   [ "$ROBOTS_ALLOWED" = "true" ] && ACC=$((ACC + 40))
-  # HTTP 200: 40
   [ "$STATUS" = "200" ] && ACC=$((ACC + 40))
-  # Response time: <2s = 20, <5s = 10, else 0
   TIME_SCORE=$(awk -v t="$TOTAL_TIME" 'BEGIN { if (t < 2) print 20; else if (t < 5) print 10; else print 0 }')
   ACC=$((ACC + TIME_SCORE))
@@ -216,34 +336,106 @@ for bot_id in $BOTS; do
     CONTENT=$((CONTENT + ALT_SCORE))
   fi
-  # Apply hydration penalty for non-rendering bots that are missing content
   CONTENT=$((CONTENT - HYDRATION_PENALTY))
   [ $CONTENT -lt 0 ] && CONTENT=0
   # --- Category 3: Structured Data (0-100) ---
-  STRUCTURED=0
   JSONLD_COUNT=$(jget_num "$JSONLD" '.blockCount')
   JSONLD_VALID=$(jget_num "$JSONLD" '.validCount')
   JSONLD_INVALID=$(jget_num "$JSONLD" '.invalidCount')
-  HAS_ORG=$(jget_bool "$JSONLD" '.flags.hasOrganization')
-  HAS_WEBSITE=$(jget_bool "$JSONLD" '.flags.hasWebSite')
-  HAS_BREADCRUMB=$(jget_bool "$JSONLD" '.flags.hasBreadcrumbList')
-  HAS_ARTICLE=$(jget_bool "$JSONLD" '.flags.hasArticle')
-  HAS_PRODUCT=$(jget_bool "$JSONLD" '.flags.hasProduct')
-  HAS_FAQ=$(jget_bool "$JSONLD" '.flags.hasFAQPage')
-  [ "$JSONLD_COUNT" -ge 1 ] && STRUCTURED=$((STRUCTURED + 30))
-  if [ "$JSONLD_COUNT" -ge 1 ] && [ "$JSONLD_INVALID" -eq 0 ]; then
-    STRUCTURED=$((STRUCTURED + 20))
+  if [ -f "$JSONLD" ]; then
+    PRESENT_TYPES=$(jq -r '.types[]? // empty' "$JSONLD" 2>/dev/null | awk 'NF && !seen[$0]++' | tr '\n' ' ')
+    PRESENT_TYPES=${PRESENT_TYPES% }
+  else
+    PRESENT_TYPES=""
   fi
-  if [ "$HAS_ORG" = "true" ] || [ "$HAS_WEBSITE" = "true" ]; then
-    STRUCTURED=$((STRUCTURED + 20))
+  PRESENT_EXPECTED=$(list_intersect "$RUBRIC_EXPECTED" "$PRESENT_TYPES")
+  PRESENT_OPTIONAL=$(list_intersect "$RUBRIC_OPTIONAL" "$PRESENT_TYPES")
+  PRESENT_FORBIDDEN=$(list_intersect "$RUBRIC_FORBIDDEN" "$PRESENT_TYPES")
+  MISSING_EXPECTED=$(list_diff "$RUBRIC_EXPECTED" "$PRESENT_TYPES")
+  RUBRIC_KNOWN="$RUBRIC_EXPECTED $RUBRIC_OPTIONAL $RUBRIC_FORBIDDEN"
+  EXTRAS=$(list_diff "$PRESENT_TYPES" "$RUBRIC_KNOWN")
+  PRESENT_EXPECTED_COUNT=$(list_count "$PRESENT_EXPECTED")
+  PRESENT_OPTIONAL_COUNT=$(list_count "$PRESENT_OPTIONAL")
+  PRESENT_FORBIDDEN_COUNT=$(list_count "$PRESENT_FORBIDDEN")
+  BASE=$(awk -v h="$PRESENT_EXPECTED_COUNT" -v t="$EXPECTED_COUNT" \
+    'BEGIN { if (t == 0) print 0; else printf "%d", (h / t) * 100 + 0.5 }')
+  BONUS=$((PRESENT_OPTIONAL_COUNT * 10))
+  [ $BONUS -gt 20 ] && BONUS=20
+  FORBID_PENALTY=$((PRESENT_FORBIDDEN_COUNT * 10))
+  VALID_PENALTY=0
+  if [ "$JSONLD_COUNT" -gt 0 ] && [ "$JSONLD_INVALID" -gt 0 ]; then
+    VALID_PENALTY=$((JSONLD_INVALID * 5))
+    [ $VALID_PENALTY -gt 20 ] && VALID_PENALTY=20
   fi
-  [ "$HAS_BREADCRUMB" = "true" ] && STRUCTURED=$((STRUCTURED + 15))
-  if [ "$HAS_ARTICLE" = "true" ] || [ "$HAS_PRODUCT" = "true" ] || [ "$HAS_FAQ" = "true" ]; then
-    STRUCTURED=$((STRUCTURED + 15))
+  STRUCTURED=$((BASE + BONUS - FORBID_PENALTY - VALID_PENALTY))
+  [ $STRUCTURED -gt 100 ] && STRUCTURED=100
+  [ $STRUCTURED -lt 0 ] && STRUCTURED=0
+  CALCULATION=$(printf 'base: %d/%d expected present = %d; +%d optional bonus; -%d forbidden penalty; -%d validity penalty; clamp [0,100] = %d' \
+    "$PRESENT_EXPECTED_COUNT" "$EXPECTED_COUNT" "$BASE" \
+    "$BONUS" "$FORBID_PENALTY" "$VALID_PENALTY" "$STRUCTURED")
+  if [ "$STRUCTURED" -ge 100 ] && [ -z "$PRESENT_FORBIDDEN" ] && [ "$VALID_PENALTY" -eq 0 ]; then
+    NOTES="All expected schemas for pageType=$PAGE_TYPE are present. No structured-data action needed."
+  elif [ -n "$MISSING_EXPECTED" ] && [ -z "$PRESENT_FORBIDDEN" ]; then
+    NOTES="Missing expected schemas for pageType=$PAGE_TYPE: $MISSING_EXPECTED. Add these to raise the score."
+  elif [ -n "$PRESENT_FORBIDDEN" ] && [ -z "$MISSING_EXPECTED" ]; then
+    NOTES="Forbidden schemas present for pageType=$PAGE_TYPE: $PRESENT_FORBIDDEN. Remove these (or re-classify the page type with --page-type)."
+  elif [ -n "$PRESENT_FORBIDDEN" ] && [ -n "$MISSING_EXPECTED" ]; then
+    NOTES="Mixed: missing $MISSING_EXPECTED and forbidden present $PRESENT_FORBIDDEN for pageType=$PAGE_TYPE."
+  else
+    NOTES="Score reduced by $VALID_PENALTY pts due to invalid JSON-LD blocks."
   fi
+  STRUCTURED_GRADE=$(grade_for "$STRUCTURED")
+  STRUCTURED_OBJ=$(jq -n \
+    --argjson score "$STRUCTURED" \
+    --arg grade "$STRUCTURED_GRADE" \
+    --arg pageType "$PAGE_TYPE" \
+    --arg expectedList "$RUBRIC_EXPECTED" \
+    --arg optionalList "$RUBRIC_OPTIONAL" \
+    --arg forbiddenList "$RUBRIC_FORBIDDEN" \
+    --arg presentList "$PRESENT_TYPES" \
+    --arg missingList "$MISSING_EXPECTED" \
+    --arg extrasList "$EXTRAS" \
+    --arg forbiddenPresent "$PRESENT_FORBIDDEN" \
+    --argjson invalidCount "$JSONLD_INVALID" \
+    --argjson validPenalty "$VALID_PENALTY" \
+    --arg calculation "$CALCULATION" \
+    --arg notes "$NOTES" \
+    '
+    def to_arr: split(" ") | map(select(length > 0));
+    {
+      score: $score,
+      grade: $grade,
+      pageType: $pageType,
+      expected:   ($expectedList   | to_arr),
+      optional:   ($optionalList   | to_arr),
+      forbidden:  ($forbiddenList  | to_arr),
+      present:    ($presentList    | to_arr),
+      missing:    ($missingList    | to_arr),
+      extras:     ($extrasList     | to_arr),
+      violations: (
+        ($forbiddenPresent | to_arr | map({kind: "forbidden_schema", schema: ., impact: -10}))
+        + (if $validPenalty > 0
+             then [{kind: "invalid_jsonld", count: $invalidCount, impact: (0 - $validPenalty)}]
+             else []
+           end)
+      ),
+      calculation: $calculation,
+      notes: $notes
+    }
+    ')
   # --- Category 4: Technical Signals (0-100) ---
   TECHNICAL=0
   TITLE=$(jget "$META" '.title' "")
@@ -279,21 +471,16 @@ for bot_id in $BOTS; do
     [ "$LLMS_HAS_DESC" = "true" ] && AI=$((AI + 7))
     [ "$LLMS_URLS" -ge 1 ] && AI=$((AI + 6))
   fi
-  # Content citable (>= 200 words, effective for this bot)
   [ "$EFFECTIVE_WORD_COUNT" -ge 200 ] && AI=$((AI + 20))
-  # Semantic clarity: has H1 + description
   if [ "$H1_COUNT" -ge 1 ] && [ -n "$DESCRIPTION" ] && [ "$DESCRIPTION" != "null" ]; then
     AI=$((AI + 20))
   fi
-  # Cap categories at 100
   [ $ACC -gt 100 ] && ACC=100
   [ $CONTENT -gt 100 ] && CONTENT=100
-  [ $STRUCTURED -gt 100 ] && STRUCTURED=100
   [ $TECHNICAL -gt 100 ] && TECHNICAL=100
   [ $AI -gt 100 ] && AI=100
-  # Per-bot composite score (weighted average of 5 categories)
   BOT_SCORE=$(awk -v a=$ACC -v c=$CONTENT -v s=$STRUCTURED -v t=$TECHNICAL -v ai=$AI \
     -v wa=$W_ACCESSIBILITY -v wc=$W_CONTENT -v ws=$W_STRUCTURED -v wt=$W_TECHNICAL -v wai=$W_AI \
     'BEGIN { printf "%d", (a*wa + c*wc + s*ws + t*wt + ai*wai) / (wa+wc+ws+wt+wai) + 0.5 }')
@@ -301,7 +488,6 @@ for bot_id in $BOTS; do
   BOT_GRADE=$(grade_for "$BOT_SCORE")
   ACC_GRADE=$(grade_for "$ACC")
   CONTENT_GRADE=$(grade_for "$CONTENT")
-  STRUCTURED_GRADE=$(grade_for "$STRUCTURED")
   TECHNICAL_GRADE=$(grade_for "$TECHNICAL")
   AI_GRADE=$(grade_for "$AI")
@@ -315,8 +501,7 @@ for bot_id in $BOTS; do
     --arg accGrade "$ACC_GRADE" \
     --argjson content "$CONTENT" \
     --arg contentGrade "$CONTENT_GRADE" \
-    --argjson structured "$STRUCTURED" \
-    --arg structuredGrade "$STRUCTURED_GRADE" \
+    --argjson structured "$STRUCTURED_OBJ" \
     --argjson technical "$TECHNICAL" \
     --arg technicalGrade "$TECHNICAL_GRADE" \
     --argjson ai "$AI" \
@@ -338,17 +523,16 @@ for bot_id in $BOTS; do
         hydrationPenaltyPts: $hydrationPenalty
       },
       categories: {
-        accessibility:     { score: $acc,        grade: $accGrade },
-        contentVisibility: { score: $content,    grade: $contentGrade },
-        structuredData:    { score: $structured, grade: $structuredGrade },
-        technicalSignals:  { score: $technical,  grade: $technicalGrade },
-        aiReadiness:       { score: $ai,         grade: $aiGrade }
+        accessibility:     { score: $acc,       grade: $accGrade },
+        contentVisibility: { score: $content,   grade: $contentGrade },
+        structuredData:    $structured,
+        technicalSignals:  { score: $technical, grade: $technicalGrade },
+        aiReadiness:       { score: $ai,        grade: $aiGrade }
       }
     }')
   BOTS_JSON=$(printf '%s' "$BOTS_JSON" | jq --argjson bot "$BOT_OBJ" --arg id "$bot_id" '.[$id] = $bot')
-  # Accumulate category averages
   CAT_ACCESSIBILITY_SUM=$((CAT_ACCESSIBILITY_SUM + ACC))
   CAT_CONTENT_SUM=$((CAT_CONTENT_SUM + CONTENT))
   CAT_STRUCTURED_SUM=$((CAT_STRUCTURED_SUM + STRUCTURED))
@@ -356,7 +540,6 @@ for bot_id in $BOTS; do
   CAT_AI_SUM=$((CAT_AI_SUM + AI))
   CAT_N=$((CAT_N + 1))
-  # Accumulate weighted overall
   W=$(overall_weight "$bot_id")
   if [ "$W" -gt 0 ]; then
     OVERALL_WEIGHTED_SUM=$((OVERALL_WEIGHTED_SUM + BOT_SCORE * W))
@@ -364,18 +547,15 @@ for bot_id in $BOTS; do
   fi
 done
-# Per-category averages (across all bots)
 CAT_ACC_AVG=$((CAT_ACCESSIBILITY_SUM / CAT_N))
 CAT_CONTENT_AVG=$((CAT_CONTENT_SUM / CAT_N))
 CAT_STRUCTURED_AVG=$((CAT_STRUCTURED_SUM / CAT_N))
 CAT_TECHNICAL_AVG=$((CAT_TECHNICAL_SUM / CAT_N))
 CAT_AI_AVG=$((CAT_AI_SUM / CAT_N))
-# Overall composite
 if [ "$OVERALL_WEIGHT_TOTAL" -gt 0 ]; then
   OVERALL_SCORE=$((OVERALL_WEIGHTED_SUM / OVERALL_WEIGHT_TOTAL))
 else
-  # Fall back to simple average if none of the 4 standard bots are present
   OVERALL_SCORE=$(((CAT_ACC_AVG + CAT_CONTENT_AVG + CAT_STRUCTURED_AVG + CAT_TECHNICAL_AVG + CAT_AI_AVG) / 5))
 fi
@@ -386,15 +566,14 @@ CAT_STRUCTURED_GRADE=$(grade_for "$CAT_STRUCTURED_AVG")
 CAT_TECHNICAL_GRADE=$(grade_for "$CAT_TECHNICAL_AVG")
 CAT_AI_GRADE=$(grade_for "$CAT_AI_AVG")
-# Get the URL from the first fetch file
-FIRST_FETCH=$(ls "$RESULTS_DIR"/fetch-*.json | head -1)
-TARGET_URL=$(jget "$FIRST_FETCH" '.url' "")
 TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
 jq -n \
   --arg url "$TARGET_URL" \
   --arg timestamp "$TIMESTAMP" \
-  --arg version "0.1.0" \
+  --arg version "0.2.0" \
+  --arg pageType "$PAGE_TYPE" \
+  --arg pageTypeOverride "$PAGE_TYPE_OVERRIDE" \
   --argjson overallScore "$OVERALL_SCORE" \
   --arg overallGrade "$OVERALL_GRADE" \
   --argjson bots "$BOTS_JSON" \
@@ -412,6 +591,8 @@ jq -n \
     url: $url,
     timestamp: $timestamp,
     version: $version,
+    pageType: $pageType,
+    pageTypeOverridden: ($pageTypeOverride | length > 0),
     overall: { score: $overallScore, grade: $overallGrade },
     bots: $bots,
     categories: {