@braedenbuilds/crawl-sim 1.0.5 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +13 -3
- package/SKILL.md +9 -0
- package/package.json +4 -1
- package/scripts/_lib.sh +30 -0
- package/scripts/compute-score.sh +245 -64
package/README.md
CHANGED
|
@@ -60,7 +60,7 @@ Then in Claude Code:
|
|
|
60
60
|
|
|
61
61
|
Claude runs the full pipeline, interprets the results, and returns a score card plus prioritized findings.
|
|
62
62
|
|
|
63
|
-
> **Why `npm install -g` instead of `npx`?** Recent versions of npx have a
|
|
63
|
+
> **Why `npm install -g` instead of `npx`?** Recent versions of npx have a known issue linking bins for scoped single-bin packages in ephemeral installs. A persistent global install avoids the problem entirely. The git clone path below is the zero-npm fallback.
|
|
64
64
|
|
|
65
65
|
### As a standalone CLI
|
|
66
66
|
|
|
@@ -88,10 +88,13 @@ git clone https://github.com/BraedenBDev/crawl-sim.git ~/.claude/skills/crawl-si
|
|
|
88
88
|
|
|
89
89
|
- **Multi-bot simulation.** Nine verified bot profiles covering Google, OpenAI, Anthropic, and Perplexity — including the bot-vs-user-agent distinction (e.g., `ChatGPT-User` officially ignores robots.txt; `claude-user` respects it).
|
|
90
90
|
- **Quantified scoring.** Each bot is graded 0–100 across five categories with letter grades A through F, plus a weighted composite score.
|
|
91
|
+
- **Page-type-aware rubric.** The structured-data category derives the page type from the URL (`root` / `detail` / `archive` / `faq` / `about` / `contact` / `generic`) and applies a per-type schema rubric. A homepage shipping `Organization` + `WebSite` scores 100 without being penalized for not having `BreadcrumbList` or `FAQPage`. Override the detection with `--page-type <type>` when the URL heuristic picks wrong.
|
|
92
|
+
- **Self-explaining scores.** Every `structuredData` block in the JSON report ships `pageType`, `expected`, `optional`, `forbidden`, `present`, `missing`, `extras`, `violations`, `calculation`, and `notes` — so the narrative layer reads the scorer's reasoning directly instead of guessing what was penalized.
|
|
91
93
|
- **Agent-native interpretation.** The Claude Code skill reads raw data, identifies root causes (framework signals, hydration boundaries, soft-404s), and recommends specific fixes.
|
|
92
94
|
- **Three-layer output.** Terminal score card, prose narrative, and structured JSON — so humans and CI both get what they need.
|
|
93
95
|
- **Confidence transparency.** Every claim is tagged `official`, `observed`, or `inferred`. The skill notes when recommendations depend on observed-but-undocumented behavior.
|
|
94
96
|
- **Shell-native core.** All checks use only `curl` + `jq`. No Node, no Python, no Docker. Each script is independently invokable.
|
|
97
|
+
- **Regression-tested.** `npm test` runs a 37-assertion scoring suite against synthetic fixtures, covering URL→page-type detection, per-type rubrics, missing/forbidden schema flagging, and golden non-structured output.
|
|
95
98
|
- **Extensible.** Drop a new profile JSON into `profiles/` and it's auto-discovered.
|
|
96
99
|
|
|
97
100
|
---
|
|
@@ -107,6 +110,8 @@ git clone https://github.com/BraedenBDev/crawl-sim.git ~/.claude/skills/crawl-si
|
|
|
107
110
|
/crawl-sim https://yoursite.com --json # JSON only (for CI)
|
|
108
111
|
```
|
|
109
112
|
|
|
113
|
+
The skill auto-detects page type from the URL. Pass `--page-type root|detail|archive|faq|about|contact|generic` to the underlying `compute-score.sh` when the URL heuristic picks the wrong type (e.g., a homepage at `/en/` that URL-parses as `generic`).
|
|
114
|
+
|
|
110
115
|
Output is a three-layer report:
|
|
111
116
|
|
|
112
117
|
1. **Score card** — ASCII overview with per-bot and per-category scores.
|
|
@@ -126,6 +131,7 @@ Every script is standalone and outputs JSON to stdout:
|
|
|
126
131
|
./scripts/check-llmstxt.sh https://yoursite.com
|
|
127
132
|
./scripts/check-sitemap.sh https://yoursite.com
|
|
128
133
|
./scripts/compute-score.sh /tmp/audit-data/
|
|
134
|
+
./scripts/compute-score.sh --page-type root /tmp/audit-data/ # override URL heuristic
|
|
129
135
|
```
|
|
130
136
|
|
|
131
137
|
### CI/CD
|
|
@@ -148,7 +154,7 @@ Each bot is scored 0–100 across five weighted categories:
|
|
|
148
154
|
|----------|:------:|----------|
|
|
149
155
|
| **Accessibility** | 25 | robots.txt allows, HTTP 200, response time |
|
|
150
156
|
| **Content Visibility** | 30 | server HTML word count, heading structure, internal links, image alt text |
|
|
151
|
-
| **Structured Data** | 20 | JSON-LD presence, validity, page-
|
|
157
|
+
| **Structured Data** | 20 | JSON-LD presence, validity, page-type-aware `@type` rubric (root / detail / archive / faq / about / contact / generic) |
|
|
152
158
|
| **Technical Signals** | 15 | title / description / canonical / OG meta, sitemap inclusion |
|
|
153
159
|
| **AI Readiness** | 10 | `llms.txt` structure, content citability |
|
|
154
160
|
|
|
@@ -218,6 +224,7 @@ crawl-sim/
|
|
|
218
224
|
├── bin/install.js # npm installer
|
|
219
225
|
├── profiles/ # 9 verified bot profiles (JSON)
|
|
220
226
|
├── scripts/
|
|
227
|
+
│ ├── _lib.sh # shared helpers (URL parsing, page-type detection)
|
|
221
228
|
│ ├── fetch-as-bot.sh # curl with bot UA → JSON (status/headers/body/timing)
|
|
222
229
|
│ ├── extract-meta.sh # title, description, OG, headings, images
|
|
223
230
|
│ ├── extract-jsonld.sh # JSON-LD @type detection
|
|
@@ -227,8 +234,11 @@ crawl-sim/
|
|
|
227
234
|
│ ├── check-sitemap.sh # sitemap.xml URL inclusion
|
|
228
235
|
│ ├── diff-render.sh # optional Playwright server-vs-rendered comparison
|
|
229
236
|
│ └── compute-score.sh # aggregates all checks → per-bot + per-category scores
|
|
237
|
+
├── test/
|
|
238
|
+
│ ├── run-scoring-tests.sh # 37-assertion bash harness (run with `npm test`)
|
|
239
|
+
│ └── fixtures/ # synthetic RUN_DIR fixtures for regression tests
|
|
230
240
|
├── research/ # Verified bot data sources
|
|
231
|
-
└── docs/
|
|
241
|
+
└── docs/ # Design docs, issues, accuracy handoffs
|
|
232
242
|
```
|
|
233
243
|
|
|
234
244
|
The shell scripts are the plumbing. The Claude Code skill is the intelligence — it reads the raw JSON, understands framework context (Next.js, Nuxt, SPAs), identifies root causes, and writes actionable recommendations.
|
package/SKILL.md
CHANGED
|
@@ -115,6 +115,14 @@ Tell the user: "Computing per-bot scores and finalizing the report..."
|
|
|
115
115
|
cp "$RUN_DIR/score.json" ./crawl-sim-report.json
|
|
116
116
|
```
|
|
117
117
|
|
|
118
|
+
**Page-type awareness.** `compute-score.sh` derives a page type from the target URL (`root` / `detail` / `archive` / `faq` / `about` / `contact` / `generic`) and picks a schema rubric accordingly. Root pages are expected to ship `Organization` + `WebSite` — penalizing them for missing `BreadcrumbList` or `FAQPage` would be wrong, so the scorer doesn't. If the URL heuristic picks the wrong type (e.g., a homepage at `/en/` that URL-parses as generic), pass `--page-type <type>`:
|
|
119
|
+
|
|
120
|
+
```bash
|
|
121
|
+
"$SKILL_DIR/scripts/compute-score.sh" --page-type root "$RUN_DIR" > "$RUN_DIR/score.json"
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Valid values: `root`, `detail`, `archive`, `faq`, `about`, `contact`, `generic`. The detected (or overridden) page type is exposed on `score.pageType`, and `score.pageTypeOverridden` flips `true` when `--page-type` was used.
|
|
125
|
+
|
|
118
126
|
## Output Layer 1 — Score Card (ASCII)
|
|
119
127
|
|
|
120
128
|
Print a boxed score card to the terminal:
|
|
@@ -168,6 +176,7 @@ Then produce **prioritized findings** ranked by total point impact across bots:
|
|
|
168
176
|
### Interpretation rules
|
|
169
177
|
|
|
170
178
|
- **Cross-bot deltas are the headline.** Compare `visibility.effectiveWords` across bots — if Googlebot has significantly more than the AI bots, that's finding #1. The raw delta is in `visibility.missedWordsVsRendered`.
|
|
179
|
+
- **Trust the structuredData rubric.** Every `bots.<bot>.categories.structuredData` block now carries `pageType`, `expected`, `optional`, `forbidden`, `present`, `missing`, `extras`, `violations`, `calculation`, and `notes`. Read `missing` and `violations` directly — never guess what the scorer was penalizing for. If `notes` says the page scores 100 with no action needed, that IS the finding; don't invent fixes. If the rubric looks wrong for this specific page (e.g., a homepage detected as `generic` because the URL ends in `/en/`), rerun with `--page-type <correct-type>` instead of arguing with the score. Never recommend adding a schema that already appears in `present` or `extras`.
|
|
171
180
|
- **Confidence transparency.** If a claim depends on a bot profile's `rendersJavaScript: false` at `observed` confidence (not `official`), note it: *"Based on observed behavior, not official documentation."*
|
|
172
181
|
- **Framework detection.** Scan the HTML body for signals: `<meta name="next-head-count">` or `_next/static` → Next.js (Pages Router or App Router respectively), `<div id="__nuxt">` → Nuxt, `<div id="app">` with thin content → SPA (Vue/React CSR), `<!--$-->` placeholder tags → React 18 Suspense. Use these to tailor fix recommendations.
|
|
173
182
|
- **No speculation beyond the data.** If server HTML has 0 `<a>` tags inside a component, say "component not present in server HTML" — not "JavaScript hydration failed" unless the diff-render data proves it.
|
package/package.json
CHANGED
|
@@ -1,10 +1,13 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@braedenbuilds/crawl-sim",
|
|
3
|
-
"version": "1.0
|
|
3
|
+
"version": "1.1.0",
|
|
4
4
|
"description": "Agent-native multi-bot web crawler simulator. See your site through the eyes of Googlebot, GPTBot, ClaudeBot, and PerplexityBot.",
|
|
5
5
|
"bin": {
|
|
6
6
|
"crawl-sim": "bin/install.js"
|
|
7
7
|
},
|
|
8
|
+
"scripts": {
|
|
9
|
+
"test": "./test/run-scoring-tests.sh"
|
|
10
|
+
},
|
|
8
11
|
"keywords": [
|
|
9
12
|
"seo",
|
|
10
13
|
"crawler",
|
package/scripts/_lib.sh
CHANGED
|
@@ -41,6 +41,36 @@ count_words() {
|
|
|
41
41
|
sed 's/<[^>]*>//g' "$1" | tr -s '[:space:]' '\n' | grep -c '[a-zA-Z0-9]' || true
|
|
42
42
|
}
|
|
43
43
|
|
|
44
|
+
# Detect the structural page type of a URL based on its path.
|
|
45
|
+
# Returns one of: root, detail, archive, faq, about, contact, generic.
|
|
46
|
+
#
|
|
47
|
+
# Used by compute-score.sh to pick a schema rubric, but also exposed here
|
|
48
|
+
# so other tooling (narrative layer, planned multi-URL mode) can classify
|
|
49
|
+
# URLs consistently without re-implementing the heuristic.
|
|
50
|
+
page_type_for_url() {
|
|
51
|
+
local url="$1"
|
|
52
|
+
local path
|
|
53
|
+
path=$(path_from_url "$url" | sed 's#[?#].*##')
|
|
54
|
+
if [ "$path" = "/" ]; then
|
|
55
|
+
echo "root"
|
|
56
|
+
return
|
|
57
|
+
fi
|
|
58
|
+
local trimmed lower
|
|
59
|
+
trimmed=$(printf '%s' "$path" | sed 's#^/##' | sed 's#/$##')
|
|
60
|
+
lower=$(printf '%s' "$trimmed" | tr '[:upper:]' '[:lower:]')
|
|
61
|
+
case "$lower" in
|
|
62
|
+
"") echo "root" ;;
|
|
63
|
+
work|journal|blog|articles|news|careers|projects|case-studies|cases)
|
|
64
|
+
echo "archive" ;;
|
|
65
|
+
work/*|articles/*|journal/*|blog/*|news/*|case-studies/*|cases/*|case/*|careers/*|projects/*)
|
|
66
|
+
echo "detail" ;;
|
|
67
|
+
*faq*) echo "faq" ;;
|
|
68
|
+
*about*|*team*|*purpose*|*who-we-are*) echo "about" ;;
|
|
69
|
+
*contact*) echo "contact" ;;
|
|
70
|
+
*) echo "generic" ;;
|
|
71
|
+
esac
|
|
72
|
+
}
|
|
73
|
+
|
|
44
74
|
# Fetch a URL to a local file and return the HTTP status code on stdout.
|
|
45
75
|
# Usage: status=$(fetch_to_file <url> <output-file> [timeout-seconds])
|
|
46
76
|
fetch_to_file() {
|
package/scripts/compute-score.sh
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
set -eu
|
|
3
3
|
|
|
4
4
|
# compute-score.sh — Aggregate check outputs into per-bot + per-category scores
|
|
5
|
-
# Usage: compute-score.sh <results-dir>
|
|
5
|
+
# Usage: compute-score.sh [--page-type <type>] <results-dir>
|
|
6
6
|
# Output: JSON to stdout
|
|
7
7
|
#
|
|
8
8
|
# Expected filenames in <results-dir>:
|
|
@@ -14,8 +14,56 @@ set -eu
|
|
|
14
14
|
# llmstxt.json — check-llmstxt.sh output (bot-independent)
|
|
15
15
|
# sitemap.json — check-sitemap.sh output (bot-independent)
|
|
16
16
|
# diff-render.json — diff-render.sh output (optional, Googlebot only)
|
|
17
|
+
#
|
|
18
|
+
# The --page-type flag overrides URL-based page-type detection. Valid values:
|
|
19
|
+
# root, detail, archive, faq, about, contact, generic.
|
|
20
|
+
|
|
21
|
+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
22
|
+
# shellcheck source=_lib.sh
|
|
23
|
+
. "$SCRIPT_DIR/_lib.sh"
|
|
24
|
+
|
|
25
|
+
PAGE_TYPE_OVERRIDE=""
|
|
26
|
+
while [ $# -gt 0 ]; do
|
|
27
|
+
case "$1" in
|
|
28
|
+
--page-type)
|
|
29
|
+
[ $# -ge 2 ] || { echo "--page-type requires a value" >&2; exit 2; }
|
|
30
|
+
PAGE_TYPE_OVERRIDE="$2"
|
|
31
|
+
shift 2
|
|
32
|
+
;;
|
|
33
|
+
--page-type=*)
|
|
34
|
+
PAGE_TYPE_OVERRIDE="${1#--page-type=}"
|
|
35
|
+
shift
|
|
36
|
+
;;
|
|
37
|
+
-h|--help)
|
|
38
|
+
echo "Usage: compute-score.sh [--page-type <type>] <results-dir>"
|
|
39
|
+
exit 0
|
|
40
|
+
;;
|
|
41
|
+
--)
|
|
42
|
+
shift
|
|
43
|
+
break
|
|
44
|
+
;;
|
|
45
|
+
-*)
|
|
46
|
+
echo "Unknown flag: $1" >&2
|
|
47
|
+
exit 2
|
|
48
|
+
;;
|
|
49
|
+
*)
|
|
50
|
+
break
|
|
51
|
+
;;
|
|
52
|
+
esac
|
|
53
|
+
done
|
|
54
|
+
|
|
55
|
+
RESULTS_DIR="${1:?Usage: compute-score.sh [--page-type <type>] <results-dir>}"
|
|
56
|
+
|
|
57
|
+
if [ -n "$PAGE_TYPE_OVERRIDE" ]; then
|
|
58
|
+
case "$PAGE_TYPE_OVERRIDE" in
|
|
59
|
+
root|detail|archive|faq|about|contact|generic) ;;
|
|
60
|
+
*)
|
|
61
|
+
echo "Error: invalid --page-type '$PAGE_TYPE_OVERRIDE' (valid: root, detail, archive, faq, about, contact, generic)" >&2
|
|
62
|
+
exit 2
|
|
63
|
+
;;
|
|
64
|
+
esac
|
|
65
|
+
fi
|
|
17
66
|
|
|
18
|
-
RESULTS_DIR="${1:?Usage: compute-score.sh <results-dir>}"
|
|
19
67
|
printf '[compute-score] aggregating %s\n' "$RESULTS_DIR" >&2
|
|
20
68
|
|
|
21
69
|
if [ ! -d "$RESULTS_DIR" ]; then
|
|
@@ -31,7 +79,6 @@ W_TECHNICAL=15
|
|
|
31
79
|
W_AI=10
|
|
32
80
|
|
|
33
81
|
# Overall composite weights (per bot)
|
|
34
|
-
# Default: Googlebot 40, GPTBot 20, ClaudeBot 20, PerplexityBot 20
|
|
35
82
|
overall_weight() {
|
|
36
83
|
case "$1" in
|
|
37
84
|
googlebot) echo 40 ;;
|
|
@@ -42,7 +89,6 @@ overall_weight() {
|
|
|
42
89
|
esac
|
|
43
90
|
}
|
|
44
91
|
|
|
45
|
-
# Grade from score (0-100)
|
|
46
92
|
grade_for() {
|
|
47
93
|
local s=$1
|
|
48
94
|
if [ "$s" -ge 93 ]; then echo "A"
|
|
@@ -60,7 +106,85 @@ grade_for() {
|
|
|
60
106
|
fi
|
|
61
107
|
}
|
|
62
108
|
|
|
63
|
-
#
|
|
109
|
+
# Rubric: expected schema types per page type.
|
|
110
|
+
rubric_expected() {
|
|
111
|
+
case "$1" in
|
|
112
|
+
root) echo "Organization WebSite" ;;
|
|
113
|
+
detail) echo "Article BreadcrumbList" ;;
|
|
114
|
+
archive) echo "CollectionPage ItemList BreadcrumbList" ;;
|
|
115
|
+
faq) echo "FAQPage BreadcrumbList" ;;
|
|
116
|
+
about) echo "AboutPage BreadcrumbList Organization" ;;
|
|
117
|
+
contact) echo "ContactPage BreadcrumbList" ;;
|
|
118
|
+
*) echo "WebPage BreadcrumbList" ;;
|
|
119
|
+
esac
|
|
120
|
+
}
|
|
121
|
+
|
|
122
|
+
rubric_optional() {
|
|
123
|
+
case "$1" in
|
|
124
|
+
root) echo "ProfessionalService LocalBusiness" ;;
|
|
125
|
+
detail) echo "NewsArticle ImageObject Person" ;;
|
|
126
|
+
archive) echo "" ;;
|
|
127
|
+
faq) echo "WebPage" ;;
|
|
128
|
+
about) echo "Person" ;;
|
|
129
|
+
contact) echo "PostalAddress" ;;
|
|
130
|
+
*) echo "" ;;
|
|
131
|
+
esac
|
|
132
|
+
}
|
|
133
|
+
|
|
134
|
+
rubric_forbidden() {
|
|
135
|
+
case "$1" in
|
|
136
|
+
root) echo "BreadcrumbList Article FAQPage" ;;
|
|
137
|
+
detail) echo "CollectionPage ItemList" ;;
|
|
138
|
+
archive) echo "Article Product" ;;
|
|
139
|
+
faq) echo "Article CollectionPage" ;;
|
|
140
|
+
about) echo "Article Product" ;;
|
|
141
|
+
contact) echo "Article Product" ;;
|
|
142
|
+
*) echo "" ;;
|
|
143
|
+
esac
|
|
144
|
+
}
|
|
145
|
+
|
|
146
|
+
list_contains() {
|
|
147
|
+
local needle="$1"
|
|
148
|
+
shift
|
|
149
|
+
local item
|
|
150
|
+
for item in "$@"; do
|
|
151
|
+
[ "$item" = "$needle" ] && return 0
|
|
152
|
+
done
|
|
153
|
+
return 1
|
|
154
|
+
}
|
|
155
|
+
|
|
156
|
+
list_count() {
|
|
157
|
+
# shellcheck disable=SC2086
|
|
158
|
+
set -- $1
|
|
159
|
+
echo "$#"
|
|
160
|
+
}
|
|
161
|
+
|
|
162
|
+
list_intersect() {
|
|
163
|
+
local a="$1" b="$2"
|
|
164
|
+
local out="" item
|
|
165
|
+
# shellcheck disable=SC2086
|
|
166
|
+
for item in $a; do
|
|
167
|
+
# shellcheck disable=SC2086
|
|
168
|
+
if list_contains "$item" $b; then
|
|
169
|
+
out="$out $item"
|
|
170
|
+
fi
|
|
171
|
+
done
|
|
172
|
+
printf '%s' "${out# }"
|
|
173
|
+
}
|
|
174
|
+
|
|
175
|
+
list_diff() {
|
|
176
|
+
local a="$1" b="$2"
|
|
177
|
+
local out="" item
|
|
178
|
+
# shellcheck disable=SC2086
|
|
179
|
+
for item in $a; do
|
|
180
|
+
# shellcheck disable=SC2086
|
|
181
|
+
if ! list_contains "$item" $b; then
|
|
182
|
+
out="$out $item"
|
|
183
|
+
fi
|
|
184
|
+
done
|
|
185
|
+
printf '%s' "${out# }"
|
|
186
|
+
}
|
|
187
|
+
|
|
64
188
|
jget() {
|
|
65
189
|
local file="$1"
|
|
66
190
|
local query="$2"
|
|
@@ -75,7 +199,6 @@ jget() {
|
|
|
75
199
|
jget_num() {
|
|
76
200
|
local v
|
|
77
201
|
v=$(jget "$1" "$2" "0")
|
|
78
|
-
# Replace "null" or non-numeric with 0
|
|
79
202
|
if ! printf '%s' "$v" | grep -qE '^-?[0-9]+(\.[0-9]+)?$'; then
|
|
80
203
|
echo "0"
|
|
81
204
|
else
|
|
@@ -90,8 +213,10 @@ jget_bool() {
|
|
|
90
213
|
}
|
|
91
214
|
|
|
92
215
|
BOTS=""
|
|
216
|
+
FIRST_FETCH=""
|
|
93
217
|
for f in "$RESULTS_DIR"/fetch-*.json; do
|
|
94
218
|
[ -f "$f" ] || continue
|
|
219
|
+
[ -z "$FIRST_FETCH" ] && FIRST_FETCH="$f"
|
|
95
220
|
bot_id=$(basename "$f" .json | sed 's/^fetch-//')
|
|
96
221
|
BOTS="$BOTS $bot_id"
|
|
97
222
|
done
|
|
@@ -105,14 +230,10 @@ LLMSTXT_FILE="$RESULTS_DIR/llmstxt.json"
|
|
|
105
230
|
SITEMAP_FILE="$RESULTS_DIR/sitemap.json"
|
|
106
231
|
DIFF_RENDER_FILE="$RESULTS_DIR/diff-render.json"
|
|
107
232
|
|
|
108
|
-
# Load Playwright render-delta data once (used to differentiate JS-rendering
|
|
109
|
-
# bots from non-rendering ones). If the comparison was skipped or missing,
|
|
110
|
-
# all bots score against server HTML only.
|
|
111
233
|
DIFF_AVAILABLE=false
|
|
112
234
|
DIFF_RENDERED_WORDS=0
|
|
113
235
|
DIFF_DELTA_PCT=0
|
|
114
236
|
if [ -f "$DIFF_RENDER_FILE" ]; then
|
|
115
|
-
# Explicit null check — `.skipped // true` would treat real false as null.
|
|
116
237
|
DIFF_SKIPPED=$(jq -r '.skipped | if . == null then "true" else tostring end' "$DIFF_RENDER_FILE" 2>/dev/null || echo "true")
|
|
117
238
|
if [ "$DIFF_SKIPPED" = "false" ]; then
|
|
118
239
|
DIFF_AVAILABLE=true
|
|
@@ -121,9 +242,22 @@ if [ -f "$DIFF_RENDER_FILE" ]; then
|
|
|
121
242
|
fi
|
|
122
243
|
fi
|
|
123
244
|
|
|
245
|
+
# Resolve page type once from the first fetch file's URL, unless overridden.
|
|
246
|
+
TARGET_URL=$(jget "$FIRST_FETCH" '.url' "")
|
|
247
|
+
if [ -n "$PAGE_TYPE_OVERRIDE" ]; then
|
|
248
|
+
PAGE_TYPE="$PAGE_TYPE_OVERRIDE"
|
|
249
|
+
else
|
|
250
|
+
PAGE_TYPE=$(page_type_for_url "$TARGET_URL")
|
|
251
|
+
fi
|
|
252
|
+
printf '[compute-score] page type: %s (url: %s)\n' "$PAGE_TYPE" "$TARGET_URL" >&2
|
|
253
|
+
|
|
254
|
+
RUBRIC_EXPECTED="$(rubric_expected "$PAGE_TYPE")"
|
|
255
|
+
RUBRIC_OPTIONAL="$(rubric_optional "$PAGE_TYPE")"
|
|
256
|
+
RUBRIC_FORBIDDEN="$(rubric_forbidden "$PAGE_TYPE")"
|
|
257
|
+
EXPECTED_COUNT=$(list_count "$RUBRIC_EXPECTED")
|
|
258
|
+
|
|
124
259
|
BOTS_JSON="{}"
|
|
125
260
|
|
|
126
|
-
# Accumulators for per-category averages (across bots)
|
|
127
261
|
CAT_ACCESSIBILITY_SUM=0
|
|
128
262
|
CAT_CONTENT_SUM=0
|
|
129
263
|
CAT_STRUCTURED_SUM=0
|
|
@@ -131,7 +265,6 @@ CAT_TECHNICAL_SUM=0
|
|
|
131
265
|
CAT_AI_SUM=0
|
|
132
266
|
CAT_N=0
|
|
133
267
|
|
|
134
|
-
# Accumulators for overall weighted composite
|
|
135
268
|
OVERALL_WEIGHTED_SUM=0
|
|
136
269
|
OVERALL_WEIGHT_TOTAL=0
|
|
137
270
|
|
|
@@ -146,17 +279,10 @@ for bot_id in $BOTS; do
|
|
|
146
279
|
STATUS=$(jget_num "$FETCH" '.status')
|
|
147
280
|
TOTAL_TIME=$(jget_num "$FETCH" '.timing.total')
|
|
148
281
|
SERVER_WORD_COUNT=$(jget_num "$FETCH" '.wordCount')
|
|
149
|
-
# Read with explicit null fallback — jq's `//` is unsafe here because it
|
|
150
|
-
# treats boolean false as falsy, which is exactly the value we need to see.
|
|
151
282
|
RENDERS_JS=$(jq -r '.bot.rendersJavaScript | if . == null then "unknown" else tostring end' "$FETCH" 2>/dev/null || echo "unknown")
|
|
152
283
|
|
|
153
284
|
ROBOTS_ALLOWED=$(jget_bool "$ROBOTS" '.allowed')
|
|
154
285
|
|
|
155
|
-
# Effective word count depends on JS rendering capability:
|
|
156
|
-
# - true (e.g. Googlebot) + diff-render data → rendered DOM word count
|
|
157
|
-
# - false (AI training/search bots, observed) → server HTML only, with
|
|
158
|
-
# penalty proportional to the rendering delta
|
|
159
|
-
# - unknown → conservative: server HTML (same as false but no penalty)
|
|
160
286
|
EFFECTIVE_WORD_COUNT=$SERVER_WORD_COUNT
|
|
161
287
|
HYDRATION_PENALTY=0
|
|
162
288
|
MISSED_WORDS=0
|
|
@@ -164,11 +290,8 @@ for bot_id in $BOTS; do
|
|
|
164
290
|
if [ "$RENDERS_JS" = "true" ]; then
|
|
165
291
|
EFFECTIVE_WORD_COUNT=$DIFF_RENDERED_WORDS
|
|
166
292
|
elif [ "$RENDERS_JS" = "false" ]; then
|
|
167
|
-
# Absolute-value delta: if rendered DOM has materially more than server,
|
|
168
|
-
# AI bots are missing that content.
|
|
169
293
|
ABS_DELTA=$(awk -v d="$DIFF_DELTA_PCT" 'BEGIN { printf "%d", (d < 0 ? -d : d) + 0.5 }')
|
|
170
294
|
if [ "$ABS_DELTA" -gt 5 ]; then
|
|
171
|
-
# Scale penalty: 5% delta = 0, 10% = 5, 20%+ = 15 (cap)
|
|
172
295
|
HYDRATION_PENALTY=$(awk -v d="$ABS_DELTA" 'BEGIN {
|
|
173
296
|
p = (d - 5)
|
|
174
297
|
if (p > 15) p = 15
|
|
@@ -182,11 +305,8 @@ for bot_id in $BOTS; do
|
|
|
182
305
|
|
|
183
306
|
# --- Category 1: Accessibility (0-100) ---
|
|
184
307
|
ACC=0
|
|
185
|
-
# robots.txt allows: 40
|
|
186
308
|
[ "$ROBOTS_ALLOWED" = "true" ] && ACC=$((ACC + 40))
|
|
187
|
-
# HTTP 200: 40
|
|
188
309
|
[ "$STATUS" = "200" ] && ACC=$((ACC + 40))
|
|
189
|
-
# Response time: <2s = 20, <5s = 10, else 0
|
|
190
310
|
TIME_SCORE=$(awk -v t="$TOTAL_TIME" 'BEGIN { if (t < 2) print 20; else if (t < 5) print 10; else print 0 }')
|
|
191
311
|
ACC=$((ACC + TIME_SCORE))
|
|
192
312
|
|
|
@@ -216,34 +336,106 @@ for bot_id in $BOTS; do
|
|
|
216
336
|
CONTENT=$((CONTENT + ALT_SCORE))
|
|
217
337
|
fi
|
|
218
338
|
|
|
219
|
-
# Apply hydration penalty for non-rendering bots that are missing content
|
|
220
339
|
CONTENT=$((CONTENT - HYDRATION_PENALTY))
|
|
221
340
|
[ $CONTENT -lt 0 ] && CONTENT=0
|
|
222
341
|
|
|
223
342
|
# --- Category 3: Structured Data (0-100) ---
|
|
224
|
-
STRUCTURED=0
|
|
225
343
|
JSONLD_COUNT=$(jget_num "$JSONLD" '.blockCount')
|
|
226
344
|
JSONLD_VALID=$(jget_num "$JSONLD" '.validCount')
|
|
227
345
|
JSONLD_INVALID=$(jget_num "$JSONLD" '.invalidCount')
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
[ "$JSONLD_COUNT" -ge 1 ] && STRUCTURED=$((STRUCTURED + 30))
|
|
236
|
-
if [ "$JSONLD_COUNT" -ge 1 ] && [ "$JSONLD_INVALID" -eq 0 ]; then
|
|
237
|
-
STRUCTURED=$((STRUCTURED + 20))
|
|
346
|
+
|
|
347
|
+
if [ -f "$JSONLD" ]; then
|
|
348
|
+
PRESENT_TYPES=$(jq -r '.types[]? // empty' "$JSONLD" 2>/dev/null | awk 'NF && !seen[$0]++' | tr '\n' ' ')
|
|
349
|
+
PRESENT_TYPES=${PRESENT_TYPES% }
|
|
350
|
+
else
|
|
351
|
+
PRESENT_TYPES=""
|
|
238
352
|
fi
|
|
239
|
-
|
|
240
|
-
|
|
353
|
+
|
|
354
|
+
PRESENT_EXPECTED=$(list_intersect "$RUBRIC_EXPECTED" "$PRESENT_TYPES")
|
|
355
|
+
PRESENT_OPTIONAL=$(list_intersect "$RUBRIC_OPTIONAL" "$PRESENT_TYPES")
|
|
356
|
+
PRESENT_FORBIDDEN=$(list_intersect "$RUBRIC_FORBIDDEN" "$PRESENT_TYPES")
|
|
357
|
+
MISSING_EXPECTED=$(list_diff "$RUBRIC_EXPECTED" "$PRESENT_TYPES")
|
|
358
|
+
RUBRIC_KNOWN="$RUBRIC_EXPECTED $RUBRIC_OPTIONAL $RUBRIC_FORBIDDEN"
|
|
359
|
+
EXTRAS=$(list_diff "$PRESENT_TYPES" "$RUBRIC_KNOWN")
|
|
360
|
+
|
|
361
|
+
PRESENT_EXPECTED_COUNT=$(list_count "$PRESENT_EXPECTED")
|
|
362
|
+
PRESENT_OPTIONAL_COUNT=$(list_count "$PRESENT_OPTIONAL")
|
|
363
|
+
PRESENT_FORBIDDEN_COUNT=$(list_count "$PRESENT_FORBIDDEN")
|
|
364
|
+
|
|
365
|
+
BASE=$(awk -v h="$PRESENT_EXPECTED_COUNT" -v t="$EXPECTED_COUNT" \
|
|
366
|
+
'BEGIN { if (t == 0) print 0; else printf "%d", (h / t) * 100 + 0.5 }')
|
|
367
|
+
|
|
368
|
+
BONUS=$((PRESENT_OPTIONAL_COUNT * 10))
|
|
369
|
+
[ $BONUS -gt 20 ] && BONUS=20
|
|
370
|
+
|
|
371
|
+
FORBID_PENALTY=$((PRESENT_FORBIDDEN_COUNT * 10))
|
|
372
|
+
|
|
373
|
+
VALID_PENALTY=0
|
|
374
|
+
if [ "$JSONLD_COUNT" -gt 0 ] && [ "$JSONLD_INVALID" -gt 0 ]; then
|
|
375
|
+
VALID_PENALTY=$((JSONLD_INVALID * 5))
|
|
376
|
+
[ $VALID_PENALTY -gt 20 ] && VALID_PENALTY=20
|
|
241
377
|
fi
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
378
|
+
|
|
379
|
+
STRUCTURED=$((BASE + BONUS - FORBID_PENALTY - VALID_PENALTY))
|
|
380
|
+
[ $STRUCTURED -gt 100 ] && STRUCTURED=100
|
|
381
|
+
[ $STRUCTURED -lt 0 ] && STRUCTURED=0
|
|
382
|
+
|
|
383
|
+
CALCULATION=$(printf 'base: %d/%d expected present = %d; +%d optional bonus; -%d forbidden penalty; -%d validity penalty; clamp [0,100] = %d' \
|
|
384
|
+
"$PRESENT_EXPECTED_COUNT" "$EXPECTED_COUNT" "$BASE" \
|
|
385
|
+
"$BONUS" "$FORBID_PENALTY" "$VALID_PENALTY" "$STRUCTURED")
|
|
386
|
+
|
|
387
|
+
if [ "$STRUCTURED" -ge 100 ] && [ -z "$PRESENT_FORBIDDEN" ] && [ "$VALID_PENALTY" -eq 0 ]; then
|
|
388
|
+
NOTES="All expected schemas for pageType=$PAGE_TYPE are present. No structured-data action needed."
|
|
389
|
+
elif [ -n "$MISSING_EXPECTED" ] && [ -z "$PRESENT_FORBIDDEN" ]; then
|
|
390
|
+
NOTES="Missing expected schemas for pageType=$PAGE_TYPE: $MISSING_EXPECTED. Add these to raise the score."
|
|
391
|
+
elif [ -n "$PRESENT_FORBIDDEN" ] && [ -z "$MISSING_EXPECTED" ]; then
|
|
392
|
+
NOTES="Forbidden schemas present for pageType=$PAGE_TYPE: $PRESENT_FORBIDDEN. Remove these (or re-classify the page type with --page-type)."
|
|
393
|
+
elif [ -n "$PRESENT_FORBIDDEN" ] && [ -n "$MISSING_EXPECTED" ]; then
|
|
394
|
+
NOTES="Mixed: missing $MISSING_EXPECTED and forbidden present $PRESENT_FORBIDDEN for pageType=$PAGE_TYPE."
|
|
395
|
+
else
|
|
396
|
+
NOTES="Score reduced by $VALID_PENALTY pts due to invalid JSON-LD blocks."
|
|
245
397
|
fi
|
|
246
398
|
|
|
399
|
+
STRUCTURED_GRADE=$(grade_for "$STRUCTURED")
|
|
400
|
+
STRUCTURED_OBJ=$(jq -n \
|
|
401
|
+
--argjson score "$STRUCTURED" \
|
|
402
|
+
--arg grade "$STRUCTURED_GRADE" \
|
|
403
|
+
--arg pageType "$PAGE_TYPE" \
|
|
404
|
+
--arg expectedList "$RUBRIC_EXPECTED" \
|
|
405
|
+
--arg optionalList "$RUBRIC_OPTIONAL" \
|
|
406
|
+
--arg forbiddenList "$RUBRIC_FORBIDDEN" \
|
|
407
|
+
--arg presentList "$PRESENT_TYPES" \
|
|
408
|
+
--arg missingList "$MISSING_EXPECTED" \
|
|
409
|
+
--arg extrasList "$EXTRAS" \
|
|
410
|
+
--arg forbiddenPresent "$PRESENT_FORBIDDEN" \
|
|
411
|
+
--argjson invalidCount "$JSONLD_INVALID" \
|
|
412
|
+
--argjson validPenalty "$VALID_PENALTY" \
|
|
413
|
+
--arg calculation "$CALCULATION" \
|
|
414
|
+
--arg notes "$NOTES" \
|
|
415
|
+
'
|
|
416
|
+
def to_arr: split(" ") | map(select(length > 0));
|
|
417
|
+
{
|
|
418
|
+
score: $score,
|
|
419
|
+
grade: $grade,
|
|
420
|
+
pageType: $pageType,
|
|
421
|
+
expected: ($expectedList | to_arr),
|
|
422
|
+
optional: ($optionalList | to_arr),
|
|
423
|
+
forbidden: ($forbiddenList | to_arr),
|
|
424
|
+
present: ($presentList | to_arr),
|
|
425
|
+
missing: ($missingList | to_arr),
|
|
426
|
+
extras: ($extrasList | to_arr),
|
|
427
|
+
violations: (
|
|
428
|
+
($forbiddenPresent | to_arr | map({kind: "forbidden_schema", schema: ., impact: -10}))
|
|
429
|
+
+ (if $validPenalty > 0
|
|
430
|
+
then [{kind: "invalid_jsonld", count: $invalidCount, impact: (0 - $validPenalty)}]
|
|
431
|
+
else []
|
|
432
|
+
end)
|
|
433
|
+
),
|
|
434
|
+
calculation: $calculation,
|
|
435
|
+
notes: $notes
|
|
436
|
+
}
|
|
437
|
+
')
|
|
438
|
+
|
|
247
439
|
# --- Category 4: Technical Signals (0-100) ---
|
|
248
440
|
TECHNICAL=0
|
|
249
441
|
TITLE=$(jget "$META" '.title' "")
|
|
@@ -279,21 +471,16 @@ for bot_id in $BOTS; do
|
|
|
279
471
|
[ "$LLMS_HAS_DESC" = "true" ] && AI=$((AI + 7))
|
|
280
472
|
[ "$LLMS_URLS" -ge 1 ] && AI=$((AI + 6))
|
|
281
473
|
fi
|
|
282
|
-
# Content citable (>= 200 words, effective for this bot)
|
|
283
474
|
[ "$EFFECTIVE_WORD_COUNT" -ge 200 ] && AI=$((AI + 20))
|
|
284
|
-
# Semantic clarity: has H1 + description
|
|
285
475
|
if [ "$H1_COUNT" -ge 1 ] && [ -n "$DESCRIPTION" ] && [ "$DESCRIPTION" != "null" ]; then
|
|
286
476
|
AI=$((AI + 20))
|
|
287
477
|
fi
|
|
288
478
|
|
|
289
|
-
# Cap categories at 100
|
|
290
479
|
[ $ACC -gt 100 ] && ACC=100
|
|
291
480
|
[ $CONTENT -gt 100 ] && CONTENT=100
|
|
292
|
-
[ $STRUCTURED -gt 100 ] && STRUCTURED=100
|
|
293
481
|
[ $TECHNICAL -gt 100 ] && TECHNICAL=100
|
|
294
482
|
[ $AI -gt 100 ] && AI=100
|
|
295
483
|
|
|
296
|
-
# Per-bot composite score (weighted average of 5 categories)
|
|
297
484
|
BOT_SCORE=$(awk -v a=$ACC -v c=$CONTENT -v s=$STRUCTURED -v t=$TECHNICAL -v ai=$AI \
|
|
298
485
|
-v wa=$W_ACCESSIBILITY -v wc=$W_CONTENT -v ws=$W_STRUCTURED -v wt=$W_TECHNICAL -v wai=$W_AI \
|
|
299
486
|
'BEGIN { printf "%d", (a*wa + c*wc + s*ws + t*wt + ai*wai) / (wa+wc+ws+wt+wai) + 0.5 }')
|
|
@@ -301,7 +488,6 @@ for bot_id in $BOTS; do
|
|
|
301
488
|
BOT_GRADE=$(grade_for "$BOT_SCORE")
|
|
302
489
|
ACC_GRADE=$(grade_for "$ACC")
|
|
303
490
|
CONTENT_GRADE=$(grade_for "$CONTENT")
|
|
304
|
-
STRUCTURED_GRADE=$(grade_for "$STRUCTURED")
|
|
305
491
|
TECHNICAL_GRADE=$(grade_for "$TECHNICAL")
|
|
306
492
|
AI_GRADE=$(grade_for "$AI")
|
|
307
493
|
|
|
@@ -315,8 +501,7 @@ for bot_id in $BOTS; do
|
|
|
315
501
|
--arg accGrade "$ACC_GRADE" \
|
|
316
502
|
--argjson content "$CONTENT" \
|
|
317
503
|
--arg contentGrade "$CONTENT_GRADE" \
|
|
318
|
-
--argjson structured "$
|
|
319
|
-
--arg structuredGrade "$STRUCTURED_GRADE" \
|
|
504
|
+
--argjson structured "$STRUCTURED_OBJ" \
|
|
320
505
|
--argjson technical "$TECHNICAL" \
|
|
321
506
|
--arg technicalGrade "$TECHNICAL_GRADE" \
|
|
322
507
|
--argjson ai "$AI" \
|
|
@@ -338,17 +523,16 @@ for bot_id in $BOTS; do
|
|
|
338
523
|
hydrationPenaltyPts: $hydrationPenalty
|
|
339
524
|
},
|
|
340
525
|
categories: {
|
|
341
|
-
accessibility: { score: $acc,
|
|
342
|
-
contentVisibility: { score: $content,
|
|
343
|
-
structuredData:
|
|
344
|
-
technicalSignals: { score: $technical,
|
|
345
|
-
aiReadiness: { score: $ai,
|
|
526
|
+
accessibility: { score: $acc, grade: $accGrade },
|
|
527
|
+
contentVisibility: { score: $content, grade: $contentGrade },
|
|
528
|
+
structuredData: $structured,
|
|
529
|
+
technicalSignals: { score: $technical, grade: $technicalGrade },
|
|
530
|
+
aiReadiness: { score: $ai, grade: $aiGrade }
|
|
346
531
|
}
|
|
347
532
|
}')
|
|
348
533
|
|
|
349
534
|
BOTS_JSON=$(printf '%s' "$BOTS_JSON" | jq --argjson bot "$BOT_OBJ" --arg id "$bot_id" '.[$id] = $bot')
|
|
350
535
|
|
|
351
|
-
# Accumulate category averages
|
|
352
536
|
CAT_ACCESSIBILITY_SUM=$((CAT_ACCESSIBILITY_SUM + ACC))
|
|
353
537
|
CAT_CONTENT_SUM=$((CAT_CONTENT_SUM + CONTENT))
|
|
354
538
|
CAT_STRUCTURED_SUM=$((CAT_STRUCTURED_SUM + STRUCTURED))
|
|
@@ -356,7 +540,6 @@ for bot_id in $BOTS; do
|
|
|
356
540
|
CAT_AI_SUM=$((CAT_AI_SUM + AI))
|
|
357
541
|
CAT_N=$((CAT_N + 1))
|
|
358
542
|
|
|
359
|
-
# Accumulate weighted overall
|
|
360
543
|
W=$(overall_weight "$bot_id")
|
|
361
544
|
if [ "$W" -gt 0 ]; then
|
|
362
545
|
OVERALL_WEIGHTED_SUM=$((OVERALL_WEIGHTED_SUM + BOT_SCORE * W))
|
|
@@ -364,18 +547,15 @@ for bot_id in $BOTS; do
|
|
|
364
547
|
fi
|
|
365
548
|
done
|
|
366
549
|
|
|
367
|
-
# Per-category averages (across all bots)
|
|
368
550
|
CAT_ACC_AVG=$((CAT_ACCESSIBILITY_SUM / CAT_N))
|
|
369
551
|
CAT_CONTENT_AVG=$((CAT_CONTENT_SUM / CAT_N))
|
|
370
552
|
CAT_STRUCTURED_AVG=$((CAT_STRUCTURED_SUM / CAT_N))
|
|
371
553
|
CAT_TECHNICAL_AVG=$((CAT_TECHNICAL_SUM / CAT_N))
|
|
372
554
|
CAT_AI_AVG=$((CAT_AI_SUM / CAT_N))
|
|
373
555
|
|
|
374
|
-
# Overall composite
|
|
375
556
|
if [ "$OVERALL_WEIGHT_TOTAL" -gt 0 ]; then
|
|
376
557
|
OVERALL_SCORE=$((OVERALL_WEIGHTED_SUM / OVERALL_WEIGHT_TOTAL))
|
|
377
558
|
else
|
|
378
|
-
# Fall back to simple average if none of the 4 standard bots are present
|
|
379
559
|
OVERALL_SCORE=$(((CAT_ACC_AVG + CAT_CONTENT_AVG + CAT_STRUCTURED_AVG + CAT_TECHNICAL_AVG + CAT_AI_AVG) / 5))
|
|
380
560
|
fi
|
|
381
561
|
|
|
@@ -386,15 +566,14 @@ CAT_STRUCTURED_GRADE=$(grade_for "$CAT_STRUCTURED_AVG")
|
|
|
386
566
|
CAT_TECHNICAL_GRADE=$(grade_for "$CAT_TECHNICAL_AVG")
|
|
387
567
|
CAT_AI_GRADE=$(grade_for "$CAT_AI_AVG")
|
|
388
568
|
|
|
389
|
-
# Get the URL from the first fetch file
|
|
390
|
-
FIRST_FETCH=$(ls "$RESULTS_DIR"/fetch-*.json | head -1)
|
|
391
|
-
TARGET_URL=$(jget "$FIRST_FETCH" '.url' "")
|
|
392
569
|
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
|
|
393
570
|
|
|
394
571
|
jq -n \
|
|
395
572
|
--arg url "$TARGET_URL" \
|
|
396
573
|
--arg timestamp "$TIMESTAMP" \
|
|
397
|
-
--arg version "0.
|
|
574
|
+
--arg version "0.2.0" \
|
|
575
|
+
--arg pageType "$PAGE_TYPE" \
|
|
576
|
+
--arg pageTypeOverride "$PAGE_TYPE_OVERRIDE" \
|
|
398
577
|
--argjson overallScore "$OVERALL_SCORE" \
|
|
399
578
|
--arg overallGrade "$OVERALL_GRADE" \
|
|
400
579
|
--argjson bots "$BOTS_JSON" \
|
|
@@ -412,6 +591,8 @@ jq -n \
|
|
|
412
591
|
url: $url,
|
|
413
592
|
timestamp: $timestamp,
|
|
414
593
|
version: $version,
|
|
594
|
+
pageType: $pageType,
|
|
595
|
+
pageTypeOverridden: ($pageTypeOverride | length > 0),
|
|
415
596
|
overall: { score: $overallScore, grade: $overallGrade },
|
|
416
597
|
bots: $bots,
|
|
417
598
|
categories: {
|