seo-intel 1.5.45 → 1.5.50
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +60 -0
- package/analyses/aeo/ai-access.js +210 -0
- package/analyses/aeo/index.js +52 -9
- package/analyses/aeo/scorer.js +36 -13
- package/cli.js +175 -18
- package/lib/license.js +26 -15
- package/lib/updater.js +17 -6
- package/mcp/server.js +250 -6
- package/package.json +1 -1
- package/seo-intel.png +0 -0
- package/server.js +47 -2
- package/setup/engine.js +3 -0
- package/setup/models.js +90 -2
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,65 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 1.5.50 (2026-06-11)
|
|
4
|
+
|
|
5
|
+
### New MCP tool: `setup_project` — zero → configured → audited, entirely from chat
|
|
6
|
+
The last setup gap in chat-native coverage: projects could previously only be created via the CLI or web wizard. An AI agent can now take a user from nothing to a configured, crawled, audited project without leaving the conversation.
|
|
7
|
+
|
|
8
|
+
- **`setup_project(project_name, target_url, …)`** writes the same project config the wizard produces — target domain, competitors, owned domains, analysis context (industry / audience / goal), crawl budget, and extraction model. Pairs with `suggest_models` for picking the local model first.
|
|
9
|
+
- Refuses to overwrite an existing project unless `overwrite=true`.
|
|
10
|
+
- MCP server now exposes **21 tools**; the full lifecycle (set up → crawl → extract → audit → problems → fix → draft → re-audit) is reachable from any MCP host.
|
|
11
|
+
- Model catalog: cloud analysis entry refreshed to **Claude Opus 4.8**.
|
|
12
|
+
|
|
13
|
+
## 1.5.49 (2026-06-08)
|
|
14
|
+
|
|
15
|
+
### New skill: `seo-autofix` — autonomous audit → fix → verify loop
|
|
16
|
+
SEO Intel already reports each problem with a concrete fix **and** a verification recipe. This skill turns that into a closed loop an AI code agent runs against a repo it has checked out — with the human in exactly one place: merging and deploying the branch.
|
|
17
|
+
|
|
18
|
+
- **The loop:** `run_crawl` → `list_problems` → for each problem, map the affected URL to its source file, apply the `fix_template`, **verify against a local preview before deploying** (`crawl_site` against `localhost`), keep it only if the problem signal clears, then collect verified fixes on one branch and `mark_problem_status(fixed)`.
|
|
19
|
+
- **Autonomy gate:** only `fix_difficulty ≤ 2` (deterministic structural fixes — missing meta/title, missing JSON-LD, orphan links, noindex conflicts) are applied autonomously. Judgment-heavy problems (positioning, content rewrites) are summarized for the human, never auto-applied.
|
|
20
|
+
- **Hard rules:** verify every fix against a real crawl (a `fix_template` is guidance, the crawl is proof — unverified edits get reverted); one branch, no push to `main`, no deploy, no publishing. The blast radius is a branch the human reviews.
|
|
21
|
+
- Lives in `skills/seo-autofix/` — distributed via the repo / skill directories.
|
|
22
|
+
|
|
23
|
+
### Fixed
|
|
24
|
+
- **CLI starts in ~100ms instead of loading the browser engine up front.** `cli.js` statically imported the crawl engine (Playwright + the HTML→markdown chain) at startup, so every command — even `seo-intel --version` — paid that import, which could stall for minutes on a cold module cache. The crawler now loads on first use (`crawl` / `run` / `scan`); all other commands skip it entirely. Measured: `--version` went from 143s (worst case observed) to ~110ms.
|
|
25
|
+
|
|
26
|
+
## 1.5.48 (2026-06-07)
|
|
27
|
+
|
|
28
|
+
### Local-model suggester — `seo-intel models` + `suggest_models` (MCP)
|
|
29
|
+
Extraction runs a small AI model once per crawled page. This makes it easy to pick the right **local** one — and is emphatic that local is the way to do it.
|
|
30
|
+
|
|
31
|
+
- **`seo-intel models`** — detects your GPU/VRAM and which models are already in Ollama, then recommends from a curated local set: **Gemma 4 E2B / E4B / 12B** and **Qwen 3.5 4B / 9B** (smallest → largest, with VRAM/speed/quality and the `ollama pull` command for each). `--format json` for machine output.
|
|
32
|
+
- **`suggest_models` (MCP)** — the same recommendation from any chat/agent, so an assistant can suggest a model for the user's hardware.
|
|
33
|
+
- **Both always carry a disclaimer: extraction should be done with a LOCAL model.** Cloud is a fallback, not the default — it sends every page's content off-machine, costs money at scale, and rate-limits, all for a task a 4–8B local model handles well, offline, with data never leaving the machine.
|
|
34
|
+
- Added **Gemma 4 12B** to the extraction model catalog (a quality step up from E4B that still fits ~10 GB cards).
|
|
35
|
+
|
|
36
|
+
### Fixed
|
|
37
|
+
- **MCP server boot is no longer blocked by the crawler dependency chain.** The `crawl_site` tool's crawler (which pulls in `turndown`) is now loaded on first use instead of at startup. Previously, if importing `turndown` was slow on a given machine, the entire MCP server could fail to start — no tools, no banner, no handshake. Boot now completes in well under a second regardless of crawler import speed.
|
|
38
|
+
- **Commands no longer hang when the license or update servers are slow or unreachable.** Two startup network paths had no effective cap: the license phone-home was awaited without a hard timeout (blocking commands like `status` for up to ~10s), and background update-check fetches kept the process from exiting until the OS connect timeout. The license check is now capped at 2.5s and degrades to cached/offline behavior, and one-shot commands exit as soon as their work is done instead of waiting on lingering background requests. Activation still works normally when the server is reachable.
|
|
39
|
+
|
|
40
|
+
## 1.5.47 (2026-06-07)
|
|
41
|
+
|
|
42
|
+
### AI-crawler access is now part of AI citability — and the audit runs from your agent
|
|
43
|
+
A page can be perfectly structured and still be impossible for an AI assistant to cite — because `robots.txt` blocks the crawler. AEO now checks for exactly that.
|
|
44
|
+
|
|
45
|
+
- **New citability signal: AI-crawler access.** `seo-intel aeo <project>` fetches each target domain's `robots.txt` and detects whether answer-engine crawlers (ClaudeBot, GPTBot, OAI-SearchBot, PerplexityBot, Google-Extended, Amazonbot, DuckAssistBot, and training crawlers like CCBot / Applebot-Extended) are allowed, plus the Cloudflare `Content-Signal: ai-train=no` directive. When the assistants developers actually use are locked out, the affected pages are **capped at 30/100** — on-page quality can't help a page the AI can't read. A new "AI Crawler Access" section appears in the audit, and a high-priority `citability_gap` is written to the Intelligence Ledger per blocked domain.
|
|
46
|
+
- The check is the only network call AEO makes (one `robots.txt` per target domain), it's best-effort, and a missing/unreachable `robots.txt` is treated as open. The pure scorer stays offline — robots verdicts are fetched separately and passed in.
|
|
47
|
+
- **New MCP tool `tech_audit`** — run the technical SEO audit (titles, meta, noindex/robots conflicts, redirects, canonicals, sitemap diff) straight from any MCP host, no shelling out to the CLI.
|
|
48
|
+
- **New MCP tool `scan_site`** (Solo) — one-shot full audit of any domain (crawl → extract → analyze → export) as a detached background job, mirroring `seo-intel scan`.
|
|
49
|
+
- **`run_citability_audit` (MCP)** now performs the same AI-crawler-access check and returns an `ai_access` verdict per domain.
|
|
50
|
+
- **Fix: `aeo` and `gap-intel` now emit clean JSON in `--format json`.** Both commands previously regenerated the dashboard after printing the JSON, and the dashboard step's progress logs (e.g. "Topic clusters loaded…") leaked onto stdout — breaking `JSON.parse` for agents and scripts. Dashboard regeneration is now skipped in JSON mode, so stdout contains only the JSON object.
|
|
51
|
+
|
|
52
|
+
## 1.5.46 (2026-05-29)
|
|
53
|
+
|
|
54
|
+
### Security — the local dashboard now accepts requests from localhost only
|
|
55
|
+
Hardened `seo-intel serve` against a class of browser-based attack that affects local web servers in general (cross-site request forgery and DNS rebinding). While the dashboard was running, a web page open in the same browser could send requests to `localhost` — and the command-stream endpoint additionally sent a wildcard `Access-Control-Allow-Origin`, which would have let such a page read its output.
|
|
56
|
+
|
|
57
|
+
- **Loopback-only gate:** every request is now checked at the door — the `Host` must be a loopback name (defeats DNS rebinding) and any `Origin` must be loopback too (blocks cross-origin / CSRF). Non-local requests get `403`.
|
|
58
|
+
- **Removed the wildcard `Access-Control-Allow-Origin: *`** from the terminal SSE stream — the dashboard is same-origin and never needed CORS.
|
|
59
|
+
- **Standard headers added:** `X-Frame-Options: DENY`, `Content-Security-Policy: frame-ancestors 'none'` (anti-clickjacking), `X-Content-Type-Options: nosniff`.
|
|
60
|
+
|
|
61
|
+
The server already bound `127.0.0.1` only; this adds the missing in-app checks. Same-origin dashboard use is unchanged. **Recommended update for anyone who runs `seo-intel serve`.**
|
|
62
|
+
|
|
3
63
|
## 1.5.45 (2026-05-29)
|
|
4
64
|
|
|
5
65
|
### The content loop in one command — `seo-intel loop` + `run_content_loop` (MCP)
|
|
@@ -0,0 +1,210 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* AI Crawler Access — robots.txt analysis for AEO citability.
|
|
3
|
+
*
|
|
4
|
+
* The single biggest AEO failure mode is invisible to on-page scoring: a page
|
|
5
|
+
* can be perfectly structured and still be uncitable because robots.txt blocks
|
|
6
|
+
* the AI assistants' crawlers. This module parses robots.txt and reports which
|
|
7
|
+
* answer-engine / AI crawlers are allowed to read the site.
|
|
8
|
+
*
|
|
9
|
+
* `analyzeAiAccess` is a pure function (robots text → verdict). `fetchAiAccess`
|
|
10
|
+
* is the only thing that touches the network — callers that want to keep the
|
|
11
|
+
* AEO core network-free can fetch robots.txt themselves and pass the text in.
|
|
12
|
+
*/
|
|
13
|
+
|
|
14
|
+
// Known AI / answer-engine crawlers. `tier: citation` = used for live grounding
|
|
15
|
+
// and citation in assistant answers (blocking these directly kills citability);
|
|
16
|
+
// `tier: training` = primarily corpus/training crawlers (blocking hurts long-term
|
|
17
|
+
// model familiarity but not live citation as hard).
|
|
18
|
+
export const AI_BOTS = [
|
|
19
|
+
{ ua: 'ClaudeBot', vendor: 'Anthropic — Claude', tier: 'citation' },
|
|
20
|
+
{ ua: 'Claude-Web', vendor: 'Anthropic — Claude', tier: 'citation' },
|
|
21
|
+
{ ua: 'anthropic-ai', vendor: 'Anthropic — Claude', tier: 'citation' },
|
|
22
|
+
{ ua: 'GPTBot', vendor: 'OpenAI — ChatGPT', tier: 'citation' },
|
|
23
|
+
{ ua: 'OAI-SearchBot', vendor: 'OpenAI — ChatGPT Search', tier: 'citation' },
|
|
24
|
+
{ ua: 'ChatGPT-User', vendor: 'OpenAI — ChatGPT (browse)', tier: 'citation' },
|
|
25
|
+
{ ua: 'PerplexityBot', vendor: 'Perplexity', tier: 'citation' },
|
|
26
|
+
{ ua: 'Perplexity-User', vendor: 'Perplexity (browse)', tier: 'citation' },
|
|
27
|
+
{ ua: 'Google-Extended', vendor: 'Google — Gemini / AI Overviews', tier: 'citation' },
|
|
28
|
+
{ ua: 'Amazonbot', vendor: 'Amazon — Alexa / Rufus', tier: 'citation' },
|
|
29
|
+
{ ua: 'DuckAssistBot', vendor: 'DuckDuckGo — DuckAssist', tier: 'citation' },
|
|
30
|
+
{ ua: 'Applebot-Extended', vendor: 'Apple Intelligence', tier: 'training' },
|
|
31
|
+
{ ua: 'CCBot', vendor: 'Common Crawl (feeds many LLMs)', tier: 'training' },
|
|
32
|
+
{ ua: 'Bytespider', vendor: 'ByteDance — Doubao', tier: 'training' },
|
|
33
|
+
{ ua: 'Meta-ExternalAgent', vendor: 'Meta AI', tier: 'training' },
|
|
34
|
+
{ ua: 'cohere-ai', vendor: 'Cohere', tier: 'training' },
|
|
35
|
+
];
|
|
36
|
+
|
|
37
|
+
// ── robots.txt parsing ──────────────────────────────────────────────────────
|
|
38
|
+
|
|
39
|
+
/**
|
|
40
|
+
* Parse robots.txt into user-agent groups. A group is one-or-more consecutive
|
|
41
|
+
* `User-agent` lines followed by the rules that apply to them (per RFC 9309: a
|
|
42
|
+
* new User-agent after a rule line starts a fresh group).
|
|
43
|
+
*/
|
|
44
|
+
function parseRobots(txt) {
|
|
45
|
+
const groups = [];
|
|
46
|
+
let current = null;
|
|
47
|
+
let lastWasRule = false;
|
|
48
|
+
|
|
49
|
+
for (const raw of txt.split(/\r?\n/)) {
|
|
50
|
+
const line = raw.replace(/#.*$/, '').trim();
|
|
51
|
+
if (!line) continue;
|
|
52
|
+
const idx = line.indexOf(':');
|
|
53
|
+
if (idx === -1) continue;
|
|
54
|
+
const field = line.slice(0, idx).trim().toLowerCase();
|
|
55
|
+
const value = line.slice(idx + 1).trim();
|
|
56
|
+
|
|
57
|
+
if (field === 'user-agent') {
|
|
58
|
+
if (!current || lastWasRule) {
|
|
59
|
+
current = { agents: [], rules: [] };
|
|
60
|
+
groups.push(current);
|
|
61
|
+
lastWasRule = false;
|
|
62
|
+
}
|
|
63
|
+
current.agents.push(value.toLowerCase());
|
|
64
|
+
} else if (field === 'allow' || field === 'disallow') {
|
|
65
|
+
if (!current) { current = { agents: ['*'], rules: [] }; groups.push(current); }
|
|
66
|
+
current.rules.push({ type: field, path: value });
|
|
67
|
+
lastWasRule = true;
|
|
68
|
+
}
|
|
69
|
+
}
|
|
70
|
+
return groups;
|
|
71
|
+
}
|
|
72
|
+
|
|
73
|
+
/** Pick the group that governs `ua`: exact match wins, else the `*` group. */
|
|
74
|
+
function groupFor(groups, ua) {
|
|
75
|
+
const lc = ua.toLowerCase();
|
|
76
|
+
return groups.find(g => g.agents.includes(lc))
|
|
77
|
+
|| groups.find(g => g.agents.includes('*'))
|
|
78
|
+
|| null;
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
/** Does this group block the site root (`/`)? `Disallow: /` blocks all; an
|
|
82
|
+
* explicit `Allow: /` overrides; empty `Disallow:` means allow-all. */
|
|
83
|
+
function blocksRoot(group) {
|
|
84
|
+
if (!group) return false;
|
|
85
|
+
let blocked = false;
|
|
86
|
+
for (const r of group.rules) {
|
|
87
|
+
if (r.type === 'disallow' && (r.path === '/' || r.path === '/*')) blocked = true;
|
|
88
|
+
else if (r.type === 'allow' && r.path === '/') blocked = false;
|
|
89
|
+
}
|
|
90
|
+
return blocked;
|
|
91
|
+
}
|
|
92
|
+
|
|
93
|
+
// ── Verdict ─────────────────────────────────────────────────────────────────
|
|
94
|
+
|
|
95
|
+
/**
|
|
96
|
+
* Analyze robots.txt for AI-crawler access. Pure function.
|
|
97
|
+
*
|
|
98
|
+
* @param {string} robotsTxt - raw robots.txt body ('' if none/unavailable)
|
|
99
|
+
* @param {object} [opts] - { fetched: boolean } — false when robots couldn't be read
|
|
100
|
+
* @returns {object} {
|
|
101
|
+
* score, blocked, verdict, blockedBots[], allowedCount, citationBlocked[],
|
|
102
|
+
* aiTrainSignal, fetched, detail
|
|
103
|
+
* }
|
|
104
|
+
*/
|
|
105
|
+
export function analyzeAiAccess(robotsTxt, opts = {}) {
|
|
106
|
+
const fetched = opts.fetched ?? true;
|
|
107
|
+
|
|
108
|
+
// No robots.txt (or unreadable) → crawlers default-allow. Open, but flagged.
|
|
109
|
+
if (!fetched || !robotsTxt || !robotsTxt.trim()) {
|
|
110
|
+
return {
|
|
111
|
+
score: 100, blocked: false, verdict: 'open',
|
|
112
|
+
blockedBots: [], citationBlocked: [], allowedCount: AI_BOTS.length,
|
|
113
|
+
aiTrainSignal: null, fetched,
|
|
114
|
+
detail: fetched
|
|
115
|
+
? 'No robots.txt rules — all AI crawlers default-allowed.'
|
|
116
|
+
: 'robots.txt unavailable — assuming open (crawlers default-allow when absent).',
|
|
117
|
+
};
|
|
118
|
+
}
|
|
119
|
+
|
|
120
|
+
const groups = parseRobots(robotsTxt);
|
|
121
|
+
const aiTrainSignal = /content-signal\s*:[^\n]*\bai-train\s*=\s*no\b/i.test(robotsTxt)
|
|
122
|
+
? 'ai-train=no' : null;
|
|
123
|
+
|
|
124
|
+
const blockedBots = [];
|
|
125
|
+
for (const bot of AI_BOTS) {
|
|
126
|
+
if (blocksRoot(groupFor(groups, bot.ua))) blockedBots.push(bot);
|
|
127
|
+
}
|
|
128
|
+
const citationBlocked = blockedBots.filter(b => b.tier === 'citation');
|
|
129
|
+
|
|
130
|
+
let penalty = 0;
|
|
131
|
+
for (const b of blockedBots) penalty += b.tier === 'citation' ? 18 : 7;
|
|
132
|
+
if (aiTrainSignal) penalty += 8;
|
|
133
|
+
const score = Math.max(0, 100 - penalty);
|
|
134
|
+
|
|
135
|
+
// `blocked` = the hard-reality gate: any live citation crawler is locked out.
|
|
136
|
+
const blocked = citationBlocked.length > 0;
|
|
137
|
+
let verdict;
|
|
138
|
+
if (citationBlocked.length >= 3 || score < 40) verdict = 'blocked';
|
|
139
|
+
else if (blockedBots.length > 0 || aiTrainSignal) verdict = 'partial';
|
|
140
|
+
else verdict = 'open';
|
|
141
|
+
|
|
142
|
+
const names = citationBlocked.map(b => b.ua);
|
|
143
|
+
const detail = verdict === 'blocked'
|
|
144
|
+
? `robots.txt blocks ${citationBlocked.length} answer-engine crawler(s) (${names.slice(0, 6).join(', ')}${names.length > 6 ? '…' : ''}) — these pages cannot be cited by the assistants developers actually use.`
|
|
145
|
+
: verdict === 'partial'
|
|
146
|
+
? `Some AI crawlers blocked${aiTrainSignal ? ' and Content-Signal ai-train=no set' : ''}; live citation still possible but reduced.`
|
|
147
|
+
: 'All major AI crawlers allowed.';
|
|
148
|
+
|
|
149
|
+
return {
|
|
150
|
+
score, blocked, verdict,
|
|
151
|
+
blockedBots: blockedBots.map(b => ({ ua: b.ua, vendor: b.vendor, tier: b.tier })),
|
|
152
|
+
citationBlocked: citationBlocked.map(b => b.ua),
|
|
153
|
+
allowedCount: AI_BOTS.length - blockedBots.length,
|
|
154
|
+
aiTrainSignal, fetched, detail,
|
|
155
|
+
};
|
|
156
|
+
}
|
|
157
|
+
|
|
158
|
+
// ── Network fetch (the only I/O in this module) ─────────────────────────────
|
|
159
|
+
|
|
160
|
+
/**
|
|
161
|
+
* Fetch + analyze robots.txt for a site. Best-effort: any failure degrades to
|
|
162
|
+
* an "assume open" verdict rather than throwing.
|
|
163
|
+
*
|
|
164
|
+
* @param {string} siteUrl - origin or any URL on the site
|
|
165
|
+
* @param {object} [opts] - { timeoutMs }
|
|
166
|
+
*/
|
|
167
|
+
export async function fetchAiAccess(siteUrl, opts = {}) {
|
|
168
|
+
const timeoutMs = opts.timeoutMs ?? 5000;
|
|
169
|
+
let origin;
|
|
170
|
+
try {
|
|
171
|
+
origin = new URL(/^https?:\/\//.test(siteUrl) ? siteUrl : `https://${siteUrl}`).origin;
|
|
172
|
+
} catch {
|
|
173
|
+
return { ...analyzeAiAccess('', { fetched: false }), origin: null };
|
|
174
|
+
}
|
|
175
|
+
|
|
176
|
+
try {
|
|
177
|
+
const controller = new AbortController();
|
|
178
|
+
const timer = setTimeout(() => controller.abort(), timeoutMs);
|
|
179
|
+
const res = await fetch(`${origin}/robots.txt`, {
|
|
180
|
+
signal: controller.signal,
|
|
181
|
+
redirect: 'follow',
|
|
182
|
+
headers: { 'user-agent': 'seo-intel-aeo (+https://ukkometa.fi/seo-intel)' },
|
|
183
|
+
});
|
|
184
|
+
clearTimeout(timer);
|
|
185
|
+
if (!res.ok) {
|
|
186
|
+
return { ...analyzeAiAccess('', { fetched: true }), origin, httpStatus: res.status };
|
|
187
|
+
}
|
|
188
|
+
const txt = await res.text();
|
|
189
|
+
return { ...analyzeAiAccess(txt), origin, httpStatus: res.status };
|
|
190
|
+
} catch (e) {
|
|
191
|
+
return { ...analyzeAiAccess('', { fetched: false }), origin, error: e.message };
|
|
192
|
+
}
|
|
193
|
+
}
|
|
194
|
+
|
|
195
|
+
/**
|
|
196
|
+
* Fetch AI access for many domains in parallel. Returns Map<domain, verdict>.
|
|
197
|
+
* Domains can be bare ("docs.carbium.sh") or full URLs.
|
|
198
|
+
*/
|
|
199
|
+
export async function fetchAiAccessForDomains(domains, opts = {}) {
|
|
200
|
+
const map = new Map();
|
|
201
|
+
await Promise.all([...new Set(domains)].map(async (d) => {
|
|
202
|
+
const verdict = await fetchAiAccess(d, opts);
|
|
203
|
+
// key by bare host so it matches the `domains.domain` column
|
|
204
|
+
let host = d;
|
|
205
|
+
try { host = new URL(/^https?:\/\//.test(d) ? d : `https://${d}`).hostname; } catch { /* keep */ }
|
|
206
|
+
map.set(host, verdict);
|
|
207
|
+
map.set(host.replace(/^www\./, ''), verdict);
|
|
208
|
+
}));
|
|
209
|
+
return map;
|
|
210
|
+
}
|
package/analyses/aeo/index.js
CHANGED
|
@@ -12,12 +12,16 @@ import { scorePage } from './scorer.js';
|
|
|
12
12
|
*
|
|
13
13
|
* @param {import('node:sqlite').DatabaseSync} db
|
|
14
14
|
* @param {string} project
|
|
15
|
-
* @param {object} opts - { includeCompetitors
|
|
15
|
+
* @param {object} opts - { includeCompetitors, log, aiAccessByDomain }
|
|
16
|
+
* aiAccessByDomain: optional Map<domain, verdict> from ai-access.js. Pure —
|
|
17
|
+
* this function never touches the network; callers fetch robots.txt and pass
|
|
18
|
+
* the verdicts in (preserves the "AEO runs on existing crawl data" contract).
|
|
16
19
|
* @returns {object} { target: PageScore[], competitors: Map<domain, PageScore[]>, summary }
|
|
17
20
|
*/
|
|
18
21
|
export function runAeoAnalysis(db, project, opts = {}) {
|
|
19
22
|
const log = opts.log || console.log;
|
|
20
23
|
const includeCompetitors = opts.includeCompetitors ?? true;
|
|
24
|
+
const aiAccessByDomain = opts.aiAccessByDomain || null;
|
|
21
25
|
|
|
22
26
|
// ── Gather pages with body_text ─────────────────────────────────────────
|
|
23
27
|
const roleFilter = includeCompetitors
|
|
@@ -75,8 +79,12 @@ export function runAeoAnalysis(db, project, opts = {}) {
|
|
|
75
79
|
entities = JSON.parse(page.primary_entities || '[]');
|
|
76
80
|
} catch { /* ignore */ }
|
|
77
81
|
|
|
82
|
+
const aiAccess = aiAccessByDomain
|
|
83
|
+
? (aiAccessByDomain.get(page.domain) || aiAccessByDomain.get(page.domain.replace(/^www\./, '')) || null)
|
|
84
|
+
: null;
|
|
85
|
+
|
|
78
86
|
const result = scorePage(
|
|
79
|
-
page, headings, entities, schemaTypes, pageSchemas, page.search_intent
|
|
87
|
+
page, headings, entities, schemaTypes, pageSchemas, page.search_intent, aiAccess
|
|
80
88
|
);
|
|
81
89
|
|
|
82
90
|
const pageScore = {
|
|
@@ -117,6 +125,20 @@ export function runAeoAnalysis(db, project, opts = {}) {
|
|
|
117
125
|
const tierCounts = { excellent: 0, good: 0, needs_work: 0, poor: 0 };
|
|
118
126
|
for (const r of targetResults) tierCounts[r.tier]++;
|
|
119
127
|
|
|
128
|
+
// Domain-level AI-access rollup (one verdict per target/owned domain).
|
|
129
|
+
const aiAccess = [];
|
|
130
|
+
if (aiAccessByDomain) {
|
|
131
|
+
const seen = new Set();
|
|
132
|
+
for (const r of targetResults) {
|
|
133
|
+
const key = r.domain.replace(/^www\./, '');
|
|
134
|
+
if (seen.has(key)) continue;
|
|
135
|
+
seen.add(key);
|
|
136
|
+
const v = aiAccessByDomain.get(r.domain) || aiAccessByDomain.get(key);
|
|
137
|
+
if (v) aiAccess.push({ domain: key, verdict: v.verdict, score: v.score, blocked: !!v.blocked, blockedBots: v.citationBlocked || [], detail: v.detail });
|
|
138
|
+
}
|
|
139
|
+
}
|
|
140
|
+
const gatedPages = targetResults.filter(r => r.aiAccessGated).length;
|
|
141
|
+
|
|
120
142
|
const summary = {
|
|
121
143
|
totalScored: scored,
|
|
122
144
|
targetPages: targetResults.length,
|
|
@@ -126,6 +148,8 @@ export function runAeoAnalysis(db, project, opts = {}) {
|
|
|
126
148
|
scoreDelta: avgTarget - avgComp,
|
|
127
149
|
tierCounts,
|
|
128
150
|
weakestSignals: getWeakestSignals(targetResults),
|
|
151
|
+
aiAccess,
|
|
152
|
+
gatedPages,
|
|
129
153
|
};
|
|
130
154
|
|
|
131
155
|
log(` Scored ${scored} pages (${targetResults.length} target, ${compScores.length} competitor)`);
|
|
@@ -171,7 +195,7 @@ export function persistAeoScores(db, results) {
|
|
|
171
195
|
/**
|
|
172
196
|
* Feed low-scoring pages into Intelligence Ledger as citability_gap insights
|
|
173
197
|
*/
|
|
174
|
-
export function upsertCitabilityInsights(db, project, targetResults) {
|
|
198
|
+
export function upsertCitabilityInsights(db, project, targetResults, aiAccess = null) {
|
|
175
199
|
const upsertStmt = db.prepare(`
|
|
176
200
|
INSERT INTO insights (project, type, status, fingerprint, first_seen, last_seen, source_analysis_id, data)
|
|
177
201
|
VALUES (?, 'citability_gap', 'active', ?, ?, ?, NULL, ?)
|
|
@@ -183,6 +207,27 @@ export function upsertCitabilityInsights(db, project, targetResults) {
|
|
|
183
207
|
const ts = Date.now();
|
|
184
208
|
db.exec('BEGIN');
|
|
185
209
|
try {
|
|
210
|
+
// Domain-level AI-access blocks — the highest-severity citability gap: the
|
|
211
|
+
// page can't be cited at all because robots.txt locks out the crawlers.
|
|
212
|
+
if (Array.isArray(aiAccess)) {
|
|
213
|
+
for (const a of aiAccess) {
|
|
214
|
+
if (a.verdict === 'open') continue;
|
|
215
|
+
const fp = `ai-access::${a.domain}`;
|
|
216
|
+
const data = {
|
|
217
|
+
domain: a.domain,
|
|
218
|
+
score: a.score,
|
|
219
|
+
tier: a.blocked ? 'poor' : 'needs_work',
|
|
220
|
+
verdict: a.verdict,
|
|
221
|
+
blocked_crawlers: a.blockedBots,
|
|
222
|
+
weakest_signals: ['ai access'],
|
|
223
|
+
recommendation: a.blocked
|
|
224
|
+
? `robots.txt blocks AI answer-engine crawlers (${(a.blockedBots || []).slice(0, 5).join(', ')}). Allow ClaudeBot / GPTBot / PerplexityBot / Google-Extended so the assistants developers use can read and cite ${a.domain}.`
|
|
225
|
+
: `${a.detail} Review robots.txt AI-crawler rules on ${a.domain}.`,
|
|
226
|
+
};
|
|
227
|
+
upsertStmt.run(project, fp, ts, ts, JSON.stringify(data));
|
|
228
|
+
}
|
|
229
|
+
}
|
|
230
|
+
|
|
186
231
|
for (const r of targetResults) {
|
|
187
232
|
if (r.score >= 60) continue; // only flag pages that need work
|
|
188
233
|
|
|
@@ -216,14 +261,12 @@ export function upsertCitabilityInsights(db, project, targetResults) {
|
|
|
216
261
|
function getWeakestSignals(targetResults) {
|
|
217
262
|
if (!targetResults.length) return [];
|
|
218
263
|
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
};
|
|
223
|
-
|
|
264
|
+
// Key-agnostic: ai_access only appears in the breakdown when robots data was
|
|
265
|
+
// supplied, so build the accumulator from whatever signals are present.
|
|
266
|
+
const signalTotals = {};
|
|
224
267
|
for (const r of targetResults) {
|
|
225
268
|
for (const [k, v] of Object.entries(r.breakdown)) {
|
|
226
|
-
signalTotals[k]
|
|
269
|
+
signalTotals[k] = (signalTotals[k] || 0) + v;
|
|
227
270
|
}
|
|
228
271
|
}
|
|
229
272
|
|
package/analyses/aeo/scorer.js
CHANGED
|
@@ -265,9 +265,12 @@ export function richResultProbability(headings, bodyText, schemaTypes, wordCount
|
|
|
265
265
|
* @param {string[]} schemaTypes - schema type strings present on page
|
|
266
266
|
* @param {object[]} schemas - full page_schemas rows
|
|
267
267
|
* @param {string} searchIntent - from extraction
|
|
268
|
-
* @
|
|
268
|
+
* @param {object} [aiAccess] - domain-level robots.txt verdict from ai-access.js
|
|
269
|
+
* ({ score, blocked, verdict }). When provided, adds a 7th signal and applies
|
|
270
|
+
* a hard-reality gate: pages whose AI crawlers are blocked can't be cited.
|
|
271
|
+
* @returns {object} { score, breakdown, aiIntents, tier, richResult, aiAccess, aiAccessGated }
|
|
269
272
|
*/
|
|
270
|
-
export function scorePage(page, headings, entities, schemaTypes, schemas, searchIntent) {
|
|
273
|
+
export function scorePage(page, headings, entities, schemaTypes, schemas, searchIntent, aiAccess = null) {
|
|
271
274
|
const bodyText = page.body_text || '';
|
|
272
275
|
const wordCount = page.word_count || bodyText.split(/\s+/).length;
|
|
273
276
|
|
|
@@ -280,20 +283,37 @@ export function scorePage(page, headings, entities, schemaTypes, schemas, search
|
|
|
280
283
|
schema_coverage: schemaCoverageScore(schemaTypes),
|
|
281
284
|
};
|
|
282
285
|
|
|
283
|
-
// Weighted composite — entity authority and structured claims matter most for AI
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
|
|
290
|
-
|
|
291
|
-
|
|
286
|
+
// Weighted composite — entity authority and structured claims matter most for AI.
|
|
287
|
+
// When a domain-level AI-access verdict is supplied we fold in a 7th signal and
|
|
288
|
+
// rebalance to sum to 1.0; otherwise the original 6-signal model is preserved
|
|
289
|
+
// exactly (callers without robots data are unaffected).
|
|
290
|
+
const hasAiAccess = aiAccess && typeof aiAccess.score === 'number';
|
|
291
|
+
let weights;
|
|
292
|
+
if (hasAiAccess) {
|
|
293
|
+
breakdown.ai_access = aiAccess.score;
|
|
294
|
+
weights = {
|
|
295
|
+
entity_authority: 0.22, structured_claims: 0.18, answer_density: 0.18,
|
|
296
|
+
qa_proximity: 0.14, freshness: 0.08, schema_coverage: 0.10, ai_access: 0.10,
|
|
297
|
+
};
|
|
298
|
+
} else {
|
|
299
|
+
weights = {
|
|
300
|
+
entity_authority: 0.25, structured_claims: 0.20, answer_density: 0.20,
|
|
301
|
+
qa_proximity: 0.15, freshness: 0.10, schema_coverage: 0.10,
|
|
302
|
+
};
|
|
303
|
+
}
|
|
292
304
|
|
|
293
|
-
|
|
305
|
+
let score = Math.round(
|
|
294
306
|
Object.entries(weights).reduce((sum, [k, w]) => sum + breakdown[k] * w, 0)
|
|
295
307
|
);
|
|
296
308
|
|
|
309
|
+
// Hard-reality gate: if AI assistants are blocked from fetching the page,
|
|
310
|
+
// on-page quality is moot — they cannot cite what they cannot read.
|
|
311
|
+
let aiAccessGated = false;
|
|
312
|
+
if (aiAccess && aiAccess.blocked) {
|
|
313
|
+
score = Math.min(score, 30);
|
|
314
|
+
aiAccessGated = true;
|
|
315
|
+
}
|
|
316
|
+
|
|
297
317
|
const aiIntents = classifyAiIntent(headings, bodyText, searchIntent);
|
|
298
318
|
const richResult = richResultProbability(headings, bodyText, schemaTypes, wordCount);
|
|
299
319
|
|
|
@@ -304,5 +324,8 @@ export function scorePage(page, headings, entities, schemaTypes, schemas, search
|
|
|
304
324
|
else if (score >= 35) tier = 'needs_work';
|
|
305
325
|
else tier = 'poor';
|
|
306
326
|
|
|
307
|
-
return {
|
|
327
|
+
return {
|
|
328
|
+
score, breakdown, aiIntents, tier, richResult,
|
|
329
|
+
...(hasAiAccess ? { aiAccess: { score: aiAccess.score, verdict: aiAccess.verdict, blocked: !!aiAccess.blocked }, aiAccessGated } : {}),
|
|
330
|
+
};
|
|
308
331
|
}
|