npm - seo-intel - Versions diffs - 1.5.40 → 1.5.45 - Mend

seo-intel 1.5.40 → 1.5.45

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/CHANGELOG.md +64 -0
package/analyses/blog-draft/prescorer.js +17 -0
package/analyses/loop/orchestrator.js +179 -0
package/cli.js +162 -6
package/crawler/html-extract.js +127 -0
package/crawler/light.js +169 -0
package/db/db.js +66 -0
package/lib/gate.js +33 -1
package/lib/intel.js +9 -3
package/mcp/server.js +172 -17
package/package.json +1 -1
package/reports/generate-html.js +42 -404

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,69 @@
 # Changelog
+## 1.5.45 (2026-05-29)
+### The content loop in one command — `seo-intel loop` + `run_content_loop` (MCP)
+Closes the loop from the agent's side: instead of running gap-finding, drafting, and scoring as separate steps, one invocation walks the whole content half — **rank the open gaps → draft the highest-leverage one → AEO-prescore → record it → queue for publish.**
+- **CLI `seo-intel loop <project>`** (free): picks the top gap from the Intelligence Ledger (ranked by priority × source × AI-intent), drafts it with your chosen model, pre-scores citability, optionally auto-revises (`--revise k`) until it clears `--min-score`, marks the gap in-progress (the v1.5.42 write-back), and writes the approved draft to `reports/ready/<project>/`. Flags: `--topic`, `--count`, `--lang`, `--type`, `--model`, `--min-score`, `--revise`, `--no-queue`, `--dry-run`, `--format json`. `--dry-run` shows which gap it would target with no model call.
+- **MCP `run_content_loop`** (free): the same ranking + selection, returned as a seeded draft prompt for the agent's own LLM (hand-back mode) — write the draft, then call `prescore_draft(project, topic)` to score and close the loop. `dry_run` to just see the target.
+- New module `analyses/loop/orchestrator.js` (`runContentLoop`) backs both surfaces; the model writer is injectable so the CLI drives your cloud model while MCP hands the prompt to the agent.
+- The queued draft carries front-matter (`status: ready`, score, tier, source gap) as a publish handoff — it does not auto-deploy.
+## 1.5.44 (2026-05-29)
+### New standalone skill: `ai-citability` — score AI citability with zero install, zero account
+A drop-in Agent Skill (Claude Code / Cursor / Codex) that scores any page or draft 0–100 for how easily an AI assistant (ChatGPT, Claude, Perplexity, AI Overviews, Bing Copilot) can cite it — across the same six signals as the full AEO audit.
+- **Truly self-contained:** pure Node, no `npm install`, no account, no API key, no network. Nothing is saved or sent. Drop the folder into your agent's skills directory and it works.
+- **Markdown or HTML** input, from a file or stdin: `node scripts/score.mjs <file>` (add `--json` for machine output). The agent fetches the content however it likes — a local file, WebFetch, or the `crawl_site` MCP tool.
+- Reports the overall score + tier, all six signal bars, the two weakest signals with concrete fixes, and funnels to the full `seo-intel aeo` audit for whole-site, entity-aware, historical scoring.
+- Ships the exact scoring engine as the product (vendored from `analyses/aeo/scorer.js`), with a smoke-test drift guard so the standalone score never diverges from `seo-intel aeo`.
+- Lives in `skills/ai-citability/` — distributed via the repo / skill directories, not the npm package.
+## 1.5.43 (2026-05-29)
+### New MCP tool: `crawl_site` — crawl any URL, no setup, no account, nothing saved
+A zero-config crawl for any AI agent. Point it at a URL and it returns structured SEO/AEO data — no project to configure, no account, no API key, and nothing is persisted or sent anywhere.
+- **Lightweight by design:** plain HTTP fetch (no browser, no Playwright download), same-origin BFS, honours `robots.txt` + crawl-delay, small page budget (default 10, hard cap 50). www/non-www and http/https are treated as the same site.
+- **Returns per page:** title, meta description, canonical, indexability, headings, internal/external link counts, JSON-LD schema types, word count, and published/modified dates — plus a deduped list of discovered internal URLs.
+- **Optional AI-citability (AEO) score** per page via `include_citability` (approximate in light mode — no entity extraction; run `seo-intel aeo` for the full score).
+- **Knows its limits:** JavaScript-rendered / SPA pages under-report content (the response says so) — use `seo-intel crawl` (Playwright) for those, and install seo-intel for persistent history, the Intelligence Ledger, and competitor analysis.
+- New modules: `crawler/light.js` (fetch crawler) + `crawler/html-extract.js` (pure regex HTML extraction, zero new dependencies).
+## 1.5.42 (2026-05-29)
+### The content loop now remembers its own work
+Drafting a post used to leave the Intelligence Ledger untouched — so the same gap kept getting suggested even after you'd written about it. Now drafting closes that memory gap.
+- **`blog-draft` writes back to the Ledger.** After a draft is generated and AEO-scored, SEO Intel records a `draft_created` insight and flips the matching gap(s) to `in_progress` — so they stop resurfacing until a re-audit re-scores the published page. Matching is precise (keyword/topic), and the write-back is best-effort: a Ledger hiccup never fails your draft.
+- **MCP `prescore_draft` can close the loop too.** It gains optional `project` and `topic` arguments — pass them and the scored draft is recorded + matching gaps marked `in_progress`, identical to the CLI. Omit them for a pure, stateless score (unchanged default).
+- When no `--topic` is given, the targeted gap is recovered from the draft's own frontmatter title or first H1.
+### Fix — free-tier `blog-draft` no longer claims "no data" when you have citability gaps
+`blog-draft`'s empty-state check only counted competitor-analysis gaps, so a free user whose gaps came from `aeo` (AI citability) or only from `keywords` could be wrongly told "No intelligence data found." It now counts citability and content gaps too, and points free users at `aeo` + `keywords` first (competitor gaps via `analyze` are Solo).
+### Fix — MCP startup banner now reflects the real free/paid split
+The `seo-intel-mcp` startup line (shown in your MCP host's logs) still listed `run_citability_audit`, `prescore_draft`, `draft_blog_prompt`, and `get_intel(audit/blog)` as paid. They've been free since 1.5.41 — the banner now says so, listing only competitor synthesis (`get_competitor_positioning`, `get_intel(competitor)`, the `analyses` export table) as Solo.
+## 1.5.41 (2026-05-28)
+### Your own site is now fully free — across CLI, MCP, and the dashboard
+Everything SEO Intel can tell you about **your own** site is now available on the free tier: full crawl, AI Citability (AEO) scoring, keyword intelligence, programmatic template detection, orphan detection, JS-rendering delta, Search Console insights, blog-draft generation, and the complete dashboard. The daily problem-notification cron stays free too.
+Solo (€19.99/mo) now focuses on the things a tool has to do that an AI agent can't do for itself:
+- **Competitor synthesis** — gap analysis, keyword battleground, positioning, competitive landscape sections, and the competitor export/digest.
+- **Automation** — the scheduled-crawl scheduler.
+- **History & trends** — the crawl change brief ("what changed") and publishing velocity.
+**CLI:** own-site commands (`aeo`, `keywords`, `templates`, `orphans`, `js-delta`, `blog-draft`, `gsc-insights`, `scan`, `extract`, `html`) no longer prompt to upgrade. `intel <project> --for=audit|blog` is free; `--for=competitor` is Solo.
+**MCP:** `run_citability_audit`, `prescore_draft`, and `draft_blog_prompt` are now free tools. `export_intel` returns all own-site tables (pages, keywords, headings, links, technical, schemas, extractions, citability scores, and the Intelligence Ledger) for free; only the competitor gap-analysis history requires Solo. `get_competitor_positioning` remains Solo.
+**Dashboard:** collapsed to a single template that gates sections individually — own-site analysis (citability, keyword inventor, top keywords, GSC insights, internal links, exports, drafts) renders on every tier; competitor and strategy sections render with Solo. The old separate free-tier dashboard codepath is gone.
 ## 1.5.40 (2026-05-28)
 ### Setup wizard — daily notifications now self-install

package/analyses/blog-draft/prescorer.js CHANGED Viewed

@@ -10,6 +10,23 @@
 import { scorePage } from '../aeo/scorer.js';
+/**
+ * Recover a draft's subject from its own output when no explicit topic was
+ * given — so the agentic-loop write-back (F1) can still match Ledger gaps.
+ * Prefers YAML frontmatter `title:`, falls back to the first markdown H1.
+ * Shared by CLI `blog-draft` and the MCP `prescore_draft` tool.
+ * @param {string} draft
+ * @returns {string|null}
+ */
+export function extractDraftTopic(draft) {
+  if (!draft) return null;
+  const fm = draft.match(/^\s*---\s*[\s\S]*?\btitle\s*:\s*["']?(.+?)["']?\s*$/im);
+  if (fm && fm[1]) return fm[1].trim();
+  const h1 = draft.match(/^\s{0,3}#\s+(.+?)\s*$/m);
+  if (h1 && h1[1]) return h1[1].replace(/[#*_`]/g, '').trim();
+  return null;
+}
 export function prescore(markdownText) {
   // Strip YAML frontmatter
   const bodyMatch = markdownText.match(/^---[\s\S]*?---\n([\s\S]*)$/);

package/analyses/loop/orchestrator.js ADDED Viewed

@@ -0,0 +1,179 @@
+/**
+ * Content-loop orchestrator (F2) — the "hands, not eyes" engine.
+ *
+ * One invocation walks the content half of the agentic loop:
+ *   gather gaps → rank → draft → prescore → (revise) → write-back → queue
+ *
+ * Library-first: backs both `seo-intel loop` (CLI) and `run_content_loop` (MCP).
+ * The `generate` fn is injected so the CLI can drive the user's cloud model while
+ * MCP can hand the prompt back to the agent's own LLM (generate = null).
+ *
+ * Builds on F1 (v1.5.42): a finished draft records a `draft_created` insight and
+ * flips matching gaps to in_progress, so the loop remembers its own work.
+ * F3 (re-audit → flip in_progress→done once the live page clears 60) is NOT here.
+ */
+import { writeFileSync, mkdirSync } from 'node:fs';
+import { join, dirname, relative } from 'node:path';
+import { fileURLToPath } from 'node:url';
+import { gatherBlogDraftContext, buildBlogDraftPrompt } from '../blog-draft/index.js';
+import { prescore, extractDraftTopic } from '../blog-draft/prescorer.js';
+import { recordDraftCreated, markGapsInProgress } from '../../db/db.js';
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const REPO_ROOT = join(__dirname, '../..');
+const REPORTS_DIR = join(REPO_ROOT, 'reports');
+const PRIORITY_W = { high: 3, medium: 2, low: 1 };
+const SOURCE_W = { citability_gap: 1.3, content_gap: 1.3, keyword_gap: 1.1, long_tail: 1.0, keyword_inventor: 1.0 };
+const HOT_INTENT = /decision|comparison|implementation|compare|\bvs\b|best|should/i;
+function slugify(s) {
+  return (s || 'draft').toLowerCase().replace(/[^a-z0-9]+/g, '-').replace(/^-+|-+$/g, '').slice(0, 60) || 'draft';
+}
+/** Build a unified, leverage-ranked list of candidate gaps from gathered context. */
+export function rankGaps(ctx) {
+  const cands = [];
+  const add = (source, topicRaw, priority, intent, extra = {}) => {
+    const topic = (topicRaw || '').toString().trim();
+    if (!topic) return;
+    const pw = PRIORITY_W[(priority || 'medium').toLowerCase()] || 2;
+    const sw = SOURCE_W[source] || 1;
+    const bonus = intent && HOT_INTENT.test(intent) ? 0.25 : 0;
+    cands.push({ source, topic, priority: (priority || 'medium'), intent: intent || null, leverage: +(pw * sw * (1 + bonus)).toFixed(3), ...extra });
+  };
+  for (const kg of ctx.keywordGaps || []) add('keyword_gap', kg.keyword, kg.priority, kg.intent);
+  for (const lt of ctx.longTails || []) add('long_tail', lt.phrase, lt.priority, lt.intent);
+  for (const kw of ctx.kwInventor || []) add('keyword_inventor', kw.phrase, kw.priority, kw.intent);
+  for (const cg of ctx.contentGaps || []) add('content_gap', typeof cg === 'string' ? cg : (cg.topic || cg.suggested_title || cg.gap), cg.priority || 'high', null);
+  for (const cgap of ctx.citabilityGaps || []) add('citability_gap', cgap.title || cgap.url, (cgap.score ?? 50) < 35 ? 'high' : 'medium', (cgap.ai_intents || [])[0], { url: cgap.url, current_score: cgap.score });
+  // Dedupe by lowercased topic, keep highest leverage.
+  const best = new Map();
+  for (const c of cands) {
+    const k = c.topic.toLowerCase();
+    if (!best.has(k) || best.get(k).leverage < c.leverage) best.set(k, c);
+  }
+  return [...best.values()].sort((a, b) => b.leverage - a.leverage);
+}
+/**
+ * @param {import('node:sqlite').DatabaseSync} db
+ * @param {string} project
+ * @param {object} opts
+ * @param {object} opts.config            project config (for the prompt builder)
+ * @param {string} [opts.topic]           focus topic (else auto-pick top gap)
+ * @param {number} [opts.count=1]         draft the top N gaps
+ * @param {string} [opts.lang='en']
+ * @param {string} [opts.contentType='blog']
+ * @param {number} [opts.minScore=60]
+ * @param {number} [opts.revise=0]        auto-revise up to k times if below minScore
+ * @param {boolean} [opts.queue=true]     write approved drafts to reports/ready/<project>/
+ * @param {string} [opts.queueDir]
+ * @param {boolean} [opts.dryRun=false]   select + plan only, no model call
+ * @param {(prompt:string)=>Promise<string|null>} [opts.generate] null ⇒ hand-back mode
+ * @param {(msg:string)=>void} [opts.onProgress]
+ */
+export async function runContentLoop(db, project, opts = {}) {
+  const {
+    config = { project }, topic = null, count = 1, lang = 'en', contentType = 'blog',
+    minScore = 60, revise = 0, queue = true, queueDir, dryRun = false,
+    generate = null, onProgress = () => {},
+  } = opts;
+  const ctx = gatherBlogDraftContext(db, project, topic);
+  const ranked = rankGaps(ctx);
+  if (!ranked.length) {
+    return {
+      project, mode: 'no-gaps', drafts: [], skipped: [],
+      next_action: 'No active gaps to draft. Run `seo-intel aeo` + `seo-intel keywords` (own-site, free) or `seo-intel analyze` (competitor, Solo) to populate the Ledger.',
+    };
+  }
+  const targets = ranked.slice(0, Math.max(1, count));
+  if (dryRun) {
+    return {
+      project, mode: 'dry-run',
+      planned: targets.map(t => ({ topic: t.topic, source: t.source, priority: t.priority, leverage: t.leverage })),
+      next_action: 'Dry run — no draft generated. Re-run without dryRun to draft these.',
+    };
+  }
+  const drafts = [];
+  const skipped = [];
+  for (const target of targets) {
+    const tTopic = target.topic;
+    const prompt = buildBlogDraftPrompt(ctx, { config, lang, topic: tTopic, contentType });
+    // Hand-back mode (MCP default): no model wired — return the prompt so the
+    // agent's own LLM writes the draft, then calls prescore_draft to close.
+    if (!generate) {
+      drafts.push({
+        gap: { source: target.source, topic: tTopic, leverage: target.leverage },
+        mode: 'handback', prompt,
+        next: `Write the draft from this prompt with your own LLM, then call prescore_draft(project="${project}", topic="${tTopic}") to AEO-score it and close the loop.`,
+      });
+      continue;
+    }
+    onProgress(`drafting "${tTopic}" (${target.source} · leverage ${target.leverage})`);
+    let draft = await generate(prompt);
+    if (!draft) { skipped.push({ topic: tTopic, reason: 'generation_failed' }); continue; }
+    let score = prescore(draft);
+    let revisions = 0;
+    while (score.score < minScore && revisions < revise) {
+      const weak = Object.entries(score.breakdown).sort((a, b) => a[1] - b[1]).slice(0, 2).map(([k]) => k.replace(/_/g, ' '));
+      onProgress(`revising ${revisions + 1}/${revise} (was ${score.score}; weak: ${weak.join(', ')})`);
+      const revised = await generate(prompt + `\n\n## REVISION\nThe previous draft scored ${score.score}/100. Strengthen the weakest signals: ${weak.join(', ')}. Add Q&A structure (H2 question → immediate answer), concrete numbers/dates, and named entities/sources. Return the full improved draft.`);
+      revisions++;
+      if (revised) {
+        const rs = prescore(revised);
+        if (rs.score >= score.score) { draft = revised; score = rs; }
+      }
+    }
+    const effectiveTopic = tTopic || extractDraftTopic(draft);
+    // Queue for publish (handoff, not auto-deploy).
+    let queuedPath = null;
+    if (queue) {
+      const dir = queueDir || join(REPORTS_DIR, 'ready', project);
+      mkdirSync(dir, { recursive: true });
+      queuedPath = join(dir, `${slugify(effectiveTopic)}.md`);
+      const fm = `---\nstatus: ready\nscore: ${score.score}\ntier: ${score.tier}\ntopic: ${JSON.stringify(effectiveTopic)}\nsource_gap: ${target.source}\nlang: ${lang}\ntype: ${contentType}\ncreated_at: ${new Date().toISOString()}\n---\n\n`;
+      writeFileSync(queuedPath, /^\s*---/.test(draft) ? draft : fm + draft, 'utf8');
+    }
+    // Write-back (F1) — never let a Ledger hiccup fail the draft.
+    let marked = 0;
+    try {
+      recordDraftCreated(db, project, { topic: effectiveTopic, score: score.score, tier: score.tier, wordCount: score.wordCount, lang, contentType, savedPath: queuedPath });
+      marked = markGapsInProgress(db, project, effectiveTopic);
+    } catch { /* best-effort */ }
+    drafts.push({
+      gap: { source: target.source, topic: tTopic, leverage: target.leverage, ...(target.url ? { url: target.url, previous_score: target.current_score } : {}) },
+      topic: effectiveTopic, score: score.score, tier: score.tier, revisions,
+      word_count: score.wordCount,
+      queued_path: queuedPath ? relative(REPO_ROOT, queuedPath) : null,
+      ledger: { draft_recorded: true, gaps_marked_in_progress: marked },
+    });
+  }
+  const handback = !generate;
+  return {
+    project,
+    mode: handback ? 'handback' : 'generated',
+    drafts, skipped,
+    next_action: handback
+      ? `No generation model wired — write each draft from its prompt, then call prescore_draft(project, topic) to score + close the loop. (CLI: \`seo-intel loop ${project}\` drives a model directly.)`
+      : queue
+        ? `Review reports/ready/${project}/, publish, then re-crawl + \`seo-intel aeo\` to verify the gap closed.`
+        : 'Drafts generated (not queued).',
+  };
+}

package/cli.js CHANGED Viewed

@@ -40,6 +40,7 @@ import {
   getPageHash, getSchemasByProject,
   upsertInsightsFromAnalysis, upsertInsightsFromKeywords,
   upsertSitemapUrls,
+  recordDraftCreated, markGapsInProgress,
 } from './db/db.js';
 import { generateMultiDashboard } from './reports/generate-html.js';
 import { buildTechnicalActions } from './exports/technical.js';
@@ -4709,7 +4710,7 @@ program
     printAttackHeader('✍️  AEO Blog Draft Generator', project);
     const { gatherBlogDraftContext, buildBlogDraftPrompt } = await getBlogDraftModule();
-    const { prescore } = await getPrescorerModule();
+    const { prescore, extractDraftTopic } = await getPrescorerModule();
     // ── Gather intelligence ──
     console.log(chalk.gray('  Gathering intelligence from Ledger...'));
@@ -4728,9 +4729,13 @@ program
     console.log(chalk.gray(`    Keyword gaps: ${stats.keywordGaps}  Long-tails: ${stats.longTails}  Citability gaps: ${stats.citabilityGaps}`));
     console.log(chalk.gray(`    Keyword inventor: ${stats.kwInventor}  Content gaps: ${stats.contentGaps}  Entities: ${stats.entities}`));
-    if (stats.keywordGaps + stats.longTails + stats.kwInventor === 0) {
+    // F5 (v1.5.42): count the free-tier gap sources too — a free user's gaps
+    // come from `aeo` (citability) and `keywords` (kw-inventor), not just the
+    // Solo competitor analysis. Don't tell them "no data" when AEO gaps exist.
+    if (stats.keywordGaps + stats.longTails + stats.kwInventor + stats.citabilityGaps + stats.contentGaps === 0) {
       console.log(chalk.yellow('\n  ⚠️  No intelligence data found in the Ledger.'));
-      console.log(chalk.gray('   Run: seo-intel analyze ' + project + '  and  seo-intel keywords ' + project + '\n'));
+      console.log(chalk.gray('   Free sources: seo-intel aeo ' + project + '  and  seo-intel keywords ' + project));
+      console.log(chalk.gray('   Competitor gaps (Solo): seo-intel analyze ' + project + '\n'));
       return;
     }
@@ -4786,14 +4791,15 @@ program
     }
     // ── Output ──
+    let savedPath = null;
     if (opts.save) {
       const slug = opts.topic
         ? opts.topic.toLowerCase().replace(/[^a-z0-9]+/g, '-').slice(0, 50)
         : 'auto';
       const filename = `${project}-blog-draft-${slug}-${new Date().toISOString().slice(0, 10)}.md`;
-      const outPath = join(__dirname, 'reports', filename);
-      writeFileSync(outPath, draft, 'utf8');
-      console.log(chalk.bold.green(`  ✅ Draft saved: ${outPath}`));
+      savedPath = join(__dirname, 'reports', filename);
+      writeFileSync(savedPath, draft, 'utf8');
+      console.log(chalk.bold.green(`  ✅ Draft saved: ${savedPath}`));
       console.log('');
     } else {
       console.log(chalk.bold('  📝 Generated Draft'));
@@ -4804,6 +4810,32 @@ program
       console.log(chalk.gray('  Tip: add --save to write the draft to reports/'));
       console.log('');
     }
+    // ── F1 (v1.5.42): close the loop's memory gap ──
+    // Record the draft in the Ledger and flip matching gaps to in_progress so
+    // they stop resurfacing next pass. Best-effort — never break the command.
+    try {
+      const effectiveTopic = opts.topic || extractDraftTopic(draft);
+      recordDraftCreated(db, project, {
+        topic: effectiveTopic,
+        score: scoreResult.score,
+        tier: scoreResult.tier,
+        wordCount: scoreResult.wordCount,
+        lang: opts.lang,
+        contentType: opts.type || 'blog',
+        savedPath,
+      });
+      const marked = markGapsInProgress(db, project, effectiveTopic);
+      if (marked > 0) {
+        console.log(chalk.gray(`  📌 Ledger: ${marked} gap(s) marked in-progress — they'll stop resurfacing until re-audited.`));
+      } else {
+        console.log(chalk.gray('  📌 Ledger: draft recorded.'));
+      }
+      console.log('');
+    } catch (e) {
+      // loop write-back is best-effort; a failure must not fail the draft
+      console.log(chalk.dim(`  (Ledger write-back skipped: ${e.message})`));
+    }
   });
 // ── GUIDE (Coach-style chapter map) ──────────────────────────────────────
@@ -5359,6 +5391,130 @@ if (process.argv.length <= 2) {
   process.exit(0);
 }
+// ── LOOP — content-loop orchestrator: gap → draft → prescore → queue ─────
+program
+  .command('loop <project>')
+  .description('Run the content loop once: pick the top gap → draft → prescore → queue for publish')
+  .option('--topic <t>', 'Focus on a specific topic instead of auto-picking')
+  .option('--count <n>', 'Draft the top N gaps', '1')
+  .option('--lang <code>', 'Language: en or fi', 'en')
+  .option('--type <type>', 'Content type: blog, docs, or social', 'blog')
+  .option('--model <name>', 'Generation model (gemini, claude, gpt, deepseek)', 'gemini')
+  .option('--min-score <n>', 'Target citability before publishing', '60')
+  .option('--revise <k>', 'Auto-revise up to k times if below min-score', '0')
+  .option('--no-queue', 'Do not write drafts to reports/ready/')
+  .option('--dry-run', 'Pick the gap and show the plan — no model call')
+  .option('--format <type>', 'Output format: brief or json', 'brief')
+  .action(async (project, opts) => {
+    if (!requirePro('loop')) return;
+    const db = getDb();
+    const config = loadConfig(project);
+    const isJson = opts.format === 'json';
+    const { runContentLoop } = await import('./analyses/loop/orchestrator.js');
+    if (!isJson) printAttackHeader('🔁  Content Loop', project);
+    const result = await runContentLoop(db, project, {
+      config,
+      topic: opts.topic || null,
+      count: parseInt(opts.count, 10) || 1,
+      lang: opts.lang,
+      contentType: opts.type || 'blog',
+      minScore: parseInt(opts.minScore, 10) || 60,
+      revise: parseInt(opts.revise, 10) || 0,
+      queue: opts.queue !== false,
+      dryRun: !!opts.dryRun,
+      generate: opts.dryRun ? null : (p) => callAnalysisModel(p, opts.model),
+      onProgress: isJson ? () => {} : (m) => console.log(chalk.gray('  · ' + m)),
+    });
+    if (isJson) { console.log(JSON.stringify(result, null, 2)); return; }
+    console.log('');
+    if (result.mode === 'no-gaps') {
+      console.log(chalk.yellow('  No active gaps to draft.'));
+      console.log(chalk.gray('  ' + result.next_action));
+      console.log('');
+      return;
+    }
+    if (result.mode === 'dry-run') {
+      console.log(chalk.bold('  Planned drafts (highest leverage first):'));
+      for (const p of result.planned) {
+        console.log(`    ${chalk.cyan(p.topic.slice(0, 60))} ${chalk.gray(`[${p.source} · ${p.priority} · leverage ${p.leverage}]`)}`);
+      }
+      console.log('');
+      console.log(chalk.gray('  Re-run without --dry-run to draft these.'));
+      console.log('');
+      return;
+    }
+    for (const d of result.drafts) {
+      const tier = d.score >= 60 ? chalk.green : d.score >= 35 ? chalk.yellow : chalk.red;
+      console.log(`  ${chalk.bold('→')} ${chalk.cyan(`"${d.topic.slice(0, 56)}"`)}  ${tier(`${d.score}/100 (${d.tier})`)}`);
+      console.log(chalk.gray(`    ${d.word_count}w · ${d.revisions} revision(s) · gap: ${d.gap.source} · ${d.ledger.gaps_marked_in_progress} gap(s) marked in-progress`));
+      if (d.queued_path) console.log(chalk.gray(`    queued: ${d.queued_path}`));
+    }
+    for (const s of result.skipped) console.log(chalk.gray(`  ✗ skipped "${s.topic.slice(0, 40)}": ${s.reason}`));
+    console.log('');
+    console.log(chalk.gray('  ' + result.next_action));
+    console.log('');
+  });
+// ── CRAWL-URL — ad-hoc lightweight crawl of any URL (free, no project) ────
+program
+  .command('crawl-url <url>')
+  .description('Ad-hoc lightweight crawl of any URL — no project, no browser, nothing saved')
+  .option('--max-pages <n>', 'Pages to fetch (default 10, hard cap 50)', '10')
+  .option('--citability', 'Include per-page AI citability (AEO) score')
+  .option('--all-origins', 'Follow links across origins (default: same site only)')
+  .option('--format <type>', 'Output format: brief or json', 'brief')
+  .action(async (url, opts) => {
+    const isJson = opts.format === 'json';
+    const { lightCrawl } = await import('./crawler/light.js');
+    try {
+      const r = await lightCrawl(url, {
+        maxPages: parseInt(opts.maxPages, 10) || 10,
+        includeCitability: !!opts.citability,
+        sameOrigin: !opts.allOrigins,
+        onProgress: isJson ? undefined : (m) => console.log(chalk.gray('  ' + m)),
+      });
+      if (isJson) {
+        const pages = r.pages.map(p => ({
+          url: p.url, status_code: p.status_code, title: p.title, meta_desc: p.meta_desc,
+          canonical: p.canonical || null, is_indexable: p.is_indexable, word_count: p.word_count,
+          headings: p.headings, schema_types: p.schema_types,
+          published_date: p.published_date, modified_date: p.modified_date,
+          internal_links: p.links.filter(l => l.internal).length,
+          external_links: p.links.filter(l => !l.internal).length,
+          ...(p.citability ? { citability: p.citability } : {}),
+        }));
+        console.log(JSON.stringify({ start: r.start, origin: r.origin, stats: r.stats, pages, skipped: r.skipped }, null, 2));
+        return;
+      }
+      console.log('');
+      console.log(chalk.bold(`  🕸  Light crawl — ${r.origin}`));
+      console.log(chalk.gray(`  ${r.stats.crawled} pages · ${r.stats.with_schema} with schema · ${r.stats.missing_title} missing title · ${r.stats.missing_meta_desc} missing meta · ${r.stats.elapsed_ms}ms`));
+      console.log('');
+      for (const p of r.pages) {
+        const c = p.citability;
+        const citeTxt = c ? (c.score >= 60 ? chalk.green : c.score >= 35 ? chalk.yellow : chalk.red)(` · cite:${c.score}`) : '';
+        console.log(`  ${chalk.cyan(p.url)}`);
+        console.log(chalk.gray(`    [${p.status_code}] ${p.word_count}w · h:${p.headings.length} · schema:[${p.schema_types.join(',') || '—'}]`) + citeTxt);
+        if (p.title) console.log(chalk.gray(`    “${p.title.slice(0, 80)}”`));
+      }
+      if (r.skipped.length) console.log(chalk.gray(`\n  skipped ${r.skipped.length}: ${r.skipped.slice(0, 5).map(s => s.reason).join(', ')}`));
+      console.log('');
+      console.log(chalk.gray('  Ephemeral + local — nothing saved. JS-rendered pages under-report (use `seo-intel crawl`).'));
+      console.log(chalk.gray('  For persistent history, the Ledger & competitors: `seo-intel setup` → `crawl`.'));
+      console.log('');
+    } catch (err) {
+      if (isJson) { console.log(JSON.stringify({ error: err.message })); process.exit(1); }
+      console.error(chalk.red(`\n  ✗ ${err.message}\n`));
+      process.exit(1);
+    }
+  });
 // Global error handler — ensures uncaught errors in async actions exit non-zero (BUG-004)
 program.parseAsync().catch(err => {
   console.error(chalk.red(`\n✗ ${err.message}\n`));

package/crawler/html-extract.js ADDED Viewed

@@ -0,0 +1,127 @@
+/**
+ * Lightweight HTML extractor — pure string/regex parsing. No DOM, no browser.
+ *
+ * Powers the fetch-based light crawler (crawler/light.js) so ANY Claude user can
+ * crawl + analyze a site with zero browser environment installed. Consistent
+ * with schema-parser.js's regex approach ("no DOM parser needed").
+ *
+ * Trade-off: not as bulletproof as a full DOM parse on adversarial markup, but
+ * more than good enough for SEO/AEO metadata (title, meta, headings, links,
+ * JSON-LD, dates). The full Playwright crawler stays the heavyweight option.
+ */
+import { stripHtml } from './sanitize.js';
+import { parseJsonLd } from './schema-parser.js';
+function decodeEntities(s) {
+  return (s || '')
+    .replace(/&amp;/g, '&').replace(/&lt;/g, '<').replace(/&gt;/g, '>')
+    .replace(/&quot;/g, '"').replace(/&#0?39;/g, "'").replace(/&apos;/g, "'")
+    .replace(/&nbsp;/g, ' ')
+    .replace(/&#(\d+);/g, (_, n) => { try { return String.fromCodePoint(+n); } catch { return ' '; } })
+    .replace(/&#x([0-9a-f]+);/gi, (_, h) => { try { return String.fromCodePoint(parseInt(h, 16)); } catch { return ' '; } })
+    .trim();
+}
+const collapse = (s) => decodeEntities(stripHtml(s || '').replace(/\s+/g, ' '));
+export function extractTitle(html) {
+  const m = html.match(/<title[^>]*>([\s\S]*?)<\/title>/i);
+  return m ? decodeEntities(m[1].replace(/\s+/g, ' ')) : '';
+}
+// Find a <meta> tag by attribute (name|property) = value, then read its content.
+function metaContent(html, attr, value) {
+  const re = new RegExp(`<meta\\b[^>]*\\b${attr}\\s*=\\s*["']${value}["'][^>]*>`, 'i');
+  const tag = html.match(re);
+  if (!tag) return '';
+  const c = tag[0].match(/\bcontent\s*=\s*["']([\s\S]*?)["']/i);
+  return c ? decodeEntities(c[1]) : '';
+}
+export function extractMetaDescription(html) {
+  return metaContent(html, 'name', 'description') || metaContent(html, 'property', 'og:description');
+}
+export function extractMetaRobots(html) {
+  return metaContent(html, 'name', 'robots').toLowerCase();
+}
+export function extractCanonical(html, baseUrl) {
+  const tag = html.match(/<link\b[^>]*\brel\s*=\s*["']canonical["'][^>]*>/i);
+  if (!tag) return '';
+  const h = tag[0].match(/\bhref\s*=\s*["']([^"']+)["']/i);
+  if (!h) return '';
+  try { return new URL(h[1], baseUrl).toString(); } catch { return h[1]; }
+}
+export function extractHeadings(html) {
+  const out = [];
+  const re = /<h([1-6])\b[^>]*>([\s\S]*?)<\/h\1>/gi;
+  let m;
+  while ((m = re.exec(html)) !== null) {
+    const text = collapse(m[2]);
+    if (text) out.push({ level: Number(m[1]), text: text.slice(0, 300) });
+    if (out.length >= 300) break;
+  }
+  return out;
+}
+export function extractLinks(html, baseUrl) {
+  const out = [];
+  const seen = new Set();
+  let base; try { base = new URL(baseUrl); } catch { base = null; }
+  const re = /<a\b[^>]*\bhref\s*=\s*["']([^"']+)["'][^>]*>([\s\S]*?)<\/a>/gi;
+  let m;
+  while ((m = re.exec(html)) !== null) {
+    let href = m[1].trim();
+    if (!href) continue;
+    if (/^(#|mailto:|tel:|javascript:|data:)/i.test(href)) continue;
+    let abs;
+    try { abs = base ? new URL(href, base).toString() : href; } catch { continue; }
+    abs = abs.split('#')[0];
+    if (seen.has(abs)) continue;
+    seen.add(abs);
+    let internal = false;
+    try { internal = !!base && new URL(abs).hostname === base.hostname; } catch { /* keep false */ }
+    out.push({ href: abs, text: collapse(m[2]).slice(0, 120), internal });
+    if (out.length >= 1000) break;
+  }
+  return out;
+}
+/**
+ * Parse one fetched HTML document into the structured shape the rest of
+ * SEO Intel speaks (mirrors the Playwright crawler's per-page object).
+ * @param {string} html
+ * @param {string} url - the (final) URL this HTML was fetched from
+ */
+export function extractPageData(html, url) {
+  const schemas = parseJsonLd(html) || [];
+  const schemaTypes = [...new Set(schemas.map(s => s.type).filter(Boolean))];
+  let published = null, modified = null;
+  for (const s of schemas) {
+    if (!published && s.datePublished) published = s.datePublished;
+    if (!modified && s.dateModified) modified = s.dateModified;
+  }
+  const bodyText = stripHtml(html);
+  const wordCount = bodyText ? bodyText.split(/\s+/).filter(Boolean).length : 0;
+  const robots = extractMetaRobots(html);
+  return {
+    url,
+    title: extractTitle(html),
+    meta_desc: extractMetaDescription(html),
+    canonical: extractCanonical(html, url),
+    robots,
+    is_indexable: !/\bnoindex\b/.test(robots),
+    headings: extractHeadings(html),
+    links: extractLinks(html, url),
+    schema_types: schemaTypes,
+    schemas,
+    word_count: wordCount,
+    body_text: bodyText.slice(0, 20000),
+    published_date: published,
+    modified_date: modified,
+  };
+}