npm - similarbuild - Versions diffs - 0.1.0 - Mend

similarbuild 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

package/templates/skills/sb-crawl-and-list/SKILL.md ADDED Viewed

@@ -0,0 +1,99 @@
+---
+name: sb-crawl-and-list
+description: Discovers a site's pages by trying `sitemap.xml` first then falling back to a depth-bounded HTML crawl with cheerio, applies a default + custom blocklist, classifies each URL by path heuristic, and emits a dense JSON page list ready for human confirmation. Use when the SimilarBuild orchestrator (`/build-site`) requests page discovery before batching, or when the user asks to 'crawl and list' a site's pages.
+---
+# sb-crawl-and-list
+## Overview
+Takes a `--root-url` and produces a normalized page list — `{url, type, title?, depth, source}[]` — that downstream batch orchestration (`/build-site`) renders as a confirmation table before launching N parallel `/build-page` runs. Sitemap.xml covers the modern web; cheerio handles static HTML when sitemaps are absent or hostile; SPAs that hide all routes behind JS get a clear warning rather than a fabricated list.
+Deliberately **chromium-free**. Crawling is URL discovery, not visual inspection — `sb-inspect-live` and `sb-validate-render` already pay the chromium tax. A second headless browser here would balloon install time, memory, and crawl latency for a job that `fetch` + `cheerio` does in seconds. The trade-off is explicit: pure-SPA sites require a manual sitemap. The skill tells the user that, doesn't try to fake it.
+Act as a discovery gate: the orchestrator commits to building everything in this list, so the list must be **complete enough to confirm** (no missing key pages) and **clean enough to ship** (no admin paths, tracking URLs, or duplicate-by-querystring noise). Determinism is non-negotiable — same root, same sitemap, same list across reruns.
+## Inputs
+| Argument                 | Required | Default | Notes                                                                                                  |
+| ------------------------ | -------- | ------- | ------------------------------------------------------------------------------------------------------ |
+| `root-url`               | yes      | —       | Origin to crawl. Used as base for sitemap probe and as scope for the fallback crawler (same-origin).   |
+| `output-dir`             | yes      | —       | Directory for `pages-list.json` (and `sitemap-raw.xml` if a sitemap was downloaded).                   |
+| `max-depth`              | no       | `3`     | Only used by the fallback crawl — sitemap is one level by definition.                                  |
+| `max-pages`              | no       | `200`   | Hard cap on the emitted list. Excess URLs are dropped and a warning is added.                          |
+| `respect-robots-txt`     | no       | `true`  | If `false`, `robots.txt` `Disallow:` rules for the same origin are ignored.                            |
+| `blocklist`              | no       | (none)  | Comma-separated extra path prefixes (e.g. `/internal,/preview`). Adds to the default blocklist.        |
+| `sitemap-path`           | no       | (none)  | If set, reads sitemap XML from this local path instead of probing `{root-url}/sitemap.xml`.            |
+| `timeout`                | no       | `10000` | Per-request fetch timeout (ms). Applies to sitemap probe, robots fetch, and each crawl page.           |
+## Output
+A single JSON object printed to stdout AND saved to `{output-dir}/pages-list.json`:
+```json
+{
+  "rootUrl": "https://example.com",
+  "source": "sitemap",
+  "pageCount": 27,
+  "blocked": 12,
+  "duplicates": 3,
+  "warnings": [],
+  "pages": [
+    { "url": "https://example.com/", "type": "home", "title": null, "depth": 0, "source": "sitemap" },
+    { "url": "https://example.com/products/foo", "type": "pdp", "title": null, "depth": 1, "source": "sitemap" }
+  ]
+}
+```
+`source` is `"sitemap"` when any URL was harvested from XML, `"crawl"` when the fallback was used, `"sitemap-path"` when `--sitemap-path` was provided. `pageCount` reflects what's in `pages[]` after blocklist + dedupe + cap. `blocked` and `duplicates` are diagnostics so the orchestrator can show the user how aggressive the filtering was. `warnings[]` is where SPA detection, cap-truncation, robots-blocked-root, and oversized-sitemap messages surface.
+`title` is reserved — populated `null` here. The orchestrator can hydrate it later from individual `sb-inspect-live` runs without re-crawling.
+## On Activation
+1. **Resolve inputs.** Collect `root-url` and `output-dir`. Apply defaults above. Normalize `root-url` (strip trailing slash on the path component).
+2. **Ensure `output-dir` exists.** `mkdir -p` it.
+3. **Run the script.** From the project root:
+   ```bash
+   node {skill-root}/scripts/crawl-and-list.mjs \
+     --root-url "{root_url}" \
+     --output-dir "{output_dir}" \
+     [--max-depth N] [--max-pages N] [--respect-robots-txt false] \
+     [--blocklist "/preview,/internal"] \
+     [--sitemap-path "{path}"] \
+     [--timeout 10000]
+   ```
+   The script: optionally honors `robots.txt`, tries `--sitemap-path` → `{root}/sitemap.xml` → `{root}/sitemap_index.xml` (recursing one level), falls back to a same-origin BFS via `fetch` + `cheerio` honoring `--max-depth`, then runs blocklist → dedupe → classifier → cap. See `scripts/crawl-and-list.mjs --help` for the full flag list.
+4. **Validate the result.** Parse stdout as JSON. If exit is non-zero, surface stderr to the orchestrator and stop — don't fabricate a partial list.
+5. **Forward `warnings[]` unchanged.** Cap-exceeded, SPA-suspected, robots-blocked-root, and parse-failure warnings are user-facing — the orchestrator displays them alongside the page table.
+6. **Return the JSON unchanged** to the caller. `/build-site` consumes `pages[]`, `pageCount`, and `warnings[]` directly to render the confirmation table.
+## Failure modes
+| Symptom                                          | Likely cause                                          | What to surface                                                          |
+| ------------------------------------------------ | ----------------------------------------------------- | ------------------------------------------------------------------------ |
+| Script exits non-zero (1)                        | Network failure to root, write failure, malformed args masked as runtime | Pass stderr verbatim. If the message mentions `cheerio`, suggest `npm i cheerio`. |
+| `pageCount: 0` + `warnings: ["spa-suspected"]`   | Pure SPA — root HTML has 0–1 internal `<a href>`      | Tell the user: site appears to be a SPA; rerun with `--sitemap-path` pointing at a manually-supplied sitemap. |
+| `warnings: ["sitemap-truncated:N"]`              | Source sitemap had > 1000 entries                     | Ask the user if they want to filter (e.g. only `/products/*`) or proceed with the cap. |
+| `warnings: ["robots-disallow-root"]`             | `robots.txt` disallows `/` for `*`                    | Forward unchanged. Either rerun with `--respect-robots-txt false` (with permission) or stop. |
+| `warnings: ["max-pages-cap:N-of-M"]`             | Emitted list hit the `--max-pages` cap                | Surface both numbers. Suggest raising the cap or filtering before continuing. |
+| HTTP 403 / 401 / 429 to root                     | Origin blocks node's UA or rate-limits                | Error is clear in stderr. (Custom UA support is a future flag — escalate to user.) |
+## Conventions
+- Bare paths (e.g. `scripts/crawl-and-list.mjs`) resolve from the skill root.
+- `{skill-root}` resolves to this skill's installed directory.
+- `{project-root}` resolves to the project working directory.
+- The script never invents URLs. If discovery yields nothing, the list is empty and a warning explains why.
+- Stdout is **only** the result JSON. All progress goes to stderr with the `[sb-crawl-and-list]` prefix so the orchestrator can pipe stdout straight into `JSON.parse`.
+## Dependencies
+The host project must have `cheerio` installed (already a dep of `sb-review-checks`). Node ≥ 20 (uses native `fetch`, `URL`, `AbortController`). XML parsing is done with the bundled `lib/sitemap-parser.mjs` — no `xml2js` required. The SimilarBuild installer handles `cheerio`.

package/templates/skills/sb-crawl-and-list/scripts/crawl-and-list.mjs ADDED Viewed

@@ -0,0 +1,437 @@
+#!/usr/bin/env node
+// crawl-and-list.mjs — Discover a site's pages and emit a normalized JSON
+// list ready for human confirmation in /build-site.
+//
+// Strategy: try sitemap first (cheap, authoritative), fall back to a
+// same-origin BFS via fetch + cheerio. No chromium. SPA-only sites get a
+// warning, not a fabricated list.
+//
+// Output: JSON to stdout AND pages-list.json in --output-dir. Logs to stderr
+// with the [sb-crawl-and-list] prefix. Exit codes: 0=ok, 1=script error,
+// 2=invalid args.
+//
+// cheerio is imported lazily inside main() so --help and arg validation
+// work without the dep installed.
+import { parseArgs } from 'node:util'
+import { mkdir, writeFile, readFile, access } from 'node:fs/promises'
+import { join, resolve } from 'node:path'
+import { parseSitemap, fetchAndParseSitemap } from './lib/sitemap-parser.mjs'
+import { buildBlocklist, filterUrl } from './lib/blocklist-filter.mjs'
+import { classifyUrl } from './lib/page-classifier.mjs'
+import { fallbackCrawl } from './lib/fallback-crawler.mjs'
+const HELP = `
+crawl-and-list.mjs — Discover a site's pages for SimilarBuild batch builds.
+Required:
+  --root-url <url>          Origin to crawl (e.g. https://example.com).
+  --output-dir <dir>        Directory for pages-list.json (and sitemap-raw.xml
+                            if a sitemap was downloaded).
+Optional:
+  --max-depth <n>           Default 3. Used only by the fallback crawler.
+  --max-pages <n>           Default 200. Hard cap on the emitted list.
+  --respect-robots-txt      Default true. Pass --respect-robots-txt false to ignore.
+  --blocklist <csv>         Extra path prefixes, comma-separated. Adds to defaults.
+  --sitemap-path <file>     Read sitemap from this local path instead of probing.
+  --timeout <ms>            Default 10000. Per-request fetch timeout.
+  --help                    Show this message.
+Exit codes: 0=ok, 1=script error, 2=invalid args.
+`
+const STDERR_PREFIX = '[sb-crawl-and-list]'
+function fail(msg, code = 2) {
+  process.stderr.write(`${STDERR_PREFIX} ${msg}\n`)
+  process.exit(code)
+}
+function log(msg) {
+  process.stderr.write(`${STDERR_PREFIX} ${msg}\n`)
+}
+const { values } = parseArgs({
+  options: {
+    'root-url': { type: 'string' },
+    'output-dir': { type: 'string' },
+    'max-depth': { type: 'string', default: '3' },
+    'max-pages': { type: 'string', default: '200' },
+    'respect-robots-txt': { type: 'string', default: 'true' },
+    blocklist: { type: 'string' },
+    'sitemap-path': { type: 'string' },
+    timeout: { type: 'string', default: '10000' },
+    help: { type: 'boolean', default: false },
+  },
+  strict: false,
+})
+if (values.help) {
+  process.stdout.write(HELP)
+  process.exit(0)
+}
+if (!values['root-url']) fail('missing --root-url')
+if (!values['output-dir']) fail('missing --output-dir')
+let ROOT_URL
+try {
+  const u = new URL(values['root-url'])
+  if (!/^https?:$/.test(u.protocol)) fail(`--root-url must be http(s): got "${values['root-url']}"`)
+  // Normalize: strip query/hash, keep trailing slash off the pathname
+  u.search = ''
+  u.hash = ''
+  if (u.pathname.length > 1 && u.pathname.endsWith('/')) u.pathname = u.pathname.slice(0, -1)
+  ROOT_URL = u.toString()
+} catch (err) {
+  if (err?.message?.startsWith('--root-url')) throw err
+  fail(`--root-url is not a valid URL: ${err.message}`)
+}
+const OUTPUT_DIR = resolve(values['output-dir'])
+const MAX_DEPTH = parseInt(values['max-depth'], 10)
+const MAX_PAGES = parseInt(values['max-pages'], 10)
+const TIMEOUT = parseInt(values.timeout, 10)
+const RESPECT_ROBOTS = values['respect-robots-txt'] !== 'false'
+const BLOCKLIST = buildBlocklist(values.blocklist)
+const SITEMAP_PATH = values['sitemap-path'] ? resolve(values['sitemap-path']) : null
+if (!Number.isFinite(MAX_DEPTH) || MAX_DEPTH < 0) fail('--max-depth must be a non-negative integer')
+if (!Number.isFinite(MAX_PAGES) || MAX_PAGES < 1) fail('--max-pages must be a positive integer')
+if (!Number.isFinite(TIMEOUT) || TIMEOUT < 1) fail('--timeout must be a positive integer (ms)')
+// ---------- HTTP helpers ----------
+const USER_AGENT =
+  'Mozilla/5.0 (compatible; SimilarBuild/1.0; +https://similarbuild.local) Chrome/120 Safari/537.36'
+async function fetchText(url) {
+  const ac = new AbortController()
+  const t = setTimeout(() => ac.abort(), TIMEOUT)
+  try {
+    const res = await fetch(url, {
+      signal: ac.signal,
+      redirect: 'follow',
+      headers: {
+        'user-agent': USER_AGENT,
+        accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+      },
+    })
+    if (!res.ok) {
+      return { ok: false, status: res.status, reason: res.statusText || `http-${res.status}` }
+    }
+    const text = await res.text()
+    return { ok: true, status: res.status, text }
+  } catch (err) {
+    return {
+      ok: false,
+      status: 0,
+      reason: err?.name === 'AbortError' ? 'timeout' : err?.message || String(err),
+    }
+  } finally {
+    clearTimeout(t)
+  }
+}
+// ---------- robots.txt ----------
+// Minimal robots.txt parser — enough to decide "is `/` disallowed for `*`?".
+// We don't try to honor crawl-delay or user-agent-specific allowlists; if the
+// user wants finer control they pass --respect-robots-txt false and own the
+// decision. The check is "does any User-agent: * block disallow exactly /?".
+function isRootDisallowed(robotsTxt) {
+  const lines = robotsTxt.split(/\r?\n/)
+  let inStarBlock = false
+  for (const raw of lines) {
+    const line = raw.replace(/#.*$/, '').trim()
+    if (!line) {
+      // Blank line ends a block.
+      inStarBlock = false
+      continue
+    }
+    const m = /^([A-Za-z-]+)\s*:\s*(.*)$/.exec(line)
+    if (!m) continue
+    const key = m[1].toLowerCase()
+    const val = m[2].trim()
+    if (key === 'user-agent') {
+      inStarBlock = val === '*'
+      continue
+    }
+    if (inStarBlock && key === 'disallow') {
+      // Exactly "/" or empty val (= disallow nothing). We only care about
+      // total bans on the root.
+      if (val === '/') return true
+    }
+  }
+  return false
+}
+async function checkRobots() {
+  if (!RESPECT_ROBOTS) return { allowed: true, warnings: [] }
+  const robotsUrl = new URL('/robots.txt', ROOT_URL).toString()
+  const r = await fetchText(robotsUrl)
+  if (!r.ok) {
+    // Missing robots.txt (404) is permission-by-default, not an error.
+    return { allowed: true, warnings: [] }
+  }
+  if (isRootDisallowed(r.text)) {
+    return {
+      allowed: false,
+      warnings: ['robots-disallow-root'],
+    }
+  }
+  return { allowed: true, warnings: [] }
+}
+// ---------- main ----------
+async function readLocalSitemap(path) {
+  try {
+    await access(path)
+  } catch {
+    fail(`--sitemap-path not found: ${path}`)
+  }
+  try {
+    return await readFile(path, 'utf8')
+  } catch (err) {
+    fail(`cannot read --sitemap-path: ${err.message}`)
+  }
+}
+async function trySitemapFetch() {
+  const candidates = [
+    new URL('/sitemap.xml', ROOT_URL).toString(),
+    new URL('/sitemap_index.xml', ROOT_URL).toString(),
+  ]
+  for (const url of candidates) {
+    log(`probing ${url}`)
+    const r = await fetchText(url)
+    if (!r.ok) {
+      log(`  → ${r.status} ${r.reason}`)
+      continue
+    }
+    return { ok: true, url, text: r.text }
+  }
+  return { ok: false }
+}
+async function main() {
+  await mkdir(OUTPUT_DIR, { recursive: true })
+  const warnings = []
+  let blockedCount = 0
+  let duplicateCount = 0
+  // 1. robots.txt gate
+  const robots = await checkRobots()
+  warnings.push(...robots.warnings)
+  if (!robots.allowed) {
+    log('robots.txt disallows root for *; emitting empty list')
+    return emit({
+      rootUrl: ROOT_URL,
+      source: 'sitemap',
+      pageCount: 0,
+      blocked: 0,
+      duplicates: 0,
+      warnings,
+      pages: [],
+    })
+  }
+  // 2. Sitemap path: explicit --sitemap-path wins, then probe origin
+  let rawSitemapXml = null
+  let sitemapUrl = null
+  let sitemapWarnings = []
+  let urls = []
+  let source = 'crawl' // updated below if sitemap yields anything
+  if (SITEMAP_PATH) {
+    rawSitemapXml = await readLocalSitemap(SITEMAP_PATH)
+    log(`reading sitemap from ${SITEMAP_PATH} (${rawSitemapXml.length} bytes)`)
+    const parsed = parseSitemap(rawSitemapXml)
+    if (parsed.malformed) {
+      sitemapWarnings.push('sitemap-malformed-local')
+    } else if (parsed.kind === 'index') {
+      sitemapWarnings.push('sitemap-path-is-index-not-fetched')
+      // Don't recurse into a local sitemapindex — we'd have to fetch its
+      // children which defeats the purpose of "use this local file."
+    } else {
+      urls = parsed.urlSetUrls
+      source = 'sitemap-path'
+    }
+  } else {
+    const probed = await trySitemapFetch()
+    if (probed.ok) {
+      sitemapUrl = probed.url
+      rawSitemapXml = probed.text
+      log(`got sitemap from ${probed.url} (${probed.text.length} bytes)`)
+      // Re-parse via fetchAndParseSitemap so sitemapindex recursion works.
+      const result = await fetchAndParseSitemap(probed.url, fetchText, { maxChildren: 50 })
+      sitemapWarnings.push(...result.warnings)
+      if (result.urls.length > 0) {
+        urls = result.urls
+        source = 'sitemap'
+      }
+    }
+  }
+  warnings.push(...sitemapWarnings)
+  if (urls.length === 0 && !SITEMAP_PATH) {
+    log('no sitemap usable, falling back to fetch + cheerio crawl')
+    let cheerio
+    try {
+      cheerio = await import('cheerio')
+    } catch (err) {
+      process.stderr.write(
+        `${STDERR_PREFIX} missing dependency 'cheerio': ${err?.message || err}\n` +
+          `Install with: npm i cheerio\n`,
+      )
+      process.exit(1)
+    }
+    const loadHtml = (html) => cheerio.load(html, { decodeEntities: false })
+    const crawl = await fallbackCrawl({
+      rootUrl: ROOT_URL,
+      fetcher: fetchText,
+      loadHtml,
+      maxDepth: MAX_DEPTH,
+      maxPages: MAX_PAGES,
+      blocklistPrefixes: BLOCKLIST,
+      sitemapUrl,
+    })
+    warnings.push(...crawl.warnings)
+    // Crawl results are { url, depth }; convert to a flat list of URLs and
+    // we'll re-classify + record depth below.
+    return emit(
+      buildResult({
+        crawlPages: crawl.results,
+        urlsFromSitemap: null,
+        sitemapUrl,
+        rawSitemapXml,
+        source: 'crawl',
+        warnings,
+      }),
+    )
+  }
+  // Sitemap yielded URLs — apply blocklist + dedupe + classify + cap.
+  if (urls.length > 1000) {
+    warnings.push(`sitemap-truncated:${urls.length}`)
+  }
+  return emit(
+    buildResult({
+      crawlPages: null,
+      urlsFromSitemap: urls,
+      sitemapUrl,
+      rawSitemapXml,
+      source,
+      warnings,
+    }),
+  )
+  // ---- helpers closed over top-level state ----
+  function buildResult({ crawlPages, urlsFromSitemap, sitemapUrl, rawSitemapXml, source, warnings }) {
+    const originHost = new URL(ROOT_URL).hostname.toLowerCase()
+    const seen = new Map() // normalized URL → { depth, source }
+    let localBlocked = 0
+    let localDup = 0
+    function consider(url, depth, perEntrySource) {
+      const filtered = filterUrl(url, {
+        originHost,
+        blocklistPrefixes: BLOCKLIST,
+        sitemapUrl,
+      })
+      if (!filtered.keep) {
+        if (filtered.reason === 'blocklisted' || filtered.reason === 'xml-or-feed') {
+          localBlocked += 1
+        }
+        return
+      }
+      if (seen.has(filtered.url)) {
+        localDup += 1
+        // Keep the lowest depth seen (closer to root = better signal)
+        const existing = seen.get(filtered.url)
+        if (depth < existing.depth) existing.depth = depth
+        return
+      }
+      seen.set(filtered.url, { depth, source: perEntrySource })
+    }
+    if (urlsFromSitemap) {
+      for (const u of urlsFromSitemap) {
+        // Sitemap doesn't give us depth; treat root as depth 0 and others as 1.
+        const isRoot = (() => {
+          try {
+            const p = new URL(u).pathname.replace(/\/+$/, '')
+            return p === '' || p === '/'
+          } catch {
+            return false
+          }
+        })()
+        consider(u, isRoot ? 0 : 1, 'sitemap')
+      }
+    } else if (crawlPages) {
+      for (const { url, depth } of crawlPages) {
+        consider(url, depth, 'crawl')
+      }
+    }
+    // Cap
+    let pages = [...seen.entries()].map(([url, meta]) => ({
+      url,
+      type: classifyUrl(url),
+      title: null,
+      depth: meta.depth,
+      source: meta.source,
+    }))
+    // Stable order: home first (depth 0), then by depth, then alphabetical.
+    pages.sort((a, b) => {
+      if (a.depth !== b.depth) return a.depth - b.depth
+      return a.url.localeCompare(b.url)
+    })
+    if (pages.length > MAX_PAGES) {
+      const total = pages.length
+      pages = pages.slice(0, MAX_PAGES)
+      warnings.push(`max-pages-cap:${MAX_PAGES}-of-${total}`)
+    }
+    return {
+      rootUrl: ROOT_URL,
+      source,
+      pageCount: pages.length,
+      blocked: localBlocked,
+      duplicates: localDup,
+      warnings,
+      pages,
+      _rawSitemapXml: rawSitemapXml, // not part of output; consumed by emit()
+    }
+  }
+  async function emit(result) {
+    const rawSitemapXml = result._rawSitemapXml
+    delete result._rawSitemapXml
+    const listPath = join(OUTPUT_DIR, 'pages-list.json')
+    await writeFile(listPath, JSON.stringify(result, null, 2), 'utf8')
+    log(`wrote ${listPath}`)
+    if (rawSitemapXml) {
+      const sitemapPath = join(OUTPUT_DIR, 'sitemap-raw.xml')
+      await writeFile(sitemapPath, rawSitemapXml, 'utf8')
+      log(`wrote ${sitemapPath}`)
+    }
+    process.stdout.write(JSON.stringify(result))
+    process.stdout.write('\n')
+    process.exit(0)
+  }
+}
+main().catch((err) => {
+  process.stderr.write(`${STDERR_PREFIX} fatal: ${err?.stack || err}\n`)
+  process.exit(1)
+})