npm - @conduction/docusaurus-preset - Versions diffs - 3.4.0 → 3.6.0 - Mend

@conduction/docusaurus-preset 3.4.0 → 3.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md +79 -0
package/bin/validate-ai-baseline.mjs +193 -0
package/package.json +5 -1
package/src/index.js +85 -12

package/README.md CHANGED Viewed

@@ -21,6 +21,7 @@ A few non-negotiables encoded by the package CSS and worth knowing about:
 - **Brand-default navbar** — locale-dropdown + GitHub link. Sites override `items[]` for site-specific navigation.
 - **Brand-default footer** — three-column link grid + Conduction-tells (KvK, BTW, address). Per-property override: pass `footer: { links: [...] }` to swap columns and inherit the brand copyright unchanged. Spread `baseFooterLinks()` to keep one or two brand columns alongside site-specific ones.
 - **Sensible defaults** — `trailingSlash`, `onBrokenLinks: 'warn'`, `respectPrefersColorScheme`, dark-mode brand mapping.
+- **AI-crawler baseline** — Organization + WebSite JSON-LD on every page, `SoftwareApplication` JSON-LD from `<DetailHero>`, `FAQPage` JSON-LD from `<FAQ>`, default `og:image` + Twitter card meta, sitemap options, and a `postBuild` plugin that emits `robots.txt` when the site does not ship its own. See the AI baseline section below for the validator and content requirements.
 ## Usage
@@ -163,6 +164,84 @@ import '@conduction/docusaurus-preset/diagrams';
 This is how product sites such as `mydash.conduction.nl/docs/...` adopt the brand without copying CSS or theme code, and stay in sync as the design-system evolves.
+## AI-crawler baseline
+Every site that consumes this preset inherits a contract that AI crawlers (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google AI Overviews) expect. The schemas, meta tags, and `robots.txt` ship automatically; sites only have to opt in to the content that surfaces them.
+**What the preset ships**
+| Surface | Source | How a site uses it |
+| --- | --- | --- |
+| Organization + WebSite JSON-LD | `headTags` injected by `createConfig` | Automatic on every page |
+| `og:image`, `twitter:site`, `twitter:card`, `og:type` | `themeConfig.image` + `themeConfig.metadata` defaults | Override per site by passing `themeConfig.image: 'img/og-my-app.png'` |
+| Default `robots.txt` | `conduction-ai-crawling` postBuild plugin | Drop `static/robots.txt` to override |
+| `SoftwareApplication` JSON-LD | `<DetailHero appId="my-app" .../>` | Pages that should advertise the app must render `<DetailHero>` with an `appId` that resolves in `src/data/apps-registry.js`. No DetailHero means no schema. |
+| `FAQPage` JSON-LD | `<FAQ>` with `<FAQItem question=...>` children | Drop a `<FAQ>` block onto a page; the schema is auto-emitted from the children |
+| Sitemap options | Default `sitemap` config on the classic preset | Sites that override `presets` must include their own `sitemap` block |
+**Validating a site**
+The preset ships a generic 8-check validator as a `bin`. Wire it into the site's build:
+```jsonc
+// docs/package.json
+{
+  "scripts": {
+    "build": "docusaurus build",
+    "postbuild": "validate-ai-baseline",
+    "validate:ai-baseline": "validate-ai-baseline"
+  }
+}
+```
+`npm run build` now exits non-zero if any of these regress: `robots.txt` exists with a Sitemap line and an AI-bot allow line, `sitemap.xml` has at least one URL, the homepage emits Organization + WebSite JSON-LD plus `og:image` / `og:type` / `twitter:site` / `twitter:card`, and the `og:image` URL resolves to a real file. Sites can extend the validator with extra checks (per-app SoftwareApplication, FAQPage on specific pages, etc.) by adding their own `scripts/validate-ai-baseline-site.mjs` and chaining it.
+**Per-app docs site checklist**
+For a per-app docs site to satisfy the full schema contract, the landing page must render `<DetailHero appId="my-app" .../>` with an `appId` that exists in `src/data/apps-registry.js`. That single render emits the `SoftwareApplication` JSON-LD with category mapping (Data and Processes -> BusinessApplication, Connectors -> DeveloperApplication, etc.), `operatingSystem: 'Nextcloud'`, and the EUPL-1.2 license URL. Sites that build a custom landing without `<DetailHero>` get only Organization + WebSite, not the per-app schema.
+**Opting out**
+```js
+createConfig({
+  title: '...',
+  url: '...',
+  baseUrl: '/',
+  aiCrawling: { disable: true },          // skip the whole postBuild plugin
+  // or, finer-grained:
+  aiCrawling: { disable: { robotsTxt: true } },  // ship our own static/robots.txt
+});
+```
+## Traditional SEO baseline
+The same `createConfig` call also wires the traditional-search baseline that pairs with the AI-crawler one. Google, Bing, DuckDuckGo and the AI surfaces those engines feed (Copilot, ChatGPT Search, Perplexity) all benefit.
+**What's shipped automatically**
+- **Sitemap with `lastmod`** from file mtime; `priority` and `changefreq` are dropped because Google ignores them. `/page/N/` pagination and `/academy/tags/` thin pages are excluded sitewide so they don't dilute crawl budget.
+- **Footer legal links default to absolute URLs on `www.conduction.nl`** (`/privacy`, `/terms`, `/iso`). Earlier defaults used relative routes that 404'd on every per-app subdomain — the SEO audit found ~645 sitewide broken internal links across the fleet from this single mistake. Marketing sites that self-host these pages pass `legalLinks: { privacy: '/privacy', ... }` to opt back into relative routing.
+- **Search Console / Bing Webmaster / Yandex / Facebook / Pinterest verification meta tags** via `opts.searchConsoleVerification`. Each present token becomes a `<meta>` tag in the global head, which lets a non-DNS-admin teammate verify the property via the console UI:
+```js
+createConfig({
+  // ...
+  searchConsoleVerification: {
+    google: 'abc123...',     // -> <meta name="google-site-verification">
+    bing:   'xyz...',        // -> <meta name="msvalidate.01">
+    yandex: '...',           // -> <meta name="yandex-verification">
+    facebook: '...',         // -> <meta name="facebook-domain-verification">
+    pinterest: '...',        // -> <meta name="p:domain_verify">
+  },
+});
+```
+**Known follow-ups (not yet automatic)**
+- `BreadcrumbList` JSON-LD on every page. The DocBreadcrumbs DOM already renders; the schema needs a theme swizzle. Tracked as a 3.7+ candidate.
+- `TechArticle` JSON-LD on docs pages with `dateModified` from git mtime. Same swizzle scope.
+- Per-page title format. Docusaurus defaults to `{Page} | {Site}` which produces `OpenRegister | OpenRegister` on per-app homepages. Override per page via frontmatter `title:` for now; a `titleFormat` option may land later.
 ## Releasing
 Releases auto-publish on push to `main`, driven by [semantic-release](https://semantic-release.gitbook.io/) reading [conventional-commit](https://www.conventionalcommits.org/) messages. The [.github/workflows/publish-packages.yml](../.github/workflows/publish-packages.yml) workflow walks every commit since the last `@conduction/docusaurus-preset-v*` tag and decides what to ship:

package/bin/validate-ai-baseline.mjs ADDED Viewed

@@ -0,0 +1,193 @@
+#!/usr/bin/env node
+/**
+ * scripts/validate-ai-baseline.mjs
+ *
+ * Generic AI-crawler baseline validator. Runs as a postbuild step on
+ * every Conduction Docusaurus site that consumes
+ * @conduction/docusaurus-preset >= 3.4.0. Asserts the SSG output
+ * carries the contract AI crawlers (GPTBot, ClaudeBot, PerplexityBot,
+ * OAI-SearchBot, Claude-SearchBot, Google AI Overviews) expect.
+ *
+ * Universal checks only - no site-specific routes. Sites that want
+ * additional gates (per-app SoftwareApplication, FAQPage on specific
+ * pages, etc.) extend this script in place. See conduction-website's
+ * version for an example of additional checks.
+ *
+ * Exit codes:
+ *   0   all checks passed
+ *   1   one or more checks failed (CI should block)
+ *   2   build directory not found (script invoked before build)
+ */
+import {readFileSync, existsSync, statSync} from 'node:fs';
+import {join, resolve} from 'node:path';
+const buildDir = resolve(process.argv[2] || 'build');
+if (!existsSync(buildDir)) {
+  console.error(`✗ build directory not found: ${buildDir}`);
+  console.error(`  Run \`npx docusaurus build\` first.`);
+  process.exit(2);
+}
+const results = [];
+function check(name, fn) {
+  try {
+    const r = fn();
+    results.push({name, ok: r.ok, msg: r.msg});
+  } catch (e) {
+    results.push({name, ok: false, msg: `threw: ${e.message}`});
+  }
+}
+function readBuild(p) {
+  return readFileSync(join(buildDir, p), 'utf8');
+}
+/* robots.txt - shipped by the preset's ai-crawling plugin (or the
+   site's own static/robots.txt). Either way, the file must exist
+   and name at least one AI search bot so a `grep` audit can confirm
+   the posture at a glance. */
+check('robots.txt exists and is non-empty', () => {
+  const path = join(buildDir, 'robots.txt');
+  if (!existsSync(path)) return {ok: false, msg: 'missing'};
+  const size = statSync(path).size;
+  if (size < 50) return {ok: false, msg: `too small (${size} bytes)`};
+  return {ok: true, msg: `${size} bytes`};
+});
+check('robots.txt names at least one AI search bot', () => {
+  const body = readBuild('robots.txt');
+  const candidates = ['OAI-SearchBot', 'Claude-SearchBot', 'PerplexityBot', 'ChatGPT-User', 'Claude-User'];
+  const found = candidates.filter(ua => body.includes(`User-agent: ${ua}`));
+  if (found.length === 0) {
+    return {ok: false, msg: `none of [${candidates.join(', ')}] referenced`};
+  }
+  return {ok: true, msg: `${found.length} bot(s): ${found.join(', ')}`};
+});
+check('robots.txt has a Sitemap line', () => {
+  const body = readBuild('robots.txt');
+  const matches = body.match(/^Sitemap:\s+https?:\/\//gm) || [];
+  if (matches.length === 0) return {ok: false, msg: 'no Sitemap: line'};
+  return {ok: true, msg: `${matches.length} sitemap line(s)`};
+});
+/* sitemap.xml - emitted by @docusaurus/plugin-sitemap (loaded via
+   the classic preset). Locale-specific sitemaps (e.g. /nl/sitemap.xml)
+   are present for i18n builds; we only check the canonical one
+   because some sites are single-locale. */
+check('sitemap.xml exists and has at least 1 URL', () => {
+  const path = join(buildDir, 'sitemap.xml');
+  if (!existsSync(path)) return {ok: false, msg: 'missing'};
+  const body = readBuild('sitemap.xml');
+  const n = (body.match(/<loc>/g) || []).length;
+  if (n < 1) return {ok: false, msg: 'no <loc> entries'};
+  return {ok: true, msg: `${n} URLs`};
+});
+/* sitemap.xml should ship <lastmod> on every URL. Google treats lastmod
+   as the only sitemap-level signal that actually informs recrawl
+   priority, and only when it's trustworthy. Sites that ship priority +
+   changefreq without lastmod (the Docusaurus default before preset
+   3.6.0) get treated as having no freshness signal. */
+check('sitemap.xml emits <lastmod> on URLs', () => {
+  const body = readBuild('sitemap.xml');
+  const locCount = (body.match(/<loc>/g) || []).length;
+  const lastmodCount = (body.match(/<lastmod>/g) || []).length;
+  if (locCount === 0) return {ok: false, msg: 'no <loc> entries to compare against'};
+  if (lastmodCount === 0) return {ok: false, msg: `0 / ${locCount} URLs have <lastmod> — enable sitemap.lastmod in docusaurus.config`};
+  const ratio = lastmodCount / locCount;
+  if (ratio < 0.5) return {ok: false, msg: `only ${lastmodCount} / ${locCount} URLs have <lastmod>`};
+  return {ok: true, msg: `${lastmodCount} / ${locCount} URLs (${Math.round(ratio * 100)}%)`};
+});
+/* Helper for the JSON-LD checks below. Docusaurus emits ld+json
+   tags via two paths with different attribute ordering: top-level
+   headTags renders <script type="..."> first, while Helmet (used
+   by <Head> from inside React components like <DetailHero>, <FAQ>)
+   prefixes data-rh="true". The regex matches either ordering. */
+function extractJsonLdBlocks(html) {
+  const out = [];
+  const re = /<script\b[^>]*\btype="application\/ld\+json"[^>]*>([\s\S]*?)<\/script>/g;
+  let m;
+  while ((m = re.exec(html)) !== null) {
+    out.push(m[1]);
+  }
+  return out;
+}
+check('homepage emits >= 2 JSON-LD blocks, all valid JSON', () => {
+  if (!existsSync(join(buildDir, 'index.html'))) return {ok: false, msg: 'no index.html'};
+  const html = readBuild('index.html');
+  const blocks = extractJsonLdBlocks(html);
+  if (blocks.length < 2) return {ok: false, msg: `only ${blocks.length} block(s)`};
+  for (const [i, b] of blocks.entries()) {
+    try {JSON.parse(b);} catch (e) {
+      return {ok: false, msg: `block ${i} invalid JSON: ${e.message}`};
+    }
+  }
+  return {ok: true, msg: `${blocks.length} blocks, all valid`};
+});
+check('homepage JSON-LD includes Organization and WebSite', () => {
+  const html = readBuild('index.html');
+  const types = extractJsonLdBlocks(html).map(b => {
+    try {return JSON.parse(b)['@type'];} catch {return null;}
+  });
+  const want = ['Organization', 'WebSite'];
+  const missing = want.filter(t => !types.includes(t));
+  if (missing.length) return {ok: false, msg: `missing @type: ${missing.join(', ')}`};
+  return {ok: true, msg: types.filter(Boolean).join(' + ')};
+});
+/* Social-card meta. og:image is the one that breaks LinkedIn /
+   Slack / AI previews when it 404s, so we also resolve the URL to
+   a local file in the build output. */
+function metaTag(html, key) {
+  const re = new RegExp(`<meta[^>]+(?:name|property)="${key}"[^>]+content="([^"]+)"`, 'i');
+  const m = html.match(re);
+  return m ? m[1] : null;
+}
+check('homepage has og:image, og:type, twitter:site, twitter:card', () => {
+  const html = readBuild('index.html');
+  const checks = {
+    'og:image': metaTag(html, 'og:image'),
+    'og:type': metaTag(html, 'og:type'),
+    'twitter:site': metaTag(html, 'twitter:site'),
+    'twitter:card': metaTag(html, 'twitter:card'),
+  };
+  const missing = Object.entries(checks).filter(([, v]) => !v).map(([k]) => k);
+  if (missing.length) return {ok: false, msg: `missing: ${missing.join(', ')}`};
+  return {ok: true, msg: 'all four present'};
+});
+check('og:image URL resolves to a file in the build', () => {
+  const html = readBuild('index.html');
+  const url = metaTag(html, 'og:image');
+  if (!url) return {ok: false, msg: 'no og:image meta'};
+  const path = url.replace(/^https?:\/\/[^/]+\//, '');
+  const local = join(buildDir, path);
+  if (!existsSync(local)) return {ok: false, msg: `og:image refers to ${url}, not found at ${local}`};
+  const size = statSync(local).size;
+  if (size < 1024) return {ok: false, msg: `og:image file suspiciously small (${size} bytes)`};
+  return {ok: true, msg: `${path} (${size} bytes)`};
+});
+/* Report */
+let failed = 0;
+for (const {name, ok, msg} of results) {
+  const icon = ok ? '✓' : '✗';
+  console.log(`${icon} ${name} - ${msg}`);
+  if (!ok) failed++;
+}
+console.log('');
+if (failed) {
+  console.error(`${failed} of ${results.length} checks failed.`);
+  console.error('AI-crawler baseline regressed. Fix the failures above before merging.');
+  process.exit(1);
+} else {
+  console.log(`All ${results.length} AI-baseline checks passed.`);
+}

package/package.json CHANGED Viewed

@@ -1,9 +1,12 @@
 {
   "name": "@conduction/docusaurus-preset",
-  "version": "3.4.0",
+  "version": "3.6.0",
   "scripts": {
     "prepack": "node scripts/prepack-bundle-css.js"
   },
+  "bin": {
+    "validate-ai-baseline": "./bin/validate-ai-baseline.mjs"
+  },
   "description": "Conduction brand preset for Docusaurus 3. Tokens, theme, navbar, footer, i18n config for nl/en/de/fr, and the React component library that powers conduction.nl and the Conduction product sites.",
   "main": "src/index.js",
   "exports": {
@@ -28,6 +31,7 @@
   "files": [
     "src/",
     "static/",
+    "bin/",
     "README.md",
     "MISSING_COMPONENTS.md"
   ],

package/src/index.js CHANGED Viewed

@@ -160,7 +160,7 @@ function buildWebsiteJsonLd(opts) {
  * the site's tags after its own defaults.
  */
 function buildAiHeadTags(opts) {
-  return [
+  const tags = [
     {
       tagName: 'script',
       attributes: {type: 'application/ld+json'},
@@ -172,6 +172,45 @@ function buildAiHeadTags(opts) {
       innerHTML: JSON.stringify(buildWebsiteJsonLd(opts)),
     },
   ];
+  /* Search Console verification meta tags. Sites pass tokens via
+     opts.searchConsoleVerification = { google: '...', bing: '...',
+     yandex: '...' }; each present token becomes a meta tag. Verifying
+     via meta (vs DNS TXT) lets a non-DNS-admin teammate access Search
+     Console / Bing Webmaster Tools. */
+  const verification = opts.searchConsoleVerification || {};
+  if (verification.google) {
+    tags.push({
+      tagName: 'meta',
+      attributes: {name: 'google-site-verification', content: verification.google},
+    });
+  }
+  if (verification.bing) {
+    tags.push({
+      tagName: 'meta',
+      attributes: {name: 'msvalidate.01', content: verification.bing},
+    });
+  }
+  if (verification.yandex) {
+    tags.push({
+      tagName: 'meta',
+      attributes: {name: 'yandex-verification', content: verification.yandex},
+    });
+  }
+  if (verification.facebook) {
+    tags.push({
+      tagName: 'meta',
+      attributes: {name: 'facebook-domain-verification', content: verification.facebook},
+    });
+  }
+  if (verification.pinterest) {
+    tags.push({
+      tagName: 'meta',
+      attributes: {name: 'p:domain_verify', content: verification.pinterest},
+    });
+  }
+  return tags;
 }
 /**
@@ -184,15 +223,34 @@ function buildAiHeadTags(opts) {
  * Sites passing their own classic preset config can override by
  * including a `sitemap` key alongside `docs`/`blog`/`theme`.
  */
+/**
+ * Sitemap defaults. Google ignores `changefreq` and `priority` (and has
+ * for years; the @docusaurus/plugin-sitemap defaults are wrong on this
+ * point). `lastmod` is the only signal Google actually uses, and only
+ * if the dates are accurate, so we ship lastmod from file mtime. Bing
+ * still reads all three, harmless to omit.
+ *
+ * Sites with locale-specific tag pages and pagination should keep the
+ * exclude list in sync. Pagination (`/page/N/`) and tag pages
+ * (`/tags/*/`) are documented Docusaurus duplicate-content traps;
+ * we exclude them by default so they neither dilute crawl budget nor
+ * confuse AI summarisers.
+ */
 const DEFAULT_SITEMAP_OPTIONS = {
-  changefreq: 'weekly',
-  priority: 0.5,
+  changefreq: null,
+  priority: null,
+  lastmod: 'date',
   ignorePatterns: [
     '/academy/tags/**',
     '/nl/academy/tags/**',
     '/en/academy/tags/**',
     '/de/academy/tags/**',
     '/fr/academy/tags/**',
+    '/page/**',
+    '/nl/page/**',
+    '/en/page/**',
+    '/de/page/**',
+    '/fr/page/**',
   ],
   filename: 'sitemap.xml',
 };
@@ -489,9 +547,15 @@ function createConfig(opts) {
         footerBrand: opts.footerBrand || null,
         /* Legal-bar links (Privacy / Terms / ISO) plus the two ISO
            9001 + 27001 certification badges on the right side of the
-           canal-footer. Default keeps prior behaviour (pages live at
-           /privacy, /terms, /iso on docs.conduction.nl + www.conduction.nl).
-           Consumer sites that don't ship those pages can opt out per
+           canal-footer.
+           Defaults point at the canonical Conduction pages on
+           www.conduction.nl rather than relative routes. Earlier
+           defaults used /privacy, /terms, /iso which 404'd on every
+           per-app subdomain (openregister.conduction.nl/privacy etc.)
+           because those routes only exist on the marketing site. The
+           SEO audit found ~645 sitewide broken internal links across
+           the fleet from this single mistake. Sites can override per
            slot to silence broken-link warnings:
              legalLinks: {
@@ -499,12 +563,21 @@ function createConfig(opts) {
                terms:   false,     // hide the Terms link
                iso:     false,     // hide the ISO link AND the cert badges
                                    // (badges follow iso link by default)
-               // any slot can also take a string for an external URL:
-               privacy: 'https://docs.conduction.nl/privacy',
-               // certs default-follow iso, override here:
-               isoCertifications: true | false,
-             } */
-        legalLinks: opts.legalLinks || {},
+               privacy: '/privacy', // self-host: pass a relative route
+               certifications: true | false,
+             }
+           The marketing site at conduction-website passes legalLinks
+           explicitly with relative routes so its self-hosted Privacy /
+           Terms / ISO pages keep working as before. */
+        legalLinks: Object.assign(
+          {
+            privacy: 'https://www.conduction.nl/privacy',
+            terms: 'https://www.conduction.nl/terms',
+            iso: 'https://www.conduction.nl/iso',
+          },
+          opts.legalLinks || {}
+        ),
         /* AI-friendly social-card defaults. `image` ships from the
            preset's static/img/og-conduction.png and gets served at every
            consuming site's /img/og-conduction.png; drop your own