npm - @pointsharp/antora-llm-generator - Versions diffs - 1.1.3 → 1.2.0 - Mend

@pointsharp/antora-llm-generator 1.1.3 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -1,11 +1,16 @@
 # Antora LLM Generator Extension
-An [Antora](https://antora.org) extension that creates two auxiliary text files after each site build following the [llmstxt.org specification](https://llmstxt.org/):
+An [Antora](https://antora.org) extension that creates an auxiliary text file after each site build following the [llmstxt.org specification](https://llmstxt.org/):
 - **`llms.txt`** - A structured index with links to all documentation pages, organized into sections
-- **`llms-full.txt`** - Complete page content in markdown format with URLs for each page
-Both files help large-language models ingest your documentation with proper structure, URLs, and context.
+The extension also generates compatibility and discovery files for crawlers and tools:
+- **`llm.txt`** - Same content as `llms.txt`
+- **`sitemap-llms.xml`** - Sitemap containing the `llms.txt` URL
+- Updates the main `sitemap.xml` sitemap index to include `sitemap-llms.xml`
+These files help large-language models ingest your documentation with proper structure, URLs, and context.
 ---
@@ -35,9 +40,9 @@ antora:
 ### Configuration options
-- **`summary`** - Optional. Appears as a blockquote at the top of both output files (following llmstxt.org spec).
+- **`summary`** - Optional. Appears as a blockquote at the top of `llms.txt` (following llmstxt.org spec).
 - **`details`** - Optional. Appears as regular text after the summary. Can be multi-line markdown.
-- **`skippaths`** - Optional. Array of glob patterns. Files matching these patterns are omitted from both output files.
+- **`skippaths`** - Optional. Array of glob patterns. Files matching these patterns are omitted from `llms.txt`.
 - **`debug`** - Optional. Set to `true` to enable verbose logging during build. Default: `false`.
 ### Navigation-based organization (default)
@@ -77,12 +82,10 @@ You can control how individual pages are included using AsciiDoc page attributes
 ```adoc
 :page-llms-ignore: true
-:page-llms-full-ignore: true
 :description: Brief description of this page
 ```
 - **`:page-llms-ignore:`** - Omit this page from `llms.txt` entirely
-- **`:page-llms-full-ignore:`** - Include in `llms.txt` but omit from `llms-full.txt`
 - **`:description:`** - Standard AsciiDoc attribute used for both HTML metadata and page descriptions in `llms.txt`
 ### Page descriptions
@@ -137,31 +140,14 @@ Optional details from playbook config
 - [Page Title](https://url): Description
 ```
-### llms-full.txt structure
-Contains complete page content with clear URL references:
-```markdown
-# Your Site Title - Complete Documentation
-> Optional summary from playbook config
----
-## Page Title
-**URL:** https://example.com/page
-[Page content in markdown format...]
+### Sitemap integration
----
-## Another Page Title
+The extension also adds `llms.txt` to sitemap discovery without pretending it is an Antora page:
-**URL:** https://example.com/another
+- Generates `sitemap-llms.xml` with the `llms.txt` URL
+- Updates the main `sitemap.xml` sitemap index to reference `sitemap-llms.xml`
-[Page content in markdown format...]
-```
+This matches Antora best practice by keeping `llms.txt` in the site catalog instead of inserting it into the content catalog as a fake page.
 ---
@@ -173,11 +159,13 @@ Run your Antora build as usual:
 antora antora-playbook.yml
 ```
-After completion, four files appear in the build output directory:
+After completion, these files appear in the build output directory:
 - `llms.txt` and `llm.txt` (both contain the structured index with links)
-- `llms-full.txt` and `llm-full.txt` (both contain complete page content)
+- `sitemap-llms.xml` (contains the `llms.txt` URL)
+- `sitemap.xml` is updated to include `sitemap-llms.xml`
-The duplicate filenames ensure compatibility with different naming conventions.
+The duplicate `llms.txt` and `llm.txt` filenames ensure compatibility with different naming conventions.
 ---
@@ -186,7 +174,7 @@ The duplicate filenames ensure compatibility with different naming conventions.
 The extension runs **silently** by default, showing only a success message when complete:
 ```
-Generated llms.txt and llms-full.txt
+Generated llms.txt
 ```
 ### Enabling verbose output
@@ -235,7 +223,7 @@ antora:
         - "**/internal/**"
 ```
-**This produces:** The summary appears as a blockquote and details as regular text at the top of both `llms.txt` and `llms-full.txt`, providing important context for LLMs before they see the documentation links.
+**This produces:** The summary appears as a blockquote and details as regular text at the top of `llms.txt`, providing important context for LLMs before they see the documentation links.
 ---

package/llm-generator.js CHANGED Viewed

@@ -1,11 +1,7 @@
 "use strict";
-const { NodeHtmlMarkdown } = require("node-html-markdown");
 const { minimatch } = require("minimatch");
-// HTML to Markdown converter for page content
-const nhm = new NodeHtmlMarkdown();
 // UTF-8 BOM (Byte Order Mark) prepended to output files
 // This helps browsers and editors recognize the file as UTF-8 encoded
 const UTF8_BOM = '\uFEFF';
@@ -78,12 +74,6 @@ module.exports.register = function (context, { config }) {
       indexContent += `\n${details}\n`;
     }
-    // llms-full.txt: Complete page content in markdown
-    let fullContent = `# ${siteTitle} - Complete Documentation\n\n`;
-    if (summary) {
-      fullContent += `> ${summary}\n\n`;
-    }
     // =============================================================================
     // STEP 3: Build page index
     // =============================================================================
@@ -147,57 +137,12 @@ module.exports.register = function (context, { config }) {
     }
     // =============================================================================
-    // STEP 6: Build llms-full.txt complete page content
-    // =============================================================================
-    // For each page: convert HTML to markdown, resolve relative links, add URL header
-    for (const [path, page] of pagesByPath.entries()) {
-      // Skip pages with :page-llms-full-ignore: attribute
-      if (page.asciidoc.attributes["page-llms-full-ignore"]) {
-        if (debug) logger.info(`Skipping page from full content with 'page-llms-full-ignore' attribute: ${page.src.path}`);
-        continue;
-      }
-      const pageUrl = `${siteUrl}/${path}`;
-      // Convert HTML content to markdown (decode as UTF-8)
-      const htmlContent = page.contents.toString('utf8');
-      let plainText = nhm.translate(htmlContent);
-      // Convert relative URLs to absolute (e.g., "page.html" -> "https://site.com/path/page.html")
-      plainText = convertRelativeLinksToAbsolute(plainText, pageUrl, logger);
-      // Add page section with title, URL, optional description, and content
-      fullContent += `\n---\n\n`;
-      fullContent += `## ${page.title}\n\n`;
-      fullContent += `**URL:** ${pageUrl}\n\n`;
-      // Include :description: attribute if present
-      if (page.asciidoc.attributes["description"]) {
-        fullContent += `*${page.asciidoc.attributes["description"]}*\n\n`;
-      }
-      fullContent += plainText;
-      fullContent += `\n`;
-    }
-    // =============================================================================
-    // STEP 7: Write output files
+    // STEP 6: Write output files
     // =============================================================================
-    // Create 4 files total (both singular and plural forms for compatibility):
-    // - llms-full.txt / llm-full.txt: Complete page content
+    // Create 2 files (both singular and plural forms for compatibility):
     // - llms.txt / llm.txt: Navigation index
     //
     // All files are encoded as UTF-8 with BOM for proper character display
-    siteCatalog.addFile({
-      out: { path: "llms-full.txt" },
-      contents: Buffer.from(UTF8_BOM + fullContent, 'utf8'),
-    });
-    siteCatalog.addFile({
-      out: { path: "llm-full.txt" },
-      contents: Buffer.from(UTF8_BOM + fullContent, 'utf8'),
-    });
     siteCatalog.addFile({
       out: { path: "llm.txt" },
       contents: Buffer.from(UTF8_BOM + indexContent, 'utf8'),
@@ -208,7 +153,56 @@ module.exports.register = function (context, { config }) {
       contents: Buffer.from(UTF8_BOM + indexContent, 'utf8'),
     });
-    logger.info("Generated llms.txt and llms-full.txt");
+    // =============================================================================
+    // STEP 8: Add llms.txt to the sitemap
+    // =============================================================================
+    // Per Dan Allen's recommendation: modify the sitemap XML directly rather than
+    // faking llms.txt as a page in the contentCatalog.
+    //
+    // Strategy:
+    //   1. Create sitemap-llms.xml containing only the llms.txt URL.
+    //   2. Find sitemap.xml (the sitemap index) in the siteCatalog and inject a
+    //      <sitemap> entry pointing to sitemap-llms.xml.
+    //
+    // This keeps llms.txt discoverable by crawlers without misrepresenting it as
+    // a documentation page.
+    if (siteUrl) {
+      try {
+        const llmsSitemapXml = [
+          '<?xml version="1.0" encoding="UTF-8"?>',
+          '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
+          '  <url>',
+          `    <loc>${siteUrl}/llms.txt</loc>`,
+          '  </url>',
+          '</urlset>',
+        ].join('\n');
+        siteCatalog.addFile({
+          out: { path: 'sitemap-llms.xml' },
+          contents: Buffer.from(llmsSitemapXml, 'utf8'),
+        });
+        // Find the sitemap index produced by Antora's site-mapper and add our entry
+        const sitemapIndexFile = siteCatalog.getFiles().find((f) => f.out?.path === 'sitemap.xml');
+        if (sitemapIndexFile) {
+          const current = sitemapIndexFile.contents.toString('utf8');
+          const entry = `  <sitemap>\n    <loc>${siteUrl}/sitemap-llms.xml</loc>\n  </sitemap>\n`;
+          sitemapIndexFile.contents = Buffer.from(
+            current.replace('</sitemapindex>', entry + '</sitemapindex>'),
+            'utf8'
+          );
+          if (debug) logger.info('Added sitemap-llms.xml to sitemap index.');
+        } else {
+          logger.warn('llm-generator: sitemap.xml not found in site catalog — sitemap update skipped. Ensure the site-mapper is enabled.');
+        }
+      } catch (err) {
+        logger.warn(`llm-generator: Could not update sitemap with llms.txt: ${err.message}`);
+      }
+    } else {
+      if (debug) logger.info('No site URL configured — skipping sitemap update for llms.txt.');
+    }
+    logger.info("Generated llms.txt");
   });
 };
@@ -240,15 +234,10 @@ function buildSectionsFromNavigatorData(navDataFile, siteUrl, pagesByPath, logge
       return componentSections;
     }
-    // Convert JavaScript object notation to valid JSON
+    // Evaluate JavaScript object notation safely using Function constructor
     // The navigator extension outputs JavaScript (unquoted keys), not JSON
-    let jsData = jsonMatch[1];
-    // Quote unquoted property names: name: → "name":
-    jsData = jsData.replace(/([{,]\s*)([a-zA-Z_$][a-zA-Z0-9_$]*)(\s*:)/g, '$1"$2"$3');
-    // Convert single quotes to double quotes for string values
-    jsData = jsData.replace(/'([^']*)'/g, '"$1"');
-    const siteNavigationData = JSON.parse(jsData);
+    // We can't use simple regex to convert because of complex nested HTML strings
+    const siteNavigationData = new Function('return ' + jsonMatch[1])();
     if (debug) logger.info(`Parsed ${siteNavigationData.length} components from navigator data`);
     // Process each component
@@ -344,50 +333,6 @@ function buildSectionsFromAntoraNavigation(contentCatalog, siteUrl, pagesByPath,
   return componentSections;
 }
-/**
- * Convert relative URLs in markdown links to absolute URLs
- *
- * LLMs need absolute URLs to properly reference pages. This function converts:
- * - Relative links: [text](page.html) -> [text](https://site.com/path/page.html)
- * - Fragment links: [text](#section) -> [text](https://site.com/page#section)
- * - Already absolute links are left unchanged
- *
- * @param {string} markdown - Markdown content with potential relative links
- * @param {string} baseUrl - Full URL of the current page (used as base for resolution)
- * @param {Object} logger - Antora logger instance
- * @returns {string} Markdown with all relative URLs converted to absolute
- */
-function convertRelativeLinksToAbsolute(markdown, baseUrl, logger) {
-  // Match markdown links: [text](url)
-  const linkRegex = /\[([^\]]*)\]\(([^)]+)\)/g;
-  return markdown.replace(linkRegex, (match, text, url) => {
-    // Skip if URL is already absolute (http://, https://, mailto:, ftp:, etc.)
-    if (/^[a-z][a-z0-9+.-]*:/i.test(url)) {
-      return match;
-    }
-    // Handle fragment-only links (e.g., #section-name)
-    if (url.startsWith('#')) {
-      // Remove any existing fragment from base URL, then append new fragment
-      const baseWithoutFragment = baseUrl.split('#')[0];
-      return `[${text}](${baseWithoutFragment}${url})`;
-    }
-    // Handle relative URLs (e.g., page.html, ../other.html)
-    try {
-      // Remove existing fragment from baseUrl for clean resolution
-      const baseWithoutFragment = baseUrl.split('#')[0];
-      // Use JavaScript URL API to properly resolve relative paths
-      const absoluteUrl = new URL(url, baseWithoutFragment).href;
-      return `[${text}](${absoluteUrl})`;
-    } catch (error) {
-      // Return original if resolution fails
-      return match;
-    }
-  });
-}
 /**
  * Recursively collect navigation items from Antora's navigation tree
  *

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "@pointsharp/antora-llm-generator",
   "private": false,
-  "version": "1.1.3",
+  "version": "1.2.0",
   "description": "An Antora extension to generate llms.txt files for LLM consumption following llmstxt.org specification.",
   "main": "llm-generator.js",
   "keywords": [
@@ -16,9 +16,5 @@
   "repository": {
     "type": "git",
     "url": "git+https://github.com/pointsharp/antora-llm-generator.git"
-  },
-  "dependencies": {
-    "minimatch": "10",
-    "node-html-markdown": "^1.3.0"
   }
 }