npm - @pointsharp/antora-llm-generator - Versions diffs - 1.0.0 - Mend

@pointsharp/antora-llm-generator 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,226 @@
+# Antora LLM Generator Extension
+An [Antora](https://antora.org) extension that creates two auxiliary text files after each site build following the [llmstxt.org specification](https://llmstxt.org/):
+- **`llms.txt`** - A structured index with links to all documentation pages, organized into sections
+- **`llms-full.txt`** - Complete page content in markdown format with URLs for each page
+Both files help large-language models ingest your documentation with proper structure, URLs, and context.
+---
+## Installation
+This extension is used locally. Ensure the required dependencies are installed:
+```bash
+npm install node-html-markdown minimatch
+```
+---
+## Playbook configuration
+Add the extension to your `antora-playbook.yml`:
+```yaml
+antora:
+  extensions:
+    - require: ./extensions/antora-llm-generator/llm-generator.js
+      summary: "Brief summary about your documentation site"
+      details: |
+        Optional longer description or important notes.
+        Can be multi-line markdown text.
+      skippaths:
+        - "someGlob/**/path"
+```
+### Configuration options
+- **`summary`** - Optional. Appears as a blockquote at the top of both output files (following llmstxt.org spec).
+- **`details`** - Optional. Appears as regular text after the summary. Can be multi-line markdown.
+- **`skippaths`** - Optional. Array of glob patterns. Files matching these patterns are omitted from both output files.
+### Navigation-based organization (default)
+By default, the extension uses your Antora navigation structure from `nav.adoc` files to organize pages in `llms.txt`. This works automatically with:
+- **Native Antora navigation** - Always works
+- **Navigator extension** - If you use it, the same navigation structure is used
+- **Any custom navigation** - As long as it's in your `nav.adoc` files
+Pages are organized by component titles as H2 sections, and the navigation hierarchy is preserved.
+**Example output:**
+```markdown
+# Your Site Title
+> Optional summary
+## Product A
+- [What is Product A?](https://example.com/product-a/latest/index.html)
+- **Installation**
+  - [Install on Windows](https://example.com/product-a/latest/installation-windows.html)
+  - [Install on Linux](https://example.com/product-a/latest/installation-linux.html)
+## Product B
+- [Introduction to Product B](https://example.com/product-b/index.html)
+```
+---
+## Page-level attributes
+You can control how individual pages are included using AsciiDoc page attributes:
+```adoc
+:page-llms-ignore: true
+:page-llms-full-ignore: true
+:description: Brief description of this page
+```
+- **`:page-llms-ignore:`** - Omit this page from `llms.txt` entirely
+- **`:page-llms-full-ignore:`** - Include in `llms.txt` but omit from `llms-full.txt`
+- **`:description:`** - Standard AsciiDoc attribute used for both HTML metadata and page descriptions in `llms.txt`
+### Page descriptions
+The extension uses the standard `:description:` attribute to add optional descriptions to page links in `llms.txt`, following the [llmstxt.org specification](https://llmstxt.org/):
+**Without description:**
+```markdown
+- [Getting Started](https://example.com/getting-started)
+```
+**With description:**
+```markdown
+- [Getting Started](https://example.com/getting-started): Quick start guide for new users
+```
+**Example:**
+```adoc
+= System Requirements
+:description: This section describes the system requirements for Product A.
+```
+The `:description:` attribute is a standard AsciiDoc attribute that serves dual purposes:
+- **HTML metadata** - Used in `<meta name="description">` tags for SEO
+- **LLM files** - Provides context for each page link in `llms.txt`
+The extension automatically resolves any attribute references within the description using the standard `{attribute}` syntax, so you can compose descriptions from other attributes if needed.
+---
+## Output structure
+### llms.txt structure
+Following the [llmstxt.org specification](https://llmstxt.org/), the file contains:
+```markdown
+# Your Site Title
+> Optional summary from playbook config
+Optional details from playbook config
+## Section Name
+- [Page Title](https://url): Optional description
+- [Another Page](https://url)
+## Another Section
+- [Page Title](https://url): Description
+```
+### llms-full.txt structure
+Contains complete page content with clear URL references:
+```markdown
+# Your Site Title - Complete Documentation
+> Optional summary from playbook config
+---
+## Page Title
+**URL:** https://example.com/page
+[Page content in markdown format...]
+---
+## Another Page Title
+**URL:** https://example.com/another
+[Page content in markdown format...]
+```
+---
+## Building the site
+Run your Antora build as usual:
+```bash
+antora antora-playbook.yml
+```
+After completion, four files appear in the build output directory:
+- `llms.txt` and `llm.txt` (both contain the structured index with links)
+- `llms-full.txt` and `llm-full.txt` (both contain complete page content)
+The duplicate filenames ensure compatibility with different naming conventions.
+---
+## Example configuration
+Here's a complete example following the [llmstxt.org](https://llmstxt.org/) best practices:
+```yaml
+antora:
+  extensions:
+    - require: ./extensions/antora-llm-generator/llm-generator.js
+      summary: "Comprehensive documentation for the Acme API, including authentication, endpoints, and best practices."
+      details: |
+        Important notes:
+        - All API endpoints require authentication via API key
+        - Rate limits apply to all endpoints (1000 requests/hour)
+        - WebSocket connections are available for real-time updates
+      skippaths:
+        - "admin/**"
+        - "**/internal/**"
+```
+**This produces:** The summary appears as a blockquote and details as regular text at the top of both `llms.txt` and `llms-full.txt`, providing important context for LLMs before they see the documentation links.
+---
+## Tips for effective llms.txt files
+Following [llmstxt.org guidelines](https://llmstxt.org/):
+- **Use concise, clear language** in summaries and descriptions
+- **Include brief, informative descriptions** for important pages
+- **Avoid ambiguous terms** or unexplained jargon
+- **Use the "Optional" section** for secondary information that can be skipped for shorter context
+- **Organize related pages** into logical sections
+- **Test with actual LLMs** to ensure they can answer questions about your content
+---
+## Compatibility
+This extension follows the official [llmstxt.org specification](https://llmstxt.org/) for maximum compatibility with LLM tools and agents.
+For more details about the specification, visit [https://llmstxt.org/](https://llmstxt.org/).

package/llm-generator.js ADDED Viewed

@@ -0,0 +1,470 @@
+"use strict";
+const { NodeHtmlMarkdown } = require("node-html-markdown");
+const { minimatch } = require("minimatch");
+// HTML to Markdown converter for page content
+const nhm = new NodeHtmlMarkdown();
+// UTF-8 BOM (Byte Order Mark) prepended to output files
+// This helps browsers and editors recognize the file as UTF-8 encoded
+const UTF8_BOM = '\uFEFF';
+/**
+ * Antora Extension: LLM Text Generator
+ *
+ * Generates two files following the llmstxt.org specification:
+ * - llms.txt: Navigation index with page links and descriptions
+ * - llms-full.txt: Complete page content in markdown format
+ *
+ * Both files help LLMs understand and answer questions about your documentation.
+ *
+ * Configuration options (in playbook):
+ * - summary: Optional blockquote at the top of both files
+ * - details: Optional text after summary
+ * - skippaths: Array of glob patterns to exclude pages
+ */
+module.exports.register = function (context, { config }) {
+  const logger = context.getLogger("antora-llm-generator");
+  const { playbook } = context.getVariables();
+  const siteTitle = playbook.site?.title || "Documentation";
+  const siteUrl = playbook.site?.url;
+  // Extract configuration options
+  const skipPaths = config.skippaths || [];
+  const summary = config.summary || null;
+  const details = config.details || null;
+  logger.info(`Skip paths: ${JSON.stringify(skipPaths)}`);
+  // Helper: Check if a page path matches any skip pattern
+  const shouldSkipPath = (path) => {
+    return skipPaths.some((pattern) => minimatch(path, pattern));
+  };
+  context.on("beforePublish", ({ contentCatalog, siteCatalog }) => {
+    logger.info("Assembling content for LLM text files using Antora navigation.");
+    // =============================================================================
+    // STEP 1: Detect navigation source
+    // =============================================================================
+    // Two navigation sources are supported:
+    // 1. Antora native navigation (from nav.adoc files)
+    // 2. Navigator extension (if installed, creates site-navigation-data.js)
+    const navDataFile = siteCatalog.getFiles().find(f => f.out?.path === 'site-navigation-data.js');
+    const useNavigatorData = !!navDataFile;
+    if (useNavigatorData) {
+      logger.info("Found site-navigation-data.js - will use navigator extension data");
+    } else {
+      logger.info("Navigator extension not detected - will use Antora native navigation");
+    }
+    // =============================================================================
+    // STEP 2: Initialize output file content
+    // =============================================================================
+    // llms.txt: Navigation index with page links
+    let indexContent = `# ${siteTitle}\n`;
+    // Add optional summary as blockquote (llmstxt.org spec)
+    if (summary) {
+      indexContent += `\n> ${summary}\n`;
+    }
+    // Add optional details as regular text
+    if (details) {
+      indexContent += `\n${details}\n`;
+    }
+    // llms-full.txt: Complete page content in markdown
+    let fullContent = `# ${siteTitle} - Complete Documentation\n\n`;
+    if (summary) {
+      fullContent += `> ${summary}\n\n`;
+    }
+    // =============================================================================
+    // STEP 3: Build page index
+    // =============================================================================
+    // Create a Map for fast page lookup by URL path
+    // Key: page.out.path (e.g., "net-id-client/1.3/index.html")
+    // Value: page object with contents, attributes, etc.
+    const pagesByPath = new Map();
+    const pages = contentCatalog.findBy({ family: "page" });
+    for (const page of pages) {
+      // Skip pages without output path
+      if (!page.out) continue;
+      // Skip pages matching skippath patterns from config
+      if (shouldSkipPath(page.out.path)) {
+        logger.info(`Skipping page matching skip pattern: ${page.out.path}`);
+        continue;
+      }
+      // Skip pages with :page-llms-ignore: attribute
+      if (page.asciidoc.attributes["page-llms-ignore"]) {
+        logger.info(`Skipping page with 'page-llms-ignore' attribute: ${page.src.path}`);
+        continue;
+      }
+      pagesByPath.set(page.out.path, page);
+    }
+    logger.info(`Built pagesByPath map with ${pagesByPath.size} pages`);
+    // =============================================================================
+    // STEP 4: Build navigation sections
+    // =============================================================================
+    // Returns Map: sectionName -> array of formatted navigation items
+    // Each item is a markdown string: "- [Page Title](url): Optional description"
+    let componentSections;
+    if (useNavigatorData) {
+      // Use navigator extension's site-navigation-data.js
+      logger.info("Using navigator extension data");
+      componentSections = buildSectionsFromNavigatorData(navDataFile, siteUrl, pagesByPath, logger);
+    } else {
+      // Use Antora's native navigation structure from component versions
+      logger.info("Using Antora native navigation");
+      componentSections = buildSectionsFromAntoraNavigation(contentCatalog, siteUrl, pagesByPath, logger);
+    }
+    logger.info(`Component sections map has ${componentSections.size} sections`);
+    for (const [name, items] of componentSections.entries()) {
+      logger.info(`Section "${name}" has ${items.length} items`);
+    }
+    // =============================================================================
+    // STEP 5: Build llms.txt navigation index
+    // =============================================================================
+    // Format: H2 sections with bulleted lists of page links
+    for (const [sectionName, items] of componentSections.entries()) {
+      if (items.length > 0) {
+        indexContent += `\n## ${sectionName}\n\n`;
+        items.forEach(item => {
+          indexContent += `${item}\n`;
+        });
+      }
+    }
+    // =============================================================================
+    // STEP 6: Build llms-full.txt complete page content
+    // =============================================================================
+    // For each page: convert HTML to markdown, resolve relative links, add URL header
+    for (const [path, page] of pagesByPath.entries()) {
+      // Skip pages with :page-llms-full-ignore: attribute
+      if (page.asciidoc.attributes["page-llms-full-ignore"]) {
+        logger.info(`Skipping page from full content with 'page-llms-full-ignore' attribute: ${page.src.path}`);
+        continue;
+      }
+      const pageUrl = `${siteUrl}/${path}`;
+      // Convert HTML content to markdown (decode as UTF-8)
+      const htmlContent = page.contents.toString('utf8');
+      let plainText = nhm.translate(htmlContent);
+      // Convert relative URLs to absolute (e.g., "page.html" -> "https://site.com/path/page.html")
+      plainText = convertRelativeLinksToAbsolute(plainText, pageUrl, logger);
+      // Add page section with title, URL, optional description, and content
+      fullContent += `\n---\n\n`;
+      fullContent += `## ${page.title}\n\n`;
+      fullContent += `**URL:** ${pageUrl}\n\n`;
+      // Include :description: attribute if present
+      if (page.asciidoc.attributes["description"]) {
+        fullContent += `*${page.asciidoc.attributes["description"]}*\n\n`;
+      }
+      fullContent += plainText;
+      fullContent += `\n`;
+    }
+    // =============================================================================
+    // STEP 7: Write output files
+    // =============================================================================
+    // Create 4 files total (both singular and plural forms for compatibility):
+    // - llms-full.txt / llm-full.txt: Complete page content
+    // - llms.txt / llm.txt: Navigation index
+    //
+    // All files are encoded as UTF-8 with BOM for proper character display
+    siteCatalog.addFile({
+      out: { path: "llms-full.txt" },
+      contents: Buffer.from(UTF8_BOM + fullContent, 'utf8'),
+    });
+    siteCatalog.addFile({
+      out: { path: "llm-full.txt" },
+      contents: Buffer.from(UTF8_BOM + fullContent, 'utf8'),
+    });
+    siteCatalog.addFile({
+      out: { path: "llm.txt" },
+      contents: Buffer.from(UTF8_BOM + indexContent, 'utf8'),
+    });
+    siteCatalog.addFile({
+      out: { path: "llms.txt" },
+      contents: Buffer.from(UTF8_BOM + indexContent, 'utf8'),
+    });
+    logger.info("llms.txt and llms-full.txt files have been generated successfully.");
+  });
+};
+/**
+ * Build sections from navigator extension's site-navigation-data.js
+ *
+ * The navigator extension creates a JavaScript file with navigation data in this format:
+ * siteNavigationData=[{name:"component",title:"Title",versions:[...]}]
+ *
+ * @param {Object} navDataFile - The site-navigation-data.js file from siteCatalog
+ * @param {string} siteUrl - Base URL of the site
+ * @param {Map} pagesByPath - Map of page.out.path -> page object
+ * @param {Object} logger - Antora logger instance
+ * @returns {Map} Map of sectionName -> array of formatted navigation items
+ */
+function buildSectionsFromNavigatorData(navDataFile, siteUrl, pagesByPath, logger) {
+  const componentSections = new Map();
+  try {
+    // Parse the JavaScript file to extract JSON data
+    const navContent = navDataFile.contents.toString();
+    // Extract JSON array from: siteNavigationData=[...]
+    const jsonMatch = navContent.match(/siteNavigationData=(\[.*\])/);
+    if (!jsonMatch) {
+      logger.warn("Could not parse site-navigation-data.js format");
+      return componentSections;
+    }
+    const siteNavigationData = JSON.parse(jsonMatch[1]);
+    logger.info(`Parsed ${siteNavigationData.length} components from navigator data`);
+    // Process each component
+    for (const component of siteNavigationData) {
+      const componentTitle = component.title || component.name;
+      // Initialize section array for this component
+      if (!componentSections.has(componentTitle)) {
+        componentSections.set(componentTitle, []);
+      }
+      const sectionArray = componentSections.get(componentTitle);
+      // Process all versions (e.g., 1.0, 1.1, latest)
+      for (const version of component.versions || []) {
+        // Process all navigation sets within version
+        for (const set of version.sets || []) {
+          if (set.items && set.items.length > 0) {
+            // Recursively collect navigation items
+            const counts = collectNavItems(set.items, sectionArray, siteUrl, pagesByPath, logger);
+            logger.debug(`Component ${component.name}: collected ${counts.found} items`);
+          }
+        }
+      }
+    }
+  } catch (error) {
+    logger.error(`Error parsing navigator data: ${error.message}`);
+  }
+  return componentSections;
+}
+/**
+ * Build sections using Antora's native navigation structure
+ *
+ * Antora stores parsed navigation in version.navigation array after processing nav.adoc files.
+ * Each component version can have multiple navigation trees (if multiple nav files exist).
+ *
+ * Structure:
+ * - component.versions[] -> each version has:
+ *   - version.navigation[] -> array of nav trees, each has:
+ *     - navTree.items[] -> recursive navigation items
+ *
+ * @param {Object} contentCatalog - Antora content catalog
+ * @param {string} siteUrl - Base URL of the site
+ * @param {Map} pagesByPath - Map of page.out.path -> page object
+ * @param {Object} logger - Antora logger instance
+ * @returns {Map} Map of sectionName -> array of formatted navigation items
+ */
+function buildSectionsFromAntoraNavigation(contentCatalog, siteUrl, pagesByPath, logger) {
+  const componentSections = new Map();
+  const components = contentCatalog.getComponents();
+  for (const component of components) {
+    for (const version of component.versions) {
+      const componentName = version.name;
+      const componentTitle = version.title || componentName;
+      const versionString = version.version;
+      logger.info(`Processing component: ${componentName} (${versionString || 'unversioned'})`);
+      // Check if version has parsed navigation tree
+      // (version.navigation is populated by Antora after parsing nav.adoc files)
+      if (!version.navigation || !Array.isArray(version.navigation) || version.navigation.length === 0) {
+        logger.debug(`No navigation tree found for ${componentName} ${versionString || 'unversioned'}`);
+        continue;
+      }
+      // Use component title as section name (H2 heading in llms.txt)
+      const sectionName = componentTitle;
+      if (!componentSections.has(sectionName)) {
+        componentSections.set(sectionName, []);
+      }
+      // Process each navigation tree (a component version can have multiple nav files)
+      const sectionArray = componentSections.get(sectionName);
+      for (const navTree of version.navigation) {
+        if (navTree.items && navTree.items.length > 0) {
+          // Recursively collect all navigation items
+          const counts = collectNavItems(
+            navTree.items,
+            sectionArray,
+            siteUrl,
+            pagesByPath,
+            logger
+          );
+          logger.info(`Section "${sectionName}": collected ${counts.found} pages from navigation`);
+        }
+      }
+    }
+  }
+  return componentSections;
+}
+/**
+ * Convert relative URLs in markdown links to absolute URLs
+ *
+ * LLMs need absolute URLs to properly reference pages. This function converts:
+ * - Relative links: [text](page.html) -> [text](https://site.com/path/page.html)
+ * - Fragment links: [text](#section) -> [text](https://site.com/page#section)
+ * - Already absolute links are left unchanged
+ *
+ * @param {string} markdown - Markdown content with potential relative links
+ * @param {string} baseUrl - Full URL of the current page (used as base for resolution)
+ * @param {Object} logger - Antora logger instance
+ * @returns {string} Markdown with all relative URLs converted to absolute
+ */
+function convertRelativeLinksToAbsolute(markdown, baseUrl, logger) {
+  // Match markdown links: [text](url)
+  const linkRegex = /\[([^\]]*)\]\(([^)]+)\)/g;
+  return markdown.replace(linkRegex, (match, text, url) => {
+    // Skip if URL is already absolute (http://, https://, mailto:, ftp:, etc.)
+    if (/^[a-z][a-z0-9+.-]*:/i.test(url)) {
+      return match;
+    }
+    // Handle fragment-only links (e.g., #section-name)
+    if (url.startsWith('#')) {
+      // Remove any existing fragment from base URL, then append new fragment
+      const baseWithoutFragment = baseUrl.split('#')[0];
+      return `[${text}](${baseWithoutFragment}${url})`;
+    }
+    // Handle relative URLs (e.g., page.html, ../other.html)
+    try {
+      // Remove existing fragment from baseUrl for clean resolution
+      const baseWithoutFragment = baseUrl.split('#')[0];
+      // Use JavaScript URL API to properly resolve relative paths
+      const absoluteUrl = new URL(url, baseWithoutFragment).href;
+      return `[${text}](${absoluteUrl})`;
+    } catch (error) {
+      logger.debug(`Failed to resolve relative URL "${url}" against base "${baseUrl}": ${error.message}`);
+      return match; // Return original if resolution fails
+    }
+  });
+}
+/**
+ * Recursively collect navigation items from Antora's navigation tree
+ *
+ * Navigation items can be:
+ * - Links: Have both url and content -> formatted as markdown links
+ * - Section headers: Have content but no url -> formatted as bold text
+ * - Nested items: Have items[] array -> processed recursively with increased indentation
+ *
+ * The function preserves navigation hierarchy through indentation:
+ * - Top-level items: "- [Page](url)"
+ * - Nested items: "  - [Page](url)"
+ * - Deeply nested: "    - [Page](url)"
+ *
+ * @param {Array} items - Array of navigation items to process
+ * @param {Array} collector - Array to collect formatted navigation strings
+ * @param {string} siteUrl - Base URL of the site
+ * @param {Map} pagesByPath - Map of page.out.path -> page object
+ * @param {Object} logger - Antora logger instance
+ * @param {string} indent - Current indentation level (increases with nesting)
+ * @returns {Object} Count object: { found: number, notFound: number }
+ */
+function collectNavItems(items, collector, siteUrl, pagesByPath, logger, indent = '') {
+  let foundCount = 0;
+  let notFoundCount = 0;
+  for (const item of items) {
+    // Process navigation items with URLs (actual page links)
+    if (item.url) {
+      // Remove leading slash from URL for pagesByPath lookup
+      // item.url format: "/component/version/page.html"
+      // pagesByPath key format: "component/version/page.html"
+      const urlPath = item.url.replace(/^\//, '');
+      const page = pagesByPath.get(urlPath);
+      if (page) {
+        foundCount++;
+        // Get page description from :description: attribute
+        let description = page.asciidoc.attributes["description"];
+        // Resolve attribute references in description (e.g., {product-name})
+        if (description) {
+          description = description.replace(/\{([^}]+)\}/g, (match, attrName) => {
+            return page.asciidoc.attributes[attrName] || match;
+          });
+        }
+        // Build full URL (item.url is relative to site root)
+        const pageUrl = `${siteUrl}${item.url}`;
+        // Format as markdown link with optional description
+        // Format: "- [Page Title](url): Description" or "- [Page Title](url)"
+        const link = description
+          ? `${indent}- [${item.content}](${pageUrl}): ${description}`
+          : `${indent}- [${item.content}](${pageUrl})`;
+        collector.push(link);
+      } else {
+        // URL in navigation doesn't match any page in pagesByPath
+        notFoundCount++;
+        logger.debug(`Navigation item URL not found in pages: ${urlPath}`);
+      }
+    }
+    // Process nested navigation items (recursive)
+    if (item.items && item.items.length > 0) {
+      // If this is a section header (has content but no URL), add it as bold text
+      if (!item.url && item.content) {
+        collector.push(`${indent}- **${item.content}**`);
+      }
+      // Recursively process nested items with increased indentation
+      const nestedCounts = collectNavItems(
+        item.items,
+        collector,
+        siteUrl,
+        pagesByPath,
+        logger,
+        indent + '  '  // Add 2 spaces for each nesting level
+      );
+      // Accumulate counts from nested items
+      foundCount += nestedCounts.found;
+      notFoundCount += nestedCounts.notFound;
+    }
+  }
+  return { found: foundCount, notFound: notFoundCount };
+}

package/package.json ADDED Viewed

@@ -0,0 +1,24 @@
+{
+  "name": "@pointsharp/antora-llm-generator",
+  "private": false,
+  "version": "1.0.0",
+  "description": "An Antora extension to generate llms.txt files for LLM consumption following llmstxt.org specification.",
+  "main": "llm-generator.js",
+  "keywords": [
+    "antora",
+    "antora-extension",
+    "llm",
+    "llmstxt",
+    "rag"
+  ],
+  "author": "Pointsharp AB",
+  "license": "Apache-2.0",
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/pointsharp/antora-llm-generator.git"
+  },
+  "dependencies": {
+    "minimatch": "10",
+    "node-html-markdown": "^1.3.0"
+  }
+}