@pointsharp/antora-llm-generator 1.1.3 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +22 -34
  2. package/llm-generator.js +55 -110
  3. package/package.json +1 -5
package/README.md CHANGED
@@ -1,11 +1,16 @@
1
1
  # Antora LLM Generator Extension
2
2
 
3
- An [Antora](https://antora.org) extension that creates two auxiliary text files after each site build following the [llmstxt.org specification](https://llmstxt.org/):
3
+ An [Antora](https://antora.org) extension that creates an auxiliary text file after each site build following the [llmstxt.org specification](https://llmstxt.org/):
4
4
 
5
5
  - **`llms.txt`** - A structured index with links to all documentation pages, organized into sections
6
- - **`llms-full.txt`** - Complete page content in markdown format with URLs for each page
7
6
 
8
- Both files help large-language models ingest your documentation with proper structure, URLs, and context.
7
+ The extension also generates compatibility and discovery files for crawlers and tools:
8
+
9
+ - **`llm.txt`** - Same content as `llms.txt`
10
+ - **`sitemap-llms.xml`** - Sitemap containing the `llms.txt` URL
11
+ - Updates the main `sitemap.xml` sitemap index to include `sitemap-llms.xml`
12
+
13
+ These files help large-language models ingest your documentation with proper structure, URLs, and context.
9
14
 
10
15
  ---
11
16
 
@@ -35,9 +40,9 @@ antora:
35
40
 
36
41
  ### Configuration options
37
42
 
38
- - **`summary`** - Optional. Appears as a blockquote at the top of both output files (following llmstxt.org spec).
43
+ - **`summary`** - Optional. Appears as a blockquote at the top of `llms.txt` (following llmstxt.org spec).
39
44
  - **`details`** - Optional. Appears as regular text after the summary. Can be multi-line markdown.
40
- - **`skippaths`** - Optional. Array of glob patterns. Files matching these patterns are omitted from both output files.
45
+ - **`skippaths`** - Optional. Array of glob patterns. Files matching these patterns are omitted from `llms.txt`.
41
46
  - **`debug`** - Optional. Set to `true` to enable verbose logging during build. Default: `false`.
42
47
 
43
48
  ### Navigation-based organization (default)
@@ -77,12 +82,10 @@ You can control how individual pages are included using AsciiDoc page attributes
77
82
 
78
83
  ```adoc
79
84
  :page-llms-ignore: true
80
- :page-llms-full-ignore: true
81
85
  :description: Brief description of this page
82
86
  ```
83
87
 
84
88
  - **`:page-llms-ignore:`** - Omit this page from `llms.txt` entirely
85
- - **`:page-llms-full-ignore:`** - Include in `llms.txt` but omit from `llms-full.txt`
86
89
  - **`:description:`** - Standard AsciiDoc attribute used for both HTML metadata and page descriptions in `llms.txt`
87
90
 
88
91
  ### Page descriptions
@@ -137,31 +140,14 @@ Optional details from playbook config
137
140
  - [Page Title](https://url): Description
138
141
  ```
139
142
 
140
- ### llms-full.txt structure
141
-
142
- Contains complete page content with clear URL references:
143
-
144
- ```markdown
145
- # Your Site Title - Complete Documentation
146
-
147
- > Optional summary from playbook config
148
-
149
- ---
150
-
151
- ## Page Title
152
-
153
- **URL:** https://example.com/page
154
-
155
- [Page content in markdown format...]
143
+ ### Sitemap integration
156
144
 
157
- ---
158
-
159
- ## Another Page Title
145
+ The extension also adds `llms.txt` to sitemap discovery without pretending it is an Antora page:
160
146
 
161
- **URL:** https://example.com/another
147
+ - Generates `sitemap-llms.xml` with the `llms.txt` URL
148
+ - Updates the main `sitemap.xml` sitemap index to reference `sitemap-llms.xml`
162
149
 
163
- [Page content in markdown format...]
164
- ```
150
+ This matches Antora best practice by keeping `llms.txt` in the site catalog instead of inserting it into the content catalog as a fake page.
165
151
 
166
152
  ---
167
153
 
@@ -173,11 +159,13 @@ Run your Antora build as usual:
173
159
  antora antora-playbook.yml
174
160
  ```
175
161
 
176
- After completion, four files appear in the build output directory:
162
+ After completion, these files appear in the build output directory:
163
+
177
164
  - `llms.txt` and `llm.txt` (both contain the structured index with links)
178
- - `llms-full.txt` and `llm-full.txt` (both contain complete page content)
165
+ - `sitemap-llms.xml` (contains the `llms.txt` URL)
166
+ - `sitemap.xml` is updated to include `sitemap-llms.xml`
179
167
 
180
- The duplicate filenames ensure compatibility with different naming conventions.
168
+ The duplicate `llms.txt` and `llm.txt` filenames ensure compatibility with different naming conventions.
181
169
 
182
170
  ---
183
171
 
@@ -186,7 +174,7 @@ The duplicate filenames ensure compatibility with different naming conventions.
186
174
  The extension runs **silently** by default, showing only a success message when complete:
187
175
 
188
176
  ```
189
- Generated llms.txt and llms-full.txt
177
+ Generated llms.txt
190
178
  ```
191
179
 
192
180
  ### Enabling verbose output
@@ -235,7 +223,7 @@ antora:
235
223
  - "**/internal/**"
236
224
  ```
237
225
 
238
- **This produces:** The summary appears as a blockquote and details as regular text at the top of both `llms.txt` and `llms-full.txt`, providing important context for LLMs before they see the documentation links.
226
+ **This produces:** The summary appears as a blockquote and details as regular text at the top of `llms.txt`, providing important context for LLMs before they see the documentation links.
239
227
 
240
228
  ---
241
229
 
package/llm-generator.js CHANGED
@@ -1,11 +1,7 @@
1
1
  "use strict";
2
2
 
3
- const { NodeHtmlMarkdown } = require("node-html-markdown");
4
3
  const { minimatch } = require("minimatch");
5
4
 
6
- // HTML to Markdown converter for page content
7
- const nhm = new NodeHtmlMarkdown();
8
-
9
5
  // UTF-8 BOM (Byte Order Mark) prepended to output files
10
6
  // This helps browsers and editors recognize the file as UTF-8 encoded
11
7
  const UTF8_BOM = '\uFEFF';
@@ -78,12 +74,6 @@ module.exports.register = function (context, { config }) {
78
74
  indexContent += `\n${details}\n`;
79
75
  }
80
76
 
81
- // llms-full.txt: Complete page content in markdown
82
- let fullContent = `# ${siteTitle} - Complete Documentation\n\n`;
83
- if (summary) {
84
- fullContent += `> ${summary}\n\n`;
85
- }
86
-
87
77
  // =============================================================================
88
78
  // STEP 3: Build page index
89
79
  // =============================================================================
@@ -147,57 +137,12 @@ module.exports.register = function (context, { config }) {
147
137
  }
148
138
 
149
139
  // =============================================================================
150
- // STEP 6: Build llms-full.txt complete page content
151
- // =============================================================================
152
- // For each page: convert HTML to markdown, resolve relative links, add URL header
153
- for (const [path, page] of pagesByPath.entries()) {
154
- // Skip pages with :page-llms-full-ignore: attribute
155
- if (page.asciidoc.attributes["page-llms-full-ignore"]) {
156
- if (debug) logger.info(`Skipping page from full content with 'page-llms-full-ignore' attribute: ${page.src.path}`);
157
- continue;
158
- }
159
-
160
- const pageUrl = `${siteUrl}/${path}`;
161
-
162
- // Convert HTML content to markdown (decode as UTF-8)
163
- const htmlContent = page.contents.toString('utf8');
164
- let plainText = nhm.translate(htmlContent);
165
-
166
- // Convert relative URLs to absolute (e.g., "page.html" -> "https://site.com/path/page.html")
167
- plainText = convertRelativeLinksToAbsolute(plainText, pageUrl, logger);
168
-
169
- // Add page section with title, URL, optional description, and content
170
- fullContent += `\n---\n\n`;
171
- fullContent += `## ${page.title}\n\n`;
172
- fullContent += `**URL:** ${pageUrl}\n\n`;
173
-
174
- // Include :description: attribute if present
175
- if (page.asciidoc.attributes["description"]) {
176
- fullContent += `*${page.asciidoc.attributes["description"]}*\n\n`;
177
- }
178
-
179
- fullContent += plainText;
180
- fullContent += `\n`;
181
- }
182
-
183
- // =============================================================================
184
- // STEP 7: Write output files
140
+ // STEP 6: Write output files
185
141
  // =============================================================================
186
- // Create 4 files total (both singular and plural forms for compatibility):
187
- // - llms-full.txt / llm-full.txt: Complete page content
142
+ // Create 2 files (both singular and plural forms for compatibility):
188
143
  // - llms.txt / llm.txt: Navigation index
189
144
  //
190
145
  // All files are encoded as UTF-8 with BOM for proper character display
191
- siteCatalog.addFile({
192
- out: { path: "llms-full.txt" },
193
- contents: Buffer.from(UTF8_BOM + fullContent, 'utf8'),
194
- });
195
-
196
- siteCatalog.addFile({
197
- out: { path: "llm-full.txt" },
198
- contents: Buffer.from(UTF8_BOM + fullContent, 'utf8'),
199
- });
200
-
201
146
  siteCatalog.addFile({
202
147
  out: { path: "llm.txt" },
203
148
  contents: Buffer.from(UTF8_BOM + indexContent, 'utf8'),
@@ -208,7 +153,56 @@ module.exports.register = function (context, { config }) {
208
153
  contents: Buffer.from(UTF8_BOM + indexContent, 'utf8'),
209
154
  });
210
155
 
211
- logger.info("Generated llms.txt and llms-full.txt");
156
+ // =============================================================================
157
+ // STEP 8: Add llms.txt to the sitemap
158
+ // =============================================================================
159
+ // Per Dan Allen's recommendation: modify the sitemap XML directly rather than
160
+ // faking llms.txt as a page in the contentCatalog.
161
+ //
162
+ // Strategy:
163
+ // 1. Create sitemap-llms.xml containing only the llms.txt URL.
164
+ // 2. Find sitemap.xml (the sitemap index) in the siteCatalog and inject a
165
+ // <sitemap> entry pointing to sitemap-llms.xml.
166
+ //
167
+ // This keeps llms.txt discoverable by crawlers without misrepresenting it as
168
+ // a documentation page.
169
+ if (siteUrl) {
170
+ try {
171
+ const llmsSitemapXml = [
172
+ '<?xml version="1.0" encoding="UTF-8"?>',
173
+ '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
174
+ ' <url>',
175
+ ` <loc>${siteUrl}/llms.txt</loc>`,
176
+ ' </url>',
177
+ '</urlset>',
178
+ ].join('\n');
179
+
180
+ siteCatalog.addFile({
181
+ out: { path: 'sitemap-llms.xml' },
182
+ contents: Buffer.from(llmsSitemapXml, 'utf8'),
183
+ });
184
+
185
+ // Find the sitemap index produced by Antora's site-mapper and add our entry
186
+ const sitemapIndexFile = siteCatalog.getFiles().find((f) => f.out?.path === 'sitemap.xml');
187
+ if (sitemapIndexFile) {
188
+ const current = sitemapIndexFile.contents.toString('utf8');
189
+ const entry = ` <sitemap>\n <loc>${siteUrl}/sitemap-llms.xml</loc>\n </sitemap>\n`;
190
+ sitemapIndexFile.contents = Buffer.from(
191
+ current.replace('</sitemapindex>', entry + '</sitemapindex>'),
192
+ 'utf8'
193
+ );
194
+ if (debug) logger.info('Added sitemap-llms.xml to sitemap index.');
195
+ } else {
196
+ logger.warn('llm-generator: sitemap.xml not found in site catalog — sitemap update skipped. Ensure the site-mapper is enabled.');
197
+ }
198
+ } catch (err) {
199
+ logger.warn(`llm-generator: Could not update sitemap with llms.txt: ${err.message}`);
200
+ }
201
+ } else {
202
+ if (debug) logger.info('No site URL configured — skipping sitemap update for llms.txt.');
203
+ }
204
+
205
+ logger.info("Generated llms.txt");
212
206
  });
213
207
  };
214
208
 
@@ -240,15 +234,10 @@ function buildSectionsFromNavigatorData(navDataFile, siteUrl, pagesByPath, logge
240
234
  return componentSections;
241
235
  }
242
236
 
243
- // Convert JavaScript object notation to valid JSON
237
+ // Evaluate JavaScript object notation safely using Function constructor
244
238
  // The navigator extension outputs JavaScript (unquoted keys), not JSON
245
- let jsData = jsonMatch[1];
246
- // Quote unquoted property names: name: "name":
247
- jsData = jsData.replace(/([{,]\s*)([a-zA-Z_$][a-zA-Z0-9_$]*)(\s*:)/g, '$1"$2"$3');
248
- // Convert single quotes to double quotes for string values
249
- jsData = jsData.replace(/'([^']*)'/g, '"$1"');
250
-
251
- const siteNavigationData = JSON.parse(jsData);
239
+ // We can't use simple regex to convert because of complex nested HTML strings
240
+ const siteNavigationData = new Function('return ' + jsonMatch[1])();
252
241
  if (debug) logger.info(`Parsed ${siteNavigationData.length} components from navigator data`);
253
242
 
254
243
  // Process each component
@@ -344,50 +333,6 @@ function buildSectionsFromAntoraNavigation(contentCatalog, siteUrl, pagesByPath,
344
333
  return componentSections;
345
334
  }
346
335
 
347
- /**
348
- * Convert relative URLs in markdown links to absolute URLs
349
- *
350
- * LLMs need absolute URLs to properly reference pages. This function converts:
351
- * - Relative links: [text](page.html) -> [text](https://site.com/path/page.html)
352
- * - Fragment links: [text](#section) -> [text](https://site.com/page#section)
353
- * - Already absolute links are left unchanged
354
- *
355
- * @param {string} markdown - Markdown content with potential relative links
356
- * @param {string} baseUrl - Full URL of the current page (used as base for resolution)
357
- * @param {Object} logger - Antora logger instance
358
- * @returns {string} Markdown with all relative URLs converted to absolute
359
- */
360
- function convertRelativeLinksToAbsolute(markdown, baseUrl, logger) {
361
- // Match markdown links: [text](url)
362
- const linkRegex = /\[([^\]]*)\]\(([^)]+)\)/g;
363
-
364
- return markdown.replace(linkRegex, (match, text, url) => {
365
- // Skip if URL is already absolute (http://, https://, mailto:, ftp:, etc.)
366
- if (/^[a-z][a-z0-9+.-]*:/i.test(url)) {
367
- return match;
368
- }
369
-
370
- // Handle fragment-only links (e.g., #section-name)
371
- if (url.startsWith('#')) {
372
- // Remove any existing fragment from base URL, then append new fragment
373
- const baseWithoutFragment = baseUrl.split('#')[0];
374
- return `[${text}](${baseWithoutFragment}${url})`;
375
- }
376
-
377
- // Handle relative URLs (e.g., page.html, ../other.html)
378
- try {
379
- // Remove existing fragment from baseUrl for clean resolution
380
- const baseWithoutFragment = baseUrl.split('#')[0];
381
- // Use JavaScript URL API to properly resolve relative paths
382
- const absoluteUrl = new URL(url, baseWithoutFragment).href;
383
- return `[${text}](${absoluteUrl})`;
384
- } catch (error) {
385
- // Return original if resolution fails
386
- return match;
387
- }
388
- });
389
- }
390
-
391
336
  /**
392
337
  * Recursively collect navigation items from Antora's navigation tree
393
338
  *
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "@pointsharp/antora-llm-generator",
3
3
  "private": false,
4
- "version": "1.1.3",
4
+ "version": "1.2.0",
5
5
  "description": "An Antora extension to generate llms.txt files for LLM consumption following llmstxt.org specification.",
6
6
  "main": "llm-generator.js",
7
7
  "keywords": [
@@ -16,9 +16,5 @@
16
16
  "repository": {
17
17
  "type": "git",
18
18
  "url": "git+https://github.com/pointsharp/antora-llm-generator.git"
19
- },
20
- "dependencies": {
21
- "minimatch": "10",
22
- "node-html-markdown": "^1.3.0"
23
19
  }
24
20
  }