llm-docs-builder 0.6.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -5,34 +5,21 @@
5
5
 
6
6
  **Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
7
7
 
8
- llm-docs-builder normalizes markdown documentation to be AI-friendly and generates llms.txt files. Transform relative links to absolute URLs, measure token savings when serving markdown vs HTML, and create standardized documentation indexes that help LLMs navigate your project.
8
+ llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, optimizes documents for LLM context windows, and enhances documents for RAG retrieval with hierarchical heading context and metadata.
9
9
 
10
10
  ## The Problem
11
11
 
12
12
  When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
13
13
 
14
14
  **Real example from Karafka documentation:**
15
- - Human HTML version: 82.0 KB (~20,500 tokens)
16
- - AI markdown version: 4.1 KB (~1,025 tokens)
17
- - **Result: 95% reduction, 19,475 tokens saved, 20x smaller**
18
-
19
- With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
20
-
21
- ## What This Tool Does
22
-
23
- llm-docs-builder helps you optimize markdown documentation for AI consumption:
24
-
25
- 1. **Measure Savings** - Compare what your server sends to humans (HTML) vs AI bots (markdown) to quantify context window reduction
26
- 2. **Transform Markdown** - Normalize your markdown files with absolute links and consistent URL formats for better LLM navigation
27
- 3. **Generate llms.txt** - Create standardized documentation indexes following the [llms.txt](https://llmstxt.org/) specification
28
- 4. **Serve Efficiently** - Configure your server to automatically serve transformed markdown to AI bots while humans get HTML
15
+ - Human HTML version: 104.4 KB (~26,735 tokens)
16
+ - AI markdown version: 21.5 KB (~5,496 tokens)
17
+ - **Result: 79% reduction, 21,239 tokens saved, 5x smaller**
29
18
 
30
19
  ## Quick Start
31
20
 
32
21
  ### Measure Your Current Token Waste
33
22
 
34
- Before making any changes, see how much you could save:
35
-
36
23
  ```bash
37
24
  # Using Docker (no Ruby installation needed)
38
25
  docker pull mensfeld/llm-docs-builder:latest
@@ -42,235 +29,162 @@ docker run mensfeld/llm-docs-builder compare \
42
29
  --url https://yoursite.com/docs/getting-started.html
43
30
  ```
44
31
 
45
- **Example output:**
46
- ```
47
- ============================================================
48
- Context Window Comparison
49
- ============================================================
50
-
51
- Human version: 45.2 KB (~11,300 tokens)
52
- Source: https://yoursite.com/docs/page.html (User-Agent: human)
32
+ ### Transform Your Documentation
53
33
 
54
- AI version: 12.8 KB (~3,200 tokens)
55
- Source: https://yoursite.com/docs/page.html (User-Agent: AI)
34
+ ```bash
35
+ # Single file
36
+ llm-docs-builder transform --docs README.md
56
37
 
57
- ------------------------------------------------------------
58
- Reduction: 32.4 KB (72%)
59
- Token savings: 8,100 tokens (72%)
60
- Factor: 3.5x smaller
61
- ============================================================
38
+ # Bulk transform with config
39
+ llm-docs-builder bulk-transform --config llm-docs-builder.yml
62
40
  ```
63
41
 
64
- This single command shows you the potential ROI before you invest any time in optimization.
65
-
66
- ### Real-World Results
67
-
68
- **[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
69
-
70
- | Page | Human HTML | AI Markdown | Reduction | Tokens Saved | Factor |
71
- |------|-----------|-------------|-----------|--------------|---------|
72
- | Getting Started | 82.0 KB | 4.1 KB | 95% | ~19,475 | 20.1x |
73
- | Configuration | 86.3 KB | 7.1 KB | 92% | ~19,800 | 12.1x |
74
- | Routing | 93.6 KB | 14.7 KB | 84% | ~19,725 | 6.4x |
75
- | Deployment | 122.1 KB | 33.3 KB | 73% | ~22,200 | 3.7x |
76
- | Producing Messages | 87.7 KB | 8.3 KB | 91% | ~19,850 | 10.6x |
77
- | Consuming Messages | 105.3 KB | 21.3 KB | 80% | ~21,000 | 4.9x |
78
- | Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | ~21,950 | 5.1x |
79
- | Active Job | 88.7 KB | 8.8 KB | 90% | ~19,975 | 10.1x |
80
- | Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | ~22,050 | 3.7x |
81
- | Error Handling | 93.8 KB | 13.1 KB | 86% | ~20,175 | 7.2x |
82
-
83
- **Average: 83% reduction, ~20,620 tokens saved per page, 8.4x smaller files**
84
-
85
- For a typical RAG system making 1,000 documentation queries per day:
86
- - **Before**: ~990 KB per day (~247,500 tokens) × 1,000 queries = ~247.5M tokens/day
87
- - **After**: ~165 KB per day (~41,250 tokens) × 1,000 queries = ~41.25M tokens/day
88
- - **Savings**: 83% reduction = ~206.25M tokens saved per day
89
-
90
- At GPT-4 pricing ($2.50/M input tokens), that's approximately **$500/day or $183,000/year saved** on a documentation site with moderate traffic.
91
-
92
42
  ## Installation
93
43
 
94
- ### Option 1: Docker (Recommended)
95
-
96
- No Ruby installation required. Perfect for CI/CD and quick usage:
44
+ ### Docker (Recommended)
97
45
 
98
46
  ```bash
99
- # Pull the image
100
47
  docker pull mensfeld/llm-docs-builder:latest
101
-
102
- # Create an alias for convenience
103
48
  alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
104
-
105
- # Use like a native command
106
- llm-docs-builder compare --url https://yoursite.com/docs
107
49
  ```
108
50
 
109
- Multi-architecture support (amd64/arm64), ~50MB image size.
110
-
111
- ### Option 2: RubyGems
112
-
113
- For Ruby developers or when you need the Ruby API:
51
+ ### RubyGems
114
52
 
115
53
  ```bash
116
54
  gem install llm-docs-builder
117
55
  ```
118
56
 
119
- Or add to your Gemfile:
120
-
121
- ```ruby
122
- gem 'llm-docs-builder'
123
- ```
124
-
125
- ## Core Features
57
+ ## Features
126
58
 
127
- ### 1. Compare and Measure (The "Before You Start" Tool)
128
-
129
- Quantify exactly how much context window you're wasting:
59
+ ### Measure and Compare
130
60
 
131
61
  ```bash
132
- # Compare what your server sends to humans vs AI bots
62
+ # Compare what your server sends to humans vs AI
133
63
  llm-docs-builder compare --url https://yoursite.com/docs/page.html
134
64
 
135
- # Compare remote HTML with your local markdown
65
+ # Compare remote HTML with local markdown
136
66
  llm-docs-builder compare \
137
67
  --url https://yoursite.com/docs/api.html \
138
68
  --file docs/api.md
139
-
140
- # Verbose mode for debugging
141
- llm-docs-builder compare --url https://example.com/docs --verbose
142
69
  ```
143
70
 
144
- **Why this matters:**
145
- - Validates that optimizations actually work
146
- - Quantifies ROI before you invest time
147
- - Monitors ongoing effectiveness
148
- - Provides concrete metrics for stakeholders
149
-
150
- ### 2. Transform Markdown (The Normalizer)
151
-
152
- Normalize your markdown documentation to be LLM-friendly:
71
+ ### Generate llms.txt
153
72
 
154
- **Single file transformation:**
155
73
  ```bash
156
- # Expand relative links to absolute URLs
157
- llm-docs-builder transform \
158
- --docs README.md \
159
- --config llm-docs-builder.yml
74
+ # Create standardized documentation index
75
+ llm-docs-builder generate --config llm-docs-builder.yml
160
76
  ```
161
77
 
162
- **Bulk transformation - two modes:**
78
+ ## Configuration
163
79
 
164
- **a) Separate files (default)** - Creates `.llm.md` versions alongside originals:
165
80
  ```yaml
166
81
  # llm-docs-builder.yml
167
82
  docs: ./docs
168
83
  base_url: https://myproject.io
169
- suffix: .llm # Creates README.llm.md alongside README.md
170
- convert_urls: true # .html → .md
171
- remove_comments: true # Remove HTML comments
172
- remove_badges: true # Remove badge/shield images
173
- remove_frontmatter: true # Remove YAML/TOML frontmatter
174
- normalize_whitespace: true # Clean up excessive blank lines
175
- ```
176
-
177
- ```bash
178
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
179
- ```
84
+ title: My Project
85
+ description: Brief description
86
+ output: llms.txt
87
+ suffix: .llm
88
+ verbose: false
180
89
 
181
- Result:
182
- ```
183
- docs/
184
- ├── README.md ← Original (for humans)
185
- ├── README.llm.md ← Optimized (for AI)
186
- ├── api.md
187
- └── api.llm.md
188
- ```
90
+ # Basic options
91
+ convert_urls: true
92
+ remove_comments: true
93
+ remove_badges: true
94
+ remove_frontmatter: true
95
+ normalize_whitespace: true
189
96
 
190
- **b) In-place transformation** - Overwrites originals (for build pipelines):
191
- ```yaml
192
- # llm-docs-builder.yml
193
- docs: ./docs
194
- base_url: https://myproject.io
195
- suffix: "" # Transforms in-place
196
- convert_urls: true # Convert .html to .md
197
- remove_comments: true # Remove HTML comments
198
- remove_badges: true # Remove badge/shield images
199
- remove_frontmatter: true # Remove YAML/TOML frontmatter
200
- normalize_whitespace: true # Clean up excessive blank lines
97
+ # Additional compression options
98
+ remove_code_examples: false
99
+ remove_images: true
100
+ remove_blockquotes: true
101
+ remove_duplicates: true
102
+ remove_stopwords: false
103
+ simplify_links: true
104
+ generate_toc: true
105
+ custom_instruction: "This documentation is optimized for AI consumption"
106
+
107
+ # RAG enhancement options
108
+ normalize_headings: true # Add hierarchical context to headings
109
+ heading_separator: " / " # Separator for heading hierarchy
110
+ include_metadata: true # Enable enhanced llms.txt metadata
111
+ include_tokens: true # Include token counts in llms.txt
112
+ include_timestamps: true # Include update timestamps in llms.txt
113
+ include_priority: true # Include priority labels in llms.txt
114
+ calculate_compression: false # Calculate compression ratios (slower)
115
+
116
+ # Exclusions
201
117
  excludes:
202
118
  - "**/private/**"
119
+ - "**/drafts/**"
203
120
  ```
204
121
 
205
- ```bash
206
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
207
- ```
208
-
209
- Perfect for CI/CD where you transform docs before deployment.
210
-
211
- **What gets normalized:**
212
- - **Links**: Relative → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
213
- - **URLs**: HTML → Markdown format (`.html` → `.md`)
214
- - **Comments**: HTML comments removed (`<!-- ... -->`)
215
- - **Badges**: Shield/badge images removed (CI badges, version badges, etc.)
216
- - **Frontmatter**: YAML/TOML metadata removed (Jekyll, Hugo, etc.)
217
- - **Whitespace**: Excessive blank lines reduced (3+ → 2 max)
218
- - Clean markdown structure preserved
219
- - No content modification, just intelligent cleanup
220
-
221
- ### 3. Generate llms.txt (The Standard)
222
-
223
- Create a standardized documentation index following the [llms.txt](https://llmstxt.org/) specification:
122
+ **Configuration precedence:**
123
+ 1. CLI flags (highest)
124
+ 2. Config file
125
+ 3. Defaults
224
126
 
225
- ```yaml
226
- # llm-docs-builder.yml
227
- docs: ./docs
228
- base_url: https://myproject.io
229
- title: My Project
230
- description: A library that does amazing things
231
- output: llms.txt
232
- ```
127
+ ## CLI Commands
233
128
 
234
129
  ```bash
235
- llm-docs-builder generate --config llm-docs-builder.yml
130
+ llm-docs-builder compare [options] # Measure token savings
131
+ llm-docs-builder transform [options] # Transform single file
132
+ llm-docs-builder bulk-transform [options] # Transform directory
133
+ llm-docs-builder generate [options] # Generate llms.txt
134
+ llm-docs-builder parse [options] # Parse llms.txt
135
+ llm-docs-builder validate [options] # Validate llms.txt
136
+ llm-docs-builder version # Show version
236
137
  ```
237
138
 
238
- **Generated output:**
239
- ```markdown
240
- # My Project
139
+ **Common options:**
140
+ ```
141
+ -c, --config PATH Configuration file
142
+ -d, --docs PATH Documentation path
143
+ -o, --output PATH Output file
144
+ -u, --url URL URL for comparison
145
+ -v, --verbose Detailed output
146
+ ```
241
147
 
242
- > A library that does amazing things
148
+ ## Ruby API
243
149
 
244
- ## Documentation
150
+ ```ruby
151
+ require 'llm_docs_builder'
245
152
 
246
- - [README](https://myproject.io/README.md): Complete overview and installation
247
- - [Getting Started](https://myproject.io/getting-started.md): Quick start guide
248
- - [API Reference](https://myproject.io/api-reference.md): Detailed API documentation
249
- ```
153
+ # Transform single file with custom options
154
+ transformed = LlmDocsBuilder.transform_markdown(
155
+ 'README.md',
156
+ base_url: 'https://myproject.io',
157
+ remove_code_examples: true,
158
+ remove_images: true,
159
+ generate_toc: true,
160
+ custom_instruction: 'AI-optimized documentation'
161
+ )
250
162
 
251
- **Smart prioritization:**
252
- 1. README files (always first)
253
- 2. Getting started guides
254
- 3. Tutorials and guides
255
- 4. API references
256
- 5. Other documentation
163
+ # Bulk transform
164
+ files = LlmDocsBuilder.bulk_transform(
165
+ './docs',
166
+ base_url: 'https://myproject.io',
167
+ suffix: '.llm',
168
+ remove_duplicates: true,
169
+ generate_toc: true
170
+ )
257
171
 
258
- The llms.txt file serves as an efficient entry point for AI systems to understand your project structure.
172
+ # Generate llms.txt
173
+ content = LlmDocsBuilder.generate_from_docs(
174
+ './docs',
175
+ base_url: 'https://myproject.io',
176
+ title: 'My Project'
177
+ )
178
+ ```
259
179
 
260
- ### 4. Serve to AI Bots (The Deployment)
180
+ ## Serving Optimized Docs to AI Bots
261
181
 
262
- After using `bulk-transform` with `suffix: .llm`, configure your web server to automatically serve optimized versions to AI bots:
182
+ After using `bulk-transform` with `suffix: .llm`, configure your web server to serve optimized versions to AI bots:
263
183
 
264
184
  **Apache (.htaccess):**
265
185
  ```apache
266
- # Detect AI bots
267
- SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt|chatgpt)" IS_LLM_BOT
268
- SetEnvIf User-Agent "(?i)(perplexity|gemini|copilot|bard)" IS_LLM_BOT
269
-
270
- # Serve .llm.md to AI, .md to humans
271
- RewriteEngine On
186
+ SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt)" IS_LLM_BOT
272
187
  RewriteCond %{ENV:IS_LLM_BOT} !^$
273
- RewriteCond %{REQUEST_URI} ^/docs/.*\.md$ [NC]
274
188
  RewriteRule ^(.*)\.md$ $1.llm.md [L]
275
189
  ```
276
190
 
@@ -279,7 +193,6 @@ RewriteRule ^(.*)\.md$ $1.llm.md [L]
279
193
  map $http_user_agent $is_llm_bot {
280
194
  default 0;
281
195
  "~*(?i)(openai|anthropic|claude|gpt)" 1;
282
- "~*(?i)(perplexity|gemini|copilot)" 1;
283
196
  }
284
197
 
285
198
  location ~ ^/docs/(.*)\.md$ {
@@ -289,435 +202,222 @@ location ~ ^/docs/(.*)\.md$ {
289
202
  }
290
203
  ```
291
204
 
292
- **Cloudflare Workers:**
293
- ```javascript
294
- const isLLMBot = /openai|anthropic|claude|gpt|perplexity/i.test(userAgent);
295
- if (isLLMBot && url.pathname.startsWith('/docs/')) {
296
- url.pathname = url.pathname.replace(/\.md$/, '.llm.md');
297
- }
298
- ```
299
-
300
- **Result**: AI systems automatically get optimized versions, humans get the original. No manual switching, no duplicate URLs.
205
+ ## Real-World Results: Karafka Framework
301
206
 
302
- ## Configuration
303
-
304
- All commands support both config files and CLI flags. Config files are recommended for consistency:
207
+ **Before:** 140+ lines of custom transformation code
305
208
 
209
+ **After:** 6 lines of configuration
306
210
  ```yaml
307
- # llm-docs-builder.yml
308
- docs: ./docs
309
- base_url: https://myproject.io
310
- title: My Project
311
- description: Brief description
312
- output: llms.txt
211
+ docs: ./online/docs
212
+ base_url: https://karafka.io/docs
313
213
  convert_urls: true
314
214
  remove_comments: true
315
215
  remove_badges: true
316
216
  remove_frontmatter: true
317
217
  normalize_whitespace: true
318
- suffix: .llm
319
- verbose: false
320
- excludes:
321
- - "**/private/**"
322
- - "**/drafts/**"
218
+ suffix: "" # In-place for build pipeline
323
219
  ```
324
220
 
325
- **Configuration precedence:**
326
- 1. CLI flags (highest priority)
327
- 2. Config file values
328
- 3. Defaults
329
-
330
- **Example of overriding:**
331
- ```bash
332
- # Uses config file but overrides title
333
- llm-docs-builder generate --config llm-docs-builder.yml --title "Override Title"
334
- ```
221
+ **Results:**
222
+ - 93% average token reduction
223
+ - 20-36x smaller files
224
+ - Automated via GitHub Actions
335
225
 
336
226
  ## Docker Usage
337
227
 
338
- All CLI commands work in Docker with the same syntax:
339
-
340
228
  ```bash
341
- # Basic pattern
342
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder [command] [options]
229
+ # Pull image
230
+ docker pull mensfeld/llm-docs-builder:latest
343
231
 
344
- # Examples
345
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
346
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder transform --docs README.md
347
- docker run mensfeld/llm-docs-builder compare --url https://example.com/docs
348
- ```
232
+ # Compare (no volume needed for remote URLs)
233
+ docker run mensfeld/llm-docs-builder compare \
234
+ --url https://yoursite.com/docs
349
235
 
350
- **CI/CD Integration:**
236
+ # Transform with volume mount
237
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
238
+ bulk-transform --config llm-docs-builder.yml
239
+ ```
351
240
 
352
- GitHub Actions:
241
+ **CI/CD Example (GitHub Actions):**
353
242
  ```yaml
354
- - name: Generate llms.txt
243
+ - name: Optimize documentation
355
244
  run: |
356
245
  docker run -v ${{ github.workspace }}:/workspace \
357
- mensfeld/llm-docs-builder generate --config llm-docs-builder.yml
358
- ```
359
-
360
- GitLab CI:
361
- ```yaml
362
- generate-llms:
363
- image: mensfeld/llm-docs-builder:latest
364
- script:
365
- - llm-docs-builder generate --docs ./docs
246
+ mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
366
247
  ```
367
248
 
368
- See [Docker Usage](#detailed-docker-usage) section below for comprehensive examples.
369
-
370
- ## Ruby API
371
-
372
- For programmatic usage:
373
-
374
- ```ruby
375
- require 'llm_docs_builder'
376
-
377
- # Using config file
378
- content = LlmDocsBuilder.generate_from_docs(config_file: 'llm-docs-builder.yml')
379
-
380
- # Direct options
381
- content = LlmDocsBuilder.generate_from_docs('./docs',
382
- base_url: 'https://myproject.io',
383
- title: 'My Project'
384
- )
249
+ ## Compression Examples
385
250
 
386
- # Transform markdown
387
- transformed = LlmDocsBuilder.transform_markdown('README.md',
388
- base_url: 'https://myproject.io',
389
- convert_urls: true,
390
- remove_comments: true,
391
- remove_badges: true,
392
- remove_frontmatter: true,
393
- normalize_whitespace: true
394
- )
395
-
396
- # Bulk transform
397
- files = LlmDocsBuilder.bulk_transform('./docs',
398
- base_url: 'https://myproject.io',
399
- suffix: '.llm',
400
- remove_comments: true,
401
- remove_badges: true,
402
- remove_frontmatter: true,
403
- normalize_whitespace: true,
404
- excludes: ['**/private/**']
405
- )
406
-
407
- # In-place transformation
408
- files = LlmDocsBuilder.bulk_transform('./docs',
409
- suffix: '', # Empty for in-place
410
- base_url: 'https://myproject.io',
411
- remove_comments: true,
412
- remove_badges: true,
413
- remove_frontmatter: true,
414
- normalize_whitespace: true
415
- )
416
- ```
251
+ **Input markdown:**
252
+ ```markdown
253
+ ---
254
+ layout: docs
255
+ ---
417
256
 
418
- ## Real-World Case Study: Karafka Framework
257
+ # API Documentation
419
258
 
420
- The [Karafka framework](https://github.com/karafka/karafka) processes millions of Kafka messages daily and maintains extensive documentation. Before llm-docs-builder:
259
+ [![Build](badge.svg)](https://ci.com)
421
260
 
422
- - **140+ lines of custom Ruby code** for link expansion and URL normalization
423
- - Manual maintenance of transformation logic
424
- - No way to measure optimization effectiveness
261
+ > Important: This is a note
425
262
 
426
- **After implementing llm-docs-builder:**
263
+ [Click here to see the complete API documentation](./api.md)
427
264
 
428
- ```yaml
429
- # llm-docs-builder.yml
430
- docs: ./online/docs
431
- base_url: https://karafka.io/docs
432
- convert_urls: true
433
- remove_comments: true
434
- remove_badges: true
435
- remove_frontmatter: true
436
- normalize_whitespace: true
437
- suffix: "" # In-place transformation for build pipeline
438
- excludes:
439
- - "**/Enterprise-License-Setup/**"
265
+ ```ruby
266
+ api = API.new
440
267
  ```
441
268
 
442
- ```bash
443
- # In their deployment script
444
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
269
+ ![Diagram](./diagram.png)
445
270
  ```
446
271
 
447
- **Results:**
448
- - **140 lines of code → 6 lines of config**
449
- - **93% average token reduction** across all documentation
450
- - **Quantifiable savings** via the compare command
451
- - **Automated daily deployments** via GitHub Actions
452
-
453
- The compare command revealed that their documentation was consuming 20-36x more tokens than necessary for AI systems. After optimization, RAG queries became dramatically more efficient.
272
+ **After transformation (with default options):**
273
+ ```markdown
274
+ # API Documentation
454
275
 
455
- ## CLI Reference
276
+ [complete API documentation](./api.md)
456
277
 
457
- ```bash
458
- llm-docs-builder compare [options] # Measure token savings (start here!)
459
- llm-docs-builder transform [options] # Transform single markdown file
460
- llm-docs-builder bulk-transform [options] # Transform entire documentation tree
461
- llm-docs-builder generate [options] # Generate llms.txt index
462
- llm-docs-builder parse [options] # Parse existing llms.txt
463
- llm-docs-builder validate [options] # Validate llms.txt format
464
- llm-docs-builder version # Show version
465
- ```
466
-
467
- **Common options:**
278
+ ```ruby
279
+ api = API.new
468
280
  ```
469
- -c, --config PATH Configuration file (default: llm-docs-builder.yml)
470
- -d, --docs PATH Documentation directory or file
471
- -o, --output PATH Output file path
472
- -u, --url URL URL for comparison
473
- -f, --file PATH Local file for comparison
474
- -v, --verbose Detailed output
475
- -h, --help Show help
476
281
  ```
477
282
 
478
- For advanced options (base_url, title, suffix, excludes, convert_urls), use a config file.
479
-
480
- ## Why This Matters for RAG Systems
283
+ **Token reduction:** ~40-60% depending on configuration
481
284
 
482
- Retrieval-Augmented Generation (RAG) systems fetch documentation to answer questions. Every byte of overhead in those documents:
483
-
484
- 1. **Costs money** - More tokens = higher API costs
485
- 2. **Reduces capacity** - Less room for actual documentation in context window
486
- 3. **Slows responses** - More tokens to process = longer response times
487
- 4. **Degrades quality** - Navigation noise can confuse the model
488
-
489
- llm-docs-builder addresses all four issues by transforming markdown to be AI-friendly and enabling your server to automatically serve it to AI bots while humans get HTML.
490
-
491
- **The JavaScript Problem:**
492
-
493
- Many documentation sites rely on JavaScript for rendering. AI crawlers typically don't execute JavaScript, so they either:
494
- - Get incomplete content
495
- - Get server-side rendered HTML (bloated with framework overhead)
496
- - Fail entirely
497
-
498
- By detecting AI bots and serving them clean markdown instead of HTML, you sidestep this problem entirely.
499
-
500
- ## Configuration Reference
501
-
502
- | Option | Type | Default | Description |
503
- |--------|------|---------|-------------|
504
- | `docs` | String | `./docs` | Documentation directory or file |
505
- | `base_url` | String | - | Base URL for absolute links (e.g., `https://myproject.io`) |
506
- | `title` | String | Auto-detected | Project title |
507
- | `description` | String | Auto-detected | Project description |
508
- | `output` | String | `llms.txt` | Output filename for llms.txt generation |
509
- | `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
510
- | `remove_comments` | Boolean | `false` | Remove HTML comments (`<!-- ... -->`) |
511
- | `remove_badges` | Boolean | `false` | Remove badge/shield images (CI, version, etc.) |
512
- | `remove_frontmatter` | Boolean | `false` | Remove YAML/TOML frontmatter (Jekyll, Hugo) |
513
- | `normalize_whitespace` | Boolean | `false` | Normalize excessive blank lines and trailing spaces |
514
- | `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
515
- | `excludes` | Array | `[]` | Glob patterns to exclude |
516
- | `verbose` | Boolean | `false` | Enable detailed output |
517
-
518
- ## Detailed Docker Usage
285
+ ## FAQ
519
286
 
520
- ### Installation and Setup
287
+ **Q: Do I need to use llms.txt?**
288
+ No. The compare and transform commands work independently.
521
289
 
522
- ```bash
523
- # Pull from Docker Hub
524
- docker pull mensfeld/llm-docs-builder:latest
290
+ **Q: Will this change how humans see my docs?**
291
+ Not with default `suffix: .llm`. Separate files are served only to AI bots.
525
292
 
526
- # Or from GitHub Container Registry
527
- docker pull ghcr.io/mensfeld/llm-docs-builder:latest
293
+ **Q: Can I use this in my build pipeline?**
294
+ Yes. Use `suffix: ""` for in-place transformation.
528
295
 
529
- # Create an alias for convenience
530
- alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
531
- ```
296
+ **Q: How do I know if it's working?**
297
+ Use `llm-docs-builder compare` to measure before and after.
532
298
 
533
- ### Common Commands
299
+ **Q: What about private documentation?**
300
+ Use the `excludes` option to skip sensitive files.
534
301
 
535
- **Compare (no volume mount needed for remote URLs):**
536
- ```bash
537
- docker run mensfeld/llm-docs-builder compare \
538
- --url https://karafka.io/docs/Getting-Started/
302
+ ## RAG Enhancement Features
539
303
 
540
- # With local file
541
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder compare \
542
- --url https://example.com/page.html \
543
- --file docs/page.md
544
- ```
304
+ ### Heading Normalization
545
305
 
546
- **Generate llms.txt:**
547
- ```bash
548
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
549
- generate --docs ./docs --output llms.txt
550
- ```
306
+ Transform headings to include hierarchical context, making each section self-contained for RAG retrieval:
551
307
 
552
- **Transform single file:**
553
- ```bash
554
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
555
- transform --docs README.md --config llm-docs-builder.yml
556
- ```
308
+ **Before:**
309
+ ```markdown
310
+ # Configuration
311
+ ## Consumer Settings
312
+ ### auto_offset_reset
557
313
 
558
- **Bulk transform:**
559
- ```bash
560
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
561
- bulk-transform --config llm-docs-builder.yml
314
+ Controls behavior when no offset exists...
562
315
  ```
563
316
 
564
- **Parse and validate:**
565
- ```bash
566
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
567
- parse --docs llms.txt --verbose
317
+ **After (with `normalize_headings: true`):**
318
+ ```markdown
319
+ # Configuration
320
+ ## Configuration / Consumer Settings
321
+ ### Configuration / Consumer Settings / auto_offset_reset
568
322
 
569
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
570
- validate --docs llms.txt
323
+ Controls behavior when no offset exists...
571
324
  ```
572
325
 
573
- ### CI/CD Examples
326
+ **Why this matters for RAG:** When documents are chunked and retrieved independently, each section retains full context. An LLM seeing just the `auto_offset_reset` section knows it's about "Configuration / Consumer Settings / auto_offset_reset" not just generic "auto_offset_reset".
574
327
 
575
- **GitHub Actions:**
576
328
  ```yaml
577
- jobs:
578
- optimize-docs:
579
- runs-on: ubuntu-latest
580
- steps:
581
- - uses: actions/checkout@v3
582
- - name: Transform documentation
583
- run: |
584
- docker run -v ${{ github.workspace }}:/workspace \
585
- mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
586
- - name: Measure savings
587
- run: |
588
- docker run mensfeld/llm-docs-builder \
589
- compare --url https://yoursite.com/docs/main.html
329
+ # Enable in config
330
+ normalize_headings: true
331
+ heading_separator: " / " # Customize separator (default: " / ")
590
332
  ```
591
333
 
592
- **GitLab CI:**
593
- ```yaml
594
- optimize-docs:
595
- image: mensfeld/llm-docs-builder:latest
596
- script:
597
- - llm-docs-builder bulk-transform --docs ./docs
598
- - llm-docs-builder compare --url https://yoursite.com/docs
599
- ```
600
-
601
- **Jenkins:**
602
- ```groovy
603
- stage('Optimize Documentation') {
604
- steps {
605
- sh '''
606
- docker run -v ${WORKSPACE}:/workspace \
607
- mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
608
- '''
609
- }
610
- }
611
- ```
612
-
613
- ### Version Pinning
614
-
615
- ```bash
616
- # Use specific version
617
- docker run mensfeld/llm-docs-builder:0.3.0 version
334
+ ### Enhanced llms.txt Metadata
618
335
 
619
- # Use major version (gets latest patch)
620
- docker run mensfeld/llm-docs-builder:0 version
336
+ Generate enriched llms.txt files with token counts, timestamps, and priority labels to help AI agents make better decisions:
621
337
 
622
- # Always latest
623
- docker run mensfeld/llm-docs-builder:latest version
338
+ **Standard llms.txt:**
339
+ ```markdown
340
+ - [Getting Started](https://myproject.io/docs/Getting-Started.md)
341
+ - [Configuration](https://myproject.io/docs/Configuration.md)
624
342
  ```
625
343
 
626
- ### Platform-Specific Usage
627
-
628
- **Windows PowerShell:**
629
- ```powershell
630
- docker run -v ${PWD}:/workspace mensfeld/llm-docs-builder generate --docs ./docs
344
+ **Enhanced llms.txt (with metadata enabled):**
345
+ ```markdown
346
+ - [Getting Started](https://myproject.io/docs/Getting-Started.md) tokens:450 updated:2025-10-13 priority:high
347
+ - [Configuration](https://myproject.io/docs/Configuration.md) tokens:2800 updated:2025-10-12 priority:high
348
+ - [Advanced Topics](https://myproject.io/docs/Advanced.md) tokens:5200 updated:2025-09-15 priority:medium
631
349
  ```
632
350
 
633
- **Windows Command Prompt:**
634
- ```cmd
635
- docker run -v %cd%:/workspace mensfeld/llm-docs-builder generate --docs ./docs
636
- ```
351
+ **Benefits:**
352
+ - AI agents can see token counts → load multiple small docs vs one large doc
353
+ - Timestamps help prefer recent documentation
354
+ - Priority signals guide which docs to fetch first
355
+ - Compression ratios show optimization effectiveness
637
356
 
638
- **macOS/Linux:**
639
- ```bash
640
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
357
+ ```yaml
358
+ # Enable in config
359
+ include_metadata: true # Master switch
360
+ include_tokens: true # Show token counts
361
+ include_timestamps: true # Show last modified dates
362
+ include_priority: true # Show priority labels (high/medium/low)
363
+ calculate_compression: true # Show compression ratios (slower, requires transformation)
641
364
  ```
642
365
 
643
- ## About llms.txt Standard
644
-
645
- The [llms.txt specification](https://llmstxt.org/) is a proposed standard for providing LLM-friendly content. It defines a structured format that helps AI systems:
646
-
647
- - Quickly understand project structure
648
- - Find relevant documentation efficiently
649
- - Navigate complex documentation hierarchies
650
- - Access clean, markdown-formatted content
366
+ ## Advanced Compression Options
651
367
 
652
- llm-docs-builder generates llms.txt files automatically by:
653
- 1. Scanning your documentation directory
654
- 2. Extracting titles and descriptions from markdown files
655
- 3. Prioritizing content by importance (README first, then guides, APIs, etc.)
656
- 4. Formatting everything according to the specification
368
+ All compression features can be used individually for fine-grained control:
657
369
 
658
- The llms.txt file serves as an efficient entry point, but the real token savings come from serving optimized markdown for each individual documentation page.
370
+ ### Content Removal Options
659
371
 
660
- ## How It Works
372
+ - `remove_frontmatter: true` - Remove YAML/TOML metadata blocks
373
+ - `remove_comments: true` - Remove HTML comments (`<!-- ... -->`)
374
+ - `remove_badges: true` - Remove badge/shield images (CI badges, version badges, etc.)
375
+ - `remove_images: true` - Remove all image syntax
376
+ - `remove_code_examples: true` - Remove fenced code blocks, indented code, and inline code
377
+ - `remove_blockquotes: true` - Remove blockquote formatting (preserves content)
378
+ - `remove_duplicates: true` - Remove duplicate paragraphs using fuzzy matching
379
+ - `remove_stopwords: true` - Remove common stopwords from prose (preserves code blocks)
661
380
 
662
- **Generation Process:**
663
- 1. Scan directory for `.md` files
664
- 2. Extract title (first H1) and description (first paragraph)
665
- 3. Prioritize by importance (README → Getting Started → Guides → API → Other)
666
- 4. Build formatted llms.txt with links and descriptions
381
+ ### Content Enhancement Options
667
382
 
668
- **Transformation Process:**
669
- 1. Remove frontmatter (YAML/TOML metadata)
670
- 2. Expand relative links to absolute URLs
671
- 3. Convert `.html` URLs to `.md`
672
- 4. Remove HTML comments
673
- 5. Remove badge/shield images
674
- 6. Normalize excessive whitespace
675
- 7. Write to new file or overwrite in-place
383
+ - `generate_toc: true` - Generate table of contents from headings with anchor links
384
+ - `custom_instruction: "text"` - Inject AI context message at document top
385
+ - `simplify_links: true` - Simplify verbose link text (e.g., "Click here to see the docs" → "docs")
386
+ - `convert_urls: true` - Convert `.html`/`.htm` URLs to `.md` format
387
+ - `normalize_whitespace: true` - Reduce excessive blank lines and remove trailing whitespace
676
388
 
677
- **Comparison Process:**
678
- 1. Fetch URL with human User-Agent (or read local file)
679
- 2. Fetch same URL with AI bot User-Agent
680
- 3. Calculate size difference and reduction percentage
681
- 4. Estimate token counts using character-based heuristic
682
- 5. Display human-readable comparison results with byte and token savings
389
+ ### Example Usage
683
390
 
684
- **Token Estimation:**
685
- The tool uses a simple but effective heuristic for estimating token counts: **~4 characters per token**. This approximation works well for English documentation and provides reasonable estimates without requiring external tokenizer dependencies. While not as precise as OpenAI's tiktoken, it's accurate enough (±10-15%) for understanding context window savings and making optimization decisions.
686
-
687
- ## FAQ
688
-
689
- **Q: Do I need to use llms.txt to benefit from this tool?**
690
-
691
- No. The compare and transform commands provide value independently. Many users start with `compare` to measure savings, then use `bulk-transform` to normalize their markdown files, and may never generate an llms.txt file.
692
-
693
- **Q: Will this change how humans see my documentation?**
694
-
695
- Not if you use the default `suffix: .llm` mode. This creates separate `.llm.md` files served only to AI bots. Your original files remain unchanged for human visitors.
696
-
697
- **Q: Can I use this in my build pipeline?**
698
-
699
- Yes. Use `suffix: ""` for in-place transformation. The Karafka framework does this - they transform their markdown as part of their deployment process.
700
-
701
- **Q: How do I know if it's working?**
702
-
703
- Use the `compare` command to measure before and after. It shows exact byte counts, reduction percentages, and compression factors.
704
-
705
- **Q: Does this work with static site generators?**
706
-
707
- Yes. You can transform markdown files before your static site generator processes them, or serve separate `.llm.md` versions alongside your generated HTML.
391
+ ```ruby
392
+ # Fine-grained control
393
+ LlmDocsBuilder.transform_markdown(
394
+ 'README.md',
395
+ remove_frontmatter: true,
396
+ remove_badges: true,
397
+ remove_images: true,
398
+ simplify_links: true,
399
+ generate_toc: true,
400
+ normalize_whitespace: true
401
+ )
402
+ ```
708
403
 
709
- **Q: What about private/internal documentation?**
404
+ Or configure via YAML:
710
405
 
711
- Use the `excludes` option to skip sensitive files:
712
406
  ```yaml
713
- excludes:
714
- - "**/private/**"
715
- - "**/internal/**"
716
- ```
717
-
718
- **Q: Can I customize the AI bot detection?**
407
+ # llm-docs-builder.yml
408
+ docs: ./docs
409
+ base_url: https://myproject.io
410
+ suffix: .llm
719
411
 
720
- Yes. The web server examples show the User-Agent patterns. You can add or remove patterns based on which AI systems you want to support.
412
+ # Pick exactly what you need
413
+ remove_frontmatter: true
414
+ remove_comments: true
415
+ remove_badges: true
416
+ remove_images: true
417
+ simplify_links: true
418
+ generate_toc: true
419
+ normalize_whitespace: true
420
+ ```
721
421
 
722
422
  ## Contributing
723
423