llm-docs-builder 0.6.0 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rspec +3 -0
- data/CHANGELOG.md +59 -0
- data/Gemfile.lock +1 -1
- data/README.md +241 -541
- data/bin/rspecs +2 -1
- data/lib/llm_docs_builder/cli.rb +1 -62
- data/lib/llm_docs_builder/comparator.rb +4 -16
- data/lib/llm_docs_builder/config.rb +74 -5
- data/lib/llm_docs_builder/generator.rb +67 -8
- data/lib/llm_docs_builder/markdown_transformer.rb +61 -126
- data/lib/llm_docs_builder/output_formatter.rb +93 -0
- data/lib/llm_docs_builder/parser.rb +1 -59
- data/lib/llm_docs_builder/text_compressor.rb +164 -0
- data/lib/llm_docs_builder/token_estimator.rb +52 -0
- data/lib/llm_docs_builder/transformers/base_transformer.rb +30 -0
- data/lib/llm_docs_builder/transformers/content_cleanup_transformer.rb +106 -0
- data/lib/llm_docs_builder/transformers/enhancement_transformer.rb +95 -0
- data/lib/llm_docs_builder/transformers/heading_transformer.rb +72 -0
- data/lib/llm_docs_builder/transformers/link_transformer.rb +84 -0
- data/lib/llm_docs_builder/transformers/whitespace_transformer.rb +44 -0
- data/lib/llm_docs_builder/version.rb +1 -1
- metadata +11 -3
- data/CLAUDE.md +0 -178
- data/llm-docs-builder.yml +0 -7
data/README.md
CHANGED
@@ -5,34 +5,21 @@
|
|
5
5
|
|
6
6
|
**Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
|
7
7
|
|
8
|
-
llm-docs-builder
|
8
|
+
llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, optimizes documents for LLM context windows, and enhances documents for RAG retrieval with hierarchical heading context and metadata.
|
9
9
|
|
10
10
|
## The Problem
|
11
11
|
|
12
12
|
When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
|
13
13
|
|
14
14
|
**Real example from Karafka documentation:**
|
15
|
-
- Human HTML version:
|
16
|
-
- AI markdown version:
|
17
|
-
- **Result:
|
18
|
-
|
19
|
-
With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
|
20
|
-
|
21
|
-
## What This Tool Does
|
22
|
-
|
23
|
-
llm-docs-builder helps you optimize markdown documentation for AI consumption:
|
24
|
-
|
25
|
-
1. **Measure Savings** - Compare what your server sends to humans (HTML) vs AI bots (markdown) to quantify context window reduction
|
26
|
-
2. **Transform Markdown** - Normalize your markdown files with absolute links and consistent URL formats for better LLM navigation
|
27
|
-
3. **Generate llms.txt** - Create standardized documentation indexes following the [llms.txt](https://llmstxt.org/) specification
|
28
|
-
4. **Serve Efficiently** - Configure your server to automatically serve transformed markdown to AI bots while humans get HTML
|
15
|
+
- Human HTML version: 104.4 KB (~26,735 tokens)
|
16
|
+
- AI markdown version: 21.5 KB (~5,496 tokens)
|
17
|
+
- **Result: 79% reduction, 21,239 tokens saved, 5x smaller**
|
29
18
|
|
30
19
|
## Quick Start
|
31
20
|
|
32
21
|
### Measure Your Current Token Waste
|
33
22
|
|
34
|
-
Before making any changes, see how much you could save:
|
35
|
-
|
36
23
|
```bash
|
37
24
|
# Using Docker (no Ruby installation needed)
|
38
25
|
docker pull mensfeld/llm-docs-builder:latest
|
@@ -42,235 +29,162 @@ docker run mensfeld/llm-docs-builder compare \
|
|
42
29
|
--url https://yoursite.com/docs/getting-started.html
|
43
30
|
```
|
44
31
|
|
45
|
-
|
46
|
-
```
|
47
|
-
============================================================
|
48
|
-
Context Window Comparison
|
49
|
-
============================================================
|
50
|
-
|
51
|
-
Human version: 45.2 KB (~11,300 tokens)
|
52
|
-
Source: https://yoursite.com/docs/page.html (User-Agent: human)
|
32
|
+
### Transform Your Documentation
|
53
33
|
|
54
|
-
|
55
|
-
|
34
|
+
```bash
|
35
|
+
# Single file
|
36
|
+
llm-docs-builder transform --docs README.md
|
56
37
|
|
57
|
-
|
58
|
-
|
59
|
-
Token savings: 8,100 tokens (72%)
|
60
|
-
Factor: 3.5x smaller
|
61
|
-
============================================================
|
38
|
+
# Bulk transform with config
|
39
|
+
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
62
40
|
```
|
63
41
|
|
64
|
-
This single command shows you the potential ROI before you invest any time in optimization.
|
65
|
-
|
66
|
-
### Real-World Results
|
67
|
-
|
68
|
-
**[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
|
69
|
-
|
70
|
-
| Page | Human HTML | AI Markdown | Reduction | Tokens Saved | Factor |
|
71
|
-
|------|-----------|-------------|-----------|--------------|---------|
|
72
|
-
| Getting Started | 82.0 KB | 4.1 KB | 95% | ~19,475 | 20.1x |
|
73
|
-
| Configuration | 86.3 KB | 7.1 KB | 92% | ~19,800 | 12.1x |
|
74
|
-
| Routing | 93.6 KB | 14.7 KB | 84% | ~19,725 | 6.4x |
|
75
|
-
| Deployment | 122.1 KB | 33.3 KB | 73% | ~22,200 | 3.7x |
|
76
|
-
| Producing Messages | 87.7 KB | 8.3 KB | 91% | ~19,850 | 10.6x |
|
77
|
-
| Consuming Messages | 105.3 KB | 21.3 KB | 80% | ~21,000 | 4.9x |
|
78
|
-
| Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | ~21,950 | 5.1x |
|
79
|
-
| Active Job | 88.7 KB | 8.8 KB | 90% | ~19,975 | 10.1x |
|
80
|
-
| Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | ~22,050 | 3.7x |
|
81
|
-
| Error Handling | 93.8 KB | 13.1 KB | 86% | ~20,175 | 7.2x |
|
82
|
-
|
83
|
-
**Average: 83% reduction, ~20,620 tokens saved per page, 8.4x smaller files**
|
84
|
-
|
85
|
-
For a typical RAG system making 1,000 documentation queries per day:
|
86
|
-
- **Before**: ~990 KB per day (~247,500 tokens) × 1,000 queries = ~247.5M tokens/day
|
87
|
-
- **After**: ~165 KB per day (~41,250 tokens) × 1,000 queries = ~41.25M tokens/day
|
88
|
-
- **Savings**: 83% reduction = ~206.25M tokens saved per day
|
89
|
-
|
90
|
-
At GPT-4 pricing ($2.50/M input tokens), that's approximately **$500/day or $183,000/year saved** on a documentation site with moderate traffic.
|
91
|
-
|
92
42
|
## Installation
|
93
43
|
|
94
|
-
###
|
95
|
-
|
96
|
-
No Ruby installation required. Perfect for CI/CD and quick usage:
|
44
|
+
### Docker (Recommended)
|
97
45
|
|
98
46
|
```bash
|
99
|
-
# Pull the image
|
100
47
|
docker pull mensfeld/llm-docs-builder:latest
|
101
|
-
|
102
|
-
# Create an alias for convenience
|
103
48
|
alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
|
104
|
-
|
105
|
-
# Use like a native command
|
106
|
-
llm-docs-builder compare --url https://yoursite.com/docs
|
107
49
|
```
|
108
50
|
|
109
|
-
|
110
|
-
|
111
|
-
### Option 2: RubyGems
|
112
|
-
|
113
|
-
For Ruby developers or when you need the Ruby API:
|
51
|
+
### RubyGems
|
114
52
|
|
115
53
|
```bash
|
116
54
|
gem install llm-docs-builder
|
117
55
|
```
|
118
56
|
|
119
|
-
|
120
|
-
|
121
|
-
```ruby
|
122
|
-
gem 'llm-docs-builder'
|
123
|
-
```
|
124
|
-
|
125
|
-
## Core Features
|
57
|
+
## Features
|
126
58
|
|
127
|
-
###
|
128
|
-
|
129
|
-
Quantify exactly how much context window you're wasting:
|
59
|
+
### Measure and Compare
|
130
60
|
|
131
61
|
```bash
|
132
|
-
# Compare what your server sends to humans vs AI
|
62
|
+
# Compare what your server sends to humans vs AI
|
133
63
|
llm-docs-builder compare --url https://yoursite.com/docs/page.html
|
134
64
|
|
135
|
-
# Compare remote HTML with
|
65
|
+
# Compare remote HTML with local markdown
|
136
66
|
llm-docs-builder compare \
|
137
67
|
--url https://yoursite.com/docs/api.html \
|
138
68
|
--file docs/api.md
|
139
|
-
|
140
|
-
# Verbose mode for debugging
|
141
|
-
llm-docs-builder compare --url https://example.com/docs --verbose
|
142
69
|
```
|
143
70
|
|
144
|
-
|
145
|
-
- Validates that optimizations actually work
|
146
|
-
- Quantifies ROI before you invest time
|
147
|
-
- Monitors ongoing effectiveness
|
148
|
-
- Provides concrete metrics for stakeholders
|
149
|
-
|
150
|
-
### 2. Transform Markdown (The Normalizer)
|
151
|
-
|
152
|
-
Normalize your markdown documentation to be LLM-friendly:
|
71
|
+
### Generate llms.txt
|
153
72
|
|
154
|
-
**Single file transformation:**
|
155
73
|
```bash
|
156
|
-
#
|
157
|
-
llm-docs-builder
|
158
|
-
--docs README.md \
|
159
|
-
--config llm-docs-builder.yml
|
74
|
+
# Create standardized documentation index
|
75
|
+
llm-docs-builder generate --config llm-docs-builder.yml
|
160
76
|
```
|
161
77
|
|
162
|
-
|
78
|
+
## Configuration
|
163
79
|
|
164
|
-
**a) Separate files (default)** - Creates `.llm.md` versions alongside originals:
|
165
80
|
```yaml
|
166
81
|
# llm-docs-builder.yml
|
167
82
|
docs: ./docs
|
168
83
|
base_url: https://myproject.io
|
169
|
-
|
170
|
-
|
171
|
-
|
172
|
-
|
173
|
-
|
174
|
-
normalize_whitespace: true # Clean up excessive blank lines
|
175
|
-
```
|
176
|
-
|
177
|
-
```bash
|
178
|
-
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
179
|
-
```
|
84
|
+
title: My Project
|
85
|
+
description: Brief description
|
86
|
+
output: llms.txt
|
87
|
+
suffix: .llm
|
88
|
+
verbose: false
|
180
89
|
|
181
|
-
|
182
|
-
|
183
|
-
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
└── api.llm.md
|
188
|
-
```
|
90
|
+
# Basic options
|
91
|
+
convert_urls: true
|
92
|
+
remove_comments: true
|
93
|
+
remove_badges: true
|
94
|
+
remove_frontmatter: true
|
95
|
+
normalize_whitespace: true
|
189
96
|
|
190
|
-
|
191
|
-
|
192
|
-
|
193
|
-
|
194
|
-
|
195
|
-
|
196
|
-
|
197
|
-
|
198
|
-
|
199
|
-
|
200
|
-
|
97
|
+
# Additional compression options
|
98
|
+
remove_code_examples: false
|
99
|
+
remove_images: true
|
100
|
+
remove_blockquotes: true
|
101
|
+
remove_duplicates: true
|
102
|
+
remove_stopwords: false
|
103
|
+
simplify_links: true
|
104
|
+
generate_toc: true
|
105
|
+
custom_instruction: "This documentation is optimized for AI consumption"
|
106
|
+
|
107
|
+
# RAG enhancement options
|
108
|
+
normalize_headings: true # Add hierarchical context to headings
|
109
|
+
heading_separator: " / " # Separator for heading hierarchy
|
110
|
+
include_metadata: true # Enable enhanced llms.txt metadata
|
111
|
+
include_tokens: true # Include token counts in llms.txt
|
112
|
+
include_timestamps: true # Include update timestamps in llms.txt
|
113
|
+
include_priority: true # Include priority labels in llms.txt
|
114
|
+
calculate_compression: false # Calculate compression ratios (slower)
|
115
|
+
|
116
|
+
# Exclusions
|
201
117
|
excludes:
|
202
118
|
- "**/private/**"
|
119
|
+
- "**/drafts/**"
|
203
120
|
```
|
204
121
|
|
205
|
-
|
206
|
-
|
207
|
-
|
208
|
-
|
209
|
-
Perfect for CI/CD where you transform docs before deployment.
|
210
|
-
|
211
|
-
**What gets normalized:**
|
212
|
-
- **Links**: Relative → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
|
213
|
-
- **URLs**: HTML → Markdown format (`.html` → `.md`)
|
214
|
-
- **Comments**: HTML comments removed (`<!-- ... -->`)
|
215
|
-
- **Badges**: Shield/badge images removed (CI badges, version badges, etc.)
|
216
|
-
- **Frontmatter**: YAML/TOML metadata removed (Jekyll, Hugo, etc.)
|
217
|
-
- **Whitespace**: Excessive blank lines reduced (3+ → 2 max)
|
218
|
-
- Clean markdown structure preserved
|
219
|
-
- No content modification, just intelligent cleanup
|
220
|
-
|
221
|
-
### 3. Generate llms.txt (The Standard)
|
222
|
-
|
223
|
-
Create a standardized documentation index following the [llms.txt](https://llmstxt.org/) specification:
|
122
|
+
**Configuration precedence:**
|
123
|
+
1. CLI flags (highest)
|
124
|
+
2. Config file
|
125
|
+
3. Defaults
|
224
126
|
|
225
|
-
|
226
|
-
# llm-docs-builder.yml
|
227
|
-
docs: ./docs
|
228
|
-
base_url: https://myproject.io
|
229
|
-
title: My Project
|
230
|
-
description: A library that does amazing things
|
231
|
-
output: llms.txt
|
232
|
-
```
|
127
|
+
## CLI Commands
|
233
128
|
|
234
129
|
```bash
|
235
|
-
llm-docs-builder
|
130
|
+
llm-docs-builder compare [options] # Measure token savings
|
131
|
+
llm-docs-builder transform [options] # Transform single file
|
132
|
+
llm-docs-builder bulk-transform [options] # Transform directory
|
133
|
+
llm-docs-builder generate [options] # Generate llms.txt
|
134
|
+
llm-docs-builder parse [options] # Parse llms.txt
|
135
|
+
llm-docs-builder validate [options] # Validate llms.txt
|
136
|
+
llm-docs-builder version # Show version
|
236
137
|
```
|
237
138
|
|
238
|
-
**
|
239
|
-
```
|
240
|
-
|
139
|
+
**Common options:**
|
140
|
+
```
|
141
|
+
-c, --config PATH Configuration file
|
142
|
+
-d, --docs PATH Documentation path
|
143
|
+
-o, --output PATH Output file
|
144
|
+
-u, --url URL URL for comparison
|
145
|
+
-v, --verbose Detailed output
|
146
|
+
```
|
241
147
|
|
242
|
-
|
148
|
+
## Ruby API
|
243
149
|
|
244
|
-
|
150
|
+
```ruby
|
151
|
+
require 'llm_docs_builder'
|
245
152
|
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
153
|
+
# Transform single file with custom options
|
154
|
+
transformed = LlmDocsBuilder.transform_markdown(
|
155
|
+
'README.md',
|
156
|
+
base_url: 'https://myproject.io',
|
157
|
+
remove_code_examples: true,
|
158
|
+
remove_images: true,
|
159
|
+
generate_toc: true,
|
160
|
+
custom_instruction: 'AI-optimized documentation'
|
161
|
+
)
|
250
162
|
|
251
|
-
|
252
|
-
|
253
|
-
|
254
|
-
|
255
|
-
|
256
|
-
|
163
|
+
# Bulk transform
|
164
|
+
files = LlmDocsBuilder.bulk_transform(
|
165
|
+
'./docs',
|
166
|
+
base_url: 'https://myproject.io',
|
167
|
+
suffix: '.llm',
|
168
|
+
remove_duplicates: true,
|
169
|
+
generate_toc: true
|
170
|
+
)
|
257
171
|
|
258
|
-
|
172
|
+
# Generate llms.txt
|
173
|
+
content = LlmDocsBuilder.generate_from_docs(
|
174
|
+
'./docs',
|
175
|
+
base_url: 'https://myproject.io',
|
176
|
+
title: 'My Project'
|
177
|
+
)
|
178
|
+
```
|
259
179
|
|
260
|
-
|
180
|
+
## Serving Optimized Docs to AI Bots
|
261
181
|
|
262
|
-
After using `bulk-transform` with `suffix: .llm`, configure your web server to
|
182
|
+
After using `bulk-transform` with `suffix: .llm`, configure your web server to serve optimized versions to AI bots:
|
263
183
|
|
264
184
|
**Apache (.htaccess):**
|
265
185
|
```apache
|
266
|
-
|
267
|
-
SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt|chatgpt)" IS_LLM_BOT
|
268
|
-
SetEnvIf User-Agent "(?i)(perplexity|gemini|copilot|bard)" IS_LLM_BOT
|
269
|
-
|
270
|
-
# Serve .llm.md to AI, .md to humans
|
271
|
-
RewriteEngine On
|
186
|
+
SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt)" IS_LLM_BOT
|
272
187
|
RewriteCond %{ENV:IS_LLM_BOT} !^$
|
273
|
-
RewriteCond %{REQUEST_URI} ^/docs/.*\.md$ [NC]
|
274
188
|
RewriteRule ^(.*)\.md$ $1.llm.md [L]
|
275
189
|
```
|
276
190
|
|
@@ -279,7 +193,6 @@ RewriteRule ^(.*)\.md$ $1.llm.md [L]
|
|
279
193
|
map $http_user_agent $is_llm_bot {
|
280
194
|
default 0;
|
281
195
|
"~*(?i)(openai|anthropic|claude|gpt)" 1;
|
282
|
-
"~*(?i)(perplexity|gemini|copilot)" 1;
|
283
196
|
}
|
284
197
|
|
285
198
|
location ~ ^/docs/(.*)\.md$ {
|
@@ -289,435 +202,222 @@ location ~ ^/docs/(.*)\.md$ {
|
|
289
202
|
}
|
290
203
|
```
|
291
204
|
|
292
|
-
|
293
|
-
```javascript
|
294
|
-
const isLLMBot = /openai|anthropic|claude|gpt|perplexity/i.test(userAgent);
|
295
|
-
if (isLLMBot && url.pathname.startsWith('/docs/')) {
|
296
|
-
url.pathname = url.pathname.replace(/\.md$/, '.llm.md');
|
297
|
-
}
|
298
|
-
```
|
299
|
-
|
300
|
-
**Result**: AI systems automatically get optimized versions, humans get the original. No manual switching, no duplicate URLs.
|
205
|
+
## Real-World Results: Karafka Framework
|
301
206
|
|
302
|
-
|
303
|
-
|
304
|
-
All commands support both config files and CLI flags. Config files are recommended for consistency:
|
207
|
+
**Before:** 140+ lines of custom transformation code
|
305
208
|
|
209
|
+
**After:** 6 lines of configuration
|
306
210
|
```yaml
|
307
|
-
|
308
|
-
|
309
|
-
base_url: https://myproject.io
|
310
|
-
title: My Project
|
311
|
-
description: Brief description
|
312
|
-
output: llms.txt
|
211
|
+
docs: ./online/docs
|
212
|
+
base_url: https://karafka.io/docs
|
313
213
|
convert_urls: true
|
314
214
|
remove_comments: true
|
315
215
|
remove_badges: true
|
316
216
|
remove_frontmatter: true
|
317
217
|
normalize_whitespace: true
|
318
|
-
suffix:
|
319
|
-
verbose: false
|
320
|
-
excludes:
|
321
|
-
- "**/private/**"
|
322
|
-
- "**/drafts/**"
|
218
|
+
suffix: "" # In-place for build pipeline
|
323
219
|
```
|
324
220
|
|
325
|
-
**
|
326
|
-
|
327
|
-
|
328
|
-
|
329
|
-
|
330
|
-
**Example of overriding:**
|
331
|
-
```bash
|
332
|
-
# Uses config file but overrides title
|
333
|
-
llm-docs-builder generate --config llm-docs-builder.yml --title "Override Title"
|
334
|
-
```
|
221
|
+
**Results:**
|
222
|
+
- 93% average token reduction
|
223
|
+
- 20-36x smaller files
|
224
|
+
- Automated via GitHub Actions
|
335
225
|
|
336
226
|
## Docker Usage
|
337
227
|
|
338
|
-
All CLI commands work in Docker with the same syntax:
|
339
|
-
|
340
228
|
```bash
|
341
|
-
#
|
342
|
-
docker
|
229
|
+
# Pull image
|
230
|
+
docker pull mensfeld/llm-docs-builder:latest
|
343
231
|
|
344
|
-
#
|
345
|
-
docker run
|
346
|
-
|
347
|
-
docker run mensfeld/llm-docs-builder compare --url https://example.com/docs
|
348
|
-
```
|
232
|
+
# Compare (no volume needed for remote URLs)
|
233
|
+
docker run mensfeld/llm-docs-builder compare \
|
234
|
+
--url https://yoursite.com/docs
|
349
235
|
|
350
|
-
|
236
|
+
# Transform with volume mount
|
237
|
+
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
|
238
|
+
bulk-transform --config llm-docs-builder.yml
|
239
|
+
```
|
351
240
|
|
352
|
-
GitHub Actions
|
241
|
+
**CI/CD Example (GitHub Actions):**
|
353
242
|
```yaml
|
354
|
-
- name:
|
243
|
+
- name: Optimize documentation
|
355
244
|
run: |
|
356
245
|
docker run -v ${{ github.workspace }}:/workspace \
|
357
|
-
mensfeld/llm-docs-builder
|
358
|
-
```
|
359
|
-
|
360
|
-
GitLab CI:
|
361
|
-
```yaml
|
362
|
-
generate-llms:
|
363
|
-
image: mensfeld/llm-docs-builder:latest
|
364
|
-
script:
|
365
|
-
- llm-docs-builder generate --docs ./docs
|
246
|
+
mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
366
247
|
```
|
367
248
|
|
368
|
-
|
369
|
-
|
370
|
-
## Ruby API
|
371
|
-
|
372
|
-
For programmatic usage:
|
373
|
-
|
374
|
-
```ruby
|
375
|
-
require 'llm_docs_builder'
|
376
|
-
|
377
|
-
# Using config file
|
378
|
-
content = LlmDocsBuilder.generate_from_docs(config_file: 'llm-docs-builder.yml')
|
379
|
-
|
380
|
-
# Direct options
|
381
|
-
content = LlmDocsBuilder.generate_from_docs('./docs',
|
382
|
-
base_url: 'https://myproject.io',
|
383
|
-
title: 'My Project'
|
384
|
-
)
|
249
|
+
## Compression Examples
|
385
250
|
|
386
|
-
|
387
|
-
|
388
|
-
|
389
|
-
|
390
|
-
|
391
|
-
remove_badges: true,
|
392
|
-
remove_frontmatter: true,
|
393
|
-
normalize_whitespace: true
|
394
|
-
)
|
395
|
-
|
396
|
-
# Bulk transform
|
397
|
-
files = LlmDocsBuilder.bulk_transform('./docs',
|
398
|
-
base_url: 'https://myproject.io',
|
399
|
-
suffix: '.llm',
|
400
|
-
remove_comments: true,
|
401
|
-
remove_badges: true,
|
402
|
-
remove_frontmatter: true,
|
403
|
-
normalize_whitespace: true,
|
404
|
-
excludes: ['**/private/**']
|
405
|
-
)
|
406
|
-
|
407
|
-
# In-place transformation
|
408
|
-
files = LlmDocsBuilder.bulk_transform('./docs',
|
409
|
-
suffix: '', # Empty for in-place
|
410
|
-
base_url: 'https://myproject.io',
|
411
|
-
remove_comments: true,
|
412
|
-
remove_badges: true,
|
413
|
-
remove_frontmatter: true,
|
414
|
-
normalize_whitespace: true
|
415
|
-
)
|
416
|
-
```
|
251
|
+
**Input markdown:**
|
252
|
+
```markdown
|
253
|
+
---
|
254
|
+
layout: docs
|
255
|
+
---
|
417
256
|
|
418
|
-
|
257
|
+
# API Documentation
|
419
258
|
|
420
|
-
|
259
|
+
[](https://ci.com)
|
421
260
|
|
422
|
-
|
423
|
-
- Manual maintenance of transformation logic
|
424
|
-
- No way to measure optimization effectiveness
|
261
|
+
> Important: This is a note
|
425
262
|
|
426
|
-
|
263
|
+
[Click here to see the complete API documentation](./api.md)
|
427
264
|
|
428
|
-
```
|
429
|
-
|
430
|
-
docs: ./online/docs
|
431
|
-
base_url: https://karafka.io/docs
|
432
|
-
convert_urls: true
|
433
|
-
remove_comments: true
|
434
|
-
remove_badges: true
|
435
|
-
remove_frontmatter: true
|
436
|
-
normalize_whitespace: true
|
437
|
-
suffix: "" # In-place transformation for build pipeline
|
438
|
-
excludes:
|
439
|
-
- "**/Enterprise-License-Setup/**"
|
265
|
+
```ruby
|
266
|
+
api = API.new
|
440
267
|
```
|
441
268
|
|
442
|
-
|
443
|
-
# In their deployment script
|
444
|
-
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
269
|
+

|
445
270
|
```
|
446
271
|
|
447
|
-
**
|
448
|
-
|
449
|
-
|
450
|
-
- **Quantifiable savings** via the compare command
|
451
|
-
- **Automated daily deployments** via GitHub Actions
|
452
|
-
|
453
|
-
The compare command revealed that their documentation was consuming 20-36x more tokens than necessary for AI systems. After optimization, RAG queries became dramatically more efficient.
|
272
|
+
**After transformation (with default options):**
|
273
|
+
```markdown
|
274
|
+
# API Documentation
|
454
275
|
|
455
|
-
|
276
|
+
[complete API documentation](./api.md)
|
456
277
|
|
457
|
-
```
|
458
|
-
|
459
|
-
llm-docs-builder transform [options] # Transform single markdown file
|
460
|
-
llm-docs-builder bulk-transform [options] # Transform entire documentation tree
|
461
|
-
llm-docs-builder generate [options] # Generate llms.txt index
|
462
|
-
llm-docs-builder parse [options] # Parse existing llms.txt
|
463
|
-
llm-docs-builder validate [options] # Validate llms.txt format
|
464
|
-
llm-docs-builder version # Show version
|
465
|
-
```
|
466
|
-
|
467
|
-
**Common options:**
|
278
|
+
```ruby
|
279
|
+
api = API.new
|
468
280
|
```
|
469
|
-
-c, --config PATH Configuration file (default: llm-docs-builder.yml)
|
470
|
-
-d, --docs PATH Documentation directory or file
|
471
|
-
-o, --output PATH Output file path
|
472
|
-
-u, --url URL URL for comparison
|
473
|
-
-f, --file PATH Local file for comparison
|
474
|
-
-v, --verbose Detailed output
|
475
|
-
-h, --help Show help
|
476
281
|
```
|
477
282
|
|
478
|
-
|
479
|
-
|
480
|
-
## Why This Matters for RAG Systems
|
283
|
+
**Token reduction:** ~40-60% depending on configuration
|
481
284
|
|
482
|
-
|
483
|
-
|
484
|
-
1. **Costs money** - More tokens = higher API costs
|
485
|
-
2. **Reduces capacity** - Less room for actual documentation in context window
|
486
|
-
3. **Slows responses** - More tokens to process = longer response times
|
487
|
-
4. **Degrades quality** - Navigation noise can confuse the model
|
488
|
-
|
489
|
-
llm-docs-builder addresses all four issues by transforming markdown to be AI-friendly and enabling your server to automatically serve it to AI bots while humans get HTML.
|
490
|
-
|
491
|
-
**The JavaScript Problem:**
|
492
|
-
|
493
|
-
Many documentation sites rely on JavaScript for rendering. AI crawlers typically don't execute JavaScript, so they either:
|
494
|
-
- Get incomplete content
|
495
|
-
- Get server-side rendered HTML (bloated with framework overhead)
|
496
|
-
- Fail entirely
|
497
|
-
|
498
|
-
By detecting AI bots and serving them clean markdown instead of HTML, you sidestep this problem entirely.
|
499
|
-
|
500
|
-
## Configuration Reference
|
501
|
-
|
502
|
-
| Option | Type | Default | Description |
|
503
|
-
|--------|------|---------|-------------|
|
504
|
-
| `docs` | String | `./docs` | Documentation directory or file |
|
505
|
-
| `base_url` | String | - | Base URL for absolute links (e.g., `https://myproject.io`) |
|
506
|
-
| `title` | String | Auto-detected | Project title |
|
507
|
-
| `description` | String | Auto-detected | Project description |
|
508
|
-
| `output` | String | `llms.txt` | Output filename for llms.txt generation |
|
509
|
-
| `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
|
510
|
-
| `remove_comments` | Boolean | `false` | Remove HTML comments (`<!-- ... -->`) |
|
511
|
-
| `remove_badges` | Boolean | `false` | Remove badge/shield images (CI, version, etc.) |
|
512
|
-
| `remove_frontmatter` | Boolean | `false` | Remove YAML/TOML frontmatter (Jekyll, Hugo) |
|
513
|
-
| `normalize_whitespace` | Boolean | `false` | Normalize excessive blank lines and trailing spaces |
|
514
|
-
| `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
|
515
|
-
| `excludes` | Array | `[]` | Glob patterns to exclude |
|
516
|
-
| `verbose` | Boolean | `false` | Enable detailed output |
|
517
|
-
|
518
|
-
## Detailed Docker Usage
|
285
|
+
## FAQ
|
519
286
|
|
520
|
-
|
287
|
+
**Q: Do I need to use llms.txt?**
|
288
|
+
No. The compare and transform commands work independently.
|
521
289
|
|
522
|
-
|
523
|
-
|
524
|
-
docker pull mensfeld/llm-docs-builder:latest
|
290
|
+
**Q: Will this change how humans see my docs?**
|
291
|
+
Not with default `suffix: .llm`. Separate files are served only to AI bots.
|
525
292
|
|
526
|
-
|
527
|
-
|
293
|
+
**Q: Can I use this in my build pipeline?**
|
294
|
+
Yes. Use `suffix: ""` for in-place transformation.
|
528
295
|
|
529
|
-
|
530
|
-
|
531
|
-
```
|
296
|
+
**Q: How do I know if it's working?**
|
297
|
+
Use `llm-docs-builder compare` to measure before and after.
|
532
298
|
|
533
|
-
|
299
|
+
**Q: What about private documentation?**
|
300
|
+
Use the `excludes` option to skip sensitive files.
|
534
301
|
|
535
|
-
|
536
|
-
```bash
|
537
|
-
docker run mensfeld/llm-docs-builder compare \
|
538
|
-
--url https://karafka.io/docs/Getting-Started/
|
302
|
+
## RAG Enhancement Features
|
539
303
|
|
540
|
-
|
541
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder compare \
|
542
|
-
--url https://example.com/page.html \
|
543
|
-
--file docs/page.md
|
544
|
-
```
|
304
|
+
### Heading Normalization
|
545
305
|
|
546
|
-
|
547
|
-
```bash
|
548
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
|
549
|
-
generate --docs ./docs --output llms.txt
|
550
|
-
```
|
306
|
+
Transform headings to include hierarchical context, making each section self-contained for RAG retrieval:
|
551
307
|
|
552
|
-
**
|
553
|
-
```
|
554
|
-
|
555
|
-
|
556
|
-
|
308
|
+
**Before:**
|
309
|
+
```markdown
|
310
|
+
# Configuration
|
311
|
+
## Consumer Settings
|
312
|
+
### auto_offset_reset
|
557
313
|
|
558
|
-
|
559
|
-
```bash
|
560
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
|
561
|
-
bulk-transform --config llm-docs-builder.yml
|
314
|
+
Controls behavior when no offset exists...
|
562
315
|
```
|
563
316
|
|
564
|
-
**
|
565
|
-
```
|
566
|
-
|
567
|
-
|
317
|
+
**After (with `normalize_headings: true`):**
|
318
|
+
```markdown
|
319
|
+
# Configuration
|
320
|
+
## Configuration / Consumer Settings
|
321
|
+
### Configuration / Consumer Settings / auto_offset_reset
|
568
322
|
|
569
|
-
|
570
|
-
validate --docs llms.txt
|
323
|
+
Controls behavior when no offset exists...
|
571
324
|
```
|
572
325
|
|
573
|
-
|
326
|
+
**Why this matters for RAG:** When documents are chunked and retrieved independently, each section retains full context. An LLM seeing just the `auto_offset_reset` section knows it's about "Configuration / Consumer Settings / auto_offset_reset" not just generic "auto_offset_reset".
|
574
327
|
|
575
|
-
**GitHub Actions:**
|
576
328
|
```yaml
|
577
|
-
|
578
|
-
|
579
|
-
|
580
|
-
steps:
|
581
|
-
- uses: actions/checkout@v3
|
582
|
-
- name: Transform documentation
|
583
|
-
run: |
|
584
|
-
docker run -v ${{ github.workspace }}:/workspace \
|
585
|
-
mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
586
|
-
- name: Measure savings
|
587
|
-
run: |
|
588
|
-
docker run mensfeld/llm-docs-builder \
|
589
|
-
compare --url https://yoursite.com/docs/main.html
|
329
|
+
# Enable in config
|
330
|
+
normalize_headings: true
|
331
|
+
heading_separator: " / " # Customize separator (default: " / ")
|
590
332
|
```
|
591
333
|
|
592
|
-
|
593
|
-
```yaml
|
594
|
-
optimize-docs:
|
595
|
-
image: mensfeld/llm-docs-builder:latest
|
596
|
-
script:
|
597
|
-
- llm-docs-builder bulk-transform --docs ./docs
|
598
|
-
- llm-docs-builder compare --url https://yoursite.com/docs
|
599
|
-
```
|
600
|
-
|
601
|
-
**Jenkins:**
|
602
|
-
```groovy
|
603
|
-
stage('Optimize Documentation') {
|
604
|
-
steps {
|
605
|
-
sh '''
|
606
|
-
docker run -v ${WORKSPACE}:/workspace \
|
607
|
-
mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
608
|
-
'''
|
609
|
-
}
|
610
|
-
}
|
611
|
-
```
|
612
|
-
|
613
|
-
### Version Pinning
|
614
|
-
|
615
|
-
```bash
|
616
|
-
# Use specific version
|
617
|
-
docker run mensfeld/llm-docs-builder:0.3.0 version
|
334
|
+
### Enhanced llms.txt Metadata
|
618
335
|
|
619
|
-
|
620
|
-
docker run mensfeld/llm-docs-builder:0 version
|
336
|
+
Generate enriched llms.txt files with token counts, timestamps, and priority labels to help AI agents make better decisions:
|
621
337
|
|
622
|
-
|
623
|
-
|
338
|
+
**Standard llms.txt:**
|
339
|
+
```markdown
|
340
|
+
- [Getting Started](https://myproject.io/docs/Getting-Started.md)
|
341
|
+
- [Configuration](https://myproject.io/docs/Configuration.md)
|
624
342
|
```
|
625
343
|
|
626
|
-
|
627
|
-
|
628
|
-
|
629
|
-
|
630
|
-
|
344
|
+
**Enhanced llms.txt (with metadata enabled):**
|
345
|
+
```markdown
|
346
|
+
- [Getting Started](https://myproject.io/docs/Getting-Started.md) tokens:450 updated:2025-10-13 priority:high
|
347
|
+
- [Configuration](https://myproject.io/docs/Configuration.md) tokens:2800 updated:2025-10-12 priority:high
|
348
|
+
- [Advanced Topics](https://myproject.io/docs/Advanced.md) tokens:5200 updated:2025-09-15 priority:medium
|
631
349
|
```
|
632
350
|
|
633
|
-
**
|
634
|
-
|
635
|
-
|
636
|
-
|
351
|
+
**Benefits:**
|
352
|
+
- AI agents can see token counts → load multiple small docs vs one large doc
|
353
|
+
- Timestamps help prefer recent documentation
|
354
|
+
- Priority signals guide which docs to fetch first
|
355
|
+
- Compression ratios show optimization effectiveness
|
637
356
|
|
638
|
-
|
639
|
-
|
640
|
-
|
357
|
+
```yaml
|
358
|
+
# Enable in config
|
359
|
+
include_metadata: true # Master switch
|
360
|
+
include_tokens: true # Show token counts
|
361
|
+
include_timestamps: true # Show last modified dates
|
362
|
+
include_priority: true # Show priority labels (high/medium/low)
|
363
|
+
calculate_compression: true # Show compression ratios (slower, requires transformation)
|
641
364
|
```
|
642
365
|
|
643
|
-
##
|
644
|
-
|
645
|
-
The [llms.txt specification](https://llmstxt.org/) is a proposed standard for providing LLM-friendly content. It defines a structured format that helps AI systems:
|
646
|
-
|
647
|
-
- Quickly understand project structure
|
648
|
-
- Find relevant documentation efficiently
|
649
|
-
- Navigate complex documentation hierarchies
|
650
|
-
- Access clean, markdown-formatted content
|
366
|
+
## Advanced Compression Options
|
651
367
|
|
652
|
-
|
653
|
-
1. Scanning your documentation directory
|
654
|
-
2. Extracting titles and descriptions from markdown files
|
655
|
-
3. Prioritizing content by importance (README first, then guides, APIs, etc.)
|
656
|
-
4. Formatting everything according to the specification
|
368
|
+
All compression features can be used individually for fine-grained control:
|
657
369
|
|
658
|
-
|
370
|
+
### Content Removal Options
|
659
371
|
|
660
|
-
|
372
|
+
- `remove_frontmatter: true` - Remove YAML/TOML metadata blocks
|
373
|
+
- `remove_comments: true` - Remove HTML comments (`<!-- ... -->`)
|
374
|
+
- `remove_badges: true` - Remove badge/shield images (CI badges, version badges, etc.)
|
375
|
+
- `remove_images: true` - Remove all image syntax
|
376
|
+
- `remove_code_examples: true` - Remove fenced code blocks, indented code, and inline code
|
377
|
+
- `remove_blockquotes: true` - Remove blockquote formatting (preserves content)
|
378
|
+
- `remove_duplicates: true` - Remove duplicate paragraphs using fuzzy matching
|
379
|
+
- `remove_stopwords: true` - Remove common stopwords from prose (preserves code blocks)
|
661
380
|
|
662
|
-
|
663
|
-
1. Scan directory for `.md` files
|
664
|
-
2. Extract title (first H1) and description (first paragraph)
|
665
|
-
3. Prioritize by importance (README → Getting Started → Guides → API → Other)
|
666
|
-
4. Build formatted llms.txt with links and descriptions
|
381
|
+
### Content Enhancement Options
|
667
382
|
|
668
|
-
|
669
|
-
|
670
|
-
|
671
|
-
|
672
|
-
|
673
|
-
5. Remove badge/shield images
|
674
|
-
6. Normalize excessive whitespace
|
675
|
-
7. Write to new file or overwrite in-place
|
383
|
+
- `generate_toc: true` - Generate table of contents from headings with anchor links
|
384
|
+
- `custom_instruction: "text"` - Inject AI context message at document top
|
385
|
+
- `simplify_links: true` - Simplify verbose link text (e.g., "Click here to see the docs" → "docs")
|
386
|
+
- `convert_urls: true` - Convert `.html`/`.htm` URLs to `.md` format
|
387
|
+
- `normalize_whitespace: true` - Reduce excessive blank lines and remove trailing whitespace
|
676
388
|
|
677
|
-
|
678
|
-
1. Fetch URL with human User-Agent (or read local file)
|
679
|
-
2. Fetch same URL with AI bot User-Agent
|
680
|
-
3. Calculate size difference and reduction percentage
|
681
|
-
4. Estimate token counts using character-based heuristic
|
682
|
-
5. Display human-readable comparison results with byte and token savings
|
389
|
+
### Example Usage
|
683
390
|
|
684
|
-
|
685
|
-
|
686
|
-
|
687
|
-
|
688
|
-
|
689
|
-
|
690
|
-
|
691
|
-
|
692
|
-
|
693
|
-
|
694
|
-
|
695
|
-
|
696
|
-
|
697
|
-
**Q: Can I use this in my build pipeline?**
|
698
|
-
|
699
|
-
Yes. Use `suffix: ""` for in-place transformation. The Karafka framework does this - they transform their markdown as part of their deployment process.
|
700
|
-
|
701
|
-
**Q: How do I know if it's working?**
|
702
|
-
|
703
|
-
Use the `compare` command to measure before and after. It shows exact byte counts, reduction percentages, and compression factors.
|
704
|
-
|
705
|
-
**Q: Does this work with static site generators?**
|
706
|
-
|
707
|
-
Yes. You can transform markdown files before your static site generator processes them, or serve separate `.llm.md` versions alongside your generated HTML.
|
391
|
+
```ruby
|
392
|
+
# Fine-grained control
|
393
|
+
LlmDocsBuilder.transform_markdown(
|
394
|
+
'README.md',
|
395
|
+
remove_frontmatter: true,
|
396
|
+
remove_badges: true,
|
397
|
+
remove_images: true,
|
398
|
+
simplify_links: true,
|
399
|
+
generate_toc: true,
|
400
|
+
normalize_whitespace: true
|
401
|
+
)
|
402
|
+
```
|
708
403
|
|
709
|
-
|
404
|
+
Or configure via YAML:
|
710
405
|
|
711
|
-
Use the `excludes` option to skip sensitive files:
|
712
406
|
```yaml
|
713
|
-
|
714
|
-
|
715
|
-
|
716
|
-
|
717
|
-
|
718
|
-
**Q: Can I customize the AI bot detection?**
|
407
|
+
# llm-docs-builder.yml
|
408
|
+
docs: ./docs
|
409
|
+
base_url: https://myproject.io
|
410
|
+
suffix: .llm
|
719
411
|
|
720
|
-
|
412
|
+
# Pick exactly what you need
|
413
|
+
remove_frontmatter: true
|
414
|
+
remove_comments: true
|
415
|
+
remove_badges: true
|
416
|
+
remove_images: true
|
417
|
+
simplify_links: true
|
418
|
+
generate_toc: true
|
419
|
+
normalize_whitespace: true
|
420
|
+
```
|
721
421
|
|
722
422
|
## Contributing
|
723
423
|
|