llm-docs-builder 0.6.0 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rspec +3 -0
- data/CHANGELOG.md +37 -0
- data/Gemfile.lock +1 -1
- data/README.md +182 -555
- data/bin/rspecs +2 -1
- data/lib/llm_docs_builder/cli.rb +1 -62
- data/lib/llm_docs_builder/comparator.rb +4 -16
- data/lib/llm_docs_builder/config.rb +42 -5
- data/lib/llm_docs_builder/markdown_transformer.rb +54 -128
- data/lib/llm_docs_builder/output_formatter.rb +93 -0
- data/lib/llm_docs_builder/parser.rb +1 -59
- data/lib/llm_docs_builder/text_compressor.rb +164 -0
- data/lib/llm_docs_builder/token_estimator.rb +52 -0
- data/lib/llm_docs_builder/transformers/base_transformer.rb +30 -0
- data/lib/llm_docs_builder/transformers/content_cleanup_transformer.rb +106 -0
- data/lib/llm_docs_builder/transformers/enhancement_transformer.rb +95 -0
- data/lib/llm_docs_builder/transformers/link_transformer.rb +84 -0
- data/lib/llm_docs_builder/transformers/whitespace_transformer.rb +44 -0
- data/lib/llm_docs_builder/version.rb +1 -1
- metadata +10 -3
- data/CLAUDE.md +0 -178
- data/llm-docs-builder.yml +0 -7
data/README.md
CHANGED
@@ -5,34 +5,21 @@
|
|
5
5
|
|
6
6
|
**Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
|
7
7
|
|
8
|
-
llm-docs-builder
|
8
|
+
llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, and optimizes documents for LLM context windows.
|
9
9
|
|
10
10
|
## The Problem
|
11
11
|
|
12
12
|
When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
|
13
13
|
|
14
14
|
**Real example from Karafka documentation:**
|
15
|
-
- Human HTML version:
|
16
|
-
- AI markdown version:
|
17
|
-
- **Result:
|
18
|
-
|
19
|
-
With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
|
20
|
-
|
21
|
-
## What This Tool Does
|
22
|
-
|
23
|
-
llm-docs-builder helps you optimize markdown documentation for AI consumption:
|
24
|
-
|
25
|
-
1. **Measure Savings** - Compare what your server sends to humans (HTML) vs AI bots (markdown) to quantify context window reduction
|
26
|
-
2. **Transform Markdown** - Normalize your markdown files with absolute links and consistent URL formats for better LLM navigation
|
27
|
-
3. **Generate llms.txt** - Create standardized documentation indexes following the [llms.txt](https://llmstxt.org/) specification
|
28
|
-
4. **Serve Efficiently** - Configure your server to automatically serve transformed markdown to AI bots while humans get HTML
|
15
|
+
- Human HTML version: 104.4 KB (~26,735 tokens)
|
16
|
+
- AI markdown version: 21.5 KB (~5,496 tokens)
|
17
|
+
- **Result: 79% reduction, 21,239 tokens saved, 5x smaller**
|
29
18
|
|
30
19
|
## Quick Start
|
31
20
|
|
32
21
|
### Measure Your Current Token Waste
|
33
22
|
|
34
|
-
Before making any changes, see how much you could save:
|
35
|
-
|
36
23
|
```bash
|
37
24
|
# Using Docker (no Ruby installation needed)
|
38
25
|
docker pull mensfeld/llm-docs-builder:latest
|
@@ -42,267 +29,54 @@ docker run mensfeld/llm-docs-builder compare \
|
|
42
29
|
--url https://yoursite.com/docs/getting-started.html
|
43
30
|
```
|
44
31
|
|
45
|
-
|
46
|
-
```
|
47
|
-
============================================================
|
48
|
-
Context Window Comparison
|
49
|
-
============================================================
|
50
|
-
|
51
|
-
Human version: 45.2 KB (~11,300 tokens)
|
52
|
-
Source: https://yoursite.com/docs/page.html (User-Agent: human)
|
32
|
+
### Transform Your Documentation
|
53
33
|
|
54
|
-
|
55
|
-
|
34
|
+
```bash
|
35
|
+
# Single file
|
36
|
+
llm-docs-builder transform --docs README.md
|
56
37
|
|
57
|
-
|
58
|
-
|
59
|
-
Token savings: 8,100 tokens (72%)
|
60
|
-
Factor: 3.5x smaller
|
61
|
-
============================================================
|
38
|
+
# Bulk transform with config
|
39
|
+
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
62
40
|
```
|
63
41
|
|
64
|
-
This single command shows you the potential ROI before you invest any time in optimization.
|
65
|
-
|
66
|
-
### Real-World Results
|
67
|
-
|
68
|
-
**[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
|
69
|
-
|
70
|
-
| Page | Human HTML | AI Markdown | Reduction | Tokens Saved | Factor |
|
71
|
-
|------|-----------|-------------|-----------|--------------|---------|
|
72
|
-
| Getting Started | 82.0 KB | 4.1 KB | 95% | ~19,475 | 20.1x |
|
73
|
-
| Configuration | 86.3 KB | 7.1 KB | 92% | ~19,800 | 12.1x |
|
74
|
-
| Routing | 93.6 KB | 14.7 KB | 84% | ~19,725 | 6.4x |
|
75
|
-
| Deployment | 122.1 KB | 33.3 KB | 73% | ~22,200 | 3.7x |
|
76
|
-
| Producing Messages | 87.7 KB | 8.3 KB | 91% | ~19,850 | 10.6x |
|
77
|
-
| Consuming Messages | 105.3 KB | 21.3 KB | 80% | ~21,000 | 4.9x |
|
78
|
-
| Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | ~21,950 | 5.1x |
|
79
|
-
| Active Job | 88.7 KB | 8.8 KB | 90% | ~19,975 | 10.1x |
|
80
|
-
| Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | ~22,050 | 3.7x |
|
81
|
-
| Error Handling | 93.8 KB | 13.1 KB | 86% | ~20,175 | 7.2x |
|
82
|
-
|
83
|
-
**Average: 83% reduction, ~20,620 tokens saved per page, 8.4x smaller files**
|
84
|
-
|
85
|
-
For a typical RAG system making 1,000 documentation queries per day:
|
86
|
-
- **Before**: ~990 KB per day (~247,500 tokens) × 1,000 queries = ~247.5M tokens/day
|
87
|
-
- **After**: ~165 KB per day (~41,250 tokens) × 1,000 queries = ~41.25M tokens/day
|
88
|
-
- **Savings**: 83% reduction = ~206.25M tokens saved per day
|
89
|
-
|
90
|
-
At GPT-4 pricing ($2.50/M input tokens), that's approximately **$500/day or $183,000/year saved** on a documentation site with moderate traffic.
|
91
|
-
|
92
42
|
## Installation
|
93
43
|
|
94
|
-
###
|
95
|
-
|
96
|
-
No Ruby installation required. Perfect for CI/CD and quick usage:
|
44
|
+
### Docker (Recommended)
|
97
45
|
|
98
46
|
```bash
|
99
|
-
# Pull the image
|
100
47
|
docker pull mensfeld/llm-docs-builder:latest
|
101
|
-
|
102
|
-
# Create an alias for convenience
|
103
48
|
alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
|
104
|
-
|
105
|
-
# Use like a native command
|
106
|
-
llm-docs-builder compare --url https://yoursite.com/docs
|
107
49
|
```
|
108
50
|
|
109
|
-
|
110
|
-
|
111
|
-
### Option 2: RubyGems
|
112
|
-
|
113
|
-
For Ruby developers or when you need the Ruby API:
|
51
|
+
### RubyGems
|
114
52
|
|
115
53
|
```bash
|
116
54
|
gem install llm-docs-builder
|
117
55
|
```
|
118
56
|
|
119
|
-
|
120
|
-
|
121
|
-
```ruby
|
122
|
-
gem 'llm-docs-builder'
|
123
|
-
```
|
124
|
-
|
125
|
-
## Core Features
|
126
|
-
|
127
|
-
### 1. Compare and Measure (The "Before You Start" Tool)
|
57
|
+
## Features
|
128
58
|
|
129
|
-
|
59
|
+
### Measure and Compare
|
130
60
|
|
131
61
|
```bash
|
132
|
-
# Compare what your server sends to humans vs AI
|
62
|
+
# Compare what your server sends to humans vs AI
|
133
63
|
llm-docs-builder compare --url https://yoursite.com/docs/page.html
|
134
64
|
|
135
|
-
# Compare remote HTML with
|
65
|
+
# Compare remote HTML with local markdown
|
136
66
|
llm-docs-builder compare \
|
137
67
|
--url https://yoursite.com/docs/api.html \
|
138
68
|
--file docs/api.md
|
139
|
-
|
140
|
-
# Verbose mode for debugging
|
141
|
-
llm-docs-builder compare --url https://example.com/docs --verbose
|
142
69
|
```
|
143
70
|
|
144
|
-
|
145
|
-
- Validates that optimizations actually work
|
146
|
-
- Quantifies ROI before you invest time
|
147
|
-
- Monitors ongoing effectiveness
|
148
|
-
- Provides concrete metrics for stakeholders
|
149
|
-
|
150
|
-
### 2. Transform Markdown (The Normalizer)
|
151
|
-
|
152
|
-
Normalize your markdown documentation to be LLM-friendly:
|
153
|
-
|
154
|
-
**Single file transformation:**
|
155
|
-
```bash
|
156
|
-
# Expand relative links to absolute URLs
|
157
|
-
llm-docs-builder transform \
|
158
|
-
--docs README.md \
|
159
|
-
--config llm-docs-builder.yml
|
160
|
-
```
|
161
|
-
|
162
|
-
**Bulk transformation - two modes:**
|
163
|
-
|
164
|
-
**a) Separate files (default)** - Creates `.llm.md` versions alongside originals:
|
165
|
-
```yaml
|
166
|
-
# llm-docs-builder.yml
|
167
|
-
docs: ./docs
|
168
|
-
base_url: https://myproject.io
|
169
|
-
suffix: .llm # Creates README.llm.md alongside README.md
|
170
|
-
convert_urls: true # .html → .md
|
171
|
-
remove_comments: true # Remove HTML comments
|
172
|
-
remove_badges: true # Remove badge/shield images
|
173
|
-
remove_frontmatter: true # Remove YAML/TOML frontmatter
|
174
|
-
normalize_whitespace: true # Clean up excessive blank lines
|
175
|
-
```
|
176
|
-
|
177
|
-
```bash
|
178
|
-
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
179
|
-
```
|
180
|
-
|
181
|
-
Result:
|
182
|
-
```
|
183
|
-
docs/
|
184
|
-
├── README.md ← Original (for humans)
|
185
|
-
├── README.llm.md ← Optimized (for AI)
|
186
|
-
├── api.md
|
187
|
-
└── api.llm.md
|
188
|
-
```
|
189
|
-
|
190
|
-
**b) In-place transformation** - Overwrites originals (for build pipelines):
|
191
|
-
```yaml
|
192
|
-
# llm-docs-builder.yml
|
193
|
-
docs: ./docs
|
194
|
-
base_url: https://myproject.io
|
195
|
-
suffix: "" # Transforms in-place
|
196
|
-
convert_urls: true # Convert .html to .md
|
197
|
-
remove_comments: true # Remove HTML comments
|
198
|
-
remove_badges: true # Remove badge/shield images
|
199
|
-
remove_frontmatter: true # Remove YAML/TOML frontmatter
|
200
|
-
normalize_whitespace: true # Clean up excessive blank lines
|
201
|
-
excludes:
|
202
|
-
- "**/private/**"
|
203
|
-
```
|
204
|
-
|
205
|
-
```bash
|
206
|
-
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
207
|
-
```
|
208
|
-
|
209
|
-
Perfect for CI/CD where you transform docs before deployment.
|
210
|
-
|
211
|
-
**What gets normalized:**
|
212
|
-
- **Links**: Relative → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
|
213
|
-
- **URLs**: HTML → Markdown format (`.html` → `.md`)
|
214
|
-
- **Comments**: HTML comments removed (`<!-- ... -->`)
|
215
|
-
- **Badges**: Shield/badge images removed (CI badges, version badges, etc.)
|
216
|
-
- **Frontmatter**: YAML/TOML metadata removed (Jekyll, Hugo, etc.)
|
217
|
-
- **Whitespace**: Excessive blank lines reduced (3+ → 2 max)
|
218
|
-
- Clean markdown structure preserved
|
219
|
-
- No content modification, just intelligent cleanup
|
220
|
-
|
221
|
-
### 3. Generate llms.txt (The Standard)
|
222
|
-
|
223
|
-
Create a standardized documentation index following the [llms.txt](https://llmstxt.org/) specification:
|
224
|
-
|
225
|
-
```yaml
|
226
|
-
# llm-docs-builder.yml
|
227
|
-
docs: ./docs
|
228
|
-
base_url: https://myproject.io
|
229
|
-
title: My Project
|
230
|
-
description: A library that does amazing things
|
231
|
-
output: llms.txt
|
232
|
-
```
|
71
|
+
### Generate llms.txt
|
233
72
|
|
234
73
|
```bash
|
74
|
+
# Create standardized documentation index
|
235
75
|
llm-docs-builder generate --config llm-docs-builder.yml
|
236
76
|
```
|
237
77
|
|
238
|
-
**Generated output:**
|
239
|
-
```markdown
|
240
|
-
# My Project
|
241
|
-
|
242
|
-
> A library that does amazing things
|
243
|
-
|
244
|
-
## Documentation
|
245
|
-
|
246
|
-
- [README](https://myproject.io/README.md): Complete overview and installation
|
247
|
-
- [Getting Started](https://myproject.io/getting-started.md): Quick start guide
|
248
|
-
- [API Reference](https://myproject.io/api-reference.md): Detailed API documentation
|
249
|
-
```
|
250
|
-
|
251
|
-
**Smart prioritization:**
|
252
|
-
1. README files (always first)
|
253
|
-
2. Getting started guides
|
254
|
-
3. Tutorials and guides
|
255
|
-
4. API references
|
256
|
-
5. Other documentation
|
257
|
-
|
258
|
-
The llms.txt file serves as an efficient entry point for AI systems to understand your project structure.
|
259
|
-
|
260
|
-
### 4. Serve to AI Bots (The Deployment)
|
261
|
-
|
262
|
-
After using `bulk-transform` with `suffix: .llm`, configure your web server to automatically serve optimized versions to AI bots:
|
263
|
-
|
264
|
-
**Apache (.htaccess):**
|
265
|
-
```apache
|
266
|
-
# Detect AI bots
|
267
|
-
SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt|chatgpt)" IS_LLM_BOT
|
268
|
-
SetEnvIf User-Agent "(?i)(perplexity|gemini|copilot|bard)" IS_LLM_BOT
|
269
|
-
|
270
|
-
# Serve .llm.md to AI, .md to humans
|
271
|
-
RewriteEngine On
|
272
|
-
RewriteCond %{ENV:IS_LLM_BOT} !^$
|
273
|
-
RewriteCond %{REQUEST_URI} ^/docs/.*\.md$ [NC]
|
274
|
-
RewriteRule ^(.*)\.md$ $1.llm.md [L]
|
275
|
-
```
|
276
|
-
|
277
|
-
**Nginx:**
|
278
|
-
```nginx
|
279
|
-
map $http_user_agent $is_llm_bot {
|
280
|
-
default 0;
|
281
|
-
"~*(?i)(openai|anthropic|claude|gpt)" 1;
|
282
|
-
"~*(?i)(perplexity|gemini|copilot)" 1;
|
283
|
-
}
|
284
|
-
|
285
|
-
location ~ ^/docs/(.*)\.md$ {
|
286
|
-
if ($is_llm_bot) {
|
287
|
-
rewrite ^(.*)\.md$ $1.llm.md last;
|
288
|
-
}
|
289
|
-
}
|
290
|
-
```
|
291
|
-
|
292
|
-
**Cloudflare Workers:**
|
293
|
-
```javascript
|
294
|
-
const isLLMBot = /openai|anthropic|claude|gpt|perplexity/i.test(userAgent);
|
295
|
-
if (isLLMBot && url.pathname.startsWith('/docs/')) {
|
296
|
-
url.pathname = url.pathname.replace(/\.md$/, '.llm.md');
|
297
|
-
}
|
298
|
-
```
|
299
|
-
|
300
|
-
**Result**: AI systems automatically get optimized versions, humans get the original. No manual switching, no duplicate URLs.
|
301
|
-
|
302
78
|
## Configuration
|
303
79
|
|
304
|
-
All commands support both config files and CLI flags. Config files are recommended for consistency:
|
305
|
-
|
306
80
|
```yaml
|
307
81
|
# llm-docs-builder.yml
|
308
82
|
docs: ./docs
|
@@ -310,123 +84,121 @@ base_url: https://myproject.io
|
|
310
84
|
title: My Project
|
311
85
|
description: Brief description
|
312
86
|
output: llms.txt
|
87
|
+
suffix: .llm
|
88
|
+
verbose: false
|
89
|
+
|
90
|
+
# Basic options
|
313
91
|
convert_urls: true
|
314
92
|
remove_comments: true
|
315
93
|
remove_badges: true
|
316
94
|
remove_frontmatter: true
|
317
95
|
normalize_whitespace: true
|
318
|
-
|
319
|
-
|
96
|
+
|
97
|
+
# Additional compression options
|
98
|
+
remove_code_examples: false
|
99
|
+
remove_images: true
|
100
|
+
remove_blockquotes: true
|
101
|
+
remove_duplicates: true
|
102
|
+
remove_stopwords: false
|
103
|
+
simplify_links: true
|
104
|
+
generate_toc: true
|
105
|
+
custom_instruction: "This documentation is optimized for AI consumption"
|
106
|
+
|
107
|
+
# Exclusions
|
320
108
|
excludes:
|
321
109
|
- "**/private/**"
|
322
110
|
- "**/drafts/**"
|
323
111
|
```
|
324
112
|
|
325
113
|
**Configuration precedence:**
|
326
|
-
1. CLI flags (highest
|
327
|
-
2. Config file
|
114
|
+
1. CLI flags (highest)
|
115
|
+
2. Config file
|
328
116
|
3. Defaults
|
329
117
|
|
330
|
-
|
331
|
-
```bash
|
332
|
-
# Uses config file but overrides title
|
333
|
-
llm-docs-builder generate --config llm-docs-builder.yml --title "Override Title"
|
334
|
-
```
|
335
|
-
|
336
|
-
## Docker Usage
|
337
|
-
|
338
|
-
All CLI commands work in Docker with the same syntax:
|
118
|
+
## CLI Commands
|
339
119
|
|
340
120
|
```bash
|
341
|
-
#
|
342
|
-
|
343
|
-
|
344
|
-
#
|
345
|
-
|
346
|
-
|
347
|
-
|
121
|
+
llm-docs-builder compare [options] # Measure token savings
|
122
|
+
llm-docs-builder transform [options] # Transform single file
|
123
|
+
llm-docs-builder bulk-transform [options] # Transform directory
|
124
|
+
llm-docs-builder generate [options] # Generate llms.txt
|
125
|
+
llm-docs-builder parse [options] # Parse llms.txt
|
126
|
+
llm-docs-builder validate [options] # Validate llms.txt
|
127
|
+
llm-docs-builder version # Show version
|
348
128
|
```
|
349
129
|
|
350
|
-
**
|
351
|
-
|
352
|
-
GitHub Actions:
|
353
|
-
```yaml
|
354
|
-
- name: Generate llms.txt
|
355
|
-
run: |
|
356
|
-
docker run -v ${{ github.workspace }}:/workspace \
|
357
|
-
mensfeld/llm-docs-builder generate --config llm-docs-builder.yml
|
130
|
+
**Common options:**
|
358
131
|
```
|
359
|
-
|
360
|
-
|
361
|
-
|
362
|
-
|
363
|
-
|
364
|
-
script:
|
365
|
-
- llm-docs-builder generate --docs ./docs
|
132
|
+
-c, --config PATH Configuration file
|
133
|
+
-d, --docs PATH Documentation path
|
134
|
+
-o, --output PATH Output file
|
135
|
+
-u, --url URL URL for comparison
|
136
|
+
-v, --verbose Detailed output
|
366
137
|
```
|
367
138
|
|
368
|
-
See [Docker Usage](#detailed-docker-usage) section below for comprehensive examples.
|
369
|
-
|
370
139
|
## Ruby API
|
371
140
|
|
372
|
-
For programmatic usage:
|
373
|
-
|
374
141
|
```ruby
|
375
142
|
require 'llm_docs_builder'
|
376
143
|
|
377
|
-
#
|
378
|
-
|
379
|
-
|
380
|
-
# Direct options
|
381
|
-
content = LlmDocsBuilder.generate_from_docs('./docs',
|
144
|
+
# Transform single file with custom options
|
145
|
+
transformed = LlmDocsBuilder.transform_markdown(
|
146
|
+
'README.md',
|
382
147
|
base_url: 'https://myproject.io',
|
383
|
-
|
384
|
-
|
385
|
-
|
386
|
-
|
387
|
-
transformed = LlmDocsBuilder.transform_markdown('README.md',
|
388
|
-
base_url: 'https://myproject.io',
|
389
|
-
convert_urls: true,
|
390
|
-
remove_comments: true,
|
391
|
-
remove_badges: true,
|
392
|
-
remove_frontmatter: true,
|
393
|
-
normalize_whitespace: true
|
148
|
+
remove_code_examples: true,
|
149
|
+
remove_images: true,
|
150
|
+
generate_toc: true,
|
151
|
+
custom_instruction: 'AI-optimized documentation'
|
394
152
|
)
|
395
153
|
|
396
154
|
# Bulk transform
|
397
|
-
files = LlmDocsBuilder.bulk_transform(
|
155
|
+
files = LlmDocsBuilder.bulk_transform(
|
156
|
+
'./docs',
|
398
157
|
base_url: 'https://myproject.io',
|
399
158
|
suffix: '.llm',
|
400
|
-
|
401
|
-
|
402
|
-
remove_frontmatter: true,
|
403
|
-
normalize_whitespace: true,
|
404
|
-
excludes: ['**/private/**']
|
159
|
+
remove_duplicates: true,
|
160
|
+
generate_toc: true
|
405
161
|
)
|
406
162
|
|
407
|
-
#
|
408
|
-
|
409
|
-
|
163
|
+
# Generate llms.txt
|
164
|
+
content = LlmDocsBuilder.generate_from_docs(
|
165
|
+
'./docs',
|
410
166
|
base_url: 'https://myproject.io',
|
411
|
-
|
412
|
-
remove_badges: true,
|
413
|
-
remove_frontmatter: true,
|
414
|
-
normalize_whitespace: true
|
167
|
+
title: 'My Project'
|
415
168
|
)
|
416
169
|
```
|
417
170
|
|
418
|
-
##
|
171
|
+
## Serving Optimized Docs to AI Bots
|
419
172
|
|
420
|
-
|
173
|
+
After using `bulk-transform` with `suffix: .llm`, configure your web server to serve optimized versions to AI bots:
|
421
174
|
|
422
|
-
|
423
|
-
|
424
|
-
-
|
175
|
+
**Apache (.htaccess):**
|
176
|
+
```apache
|
177
|
+
SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt)" IS_LLM_BOT
|
178
|
+
RewriteCond %{ENV:IS_LLM_BOT} !^$
|
179
|
+
RewriteRule ^(.*)\.md$ $1.llm.md [L]
|
180
|
+
```
|
425
181
|
|
426
|
-
**
|
182
|
+
**Nginx:**
|
183
|
+
```nginx
|
184
|
+
map $http_user_agent $is_llm_bot {
|
185
|
+
default 0;
|
186
|
+
"~*(?i)(openai|anthropic|claude|gpt)" 1;
|
187
|
+
}
|
188
|
+
|
189
|
+
location ~ ^/docs/(.*)\.md$ {
|
190
|
+
if ($is_llm_bot) {
|
191
|
+
rewrite ^(.*)\.md$ $1.llm.md last;
|
192
|
+
}
|
193
|
+
}
|
194
|
+
```
|
427
195
|
|
196
|
+
## Real-World Results: Karafka Framework
|
197
|
+
|
198
|
+
**Before:** 140+ lines of custom transformation code
|
199
|
+
|
200
|
+
**After:** 6 lines of configuration
|
428
201
|
```yaml
|
429
|
-
# llm-docs-builder.yml
|
430
202
|
docs: ./online/docs
|
431
203
|
base_url: https://karafka.io/docs
|
432
204
|
convert_urls: true
|
@@ -434,290 +206,145 @@ remove_comments: true
|
|
434
206
|
remove_badges: true
|
435
207
|
remove_frontmatter: true
|
436
208
|
normalize_whitespace: true
|
437
|
-
suffix: "" # In-place
|
438
|
-
excludes:
|
439
|
-
- "**/Enterprise-License-Setup/**"
|
440
|
-
```
|
441
|
-
|
442
|
-
```bash
|
443
|
-
# In their deployment script
|
444
|
-
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
209
|
+
suffix: "" # In-place for build pipeline
|
445
210
|
```
|
446
211
|
|
447
212
|
**Results:**
|
448
|
-
-
|
449
|
-
-
|
450
|
-
-
|
451
|
-
- **Automated daily deployments** via GitHub Actions
|
452
|
-
|
453
|
-
The compare command revealed that their documentation was consuming 20-36x more tokens than necessary for AI systems. After optimization, RAG queries became dramatically more efficient.
|
454
|
-
|
455
|
-
## CLI Reference
|
456
|
-
|
457
|
-
```bash
|
458
|
-
llm-docs-builder compare [options] # Measure token savings (start here!)
|
459
|
-
llm-docs-builder transform [options] # Transform single markdown file
|
460
|
-
llm-docs-builder bulk-transform [options] # Transform entire documentation tree
|
461
|
-
llm-docs-builder generate [options] # Generate llms.txt index
|
462
|
-
llm-docs-builder parse [options] # Parse existing llms.txt
|
463
|
-
llm-docs-builder validate [options] # Validate llms.txt format
|
464
|
-
llm-docs-builder version # Show version
|
465
|
-
```
|
466
|
-
|
467
|
-
**Common options:**
|
468
|
-
```
|
469
|
-
-c, --config PATH Configuration file (default: llm-docs-builder.yml)
|
470
|
-
-d, --docs PATH Documentation directory or file
|
471
|
-
-o, --output PATH Output file path
|
472
|
-
-u, --url URL URL for comparison
|
473
|
-
-f, --file PATH Local file for comparison
|
474
|
-
-v, --verbose Detailed output
|
475
|
-
-h, --help Show help
|
476
|
-
```
|
477
|
-
|
478
|
-
For advanced options (base_url, title, suffix, excludes, convert_urls), use a config file.
|
213
|
+
- 93% average token reduction
|
214
|
+
- 20-36x smaller files
|
215
|
+
- Automated via GitHub Actions
|
479
216
|
|
480
|
-
##
|
481
|
-
|
482
|
-
Retrieval-Augmented Generation (RAG) systems fetch documentation to answer questions. Every byte of overhead in those documents:
|
483
|
-
|
484
|
-
1. **Costs money** - More tokens = higher API costs
|
485
|
-
2. **Reduces capacity** - Less room for actual documentation in context window
|
486
|
-
3. **Slows responses** - More tokens to process = longer response times
|
487
|
-
4. **Degrades quality** - Navigation noise can confuse the model
|
488
|
-
|
489
|
-
llm-docs-builder addresses all four issues by transforming markdown to be AI-friendly and enabling your server to automatically serve it to AI bots while humans get HTML.
|
490
|
-
|
491
|
-
**The JavaScript Problem:**
|
492
|
-
|
493
|
-
Many documentation sites rely on JavaScript for rendering. AI crawlers typically don't execute JavaScript, so they either:
|
494
|
-
- Get incomplete content
|
495
|
-
- Get server-side rendered HTML (bloated with framework overhead)
|
496
|
-
- Fail entirely
|
497
|
-
|
498
|
-
By detecting AI bots and serving them clean markdown instead of HTML, you sidestep this problem entirely.
|
499
|
-
|
500
|
-
## Configuration Reference
|
501
|
-
|
502
|
-
| Option | Type | Default | Description |
|
503
|
-
|--------|------|---------|-------------|
|
504
|
-
| `docs` | String | `./docs` | Documentation directory or file |
|
505
|
-
| `base_url` | String | - | Base URL for absolute links (e.g., `https://myproject.io`) |
|
506
|
-
| `title` | String | Auto-detected | Project title |
|
507
|
-
| `description` | String | Auto-detected | Project description |
|
508
|
-
| `output` | String | `llms.txt` | Output filename for llms.txt generation |
|
509
|
-
| `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
|
510
|
-
| `remove_comments` | Boolean | `false` | Remove HTML comments (`<!-- ... -->`) |
|
511
|
-
| `remove_badges` | Boolean | `false` | Remove badge/shield images (CI, version, etc.) |
|
512
|
-
| `remove_frontmatter` | Boolean | `false` | Remove YAML/TOML frontmatter (Jekyll, Hugo) |
|
513
|
-
| `normalize_whitespace` | Boolean | `false` | Normalize excessive blank lines and trailing spaces |
|
514
|
-
| `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
|
515
|
-
| `excludes` | Array | `[]` | Glob patterns to exclude |
|
516
|
-
| `verbose` | Boolean | `false` | Enable detailed output |
|
517
|
-
|
518
|
-
## Detailed Docker Usage
|
519
|
-
|
520
|
-
### Installation and Setup
|
217
|
+
## Docker Usage
|
521
218
|
|
522
219
|
```bash
|
523
|
-
# Pull
|
220
|
+
# Pull image
|
524
221
|
docker pull mensfeld/llm-docs-builder:latest
|
525
222
|
|
526
|
-
#
|
527
|
-
docker pull ghcr.io/mensfeld/llm-docs-builder:latest
|
528
|
-
|
529
|
-
# Create an alias for convenience
|
530
|
-
alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
|
531
|
-
```
|
532
|
-
|
533
|
-
### Common Commands
|
534
|
-
|
535
|
-
**Compare (no volume mount needed for remote URLs):**
|
536
|
-
```bash
|
223
|
+
# Compare (no volume needed for remote URLs)
|
537
224
|
docker run mensfeld/llm-docs-builder compare \
|
538
|
-
--url https://
|
539
|
-
|
540
|
-
# With local file
|
541
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder compare \
|
542
|
-
--url https://example.com/page.html \
|
543
|
-
--file docs/page.md
|
544
|
-
```
|
225
|
+
--url https://yoursite.com/docs
|
545
226
|
|
546
|
-
|
547
|
-
```bash
|
548
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
|
549
|
-
generate --docs ./docs --output llms.txt
|
550
|
-
```
|
551
|
-
|
552
|
-
**Transform single file:**
|
553
|
-
```bash
|
554
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
|
555
|
-
transform --docs README.md --config llm-docs-builder.yml
|
556
|
-
```
|
557
|
-
|
558
|
-
**Bulk transform:**
|
559
|
-
```bash
|
227
|
+
# Transform with volume mount
|
560
228
|
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
|
561
229
|
bulk-transform --config llm-docs-builder.yml
|
562
230
|
```
|
563
231
|
|
564
|
-
**
|
565
|
-
```bash
|
566
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
|
567
|
-
parse --docs llms.txt --verbose
|
568
|
-
|
569
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
|
570
|
-
validate --docs llms.txt
|
571
|
-
```
|
572
|
-
|
573
|
-
### CI/CD Examples
|
574
|
-
|
575
|
-
**GitHub Actions:**
|
576
|
-
```yaml
|
577
|
-
jobs:
|
578
|
-
optimize-docs:
|
579
|
-
runs-on: ubuntu-latest
|
580
|
-
steps:
|
581
|
-
- uses: actions/checkout@v3
|
582
|
-
- name: Transform documentation
|
583
|
-
run: |
|
584
|
-
docker run -v ${{ github.workspace }}:/workspace \
|
585
|
-
mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
586
|
-
- name: Measure savings
|
587
|
-
run: |
|
588
|
-
docker run mensfeld/llm-docs-builder \
|
589
|
-
compare --url https://yoursite.com/docs/main.html
|
590
|
-
```
|
591
|
-
|
592
|
-
**GitLab CI:**
|
232
|
+
**CI/CD Example (GitHub Actions):**
|
593
233
|
```yaml
|
594
|
-
|
595
|
-
|
596
|
-
|
597
|
-
|
598
|
-
- llm-docs-builder compare --url https://yoursite.com/docs
|
599
|
-
```
|
600
|
-
|
601
|
-
**Jenkins:**
|
602
|
-
```groovy
|
603
|
-
stage('Optimize Documentation') {
|
604
|
-
steps {
|
605
|
-
sh '''
|
606
|
-
docker run -v ${WORKSPACE}:/workspace \
|
607
|
-
mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
608
|
-
'''
|
609
|
-
}
|
610
|
-
}
|
234
|
+
- name: Optimize documentation
|
235
|
+
run: |
|
236
|
+
docker run -v ${{ github.workspace }}:/workspace \
|
237
|
+
mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
611
238
|
```
|
612
239
|
|
613
|
-
|
240
|
+
## Compression Examples
|
614
241
|
|
615
|
-
|
616
|
-
|
617
|
-
|
242
|
+
**Input markdown:**
|
243
|
+
```markdown
|
244
|
+
---
|
245
|
+
layout: docs
|
246
|
+
---
|
618
247
|
|
619
|
-
#
|
620
|
-
docker run mensfeld/llm-docs-builder:0 version
|
248
|
+
# API Documentation
|
621
249
|
|
622
|
-
|
623
|
-
docker run mensfeld/llm-docs-builder:latest version
|
624
|
-
```
|
250
|
+
[](https://ci.com)
|
625
251
|
|
626
|
-
|
252
|
+
> Important: This is a note
|
627
253
|
|
628
|
-
|
629
|
-
```powershell
|
630
|
-
docker run -v ${PWD}:/workspace mensfeld/llm-docs-builder generate --docs ./docs
|
631
|
-
```
|
254
|
+
[Click here to see the complete API documentation](./api.md)
|
632
255
|
|
633
|
-
|
634
|
-
|
635
|
-
docker run -v %cd%:/workspace mensfeld/llm-docs-builder generate --docs ./docs
|
256
|
+
```ruby
|
257
|
+
api = API.new
|
636
258
|
```
|
637
259
|
|
638
|
-
|
639
|
-
```bash
|
640
|
-
docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
|
260
|
+

|
641
261
|
```
|
642
262
|
|
643
|
-
|
644
|
-
|
645
|
-
|
646
|
-
|
647
|
-
- Quickly understand project structure
|
648
|
-
- Find relevant documentation efficiently
|
649
|
-
- Navigate complex documentation hierarchies
|
650
|
-
- Access clean, markdown-formatted content
|
651
|
-
|
652
|
-
llm-docs-builder generates llms.txt files automatically by:
|
653
|
-
1. Scanning your documentation directory
|
654
|
-
2. Extracting titles and descriptions from markdown files
|
655
|
-
3. Prioritizing content by importance (README first, then guides, APIs, etc.)
|
656
|
-
4. Formatting everything according to the specification
|
263
|
+
**After transformation (with default options):**
|
264
|
+
```markdown
|
265
|
+
# API Documentation
|
657
266
|
|
658
|
-
|
267
|
+
[complete API documentation](./api.md)
|
659
268
|
|
660
|
-
|
269
|
+
```ruby
|
270
|
+
api = API.new
|
271
|
+
```
|
272
|
+
```
|
661
273
|
|
662
|
-
**
|
663
|
-
1. Scan directory for `.md` files
|
664
|
-
2. Extract title (first H1) and description (first paragraph)
|
665
|
-
3. Prioritize by importance (README → Getting Started → Guides → API → Other)
|
666
|
-
4. Build formatted llms.txt with links and descriptions
|
274
|
+
**Token reduction:** ~40-60% depending on configuration
|
667
275
|
|
668
|
-
|
669
|
-
1. Remove frontmatter (YAML/TOML metadata)
|
670
|
-
2. Expand relative links to absolute URLs
|
671
|
-
3. Convert `.html` URLs to `.md`
|
672
|
-
4. Remove HTML comments
|
673
|
-
5. Remove badge/shield images
|
674
|
-
6. Normalize excessive whitespace
|
675
|
-
7. Write to new file or overwrite in-place
|
276
|
+
## FAQ
|
676
277
|
|
677
|
-
**
|
678
|
-
|
679
|
-
2. Fetch same URL with AI bot User-Agent
|
680
|
-
3. Calculate size difference and reduction percentage
|
681
|
-
4. Estimate token counts using character-based heuristic
|
682
|
-
5. Display human-readable comparison results with byte and token savings
|
278
|
+
**Q: Do I need to use llms.txt?**
|
279
|
+
No. The compare and transform commands work independently.
|
683
280
|
|
684
|
-
**
|
685
|
-
|
281
|
+
**Q: Will this change how humans see my docs?**
|
282
|
+
Not with default `suffix: .llm`. Separate files are served only to AI bots.
|
686
283
|
|
687
|
-
|
284
|
+
**Q: Can I use this in my build pipeline?**
|
285
|
+
Yes. Use `suffix: ""` for in-place transformation.
|
688
286
|
|
689
|
-
**Q:
|
287
|
+
**Q: How do I know if it's working?**
|
288
|
+
Use `llm-docs-builder compare` to measure before and after.
|
690
289
|
|
691
|
-
|
290
|
+
**Q: What about private documentation?**
|
291
|
+
Use the `excludes` option to skip sensitive files.
|
692
292
|
|
693
|
-
|
293
|
+
## Advanced Compression Options
|
694
294
|
|
695
|
-
|
295
|
+
All compression features can be used individually for fine-grained control:
|
696
296
|
|
697
|
-
|
297
|
+
### Content Removal Options
|
698
298
|
|
699
|
-
|
299
|
+
- `remove_frontmatter: true` - Remove YAML/TOML metadata blocks
|
300
|
+
- `remove_comments: true` - Remove HTML comments (`<!-- ... -->`)
|
301
|
+
- `remove_badges: true` - Remove badge/shield images (CI badges, version badges, etc.)
|
302
|
+
- `remove_images: true` - Remove all image syntax
|
303
|
+
- `remove_code_examples: true` - Remove fenced code blocks, indented code, and inline code
|
304
|
+
- `remove_blockquotes: true` - Remove blockquote formatting (preserves content)
|
305
|
+
- `remove_duplicates: true` - Remove duplicate paragraphs using fuzzy matching
|
306
|
+
- `remove_stopwords: true` - Remove common stopwords from prose (preserves code blocks)
|
700
307
|
|
701
|
-
|
308
|
+
### Content Enhancement Options
|
702
309
|
|
703
|
-
|
310
|
+
- `generate_toc: true` - Generate table of contents from headings with anchor links
|
311
|
+
- `custom_instruction: "text"` - Inject AI context message at document top
|
312
|
+
- `simplify_links: true` - Simplify verbose link text (e.g., "Click here to see the docs" → "docs")
|
313
|
+
- `convert_urls: true` - Convert `.html`/`.htm` URLs to `.md` format
|
314
|
+
- `normalize_whitespace: true` - Reduce excessive blank lines and remove trailing whitespace
|
704
315
|
|
705
|
-
|
316
|
+
### Example Usage
|
706
317
|
|
707
|
-
|
318
|
+
```ruby
|
319
|
+
# Fine-grained control
|
320
|
+
LlmDocsBuilder.transform_markdown(
|
321
|
+
'README.md',
|
322
|
+
remove_frontmatter: true,
|
323
|
+
remove_badges: true,
|
324
|
+
remove_images: true,
|
325
|
+
simplify_links: true,
|
326
|
+
generate_toc: true,
|
327
|
+
normalize_whitespace: true
|
328
|
+
)
|
329
|
+
```
|
708
330
|
|
709
|
-
|
331
|
+
Or configure via YAML:
|
710
332
|
|
711
|
-
Use the `excludes` option to skip sensitive files:
|
712
333
|
```yaml
|
713
|
-
|
714
|
-
|
715
|
-
|
716
|
-
|
717
|
-
|
718
|
-
**Q: Can I customize the AI bot detection?**
|
334
|
+
# llm-docs-builder.yml
|
335
|
+
docs: ./docs
|
336
|
+
base_url: https://myproject.io
|
337
|
+
suffix: .llm
|
719
338
|
|
720
|
-
|
339
|
+
# Pick exactly what you need
|
340
|
+
remove_frontmatter: true
|
341
|
+
remove_comments: true
|
342
|
+
remove_badges: true
|
343
|
+
remove_images: true
|
344
|
+
simplify_links: true
|
345
|
+
generate_toc: true
|
346
|
+
normalize_whitespace: true
|
347
|
+
```
|
721
348
|
|
722
349
|
## Contributing
|
723
350
|
|