llm-docs-builder 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -5,34 +5,21 @@
5
5
 
6
6
  **Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
7
7
 
8
- llm-docs-builder normalizes markdown documentation to be AI-friendly and generates llms.txt files. Transform relative links to absolute URLs, measure token savings when serving markdown vs HTML, and create standardized documentation indexes that help LLMs navigate your project.
8
+ llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, and optimizes documents for LLM context windows.
9
9
 
10
10
  ## The Problem
11
11
 
12
12
  When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
13
13
 
14
14
  **Real example from Karafka documentation:**
15
- - Human HTML version: 82.0 KB (~20,500 tokens)
16
- - AI markdown version: 4.1 KB (~1,025 tokens)
17
- - **Result: 95% reduction, 19,475 tokens saved, 20x smaller**
18
-
19
- With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
20
-
21
- ## What This Tool Does
22
-
23
- llm-docs-builder helps you optimize markdown documentation for AI consumption:
24
-
25
- 1. **Measure Savings** - Compare what your server sends to humans (HTML) vs AI bots (markdown) to quantify context window reduction
26
- 2. **Transform Markdown** - Normalize your markdown files with absolute links and consistent URL formats for better LLM navigation
27
- 3. **Generate llms.txt** - Create standardized documentation indexes following the [llms.txt](https://llmstxt.org/) specification
28
- 4. **Serve Efficiently** - Configure your server to automatically serve transformed markdown to AI bots while humans get HTML
15
+ - Human HTML version: 104.4 KB (~26,735 tokens)
16
+ - AI markdown version: 21.5 KB (~5,496 tokens)
17
+ - **Result: 79% reduction, 21,239 tokens saved, 5x smaller**
29
18
 
30
19
  ## Quick Start
31
20
 
32
21
  ### Measure Your Current Token Waste
33
22
 
34
- Before making any changes, see how much you could save:
35
-
36
23
  ```bash
37
24
  # Using Docker (no Ruby installation needed)
38
25
  docker pull mensfeld/llm-docs-builder:latest
@@ -42,267 +29,54 @@ docker run mensfeld/llm-docs-builder compare \
42
29
  --url https://yoursite.com/docs/getting-started.html
43
30
  ```
44
31
 
45
- **Example output:**
46
- ```
47
- ============================================================
48
- Context Window Comparison
49
- ============================================================
50
-
51
- Human version: 45.2 KB (~11,300 tokens)
52
- Source: https://yoursite.com/docs/page.html (User-Agent: human)
32
+ ### Transform Your Documentation
53
33
 
54
- AI version: 12.8 KB (~3,200 tokens)
55
- Source: https://yoursite.com/docs/page.html (User-Agent: AI)
34
+ ```bash
35
+ # Single file
36
+ llm-docs-builder transform --docs README.md
56
37
 
57
- ------------------------------------------------------------
58
- Reduction: 32.4 KB (72%)
59
- Token savings: 8,100 tokens (72%)
60
- Factor: 3.5x smaller
61
- ============================================================
38
+ # Bulk transform with config
39
+ llm-docs-builder bulk-transform --config llm-docs-builder.yml
62
40
  ```
63
41
 
64
- This single command shows you the potential ROI before you invest any time in optimization.
65
-
66
- ### Real-World Results
67
-
68
- **[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
69
-
70
- | Page | Human HTML | AI Markdown | Reduction | Tokens Saved | Factor |
71
- |------|-----------|-------------|-----------|--------------|---------|
72
- | Getting Started | 82.0 KB | 4.1 KB | 95% | ~19,475 | 20.1x |
73
- | Configuration | 86.3 KB | 7.1 KB | 92% | ~19,800 | 12.1x |
74
- | Routing | 93.6 KB | 14.7 KB | 84% | ~19,725 | 6.4x |
75
- | Deployment | 122.1 KB | 33.3 KB | 73% | ~22,200 | 3.7x |
76
- | Producing Messages | 87.7 KB | 8.3 KB | 91% | ~19,850 | 10.6x |
77
- | Consuming Messages | 105.3 KB | 21.3 KB | 80% | ~21,000 | 4.9x |
78
- | Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | ~21,950 | 5.1x |
79
- | Active Job | 88.7 KB | 8.8 KB | 90% | ~19,975 | 10.1x |
80
- | Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | ~22,050 | 3.7x |
81
- | Error Handling | 93.8 KB | 13.1 KB | 86% | ~20,175 | 7.2x |
82
-
83
- **Average: 83% reduction, ~20,620 tokens saved per page, 8.4x smaller files**
84
-
85
- For a typical RAG system making 1,000 documentation queries per day:
86
- - **Before**: ~990 KB per day (~247,500 tokens) × 1,000 queries = ~247.5M tokens/day
87
- - **After**: ~165 KB per day (~41,250 tokens) × 1,000 queries = ~41.25M tokens/day
88
- - **Savings**: 83% reduction = ~206.25M tokens saved per day
89
-
90
- At GPT-4 pricing ($2.50/M input tokens), that's approximately **$500/day or $183,000/year saved** on a documentation site with moderate traffic.
91
-
92
42
  ## Installation
93
43
 
94
- ### Option 1: Docker (Recommended)
95
-
96
- No Ruby installation required. Perfect for CI/CD and quick usage:
44
+ ### Docker (Recommended)
97
45
 
98
46
  ```bash
99
- # Pull the image
100
47
  docker pull mensfeld/llm-docs-builder:latest
101
-
102
- # Create an alias for convenience
103
48
  alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
104
-
105
- # Use like a native command
106
- llm-docs-builder compare --url https://yoursite.com/docs
107
49
  ```
108
50
 
109
- Multi-architecture support (amd64/arm64), ~50MB image size.
110
-
111
- ### Option 2: RubyGems
112
-
113
- For Ruby developers or when you need the Ruby API:
51
+ ### RubyGems
114
52
 
115
53
  ```bash
116
54
  gem install llm-docs-builder
117
55
  ```
118
56
 
119
- Or add to your Gemfile:
120
-
121
- ```ruby
122
- gem 'llm-docs-builder'
123
- ```
124
-
125
- ## Core Features
126
-
127
- ### 1. Compare and Measure (The "Before You Start" Tool)
57
+ ## Features
128
58
 
129
- Quantify exactly how much context window you're wasting:
59
+ ### Measure and Compare
130
60
 
131
61
  ```bash
132
- # Compare what your server sends to humans vs AI bots
62
+ # Compare what your server sends to humans vs AI
133
63
  llm-docs-builder compare --url https://yoursite.com/docs/page.html
134
64
 
135
- # Compare remote HTML with your local markdown
65
+ # Compare remote HTML with local markdown
136
66
  llm-docs-builder compare \
137
67
  --url https://yoursite.com/docs/api.html \
138
68
  --file docs/api.md
139
-
140
- # Verbose mode for debugging
141
- llm-docs-builder compare --url https://example.com/docs --verbose
142
69
  ```
143
70
 
144
- **Why this matters:**
145
- - Validates that optimizations actually work
146
- - Quantifies ROI before you invest time
147
- - Monitors ongoing effectiveness
148
- - Provides concrete metrics for stakeholders
149
-
150
- ### 2. Transform Markdown (The Normalizer)
151
-
152
- Normalize your markdown documentation to be LLM-friendly:
153
-
154
- **Single file transformation:**
155
- ```bash
156
- # Expand relative links to absolute URLs
157
- llm-docs-builder transform \
158
- --docs README.md \
159
- --config llm-docs-builder.yml
160
- ```
161
-
162
- **Bulk transformation - two modes:**
163
-
164
- **a) Separate files (default)** - Creates `.llm.md` versions alongside originals:
165
- ```yaml
166
- # llm-docs-builder.yml
167
- docs: ./docs
168
- base_url: https://myproject.io
169
- suffix: .llm # Creates README.llm.md alongside README.md
170
- convert_urls: true # .html → .md
171
- remove_comments: true # Remove HTML comments
172
- remove_badges: true # Remove badge/shield images
173
- remove_frontmatter: true # Remove YAML/TOML frontmatter
174
- normalize_whitespace: true # Clean up excessive blank lines
175
- ```
176
-
177
- ```bash
178
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
179
- ```
180
-
181
- Result:
182
- ```
183
- docs/
184
- ├── README.md ← Original (for humans)
185
- ├── README.llm.md ← Optimized (for AI)
186
- ├── api.md
187
- └── api.llm.md
188
- ```
189
-
190
- **b) In-place transformation** - Overwrites originals (for build pipelines):
191
- ```yaml
192
- # llm-docs-builder.yml
193
- docs: ./docs
194
- base_url: https://myproject.io
195
- suffix: "" # Transforms in-place
196
- convert_urls: true # Convert .html to .md
197
- remove_comments: true # Remove HTML comments
198
- remove_badges: true # Remove badge/shield images
199
- remove_frontmatter: true # Remove YAML/TOML frontmatter
200
- normalize_whitespace: true # Clean up excessive blank lines
201
- excludes:
202
- - "**/private/**"
203
- ```
204
-
205
- ```bash
206
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
207
- ```
208
-
209
- Perfect for CI/CD where you transform docs before deployment.
210
-
211
- **What gets normalized:**
212
- - **Links**: Relative → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
213
- - **URLs**: HTML → Markdown format (`.html` → `.md`)
214
- - **Comments**: HTML comments removed (`<!-- ... -->`)
215
- - **Badges**: Shield/badge images removed (CI badges, version badges, etc.)
216
- - **Frontmatter**: YAML/TOML metadata removed (Jekyll, Hugo, etc.)
217
- - **Whitespace**: Excessive blank lines reduced (3+ → 2 max)
218
- - Clean markdown structure preserved
219
- - No content modification, just intelligent cleanup
220
-
221
- ### 3. Generate llms.txt (The Standard)
222
-
223
- Create a standardized documentation index following the [llms.txt](https://llmstxt.org/) specification:
224
-
225
- ```yaml
226
- # llm-docs-builder.yml
227
- docs: ./docs
228
- base_url: https://myproject.io
229
- title: My Project
230
- description: A library that does amazing things
231
- output: llms.txt
232
- ```
71
+ ### Generate llms.txt
233
72
 
234
73
  ```bash
74
+ # Create standardized documentation index
235
75
  llm-docs-builder generate --config llm-docs-builder.yml
236
76
  ```
237
77
 
238
- **Generated output:**
239
- ```markdown
240
- # My Project
241
-
242
- > A library that does amazing things
243
-
244
- ## Documentation
245
-
246
- - [README](https://myproject.io/README.md): Complete overview and installation
247
- - [Getting Started](https://myproject.io/getting-started.md): Quick start guide
248
- - [API Reference](https://myproject.io/api-reference.md): Detailed API documentation
249
- ```
250
-
251
- **Smart prioritization:**
252
- 1. README files (always first)
253
- 2. Getting started guides
254
- 3. Tutorials and guides
255
- 4. API references
256
- 5. Other documentation
257
-
258
- The llms.txt file serves as an efficient entry point for AI systems to understand your project structure.
259
-
260
- ### 4. Serve to AI Bots (The Deployment)
261
-
262
- After using `bulk-transform` with `suffix: .llm`, configure your web server to automatically serve optimized versions to AI bots:
263
-
264
- **Apache (.htaccess):**
265
- ```apache
266
- # Detect AI bots
267
- SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt|chatgpt)" IS_LLM_BOT
268
- SetEnvIf User-Agent "(?i)(perplexity|gemini|copilot|bard)" IS_LLM_BOT
269
-
270
- # Serve .llm.md to AI, .md to humans
271
- RewriteEngine On
272
- RewriteCond %{ENV:IS_LLM_BOT} !^$
273
- RewriteCond %{REQUEST_URI} ^/docs/.*\.md$ [NC]
274
- RewriteRule ^(.*)\.md$ $1.llm.md [L]
275
- ```
276
-
277
- **Nginx:**
278
- ```nginx
279
- map $http_user_agent $is_llm_bot {
280
- default 0;
281
- "~*(?i)(openai|anthropic|claude|gpt)" 1;
282
- "~*(?i)(perplexity|gemini|copilot)" 1;
283
- }
284
-
285
- location ~ ^/docs/(.*)\.md$ {
286
- if ($is_llm_bot) {
287
- rewrite ^(.*)\.md$ $1.llm.md last;
288
- }
289
- }
290
- ```
291
-
292
- **Cloudflare Workers:**
293
- ```javascript
294
- const isLLMBot = /openai|anthropic|claude|gpt|perplexity/i.test(userAgent);
295
- if (isLLMBot && url.pathname.startsWith('/docs/')) {
296
- url.pathname = url.pathname.replace(/\.md$/, '.llm.md');
297
- }
298
- ```
299
-
300
- **Result**: AI systems automatically get optimized versions, humans get the original. No manual switching, no duplicate URLs.
301
-
302
78
  ## Configuration
303
79
 
304
- All commands support both config files and CLI flags. Config files are recommended for consistency:
305
-
306
80
  ```yaml
307
81
  # llm-docs-builder.yml
308
82
  docs: ./docs
@@ -310,123 +84,121 @@ base_url: https://myproject.io
310
84
  title: My Project
311
85
  description: Brief description
312
86
  output: llms.txt
87
+ suffix: .llm
88
+ verbose: false
89
+
90
+ # Basic options
313
91
  convert_urls: true
314
92
  remove_comments: true
315
93
  remove_badges: true
316
94
  remove_frontmatter: true
317
95
  normalize_whitespace: true
318
- suffix: .llm
319
- verbose: false
96
+
97
+ # Additional compression options
98
+ remove_code_examples: false
99
+ remove_images: true
100
+ remove_blockquotes: true
101
+ remove_duplicates: true
102
+ remove_stopwords: false
103
+ simplify_links: true
104
+ generate_toc: true
105
+ custom_instruction: "This documentation is optimized for AI consumption"
106
+
107
+ # Exclusions
320
108
  excludes:
321
109
  - "**/private/**"
322
110
  - "**/drafts/**"
323
111
  ```
324
112
 
325
113
  **Configuration precedence:**
326
- 1. CLI flags (highest priority)
327
- 2. Config file values
114
+ 1. CLI flags (highest)
115
+ 2. Config file
328
116
  3. Defaults
329
117
 
330
- **Example of overriding:**
331
- ```bash
332
- # Uses config file but overrides title
333
- llm-docs-builder generate --config llm-docs-builder.yml --title "Override Title"
334
- ```
335
-
336
- ## Docker Usage
337
-
338
- All CLI commands work in Docker with the same syntax:
118
+ ## CLI Commands
339
119
 
340
120
  ```bash
341
- # Basic pattern
342
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder [command] [options]
343
-
344
- # Examples
345
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
346
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder transform --docs README.md
347
- docker run mensfeld/llm-docs-builder compare --url https://example.com/docs
121
+ llm-docs-builder compare [options] # Measure token savings
122
+ llm-docs-builder transform [options] # Transform single file
123
+ llm-docs-builder bulk-transform [options] # Transform directory
124
+ llm-docs-builder generate [options] # Generate llms.txt
125
+ llm-docs-builder parse [options] # Parse llms.txt
126
+ llm-docs-builder validate [options] # Validate llms.txt
127
+ llm-docs-builder version # Show version
348
128
  ```
349
129
 
350
- **CI/CD Integration:**
351
-
352
- GitHub Actions:
353
- ```yaml
354
- - name: Generate llms.txt
355
- run: |
356
- docker run -v ${{ github.workspace }}:/workspace \
357
- mensfeld/llm-docs-builder generate --config llm-docs-builder.yml
130
+ **Common options:**
358
131
  ```
359
-
360
- GitLab CI:
361
- ```yaml
362
- generate-llms:
363
- image: mensfeld/llm-docs-builder:latest
364
- script:
365
- - llm-docs-builder generate --docs ./docs
132
+ -c, --config PATH Configuration file
133
+ -d, --docs PATH Documentation path
134
+ -o, --output PATH Output file
135
+ -u, --url URL URL for comparison
136
+ -v, --verbose Detailed output
366
137
  ```
367
138
 
368
- See [Docker Usage](#detailed-docker-usage) section below for comprehensive examples.
369
-
370
139
  ## Ruby API
371
140
 
372
- For programmatic usage:
373
-
374
141
  ```ruby
375
142
  require 'llm_docs_builder'
376
143
 
377
- # Using config file
378
- content = LlmDocsBuilder.generate_from_docs(config_file: 'llm-docs-builder.yml')
379
-
380
- # Direct options
381
- content = LlmDocsBuilder.generate_from_docs('./docs',
144
+ # Transform single file with custom options
145
+ transformed = LlmDocsBuilder.transform_markdown(
146
+ 'README.md',
382
147
  base_url: 'https://myproject.io',
383
- title: 'My Project'
384
- )
385
-
386
- # Transform markdown
387
- transformed = LlmDocsBuilder.transform_markdown('README.md',
388
- base_url: 'https://myproject.io',
389
- convert_urls: true,
390
- remove_comments: true,
391
- remove_badges: true,
392
- remove_frontmatter: true,
393
- normalize_whitespace: true
148
+ remove_code_examples: true,
149
+ remove_images: true,
150
+ generate_toc: true,
151
+ custom_instruction: 'AI-optimized documentation'
394
152
  )
395
153
 
396
154
  # Bulk transform
397
- files = LlmDocsBuilder.bulk_transform('./docs',
155
+ files = LlmDocsBuilder.bulk_transform(
156
+ './docs',
398
157
  base_url: 'https://myproject.io',
399
158
  suffix: '.llm',
400
- remove_comments: true,
401
- remove_badges: true,
402
- remove_frontmatter: true,
403
- normalize_whitespace: true,
404
- excludes: ['**/private/**']
159
+ remove_duplicates: true,
160
+ generate_toc: true
405
161
  )
406
162
 
407
- # In-place transformation
408
- files = LlmDocsBuilder.bulk_transform('./docs',
409
- suffix: '', # Empty for in-place
163
+ # Generate llms.txt
164
+ content = LlmDocsBuilder.generate_from_docs(
165
+ './docs',
410
166
  base_url: 'https://myproject.io',
411
- remove_comments: true,
412
- remove_badges: true,
413
- remove_frontmatter: true,
414
- normalize_whitespace: true
167
+ title: 'My Project'
415
168
  )
416
169
  ```
417
170
 
418
- ## Real-World Case Study: Karafka Framework
171
+ ## Serving Optimized Docs to AI Bots
419
172
 
420
- The [Karafka framework](https://github.com/karafka/karafka) processes millions of Kafka messages daily and maintains extensive documentation. Before llm-docs-builder:
173
+ After using `bulk-transform` with `suffix: .llm`, configure your web server to serve optimized versions to AI bots:
421
174
 
422
- - **140+ lines of custom Ruby code** for link expansion and URL normalization
423
- - Manual maintenance of transformation logic
424
- - No way to measure optimization effectiveness
175
+ **Apache (.htaccess):**
176
+ ```apache
177
+ SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt)" IS_LLM_BOT
178
+ RewriteCond %{ENV:IS_LLM_BOT} !^$
179
+ RewriteRule ^(.*)\.md$ $1.llm.md [L]
180
+ ```
425
181
 
426
- **After implementing llm-docs-builder:**
182
+ **Nginx:**
183
+ ```nginx
184
+ map $http_user_agent $is_llm_bot {
185
+ default 0;
186
+ "~*(?i)(openai|anthropic|claude|gpt)" 1;
187
+ }
188
+
189
+ location ~ ^/docs/(.*)\.md$ {
190
+ if ($is_llm_bot) {
191
+ rewrite ^(.*)\.md$ $1.llm.md last;
192
+ }
193
+ }
194
+ ```
427
195
 
196
+ ## Real-World Results: Karafka Framework
197
+
198
+ **Before:** 140+ lines of custom transformation code
199
+
200
+ **After:** 6 lines of configuration
428
201
  ```yaml
429
- # llm-docs-builder.yml
430
202
  docs: ./online/docs
431
203
  base_url: https://karafka.io/docs
432
204
  convert_urls: true
@@ -434,290 +206,145 @@ remove_comments: true
434
206
  remove_badges: true
435
207
  remove_frontmatter: true
436
208
  normalize_whitespace: true
437
- suffix: "" # In-place transformation for build pipeline
438
- excludes:
439
- - "**/Enterprise-License-Setup/**"
440
- ```
441
-
442
- ```bash
443
- # In their deployment script
444
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
209
+ suffix: "" # In-place for build pipeline
445
210
  ```
446
211
 
447
212
  **Results:**
448
- - **140 lines of code → 6 lines of config**
449
- - **93% average token reduction** across all documentation
450
- - **Quantifiable savings** via the compare command
451
- - **Automated daily deployments** via GitHub Actions
452
-
453
- The compare command revealed that their documentation was consuming 20-36x more tokens than necessary for AI systems. After optimization, RAG queries became dramatically more efficient.
454
-
455
- ## CLI Reference
456
-
457
- ```bash
458
- llm-docs-builder compare [options] # Measure token savings (start here!)
459
- llm-docs-builder transform [options] # Transform single markdown file
460
- llm-docs-builder bulk-transform [options] # Transform entire documentation tree
461
- llm-docs-builder generate [options] # Generate llms.txt index
462
- llm-docs-builder parse [options] # Parse existing llms.txt
463
- llm-docs-builder validate [options] # Validate llms.txt format
464
- llm-docs-builder version # Show version
465
- ```
466
-
467
- **Common options:**
468
- ```
469
- -c, --config PATH Configuration file (default: llm-docs-builder.yml)
470
- -d, --docs PATH Documentation directory or file
471
- -o, --output PATH Output file path
472
- -u, --url URL URL for comparison
473
- -f, --file PATH Local file for comparison
474
- -v, --verbose Detailed output
475
- -h, --help Show help
476
- ```
477
-
478
- For advanced options (base_url, title, suffix, excludes, convert_urls), use a config file.
213
+ - 93% average token reduction
214
+ - 20-36x smaller files
215
+ - Automated via GitHub Actions
479
216
 
480
- ## Why This Matters for RAG Systems
481
-
482
- Retrieval-Augmented Generation (RAG) systems fetch documentation to answer questions. Every byte of overhead in those documents:
483
-
484
- 1. **Costs money** - More tokens = higher API costs
485
- 2. **Reduces capacity** - Less room for actual documentation in context window
486
- 3. **Slows responses** - More tokens to process = longer response times
487
- 4. **Degrades quality** - Navigation noise can confuse the model
488
-
489
- llm-docs-builder addresses all four issues by transforming markdown to be AI-friendly and enabling your server to automatically serve it to AI bots while humans get HTML.
490
-
491
- **The JavaScript Problem:**
492
-
493
- Many documentation sites rely on JavaScript for rendering. AI crawlers typically don't execute JavaScript, so they either:
494
- - Get incomplete content
495
- - Get server-side rendered HTML (bloated with framework overhead)
496
- - Fail entirely
497
-
498
- By detecting AI bots and serving them clean markdown instead of HTML, you sidestep this problem entirely.
499
-
500
- ## Configuration Reference
501
-
502
- | Option | Type | Default | Description |
503
- |--------|------|---------|-------------|
504
- | `docs` | String | `./docs` | Documentation directory or file |
505
- | `base_url` | String | - | Base URL for absolute links (e.g., `https://myproject.io`) |
506
- | `title` | String | Auto-detected | Project title |
507
- | `description` | String | Auto-detected | Project description |
508
- | `output` | String | `llms.txt` | Output filename for llms.txt generation |
509
- | `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
510
- | `remove_comments` | Boolean | `false` | Remove HTML comments (`<!-- ... -->`) |
511
- | `remove_badges` | Boolean | `false` | Remove badge/shield images (CI, version, etc.) |
512
- | `remove_frontmatter` | Boolean | `false` | Remove YAML/TOML frontmatter (Jekyll, Hugo) |
513
- | `normalize_whitespace` | Boolean | `false` | Normalize excessive blank lines and trailing spaces |
514
- | `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
515
- | `excludes` | Array | `[]` | Glob patterns to exclude |
516
- | `verbose` | Boolean | `false` | Enable detailed output |
517
-
518
- ## Detailed Docker Usage
519
-
520
- ### Installation and Setup
217
+ ## Docker Usage
521
218
 
522
219
  ```bash
523
- # Pull from Docker Hub
220
+ # Pull image
524
221
  docker pull mensfeld/llm-docs-builder:latest
525
222
 
526
- # Or from GitHub Container Registry
527
- docker pull ghcr.io/mensfeld/llm-docs-builder:latest
528
-
529
- # Create an alias for convenience
530
- alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
531
- ```
532
-
533
- ### Common Commands
534
-
535
- **Compare (no volume mount needed for remote URLs):**
536
- ```bash
223
+ # Compare (no volume needed for remote URLs)
537
224
  docker run mensfeld/llm-docs-builder compare \
538
- --url https://karafka.io/docs/Getting-Started/
539
-
540
- # With local file
541
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder compare \
542
- --url https://example.com/page.html \
543
- --file docs/page.md
544
- ```
225
+ --url https://yoursite.com/docs
545
226
 
546
- **Generate llms.txt:**
547
- ```bash
548
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
549
- generate --docs ./docs --output llms.txt
550
- ```
551
-
552
- **Transform single file:**
553
- ```bash
554
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
555
- transform --docs README.md --config llm-docs-builder.yml
556
- ```
557
-
558
- **Bulk transform:**
559
- ```bash
227
+ # Transform with volume mount
560
228
  docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
561
229
  bulk-transform --config llm-docs-builder.yml
562
230
  ```
563
231
 
564
- **Parse and validate:**
565
- ```bash
566
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
567
- parse --docs llms.txt --verbose
568
-
569
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
570
- validate --docs llms.txt
571
- ```
572
-
573
- ### CI/CD Examples
574
-
575
- **GitHub Actions:**
576
- ```yaml
577
- jobs:
578
- optimize-docs:
579
- runs-on: ubuntu-latest
580
- steps:
581
- - uses: actions/checkout@v3
582
- - name: Transform documentation
583
- run: |
584
- docker run -v ${{ github.workspace }}:/workspace \
585
- mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
586
- - name: Measure savings
587
- run: |
588
- docker run mensfeld/llm-docs-builder \
589
- compare --url https://yoursite.com/docs/main.html
590
- ```
591
-
592
- **GitLab CI:**
232
+ **CI/CD Example (GitHub Actions):**
593
233
  ```yaml
594
- optimize-docs:
595
- image: mensfeld/llm-docs-builder:latest
596
- script:
597
- - llm-docs-builder bulk-transform --docs ./docs
598
- - llm-docs-builder compare --url https://yoursite.com/docs
599
- ```
600
-
601
- **Jenkins:**
602
- ```groovy
603
- stage('Optimize Documentation') {
604
- steps {
605
- sh '''
606
- docker run -v ${WORKSPACE}:/workspace \
607
- mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
608
- '''
609
- }
610
- }
234
+ - name: Optimize documentation
235
+ run: |
236
+ docker run -v ${{ github.workspace }}:/workspace \
237
+ mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
611
238
  ```
612
239
 
613
- ### Version Pinning
240
+ ## Compression Examples
614
241
 
615
- ```bash
616
- # Use specific version
617
- docker run mensfeld/llm-docs-builder:0.3.0 version
242
+ **Input markdown:**
243
+ ```markdown
244
+ ---
245
+ layout: docs
246
+ ---
618
247
 
619
- # Use major version (gets latest patch)
620
- docker run mensfeld/llm-docs-builder:0 version
248
+ # API Documentation
621
249
 
622
- # Always latest
623
- docker run mensfeld/llm-docs-builder:latest version
624
- ```
250
+ [![Build](badge.svg)](https://ci.com)
625
251
 
626
- ### Platform-Specific Usage
252
+ > Important: This is a note
627
253
 
628
- **Windows PowerShell:**
629
- ```powershell
630
- docker run -v ${PWD}:/workspace mensfeld/llm-docs-builder generate --docs ./docs
631
- ```
254
+ [Click here to see the complete API documentation](./api.md)
632
255
 
633
- **Windows Command Prompt:**
634
- ```cmd
635
- docker run -v %cd%:/workspace mensfeld/llm-docs-builder generate --docs ./docs
256
+ ```ruby
257
+ api = API.new
636
258
  ```
637
259
 
638
- **macOS/Linux:**
639
- ```bash
640
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
260
+ ![Diagram](./diagram.png)
641
261
  ```
642
262
 
643
- ## About llms.txt Standard
644
-
645
- The [llms.txt specification](https://llmstxt.org/) is a proposed standard for providing LLM-friendly content. It defines a structured format that helps AI systems:
646
-
647
- - Quickly understand project structure
648
- - Find relevant documentation efficiently
649
- - Navigate complex documentation hierarchies
650
- - Access clean, markdown-formatted content
651
-
652
- llm-docs-builder generates llms.txt files automatically by:
653
- 1. Scanning your documentation directory
654
- 2. Extracting titles and descriptions from markdown files
655
- 3. Prioritizing content by importance (README first, then guides, APIs, etc.)
656
- 4. Formatting everything according to the specification
263
+ **After transformation (with default options):**
264
+ ```markdown
265
+ # API Documentation
657
266
 
658
- The llms.txt file serves as an efficient entry point, but the real token savings come from serving optimized markdown for each individual documentation page.
267
+ [complete API documentation](./api.md)
659
268
 
660
- ## How It Works
269
+ ```ruby
270
+ api = API.new
271
+ ```
272
+ ```
661
273
 
662
- **Generation Process:**
663
- 1. Scan directory for `.md` files
664
- 2. Extract title (first H1) and description (first paragraph)
665
- 3. Prioritize by importance (README → Getting Started → Guides → API → Other)
666
- 4. Build formatted llms.txt with links and descriptions
274
+ **Token reduction:** ~40-60% depending on configuration
667
275
 
668
- **Transformation Process:**
669
- 1. Remove frontmatter (YAML/TOML metadata)
670
- 2. Expand relative links to absolute URLs
671
- 3. Convert `.html` URLs to `.md`
672
- 4. Remove HTML comments
673
- 5. Remove badge/shield images
674
- 6. Normalize excessive whitespace
675
- 7. Write to new file or overwrite in-place
276
+ ## FAQ
676
277
 
677
- **Comparison Process:**
678
- 1. Fetch URL with human User-Agent (or read local file)
679
- 2. Fetch same URL with AI bot User-Agent
680
- 3. Calculate size difference and reduction percentage
681
- 4. Estimate token counts using character-based heuristic
682
- 5. Display human-readable comparison results with byte and token savings
278
+ **Q: Do I need to use llms.txt?**
279
+ No. The compare and transform commands work independently.
683
280
 
684
- **Token Estimation:**
685
- The tool uses a simple but effective heuristic for estimating token counts: **~4 characters per token**. This approximation works well for English documentation and provides reasonable estimates without requiring external tokenizer dependencies. While not as precise as OpenAI's tiktoken, it's accurate enough (±10-15%) for understanding context window savings and making optimization decisions.
281
+ **Q: Will this change how humans see my docs?**
282
+ Not with default `suffix: .llm`. Separate files are served only to AI bots.
686
283
 
687
- ## FAQ
284
+ **Q: Can I use this in my build pipeline?**
285
+ Yes. Use `suffix: ""` for in-place transformation.
688
286
 
689
- **Q: Do I need to use llms.txt to benefit from this tool?**
287
+ **Q: How do I know if it's working?**
288
+ Use `llm-docs-builder compare` to measure before and after.
690
289
 
691
- No. The compare and transform commands provide value independently. Many users start with `compare` to measure savings, then use `bulk-transform` to normalize their markdown files, and may never generate an llms.txt file.
290
+ **Q: What about private documentation?**
291
+ Use the `excludes` option to skip sensitive files.
692
292
 
693
- **Q: Will this change how humans see my documentation?**
293
+ ## Advanced Compression Options
694
294
 
695
- Not if you use the default `suffix: .llm` mode. This creates separate `.llm.md` files served only to AI bots. Your original files remain unchanged for human visitors.
295
+ All compression features can be used individually for fine-grained control:
696
296
 
697
- **Q: Can I use this in my build pipeline?**
297
+ ### Content Removal Options
698
298
 
699
- Yes. Use `suffix: ""` for in-place transformation. The Karafka framework does this - they transform their markdown as part of their deployment process.
299
+ - `remove_frontmatter: true` - Remove YAML/TOML metadata blocks
300
+ - `remove_comments: true` - Remove HTML comments (`<!-- ... -->`)
301
+ - `remove_badges: true` - Remove badge/shield images (CI badges, version badges, etc.)
302
+ - `remove_images: true` - Remove all image syntax
303
+ - `remove_code_examples: true` - Remove fenced code blocks, indented code, and inline code
304
+ - `remove_blockquotes: true` - Remove blockquote formatting (preserves content)
305
+ - `remove_duplicates: true` - Remove duplicate paragraphs using fuzzy matching
306
+ - `remove_stopwords: true` - Remove common stopwords from prose (preserves code blocks)
700
307
 
701
- **Q: How do I know if it's working?**
308
+ ### Content Enhancement Options
702
309
 
703
- Use the `compare` command to measure before and after. It shows exact byte counts, reduction percentages, and compression factors.
310
+ - `generate_toc: true` - Generate table of contents from headings with anchor links
311
+ - `custom_instruction: "text"` - Inject AI context message at document top
312
+ - `simplify_links: true` - Simplify verbose link text (e.g., "Click here to see the docs" → "docs")
313
+ - `convert_urls: true` - Convert `.html`/`.htm` URLs to `.md` format
314
+ - `normalize_whitespace: true` - Reduce excessive blank lines and remove trailing whitespace
704
315
 
705
- **Q: Does this work with static site generators?**
316
+ ### Example Usage
706
317
 
707
- Yes. You can transform markdown files before your static site generator processes them, or serve separate `.llm.md` versions alongside your generated HTML.
318
+ ```ruby
319
+ # Fine-grained control
320
+ LlmDocsBuilder.transform_markdown(
321
+ 'README.md',
322
+ remove_frontmatter: true,
323
+ remove_badges: true,
324
+ remove_images: true,
325
+ simplify_links: true,
326
+ generate_toc: true,
327
+ normalize_whitespace: true
328
+ )
329
+ ```
708
330
 
709
- **Q: What about private/internal documentation?**
331
+ Or configure via YAML:
710
332
 
711
- Use the `excludes` option to skip sensitive files:
712
333
  ```yaml
713
- excludes:
714
- - "**/private/**"
715
- - "**/internal/**"
716
- ```
717
-
718
- **Q: Can I customize the AI bot detection?**
334
+ # llm-docs-builder.yml
335
+ docs: ./docs
336
+ base_url: https://myproject.io
337
+ suffix: .llm
719
338
 
720
- Yes. The web server examples show the User-Agent patterns. You can add or remove patterns based on which AI systems you want to support.
339
+ # Pick exactly what you need
340
+ remove_frontmatter: true
341
+ remove_comments: true
342
+ remove_badges: true
343
+ remove_images: true
344
+ simplify_links: true
345
+ generate_toc: true
346
+ normalize_whitespace: true
347
+ ```
721
348
 
722
349
  ## Contributing
723
350