llm-docs-builder 0.3.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -5,34 +5,21 @@
5
5
 
6
6
  **Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
7
7
 
8
- llm-docs-builder normalizes markdown documentation to be AI-friendly and generates llms.txt files. Transform relative links to absolute URLs, measure token savings when serving markdown vs HTML, and create standardized documentation indexes that help LLMs navigate your project.
8
+ llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, and optimizes documents for LLM context windows.
9
9
 
10
10
  ## The Problem
11
11
 
12
12
  When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
13
13
 
14
14
  **Real example from Karafka documentation:**
15
- - Human HTML version: 82.0 KB
16
- - AI markdown version: 4.1 KB
17
- - **Result: 95% reduction, 20x smaller**
18
-
19
- With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
20
-
21
- ## What This Tool Does
22
-
23
- llm-docs-builder helps you optimize markdown documentation for AI consumption:
24
-
25
- 1. **Measure Savings** - Compare what your server sends to humans (HTML) vs AI bots (markdown) to quantify context window reduction
26
- 2. **Transform Markdown** - Normalize your markdown files with absolute links and consistent URL formats for better LLM navigation
27
- 3. **Generate llms.txt** - Create standardized documentation indexes following the [llms.txt](https://llmstxt.org/) specification
28
- 4. **Serve Efficiently** - Configure your server to automatically serve transformed markdown to AI bots while humans get HTML
15
+ - Human HTML version: 104.4 KB (~26,735 tokens)
16
+ - AI markdown version: 21.5 KB (~5,496 tokens)
17
+ - **Result: 79% reduction, 21,239 tokens saved, 5x smaller**
29
18
 
30
19
  ## Quick Start
31
20
 
32
21
  ### Measure Your Current Token Waste
33
22
 
34
- Before making any changes, see how much you could save:
35
-
36
23
  ```bash
37
24
  # Using Docker (no Ruby installation needed)
38
25
  docker pull mensfeld/llm-docs-builder:latest
@@ -42,254 +29,54 @@ docker run mensfeld/llm-docs-builder compare \
42
29
  --url https://yoursite.com/docs/getting-started.html
43
30
  ```
44
31
 
45
- **Example output:**
46
- ```
47
- ============================================================
48
- Context Window Comparison
49
- ============================================================
50
-
51
- Human version: 45.2 KB
52
- Source: https://yoursite.com/docs/page.html (User-Agent: human)
32
+ ### Transform Your Documentation
53
33
 
54
- AI version: 12.8 KB
55
- Source: https://yoursite.com/docs/page.html (User-Agent: AI)
34
+ ```bash
35
+ # Single file
36
+ llm-docs-builder transform --docs README.md
56
37
 
57
- ------------------------------------------------------------
58
- Reduction: 32.4 KB (72%)
59
- Factor: 3.5x smaller
60
- ============================================================
38
+ # Bulk transform with config
39
+ llm-docs-builder bulk-transform --config llm-docs-builder.yml
61
40
  ```
62
41
 
63
- This single command shows you the potential ROI before you invest any time in optimization.
64
-
65
- ### Real-World Results
66
-
67
- **[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
68
-
69
- | Page | Human HTML | AI Markdown | Reduction | Factor |
70
- |------|-----------|-------------|-----------|---------|
71
- | Getting Started | 82.0 KB | 4.1 KB | 95% | 20.1x |
72
- | Configuration | 86.3 KB | 7.1 KB | 92% | 12.1x |
73
- | Routing | 93.6 KB | 14.7 KB | 84% | 6.4x |
74
- | Deployment | 122.1 KB | 33.3 KB | 73% | 3.7x |
75
- | Producing Messages | 87.7 KB | 8.3 KB | 91% | 10.6x |
76
- | Consuming Messages | 105.3 KB | 21.3 KB | 80% | 4.9x |
77
- | Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | 5.1x |
78
- | Active Job | 88.7 KB | 8.8 KB | 90% | 10.1x |
79
- | Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | 3.7x |
80
- | Error Handling | 93.8 KB | 13.1 KB | 86% | 7.2x |
81
-
82
- **Average: 83% reduction, 8.4x smaller files**
83
-
84
- For a typical RAG system making 1,000 documentation queries per day:
85
- - **Before**: ~990 KB per day × 1,000 queries = ~990 MB processed
86
- - **After**: ~165 KB per day × 1,000 queries = ~165 MB processed
87
- - **Savings**: 83% reduction in token costs
88
-
89
- At GPT-4 pricing ($2.50/M input tokens), that's approximately **$2,000-5,000 saved annually** on a documentation site with moderate traffic.
90
-
91
42
  ## Installation
92
43
 
93
- ### Option 1: Docker (Recommended)
94
-
95
- No Ruby installation required. Perfect for CI/CD and quick usage:
44
+ ### Docker (Recommended)
96
45
 
97
46
  ```bash
98
- # Pull the image
99
47
  docker pull mensfeld/llm-docs-builder:latest
100
-
101
- # Create an alias for convenience
102
48
  alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
103
-
104
- # Use like a native command
105
- llm-docs-builder compare --url https://yoursite.com/docs
106
49
  ```
107
50
 
108
- Multi-architecture support (amd64/arm64), ~50MB image size.
109
-
110
- ### Option 2: RubyGems
111
-
112
- For Ruby developers or when you need the Ruby API:
51
+ ### RubyGems
113
52
 
114
53
  ```bash
115
54
  gem install llm-docs-builder
116
55
  ```
117
56
 
118
- Or add to your Gemfile:
119
-
120
- ```ruby
121
- gem 'llm-docs-builder'
122
- ```
57
+ ## Features
123
58
 
124
- ## Core Features
125
-
126
- ### 1. Compare and Measure (The "Before You Start" Tool)
127
-
128
- Quantify exactly how much context window you're wasting:
59
+ ### Measure and Compare
129
60
 
130
61
  ```bash
131
- # Compare what your server sends to humans vs AI bots
62
+ # Compare what your server sends to humans vs AI
132
63
  llm-docs-builder compare --url https://yoursite.com/docs/page.html
133
64
 
134
- # Compare remote HTML with your local markdown
65
+ # Compare remote HTML with local markdown
135
66
  llm-docs-builder compare \
136
67
  --url https://yoursite.com/docs/api.html \
137
68
  --file docs/api.md
138
-
139
- # Verbose mode for debugging
140
- llm-docs-builder compare --url https://example.com/docs --verbose
141
- ```
142
-
143
- **Why this matters:**
144
- - Validates that optimizations actually work
145
- - Quantifies ROI before you invest time
146
- - Monitors ongoing effectiveness
147
- - Provides concrete metrics for stakeholders
148
-
149
- ### 2. Transform Markdown (The Normalizer)
150
-
151
- Normalize your markdown documentation to be LLM-friendly:
152
-
153
- **Single file transformation:**
154
- ```bash
155
- # Expand relative links to absolute URLs
156
- llm-docs-builder transform \
157
- --docs README.md \
158
- --config llm-docs-builder.yml
159
- ```
160
-
161
- **Bulk transformation - two modes:**
162
-
163
- **a) Separate files (default)** - Creates `.llm.md` versions alongside originals:
164
- ```yaml
165
- # llm-docs-builder.yml
166
- docs: ./docs
167
- base_url: https://myproject.io
168
- suffix: .llm # Creates README.llm.md alongside README.md
169
- convert_urls: true # .html → .md
170
- ```
171
-
172
- ```bash
173
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
174
- ```
175
-
176
- Result:
177
- ```
178
- docs/
179
- ├── README.md ← Original (for humans)
180
- ├── README.llm.md ← Optimized (for AI)
181
- ├── api.md
182
- └── api.llm.md
183
69
  ```
184
70
 
185
- **b) In-place transformation** - Overwrites originals (for build pipelines):
186
- ```yaml
187
- # llm-docs-builder.yml
188
- docs: ./docs
189
- base_url: https://myproject.io
190
- suffix: "" # Transforms in-place
191
- convert_urls: true
192
- excludes:
193
- - "**/private/**"
194
- ```
195
-
196
- ```bash
197
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
198
- ```
199
-
200
- Perfect for CI/CD where you transform docs before deployment.
201
-
202
- **What gets normalized:**
203
- - Relative links → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
204
- - HTML URLs → Markdown format (`.html` → `.md`)
205
- - Clean markdown structure preserved
206
- - No content modification, just link normalization
207
-
208
- ### 3. Generate llms.txt (The Standard)
209
-
210
- Create a standardized documentation index following the [llms.txt](https://llmstxt.org/) specification:
211
-
212
- ```yaml
213
- # llm-docs-builder.yml
214
- docs: ./docs
215
- base_url: https://myproject.io
216
- title: My Project
217
- description: A library that does amazing things
218
- output: llms.txt
219
- ```
71
+ ### Generate llms.txt
220
72
 
221
73
  ```bash
74
+ # Create standardized documentation index
222
75
  llm-docs-builder generate --config llm-docs-builder.yml
223
76
  ```
224
77
 
225
- **Generated output:**
226
- ```markdown
227
- # My Project
228
-
229
- > A library that does amazing things
230
-
231
- ## Documentation
232
-
233
- - [README](https://myproject.io/README.md): Complete overview and installation
234
- - [Getting Started](https://myproject.io/getting-started.md): Quick start guide
235
- - [API Reference](https://myproject.io/api-reference.md): Detailed API documentation
236
- ```
237
-
238
- **Smart prioritization:**
239
- 1. README files (always first)
240
- 2. Getting started guides
241
- 3. Tutorials and guides
242
- 4. API references
243
- 5. Other documentation
244
-
245
- The llms.txt file serves as an efficient entry point for AI systems to understand your project structure.
246
-
247
- ### 4. Serve to AI Bots (The Deployment)
248
-
249
- After using `bulk-transform` with `suffix: .llm`, configure your web server to automatically serve optimized versions to AI bots:
250
-
251
- **Apache (.htaccess):**
252
- ```apache
253
- # Detect AI bots
254
- SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt|chatgpt)" IS_LLM_BOT
255
- SetEnvIf User-Agent "(?i)(perplexity|gemini|copilot|bard)" IS_LLM_BOT
256
-
257
- # Serve .llm.md to AI, .md to humans
258
- RewriteEngine On
259
- RewriteCond %{ENV:IS_LLM_BOT} !^$
260
- RewriteCond %{REQUEST_URI} ^/docs/.*\.md$ [NC]
261
- RewriteRule ^(.*)\.md$ $1.llm.md [L]
262
- ```
263
-
264
- **Nginx:**
265
- ```nginx
266
- map $http_user_agent $is_llm_bot {
267
- default 0;
268
- "~*(?i)(openai|anthropic|claude|gpt)" 1;
269
- "~*(?i)(perplexity|gemini|copilot)" 1;
270
- }
271
-
272
- location ~ ^/docs/(.*)\.md$ {
273
- if ($is_llm_bot) {
274
- rewrite ^(.*)\.md$ $1.llm.md last;
275
- }
276
- }
277
- ```
278
-
279
- **Cloudflare Workers:**
280
- ```javascript
281
- const isLLMBot = /openai|anthropic|claude|gpt|perplexity/i.test(userAgent);
282
- if (isLLMBot && url.pathname.startsWith('/docs/')) {
283
- url.pathname = url.pathname.replace(/\.md$/, '.llm.md');
284
- }
285
- ```
286
-
287
- **Result**: AI systems automatically get optimized versions, humans get the original. No manual switching, no duplicate URLs.
288
-
289
78
  ## Configuration
290
79
 
291
- All commands support both config files and CLI flags. Config files are recommended for consistency:
292
-
293
80
  ```yaml
294
81
  # llm-docs-builder.yml
295
82
  docs: ./docs
@@ -297,383 +84,267 @@ base_url: https://myproject.io
297
84
  title: My Project
298
85
  description: Brief description
299
86
  output: llms.txt
300
- convert_urls: true
301
87
  suffix: .llm
302
88
  verbose: false
89
+
90
+ # Basic options
91
+ convert_urls: true
92
+ remove_comments: true
93
+ remove_badges: true
94
+ remove_frontmatter: true
95
+ normalize_whitespace: true
96
+
97
+ # Additional compression options
98
+ remove_code_examples: false
99
+ remove_images: true
100
+ remove_blockquotes: true
101
+ remove_duplicates: true
102
+ remove_stopwords: false
103
+ simplify_links: true
104
+ generate_toc: true
105
+ custom_instruction: "This documentation is optimized for AI consumption"
106
+
107
+ # Exclusions
303
108
  excludes:
304
109
  - "**/private/**"
305
110
  - "**/drafts/**"
306
111
  ```
307
112
 
308
113
  **Configuration precedence:**
309
- 1. CLI flags (highest priority)
310
- 2. Config file values
114
+ 1. CLI flags (highest)
115
+ 2. Config file
311
116
  3. Defaults
312
117
 
313
- **Example of overriding:**
314
- ```bash
315
- # Uses config file but overrides title
316
- llm-docs-builder generate --config llm-docs-builder.yml --title "Override Title"
317
- ```
318
-
319
- ## Docker Usage
320
-
321
- All CLI commands work in Docker with the same syntax:
118
+ ## CLI Commands
322
119
 
323
120
  ```bash
324
- # Basic pattern
325
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder [command] [options]
326
-
327
- # Examples
328
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
329
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder transform --docs README.md
330
- docker run mensfeld/llm-docs-builder compare --url https://example.com/docs
121
+ llm-docs-builder compare [options] # Measure token savings
122
+ llm-docs-builder transform [options] # Transform single file
123
+ llm-docs-builder bulk-transform [options] # Transform directory
124
+ llm-docs-builder generate [options] # Generate llms.txt
125
+ llm-docs-builder parse [options] # Parse llms.txt
126
+ llm-docs-builder validate [options] # Validate llms.txt
127
+ llm-docs-builder version # Show version
331
128
  ```
332
129
 
333
- **CI/CD Integration:**
334
-
335
- GitHub Actions:
336
- ```yaml
337
- - name: Generate llms.txt
338
- run: |
339
- docker run -v ${{ github.workspace }}:/workspace \
340
- mensfeld/llm-docs-builder generate --config llm-docs-builder.yml
130
+ **Common options:**
341
131
  ```
342
-
343
- GitLab CI:
344
- ```yaml
345
- generate-llms:
346
- image: mensfeld/llm-docs-builder:latest
347
- script:
348
- - llm-docs-builder generate --docs ./docs
132
+ -c, --config PATH Configuration file
133
+ -d, --docs PATH Documentation path
134
+ -o, --output PATH Output file
135
+ -u, --url URL URL for comparison
136
+ -v, --verbose Detailed output
349
137
  ```
350
138
 
351
- See [Docker Usage](#detailed-docker-usage) section below for comprehensive examples.
352
-
353
139
  ## Ruby API
354
140
 
355
- For programmatic usage:
356
-
357
141
  ```ruby
358
142
  require 'llm_docs_builder'
359
143
 
360
- # Using config file
361
- content = LlmDocsBuilder.generate_from_docs(config_file: 'llm-docs-builder.yml')
362
-
363
- # Direct options
364
- content = LlmDocsBuilder.generate_from_docs('./docs',
365
- base_url: 'https://myproject.io',
366
- title: 'My Project'
367
- )
368
-
369
- # Transform markdown
370
- transformed = LlmDocsBuilder.transform_markdown('README.md',
144
+ # Transform single file with custom options
145
+ transformed = LlmDocsBuilder.transform_markdown(
146
+ 'README.md',
371
147
  base_url: 'https://myproject.io',
372
- convert_urls: true
148
+ remove_code_examples: true,
149
+ remove_images: true,
150
+ generate_toc: true,
151
+ custom_instruction: 'AI-optimized documentation'
373
152
  )
374
153
 
375
154
  # Bulk transform
376
- files = LlmDocsBuilder.bulk_transform('./docs',
155
+ files = LlmDocsBuilder.bulk_transform(
156
+ './docs',
377
157
  base_url: 'https://myproject.io',
378
158
  suffix: '.llm',
379
- excludes: ['**/private/**']
159
+ remove_duplicates: true,
160
+ generate_toc: true
380
161
  )
381
162
 
382
- # In-place transformation
383
- files = LlmDocsBuilder.bulk_transform('./docs',
384
- suffix: '', # Empty for in-place
385
- base_url: 'https://myproject.io'
163
+ # Generate llms.txt
164
+ content = LlmDocsBuilder.generate_from_docs(
165
+ './docs',
166
+ base_url: 'https://myproject.io',
167
+ title: 'My Project'
386
168
  )
387
169
  ```
388
170
 
389
- ## Real-World Case Study: Karafka Framework
171
+ ## Serving Optimized Docs to AI Bots
172
+
173
+ After using `bulk-transform` with `suffix: .llm`, configure your web server to serve optimized versions to AI bots:
390
174
 
391
- The [Karafka framework](https://github.com/karafka/karafka) processes millions of Kafka messages daily and maintains extensive documentation. Before llm-docs-builder:
175
+ **Apache (.htaccess):**
176
+ ```apache
177
+ SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt)" IS_LLM_BOT
178
+ RewriteCond %{ENV:IS_LLM_BOT} !^$
179
+ RewriteRule ^(.*)\.md$ $1.llm.md [L]
180
+ ```
181
+
182
+ **Nginx:**
183
+ ```nginx
184
+ map $http_user_agent $is_llm_bot {
185
+ default 0;
186
+ "~*(?i)(openai|anthropic|claude|gpt)" 1;
187
+ }
188
+
189
+ location ~ ^/docs/(.*)\.md$ {
190
+ if ($is_llm_bot) {
191
+ rewrite ^(.*)\.md$ $1.llm.md last;
192
+ }
193
+ }
194
+ ```
392
195
 
393
- - **140+ lines of custom Ruby code** for link expansion and URL normalization
394
- - Manual maintenance of transformation logic
395
- - No way to measure optimization effectiveness
196
+ ## Real-World Results: Karafka Framework
396
197
 
397
- **After implementing llm-docs-builder:**
198
+ **Before:** 140+ lines of custom transformation code
398
199
 
200
+ **After:** 6 lines of configuration
399
201
  ```yaml
400
- # llm-docs-builder.yml
401
202
  docs: ./online/docs
402
203
  base_url: https://karafka.io/docs
403
204
  convert_urls: true
404
- suffix: "" # In-place transformation for build pipeline
405
- excludes:
406
- - "**/Enterprise-License-Setup/**"
407
- ```
408
-
409
- ```bash
410
- # In their deployment script
411
- llm-docs-builder bulk-transform --config llm-docs-builder.yml
205
+ remove_comments: true
206
+ remove_badges: true
207
+ remove_frontmatter: true
208
+ normalize_whitespace: true
209
+ suffix: "" # In-place for build pipeline
412
210
  ```
413
211
 
414
212
  **Results:**
415
- - **140 lines of code → 6 lines of config**
416
- - **93% average token reduction** across all documentation
417
- - **Quantifiable savings** via the compare command
418
- - **Automated daily deployments** via GitHub Actions
419
-
420
- The compare command revealed that their documentation was consuming 20-36x more tokens than necessary for AI systems. After optimization, RAG queries became dramatically more efficient.
421
-
422
- ## CLI Reference
213
+ - 93% average token reduction
214
+ - 20-36x smaller files
215
+ - Automated via GitHub Actions
423
216
 
424
- ```bash
425
- llm-docs-builder compare [options] # Measure token savings (start here!)
426
- llm-docs-builder transform [options] # Transform single markdown file
427
- llm-docs-builder bulk-transform [options] # Transform entire documentation tree
428
- llm-docs-builder generate [options] # Generate llms.txt index
429
- llm-docs-builder parse [options] # Parse existing llms.txt
430
- llm-docs-builder validate [options] # Validate llms.txt format
431
- llm-docs-builder version # Show version
432
- ```
433
-
434
- **Common options:**
435
- ```
436
- -c, --config PATH Configuration file (default: llm-docs-builder.yml)
437
- -d, --docs PATH Documentation directory or file
438
- -o, --output PATH Output file path
439
- -u, --url URL URL for comparison
440
- -f, --file PATH Local file for comparison
441
- -v, --verbose Detailed output
442
- -h, --help Show help
443
- ```
444
-
445
- For advanced options (base_url, title, suffix, excludes, convert_urls), use a config file.
446
-
447
- ## Why This Matters for RAG Systems
448
-
449
- Retrieval-Augmented Generation (RAG) systems fetch documentation to answer questions. Every byte of overhead in those documents:
450
-
451
- 1. **Costs money** - More tokens = higher API costs
452
- 2. **Reduces capacity** - Less room for actual documentation in context window
453
- 3. **Slows responses** - More tokens to process = longer response times
454
- 4. **Degrades quality** - Navigation noise can confuse the model
455
-
456
- llm-docs-builder addresses all four issues by transforming markdown to be AI-friendly and enabling your server to automatically serve it to AI bots while humans get HTML.
457
-
458
- **The JavaScript Problem:**
459
-
460
- Many documentation sites rely on JavaScript for rendering. AI crawlers typically don't execute JavaScript, so they either:
461
- - Get incomplete content
462
- - Get server-side rendered HTML (bloated with framework overhead)
463
- - Fail entirely
464
-
465
- By detecting AI bots and serving them clean markdown instead of HTML, you sidestep this problem entirely.
466
-
467
- ## Configuration Reference
468
-
469
- | Option | Type | Default | Description |
470
- |--------|------|---------|-------------|
471
- | `docs` | String | `./docs` | Documentation directory or file |
472
- | `base_url` | String | - | Base URL for absolute links (e.g., `https://myproject.io`) |
473
- | `title` | String | Auto-detected | Project title |
474
- | `description` | String | Auto-detected | Project description |
475
- | `output` | String | `llms.txt` | Output filename for llms.txt generation |
476
- | `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
477
- | `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
478
- | `excludes` | Array | `[]` | Glob patterns to exclude |
479
- | `verbose` | Boolean | `false` | Enable detailed output |
480
-
481
- ## Detailed Docker Usage
482
-
483
- ### Installation and Setup
217
+ ## Docker Usage
484
218
 
485
219
  ```bash
486
- # Pull from Docker Hub
220
+ # Pull image
487
221
  docker pull mensfeld/llm-docs-builder:latest
488
222
 
489
- # Or from GitHub Container Registry
490
- docker pull ghcr.io/mensfeld/llm-docs-builder:latest
491
-
492
- # Create an alias for convenience
493
- alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
494
- ```
495
-
496
- ### Common Commands
497
-
498
- **Compare (no volume mount needed for remote URLs):**
499
- ```bash
223
+ # Compare (no volume needed for remote URLs)
500
224
  docker run mensfeld/llm-docs-builder compare \
501
- --url https://karafka.io/docs/Getting-Started/
502
-
503
- # With local file
504
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder compare \
505
- --url https://example.com/page.html \
506
- --file docs/page.md
507
- ```
508
-
509
- **Generate llms.txt:**
510
- ```bash
511
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
512
- generate --docs ./docs --output llms.txt
513
- ```
514
-
515
- **Transform single file:**
516
- ```bash
517
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
518
- transform --docs README.md --config llm-docs-builder.yml
519
- ```
225
+ --url https://yoursite.com/docs
520
226
 
521
- **Bulk transform:**
522
- ```bash
227
+ # Transform with volume mount
523
228
  docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
524
229
  bulk-transform --config llm-docs-builder.yml
525
230
  ```
526
231
 
527
- **Parse and validate:**
528
- ```bash
529
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
530
- parse --docs llms.txt --verbose
531
-
532
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
533
- validate --docs llms.txt
534
- ```
535
-
536
- ### CI/CD Examples
537
-
538
- **GitHub Actions:**
232
+ **CI/CD Example (GitHub Actions):**
539
233
  ```yaml
540
- jobs:
541
- optimize-docs:
542
- runs-on: ubuntu-latest
543
- steps:
544
- - uses: actions/checkout@v3
545
- - name: Transform documentation
546
- run: |
547
- docker run -v ${{ github.workspace }}:/workspace \
548
- mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
549
- - name: Measure savings
550
- run: |
551
- docker run mensfeld/llm-docs-builder \
552
- compare --url https://yoursite.com/docs/main.html
553
- ```
554
-
555
- **GitLab CI:**
556
- ```yaml
557
- optimize-docs:
558
- image: mensfeld/llm-docs-builder:latest
559
- script:
560
- - llm-docs-builder bulk-transform --docs ./docs
561
- - llm-docs-builder compare --url https://yoursite.com/docs
562
- ```
563
-
564
- **Jenkins:**
565
- ```groovy
566
- stage('Optimize Documentation') {
567
- steps {
568
- sh '''
569
- docker run -v ${WORKSPACE}:/workspace \
570
- mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
571
- '''
572
- }
573
- }
234
+ - name: Optimize documentation
235
+ run: |
236
+ docker run -v ${{ github.workspace }}:/workspace \
237
+ mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
574
238
  ```
575
239
 
576
- ### Version Pinning
240
+ ## Compression Examples
577
241
 
578
- ```bash
579
- # Use specific version
580
- docker run mensfeld/llm-docs-builder:0.3.0 version
242
+ **Input markdown:**
243
+ ```markdown
244
+ ---
245
+ layout: docs
246
+ ---
581
247
 
582
- # Use major version (gets latest patch)
583
- docker run mensfeld/llm-docs-builder:0 version
248
+ # API Documentation
584
249
 
585
- # Always latest
586
- docker run mensfeld/llm-docs-builder:latest version
587
- ```
250
+ [![Build](badge.svg)](https://ci.com)
588
251
 
589
- ### Platform-Specific Usage
252
+ > Important: This is a note
590
253
 
591
- **Windows PowerShell:**
592
- ```powershell
593
- docker run -v ${PWD}:/workspace mensfeld/llm-docs-builder generate --docs ./docs
594
- ```
254
+ [Click here to see the complete API documentation](./api.md)
595
255
 
596
- **Windows Command Prompt:**
597
- ```cmd
598
- docker run -v %cd%:/workspace mensfeld/llm-docs-builder generate --docs ./docs
256
+ ```ruby
257
+ api = API.new
599
258
  ```
600
259
 
601
- **macOS/Linux:**
602
- ```bash
603
- docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
260
+ ![Diagram](./diagram.png)
604
261
  ```
605
262
 
606
- ## About llms.txt Standard
607
-
608
- The [llms.txt specification](https://llmstxt.org/) is a proposed standard for providing LLM-friendly content. It defines a structured format that helps AI systems:
609
-
610
- - Quickly understand project structure
611
- - Find relevant documentation efficiently
612
- - Navigate complex documentation hierarchies
613
- - Access clean, markdown-formatted content
263
+ **After transformation (with default options):**
264
+ ```markdown
265
+ # API Documentation
614
266
 
615
- llm-docs-builder generates llms.txt files automatically by:
616
- 1. Scanning your documentation directory
617
- 2. Extracting titles and descriptions from markdown files
618
- 3. Prioritizing content by importance (README first, then guides, APIs, etc.)
619
- 4. Formatting everything according to the specification
267
+ [complete API documentation](./api.md)
620
268
 
621
- The llms.txt file serves as an efficient entry point, but the real token savings come from serving optimized markdown for each individual documentation page.
269
+ ```ruby
270
+ api = API.new
271
+ ```
272
+ ```
622
273
 
623
- ## How It Works
274
+ **Token reduction:** ~40-60% depending on configuration
624
275
 
625
- **Generation Process:**
626
- 1. Scan directory for `.md` files
627
- 2. Extract title (first H1) and description (first paragraph)
628
- 3. Prioritize by importance (README → Getting Started → Guides → API → Other)
629
- 4. Build formatted llms.txt with links and descriptions
276
+ ## FAQ
630
277
 
631
- **Transformation Process:**
632
- 1. Expand relative links to absolute URLs
633
- 2. Optionally convert `.html` to `.md`
634
- 3. Preserve all content unchanged
635
- 4. Write to new file or overwrite in-place
278
+ **Q: Do I need to use llms.txt?**
279
+ No. The compare and transform commands work independently.
636
280
 
637
- **Comparison Process:**
638
- 1. Fetch URL with human User-Agent (or read local file)
639
- 2. Fetch same URL with AI bot User-Agent
640
- 3. Calculate size difference and reduction percentage
641
- 4. Display human-readable comparison results
281
+ **Q: Will this change how humans see my docs?**
282
+ Not with default `suffix: .llm`. Separate files are served only to AI bots.
642
283
 
643
- ## FAQ
284
+ **Q: Can I use this in my build pipeline?**
285
+ Yes. Use `suffix: ""` for in-place transformation.
644
286
 
645
- **Q: Do I need to use llms.txt to benefit from this tool?**
287
+ **Q: How do I know if it's working?**
288
+ Use `llm-docs-builder compare` to measure before and after.
646
289
 
647
- No. The compare and transform commands provide value independently. Many users start with `compare` to measure savings, then use `bulk-transform` to normalize their markdown files, and may never generate an llms.txt file.
290
+ **Q: What about private documentation?**
291
+ Use the `excludes` option to skip sensitive files.
648
292
 
649
- **Q: Will this change how humans see my documentation?**
293
+ ## Advanced Compression Options
650
294
 
651
- Not if you use the default `suffix: .llm` mode. This creates separate `.llm.md` files served only to AI bots. Your original files remain unchanged for human visitors.
295
+ All compression features can be used individually for fine-grained control:
652
296
 
653
- **Q: Can I use this in my build pipeline?**
297
+ ### Content Removal Options
654
298
 
655
- Yes. Use `suffix: ""` for in-place transformation. The Karafka framework does this - they transform their markdown as part of their deployment process.
299
+ - `remove_frontmatter: true` - Remove YAML/TOML metadata blocks
300
+ - `remove_comments: true` - Remove HTML comments (`<!-- ... -->`)
301
+ - `remove_badges: true` - Remove badge/shield images (CI badges, version badges, etc.)
302
+ - `remove_images: true` - Remove all image syntax
303
+ - `remove_code_examples: true` - Remove fenced code blocks, indented code, and inline code
304
+ - `remove_blockquotes: true` - Remove blockquote formatting (preserves content)
305
+ - `remove_duplicates: true` - Remove duplicate paragraphs using fuzzy matching
306
+ - `remove_stopwords: true` - Remove common stopwords from prose (preserves code blocks)
656
307
 
657
- **Q: How do I know if it's working?**
308
+ ### Content Enhancement Options
658
309
 
659
- Use the `compare` command to measure before and after. It shows exact byte counts, reduction percentages, and compression factors.
310
+ - `generate_toc: true` - Generate table of contents from headings with anchor links
311
+ - `custom_instruction: "text"` - Inject AI context message at document top
312
+ - `simplify_links: true` - Simplify verbose link text (e.g., "Click here to see the docs" → "docs")
313
+ - `convert_urls: true` - Convert `.html`/`.htm` URLs to `.md` format
314
+ - `normalize_whitespace: true` - Reduce excessive blank lines and remove trailing whitespace
660
315
 
661
- **Q: Does this work with static site generators?**
316
+ ### Example Usage
662
317
 
663
- Yes. You can transform markdown files before your static site generator processes them, or serve separate `.llm.md` versions alongside your generated HTML.
318
+ ```ruby
319
+ # Fine-grained control
320
+ LlmDocsBuilder.transform_markdown(
321
+ 'README.md',
322
+ remove_frontmatter: true,
323
+ remove_badges: true,
324
+ remove_images: true,
325
+ simplify_links: true,
326
+ generate_toc: true,
327
+ normalize_whitespace: true
328
+ )
329
+ ```
664
330
 
665
- **Q: What about private/internal documentation?**
331
+ Or configure via YAML:
666
332
 
667
- Use the `excludes` option to skip sensitive files:
668
333
  ```yaml
669
- excludes:
670
- - "**/private/**"
671
- - "**/internal/**"
672
- ```
673
-
674
- **Q: Can I customize the AI bot detection?**
334
+ # llm-docs-builder.yml
335
+ docs: ./docs
336
+ base_url: https://myproject.io
337
+ suffix: .llm
675
338
 
676
- Yes. The web server examples show the User-Agent patterns. You can add or remove patterns based on which AI systems you want to support.
339
+ # Pick exactly what you need
340
+ remove_frontmatter: true
341
+ remove_comments: true
342
+ remove_badges: true
343
+ remove_images: true
344
+ simplify_links: true
345
+ generate_toc: true
346
+ normalize_whitespace: true
347
+ ```
677
348
 
678
349
  ## Contributing
679
350