llm-docs-builder 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,684 @@
1
+ # llm-docs-builder
2
+
3
+ [![CI](https://github.com/mensfeld/llm-docs-builder/actions/workflows/ci.yml/badge.svg)](
4
+ https://github.com/mensfeld/llm-docs-builder/actions/workflows/ci.yml)
5
+
6
+ **Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
7
+
8
+ llm-docs-builder normalizes markdown documentation to be AI-friendly and generates llms.txt files. Transform relative links to absolute URLs, measure token savings when serving markdown vs HTML, and create standardized documentation indexes that help LLMs navigate your project.
9
+
10
+ ## The Problem
11
+
12
+ When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
13
+
14
+ **Real example from Karafka documentation:**
15
+ - Human HTML version: 82.0 KB
16
+ - AI markdown version: 4.1 KB
17
+ - **Result: 95% reduction, 20x smaller**
18
+
19
+ With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
20
+
21
+ ## What This Tool Does
22
+
23
+ llm-docs-builder helps you optimize markdown documentation for AI consumption:
24
+
25
+ 1. **Measure Savings** - Compare what your server sends to humans (HTML) vs AI bots (markdown) to quantify context window reduction
26
+ 2. **Transform Markdown** - Normalize your markdown files with absolute links and consistent URL formats for better LLM navigation
27
+ 3. **Generate llms.txt** - Create standardized documentation indexes following the [llms.txt](https://llmstxt.org/) specification
28
+ 4. **Serve Efficiently** - Configure your server to automatically serve transformed markdown to AI bots while humans get HTML
29
+
30
+ ## Quick Start
31
+
32
+ ### Measure Your Current Token Waste
33
+
34
+ Before making any changes, see how much you could save:
35
+
36
+ ```bash
37
+ # Using Docker (no Ruby installation needed)
38
+ docker pull mensfeld/llm-docs-builder:latest
39
+
40
+ # Compare your documentation page
41
+ docker run mensfeld/llm-docs-builder compare \
42
+ --url https://yoursite.com/docs/getting-started.html
43
+ ```
44
+
45
+ **Example output:**
46
+ ```
47
+ ============================================================
48
+ Context Window Comparison
49
+ ============================================================
50
+
51
+ Human version: 45.2 KB
52
+ Source: https://yoursite.com/docs/page.html (User-Agent: human)
53
+
54
+ AI version: 12.8 KB
55
+ Source: https://yoursite.com/docs/page.html (User-Agent: AI)
56
+
57
+ ------------------------------------------------------------
58
+ Reduction: 32.4 KB (72%)
59
+ Factor: 3.5x smaller
60
+ ============================================================
61
+ ```
62
+
63
+ This single command shows you the potential ROI before you invest any time in optimization.
64
+
65
+ ### Real-World Results
66
+
67
+ **[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
68
+
69
+ | Page | Human HTML | AI Markdown | Reduction | Factor |
70
+ |------|-----------|-------------|-----------|---------|
71
+ | Getting Started | 82.0 KB | 4.1 KB | 95% | 20.1x |
72
+ | Configuration | 86.3 KB | 7.1 KB | 92% | 12.1x |
73
+ | Routing | 93.6 KB | 14.7 KB | 84% | 6.4x |
74
+ | Deployment | 122.1 KB | 33.3 KB | 73% | 3.7x |
75
+ | Producing Messages | 87.7 KB | 8.3 KB | 91% | 10.6x |
76
+ | Consuming Messages | 105.3 KB | 21.3 KB | 80% | 4.9x |
77
+ | Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | 5.1x |
78
+ | Active Job | 88.7 KB | 8.8 KB | 90% | 10.1x |
79
+ | Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | 3.7x |
80
+ | Error Handling | 93.8 KB | 13.1 KB | 86% | 7.2x |
81
+
82
+ **Average: 83% reduction, 8.4x smaller files**
83
+
84
+ For a typical RAG system making 1,000 documentation queries per day:
85
+ - **Before**: ~990 KB per day × 1,000 queries = ~990 MB processed
86
+ - **After**: ~165 KB per day × 1,000 queries = ~165 MB processed
87
+ - **Savings**: 83% reduction in token costs
88
+
89
+ At GPT-4 pricing ($2.50/M input tokens), that's approximately **$2,000-5,000 saved annually** on a documentation site with moderate traffic.
90
+
91
+ ## Installation
92
+
93
+ ### Option 1: Docker (Recommended)
94
+
95
+ No Ruby installation required. Perfect for CI/CD and quick usage:
96
+
97
+ ```bash
98
+ # Pull the image
99
+ docker pull mensfeld/llm-docs-builder:latest
100
+
101
+ # Create an alias for convenience
102
+ alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
103
+
104
+ # Use like a native command
105
+ llm-docs-builder compare --url https://yoursite.com/docs
106
+ ```
107
+
108
+ Multi-architecture support (amd64/arm64), ~50MB image size.
109
+
110
+ ### Option 2: RubyGems
111
+
112
+ For Ruby developers or when you need the Ruby API:
113
+
114
+ ```bash
115
+ gem install llm-docs-builder
116
+ ```
117
+
118
+ Or add to your Gemfile:
119
+
120
+ ```ruby
121
+ gem 'llm-docs-builder'
122
+ ```
123
+
124
+ ## Core Features
125
+
126
+ ### 1. Compare and Measure (The "Before You Start" Tool)
127
+
128
+ Quantify exactly how much context window you're wasting:
129
+
130
+ ```bash
131
+ # Compare what your server sends to humans vs AI bots
132
+ llm-docs-builder compare --url https://yoursite.com/docs/page.html
133
+
134
+ # Compare remote HTML with your local markdown
135
+ llm-docs-builder compare \
136
+ --url https://yoursite.com/docs/api.html \
137
+ --file docs/api.md
138
+
139
+ # Verbose mode for debugging
140
+ llm-docs-builder compare --url https://example.com/docs --verbose
141
+ ```
142
+
143
+ **Why this matters:**
144
+ - Validates that optimizations actually work
145
+ - Quantifies ROI before you invest time
146
+ - Monitors ongoing effectiveness
147
+ - Provides concrete metrics for stakeholders
148
+
149
+ ### 2. Transform Markdown (The Normalizer)
150
+
151
+ Normalize your markdown documentation to be LLM-friendly:
152
+
153
+ **Single file transformation:**
154
+ ```bash
155
+ # Expand relative links to absolute URLs
156
+ llm-docs-builder transform \
157
+ --docs README.md \
158
+ --config llm-docs-builder.yml
159
+ ```
160
+
161
+ **Bulk transformation - two modes:**
162
+
163
+ **a) Separate files (default)** - Creates `.llm.md` versions alongside originals:
164
+ ```yaml
165
+ # llm-docs-builder.yml
166
+ docs: ./docs
167
+ base_url: https://myproject.io
168
+ suffix: .llm # Creates README.llm.md alongside README.md
169
+ convert_urls: true # .html → .md
170
+ ```
171
+
172
+ ```bash
173
+ llm-docs-builder bulk-transform --config llm-docs-builder.yml
174
+ ```
175
+
176
+ Result:
177
+ ```
178
+ docs/
179
+ ├── README.md ← Original (for humans)
180
+ ├── README.llm.md ← Optimized (for AI)
181
+ ├── api.md
182
+ └── api.llm.md
183
+ ```
184
+
185
+ **b) In-place transformation** - Overwrites originals (for build pipelines):
186
+ ```yaml
187
+ # llm-docs-builder.yml
188
+ docs: ./docs
189
+ base_url: https://myproject.io
190
+ suffix: "" # Transforms in-place
191
+ convert_urls: true
192
+ excludes:
193
+ - "**/private/**"
194
+ ```
195
+
196
+ ```bash
197
+ llm-docs-builder bulk-transform --config llm-docs-builder.yml
198
+ ```
199
+
200
+ Perfect for CI/CD where you transform docs before deployment.
201
+
202
+ **What gets normalized:**
203
+ - Relative links → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
204
+ - HTML URLs → Markdown format (`.html` → `.md`)
205
+ - Clean markdown structure preserved
206
+ - No content modification, just link normalization
207
+
208
+ ### 3. Generate llms.txt (The Standard)
209
+
210
+ Create a standardized documentation index following the [llms.txt](https://llmstxt.org/) specification:
211
+
212
+ ```yaml
213
+ # llm-docs-builder.yml
214
+ docs: ./docs
215
+ base_url: https://myproject.io
216
+ title: My Project
217
+ description: A library that does amazing things
218
+ output: llms.txt
219
+ ```
220
+
221
+ ```bash
222
+ llm-docs-builder generate --config llm-docs-builder.yml
223
+ ```
224
+
225
+ **Generated output:**
226
+ ```markdown
227
+ # My Project
228
+
229
+ > A library that does amazing things
230
+
231
+ ## Documentation
232
+
233
+ - [README](https://myproject.io/README.md): Complete overview and installation
234
+ - [Getting Started](https://myproject.io/getting-started.md): Quick start guide
235
+ - [API Reference](https://myproject.io/api-reference.md): Detailed API documentation
236
+ ```
237
+
238
+ **Smart prioritization:**
239
+ 1. README files (always first)
240
+ 2. Getting started guides
241
+ 3. Tutorials and guides
242
+ 4. API references
243
+ 5. Other documentation
244
+
245
+ The llms.txt file serves as an efficient entry point for AI systems to understand your project structure.
246
+
247
+ ### 4. Serve to AI Bots (The Deployment)
248
+
249
+ After using `bulk-transform` with `suffix: .llm`, configure your web server to automatically serve optimized versions to AI bots:
250
+
251
+ **Apache (.htaccess):**
252
+ ```apache
253
+ # Detect AI bots
254
+ SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt|chatgpt)" IS_LLM_BOT
255
+ SetEnvIf User-Agent "(?i)(perplexity|gemini|copilot|bard)" IS_LLM_BOT
256
+
257
+ # Serve .llm.md to AI, .md to humans
258
+ RewriteEngine On
259
+ RewriteCond %{ENV:IS_LLM_BOT} !^$
260
+ RewriteCond %{REQUEST_URI} ^/docs/.*\.md$ [NC]
261
+ RewriteRule ^(.*)\.md$ $1.llm.md [L]
262
+ ```
263
+
264
+ **Nginx:**
265
+ ```nginx
266
+ map $http_user_agent $is_llm_bot {
267
+ default 0;
268
+ "~*(?i)(openai|anthropic|claude|gpt)" 1;
269
+ "~*(?i)(perplexity|gemini|copilot)" 1;
270
+ }
271
+
272
+ location ~ ^/docs/(.*)\.md$ {
273
+ if ($is_llm_bot) {
274
+ rewrite ^(.*)\.md$ $1.llm.md last;
275
+ }
276
+ }
277
+ ```
278
+
279
+ **Cloudflare Workers:**
280
+ ```javascript
281
+ const isLLMBot = /openai|anthropic|claude|gpt|perplexity/i.test(userAgent);
282
+ if (isLLMBot && url.pathname.startsWith('/docs/')) {
283
+ url.pathname = url.pathname.replace(/\.md$/, '.llm.md');
284
+ }
285
+ ```
286
+
287
+ **Result**: AI systems automatically get optimized versions, humans get the original. No manual switching, no duplicate URLs.
288
+
289
+ ## Configuration
290
+
291
+ All commands support both config files and CLI flags. Config files are recommended for consistency:
292
+
293
+ ```yaml
294
+ # llm-docs-builder.yml
295
+ docs: ./docs
296
+ base_url: https://myproject.io
297
+ title: My Project
298
+ description: Brief description
299
+ output: llms.txt
300
+ convert_urls: true
301
+ suffix: .llm
302
+ verbose: false
303
+ excludes:
304
+ - "**/private/**"
305
+ - "**/drafts/**"
306
+ ```
307
+
308
+ **Configuration precedence:**
309
+ 1. CLI flags (highest priority)
310
+ 2. Config file values
311
+ 3. Defaults
312
+
313
+ **Example of overriding:**
314
+ ```bash
315
+ # Uses config file but overrides title
316
+ llm-docs-builder generate --config llm-docs-builder.yml --title "Override Title"
317
+ ```
318
+
319
+ ## Docker Usage
320
+
321
+ All CLI commands work in Docker with the same syntax:
322
+
323
+ ```bash
324
+ # Basic pattern
325
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder [command] [options]
326
+
327
+ # Examples
328
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
329
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder transform --docs README.md
330
+ docker run mensfeld/llm-docs-builder compare --url https://example.com/docs
331
+ ```
332
+
333
+ **CI/CD Integration:**
334
+
335
+ GitHub Actions:
336
+ ```yaml
337
+ - name: Generate llms.txt
338
+ run: |
339
+ docker run -v ${{ github.workspace }}:/workspace \
340
+ mensfeld/llm-docs-builder generate --config llm-docs-builder.yml
341
+ ```
342
+
343
+ GitLab CI:
344
+ ```yaml
345
+ generate-llms:
346
+ image: mensfeld/llm-docs-builder:latest
347
+ script:
348
+ - llm-docs-builder generate --docs ./docs
349
+ ```
350
+
351
+ See [Docker Usage](#detailed-docker-usage) section below for comprehensive examples.
352
+
353
+ ## Ruby API
354
+
355
+ For programmatic usage:
356
+
357
+ ```ruby
358
+ require 'llm_docs_builder'
359
+
360
+ # Using config file
361
+ content = LlmDocsBuilder.generate_from_docs(config_file: 'llm-docs-builder.yml')
362
+
363
+ # Direct options
364
+ content = LlmDocsBuilder.generate_from_docs('./docs',
365
+ base_url: 'https://myproject.io',
366
+ title: 'My Project'
367
+ )
368
+
369
+ # Transform markdown
370
+ transformed = LlmDocsBuilder.transform_markdown('README.md',
371
+ base_url: 'https://myproject.io',
372
+ convert_urls: true
373
+ )
374
+
375
+ # Bulk transform
376
+ files = LlmDocsBuilder.bulk_transform('./docs',
377
+ base_url: 'https://myproject.io',
378
+ suffix: '.llm',
379
+ excludes: ['**/private/**']
380
+ )
381
+
382
+ # In-place transformation
383
+ files = LlmDocsBuilder.bulk_transform('./docs',
384
+ suffix: '', # Empty for in-place
385
+ base_url: 'https://myproject.io'
386
+ )
387
+ ```
388
+
389
+ ## Real-World Case Study: Karafka Framework
390
+
391
+ The [Karafka framework](https://github.com/karafka/karafka) processes millions of Kafka messages daily and maintains extensive documentation. Before llm-docs-builder:
392
+
393
+ - **140+ lines of custom Ruby code** for link expansion and URL normalization
394
+ - Manual maintenance of transformation logic
395
+ - No way to measure optimization effectiveness
396
+
397
+ **After implementing llm-docs-builder:**
398
+
399
+ ```yaml
400
+ # llm-docs-builder.yml
401
+ docs: ./online/docs
402
+ base_url: https://karafka.io/docs
403
+ convert_urls: true
404
+ suffix: "" # In-place transformation for build pipeline
405
+ excludes:
406
+ - "**/Enterprise-License-Setup/**"
407
+ ```
408
+
409
+ ```bash
410
+ # In their deployment script
411
+ llm-docs-builder bulk-transform --config llm-docs-builder.yml
412
+ ```
413
+
414
+ **Results:**
415
+ - **140 lines of code → 6 lines of config**
416
+ - **93% average token reduction** across all documentation
417
+ - **Quantifiable savings** via the compare command
418
+ - **Automated daily deployments** via GitHub Actions
419
+
420
+ The compare command revealed that their documentation was consuming 20-36x more tokens than necessary for AI systems. After optimization, RAG queries became dramatically more efficient.
421
+
422
+ ## CLI Reference
423
+
424
+ ```bash
425
+ llm-docs-builder compare [options] # Measure token savings (start here!)
426
+ llm-docs-builder transform [options] # Transform single markdown file
427
+ llm-docs-builder bulk-transform [options] # Transform entire documentation tree
428
+ llm-docs-builder generate [options] # Generate llms.txt index
429
+ llm-docs-builder parse [options] # Parse existing llms.txt
430
+ llm-docs-builder validate [options] # Validate llms.txt format
431
+ llm-docs-builder version # Show version
432
+ ```
433
+
434
+ **Common options:**
435
+ ```
436
+ -c, --config PATH Configuration file (default: llm-docs-builder.yml)
437
+ -d, --docs PATH Documentation directory or file
438
+ -o, --output PATH Output file path
439
+ -u, --url URL URL for comparison
440
+ -f, --file PATH Local file for comparison
441
+ -v, --verbose Detailed output
442
+ -h, --help Show help
443
+ ```
444
+
445
+ For advanced options (base_url, title, suffix, excludes, convert_urls), use a config file.
446
+
447
+ ## Why This Matters for RAG Systems
448
+
449
+ Retrieval-Augmented Generation (RAG) systems fetch documentation to answer questions. Every byte of overhead in those documents:
450
+
451
+ 1. **Costs money** - More tokens = higher API costs
452
+ 2. **Reduces capacity** - Less room for actual documentation in context window
453
+ 3. **Slows responses** - More tokens to process = longer response times
454
+ 4. **Degrades quality** - Navigation noise can confuse the model
455
+
456
+ llm-docs-builder addresses all four issues by transforming markdown to be AI-friendly and enabling your server to automatically serve it to AI bots while humans get HTML.
457
+
458
+ **The JavaScript Problem:**
459
+
460
+ Many documentation sites rely on JavaScript for rendering. AI crawlers typically don't execute JavaScript, so they either:
461
+ - Get incomplete content
462
+ - Get server-side rendered HTML (bloated with framework overhead)
463
+ - Fail entirely
464
+
465
+ By detecting AI bots and serving them clean markdown instead of HTML, you sidestep this problem entirely.
466
+
467
+ ## Configuration Reference
468
+
469
+ | Option | Type | Default | Description |
470
+ |--------|------|---------|-------------|
471
+ | `docs` | String | `./docs` | Documentation directory or file |
472
+ | `base_url` | String | - | Base URL for absolute links (e.g., `https://myproject.io`) |
473
+ | `title` | String | Auto-detected | Project title |
474
+ | `description` | String | Auto-detected | Project description |
475
+ | `output` | String | `llms.txt` | Output filename for llms.txt generation |
476
+ | `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
477
+ | `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
478
+ | `excludes` | Array | `[]` | Glob patterns to exclude |
479
+ | `verbose` | Boolean | `false` | Enable detailed output |
480
+
481
+ ## Detailed Docker Usage
482
+
483
+ ### Installation and Setup
484
+
485
+ ```bash
486
+ # Pull from Docker Hub
487
+ docker pull mensfeld/llm-docs-builder:latest
488
+
489
+ # Or from GitHub Container Registry
490
+ docker pull ghcr.io/mensfeld/llm-docs-builder:latest
491
+
492
+ # Create an alias for convenience
493
+ alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'
494
+ ```
495
+
496
+ ### Common Commands
497
+
498
+ **Compare (no volume mount needed for remote URLs):**
499
+ ```bash
500
+ docker run mensfeld/llm-docs-builder compare \
501
+ --url https://karafka.io/docs/Getting-Started/
502
+
503
+ # With local file
504
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder compare \
505
+ --url https://example.com/page.html \
506
+ --file docs/page.md
507
+ ```
508
+
509
+ **Generate llms.txt:**
510
+ ```bash
511
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
512
+ generate --docs ./docs --output llms.txt
513
+ ```
514
+
515
+ **Transform single file:**
516
+ ```bash
517
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
518
+ transform --docs README.md --config llm-docs-builder.yml
519
+ ```
520
+
521
+ **Bulk transform:**
522
+ ```bash
523
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
524
+ bulk-transform --config llm-docs-builder.yml
525
+ ```
526
+
527
+ **Parse and validate:**
528
+ ```bash
529
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
530
+ parse --docs llms.txt --verbose
531
+
532
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder \
533
+ validate --docs llms.txt
534
+ ```
535
+
536
+ ### CI/CD Examples
537
+
538
+ **GitHub Actions:**
539
+ ```yaml
540
+ jobs:
541
+ optimize-docs:
542
+ runs-on: ubuntu-latest
543
+ steps:
544
+ - uses: actions/checkout@v3
545
+ - name: Transform documentation
546
+ run: |
547
+ docker run -v ${{ github.workspace }}:/workspace \
548
+ mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
549
+ - name: Measure savings
550
+ run: |
551
+ docker run mensfeld/llm-docs-builder \
552
+ compare --url https://yoursite.com/docs/main.html
553
+ ```
554
+
555
+ **GitLab CI:**
556
+ ```yaml
557
+ optimize-docs:
558
+ image: mensfeld/llm-docs-builder:latest
559
+ script:
560
+ - llm-docs-builder bulk-transform --docs ./docs
561
+ - llm-docs-builder compare --url https://yoursite.com/docs
562
+ ```
563
+
564
+ **Jenkins:**
565
+ ```groovy
566
+ stage('Optimize Documentation') {
567
+ steps {
568
+ sh '''
569
+ docker run -v ${WORKSPACE}:/workspace \
570
+ mensfeld/llm-docs-builder bulk-transform --config llm-docs-builder.yml
571
+ '''
572
+ }
573
+ }
574
+ ```
575
+
576
+ ### Version Pinning
577
+
578
+ ```bash
579
+ # Use specific version
580
+ docker run mensfeld/llm-docs-builder:0.3.0 version
581
+
582
+ # Use major version (gets latest patch)
583
+ docker run mensfeld/llm-docs-builder:0 version
584
+
585
+ # Always latest
586
+ docker run mensfeld/llm-docs-builder:latest version
587
+ ```
588
+
589
+ ### Platform-Specific Usage
590
+
591
+ **Windows PowerShell:**
592
+ ```powershell
593
+ docker run -v ${PWD}:/workspace mensfeld/llm-docs-builder generate --docs ./docs
594
+ ```
595
+
596
+ **Windows Command Prompt:**
597
+ ```cmd
598
+ docker run -v %cd%:/workspace mensfeld/llm-docs-builder generate --docs ./docs
599
+ ```
600
+
601
+ **macOS/Linux:**
602
+ ```bash
603
+ docker run -v $(pwd):/workspace mensfeld/llm-docs-builder generate --docs ./docs
604
+ ```
605
+
606
+ ## About llms.txt Standard
607
+
608
+ The [llms.txt specification](https://llmstxt.org/) is a proposed standard for providing LLM-friendly content. It defines a structured format that helps AI systems:
609
+
610
+ - Quickly understand project structure
611
+ - Find relevant documentation efficiently
612
+ - Navigate complex documentation hierarchies
613
+ - Access clean, markdown-formatted content
614
+
615
+ llm-docs-builder generates llms.txt files automatically by:
616
+ 1. Scanning your documentation directory
617
+ 2. Extracting titles and descriptions from markdown files
618
+ 3. Prioritizing content by importance (README first, then guides, APIs, etc.)
619
+ 4. Formatting everything according to the specification
620
+
621
+ The llms.txt file serves as an efficient entry point, but the real token savings come from serving optimized markdown for each individual documentation page.
622
+
623
+ ## How It Works
624
+
625
+ **Generation Process:**
626
+ 1. Scan directory for `.md` files
627
+ 2. Extract title (first H1) and description (first paragraph)
628
+ 3. Prioritize by importance (README → Getting Started → Guides → API → Other)
629
+ 4. Build formatted llms.txt with links and descriptions
630
+
631
+ **Transformation Process:**
632
+ 1. Expand relative links to absolute URLs
633
+ 2. Optionally convert `.html` to `.md`
634
+ 3. Preserve all content unchanged
635
+ 4. Write to new file or overwrite in-place
636
+
637
+ **Comparison Process:**
638
+ 1. Fetch URL with human User-Agent (or read local file)
639
+ 2. Fetch same URL with AI bot User-Agent
640
+ 3. Calculate size difference and reduction percentage
641
+ 4. Display human-readable comparison results
642
+
643
+ ## FAQ
644
+
645
+ **Q: Do I need to use llms.txt to benefit from this tool?**
646
+
647
+ No. The compare and transform commands provide value independently. Many users start with `compare` to measure savings, then use `bulk-transform` to normalize their markdown files, and may never generate an llms.txt file.
648
+
649
+ **Q: Will this change how humans see my documentation?**
650
+
651
+ Not if you use the default `suffix: .llm` mode. This creates separate `.llm.md` files served only to AI bots. Your original files remain unchanged for human visitors.
652
+
653
+ **Q: Can I use this in my build pipeline?**
654
+
655
+ Yes. Use `suffix: ""` for in-place transformation. The Karafka framework does this - they transform their markdown as part of their deployment process.
656
+
657
+ **Q: How do I know if it's working?**
658
+
659
+ Use the `compare` command to measure before and after. It shows exact byte counts, reduction percentages, and compression factors.
660
+
661
+ **Q: Does this work with static site generators?**
662
+
663
+ Yes. You can transform markdown files before your static site generator processes them, or serve separate `.llm.md` versions alongside your generated HTML.
664
+
665
+ **Q: What about private/internal documentation?**
666
+
667
+ Use the `excludes` option to skip sensitive files:
668
+ ```yaml
669
+ excludes:
670
+ - "**/private/**"
671
+ - "**/internal/**"
672
+ ```
673
+
674
+ **Q: Can I customize the AI bot detection?**
675
+
676
+ Yes. The web server examples show the User-Agent patterns. You can add or remove patterns based on which AI systems you want to support.
677
+
678
+ ## Contributing
679
+
680
+ Bug reports and pull requests welcome at [github.com/mensfeld/llm-docs-builder](https://github.com/mensfeld/llm-docs-builder).
681
+
682
+ ## License
683
+
684
+ Available as open source under the [MIT License](https://opensource.org/licenses/MIT).
data/Rakefile ADDED
@@ -0,0 +1,10 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'bundler/gem_tasks'
4
+ require 'rspec/core/rake_task'
5
+ require 'rubocop/rake_task'
6
+
7
+ RSpec::Core::RakeTask.new(:spec)
8
+ RuboCop::RakeTask.new
9
+
10
+ task default: %i[spec rubocop]
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'llm_docs_builder'
5
+ require 'llm_docs_builder/cli'
6
+
7
+ LlmDocsBuilder::CLI.run
data/bin/rspecs ADDED
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env bash
2
+ # Run all tests (unit and integration specs)
3
+
4
+ set -e
5
+
6
+ echo "Running all tests..."
7
+ bundle exec rspec --format documentation