llm-docs-builder 0.3.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/docker.yml +6 -6
- data/CHANGELOG.md +1 -1
- data/Gemfile.lock +1 -1
- data/README.md +81 -37
- data/lib/llm_docs_builder/cli.rb +14 -2
- data/lib/llm_docs_builder/comparator.rb +42 -7
- data/lib/llm_docs_builder/config.rb +20 -0
- data/lib/llm_docs_builder/markdown_transformer.rb +103 -0
- data/lib/llm_docs_builder/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5f8bf86d722ce6f0727ada59374b507a0f8ce6d240469649ca241ff2c35b2a40
|
4
|
+
data.tar.gz: 745f6669e002326127d72252704b956d9c385076115abb7be8be807a476e7449
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0ebb2ae3a082748f5b407cafdc90ea36a9b47b1852fd6e05f5370eaa8e671f6160bee17c33d5773232b2b81365aaee87e29a8b2b2d863c9a9fd9b0a96e14da58
|
7
|
+
data.tar.gz: d6a2c67eca5455a5fb19962f44f0b1d5e6ebf424a1f3ba268e0c97b351f3f9321568d68fc16496e2383af9023f14f8ffdace517b79056e69c6e7a4a6dce5a4cf
|
@@ -36,7 +36,7 @@ jobs:
|
|
36
36
|
|
37
37
|
- name: Docker meta
|
38
38
|
id: meta
|
39
|
-
uses: docker/metadata-action@v5
|
39
|
+
uses: docker/metadata-action@c1e51972afc2121e065aed6d45c65596fe445f3f # v5
|
40
40
|
with:
|
41
41
|
images: |
|
42
42
|
mensfeld/llm-docs-builder
|
@@ -50,28 +50,28 @@ jobs:
|
|
50
50
|
type=raw,value=latest,enable={{is_default_branch}}
|
51
51
|
|
52
52
|
- name: Set up QEMU
|
53
|
-
uses: docker/setup-qemu-action@v3
|
53
|
+
uses: docker/setup-qemu-action@29109295f81e9208d7d86ff1c6c12d2833863392 # v3
|
54
54
|
|
55
55
|
- name: Set up Docker Buildx
|
56
|
-
uses: docker/setup-buildx-action@v3
|
56
|
+
uses: docker/setup-buildx-action@e468171a9de216ec08956ac3ada2f0791b6bd435 # v3
|
57
57
|
|
58
58
|
- name: Login to Docker Hub
|
59
59
|
if: github.event_name != 'pull_request'
|
60
|
-
uses: docker/login-action@v3
|
60
|
+
uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3
|
61
61
|
with:
|
62
62
|
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
63
63
|
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
64
64
|
|
65
65
|
- name: Login to GitHub Container Registry
|
66
66
|
if: github.event_name != 'pull_request'
|
67
|
-
uses: docker/login-action@v3
|
67
|
+
uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3
|
68
68
|
with:
|
69
69
|
registry: ghcr.io
|
70
70
|
username: ${{ github.actor }}
|
71
71
|
password: ${{ secrets.GITHUB_TOKEN }}
|
72
72
|
|
73
73
|
- name: Build and push
|
74
|
-
uses: docker/build-push-action@
|
74
|
+
uses: docker/build-push-action@263435318d21b8e681c14492fe198d362a7d2c83 # v6
|
75
75
|
with:
|
76
76
|
context: .
|
77
77
|
platforms: linux/amd64,linux/arm64
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# Changelog
|
2
2
|
|
3
|
-
##
|
3
|
+
## 0.6.0 (2025-10-09)
|
4
4
|
- [Breaking] **Project renamed from `llms-txt-ruby` to `llm-docs-builder`** to better reflect expanded functionality beyond just llms.txt generation.
|
5
5
|
- Gem name: `llms-txt-ruby` → `llm-docs-builder`
|
6
6
|
- Module name: `LlmsTxt` → `LlmDocsBuilder`
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -12,9 +12,9 @@ llm-docs-builder normalizes markdown documentation to be AI-friendly and generat
|
|
12
12
|
When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
|
13
13
|
|
14
14
|
**Real example from Karafka documentation:**
|
15
|
-
- Human HTML version: 82.0 KB
|
16
|
-
- AI markdown version: 4.1 KB
|
17
|
-
- **Result: 95% reduction, 20x smaller**
|
15
|
+
- Human HTML version: 82.0 KB (~20,500 tokens)
|
16
|
+
- AI markdown version: 4.1 KB (~1,025 tokens)
|
17
|
+
- **Result: 95% reduction, 19,475 tokens saved, 20x smaller**
|
18
18
|
|
19
19
|
With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
|
20
20
|
|
@@ -48,14 +48,15 @@ docker run mensfeld/llm-docs-builder compare \
|
|
48
48
|
Context Window Comparison
|
49
49
|
============================================================
|
50
50
|
|
51
|
-
Human version: 45.2 KB
|
51
|
+
Human version: 45.2 KB (~11,300 tokens)
|
52
52
|
Source: https://yoursite.com/docs/page.html (User-Agent: human)
|
53
53
|
|
54
|
-
AI version: 12.8 KB
|
54
|
+
AI version: 12.8 KB (~3,200 tokens)
|
55
55
|
Source: https://yoursite.com/docs/page.html (User-Agent: AI)
|
56
56
|
|
57
57
|
------------------------------------------------------------
|
58
58
|
Reduction: 32.4 KB (72%)
|
59
|
+
Token savings: 8,100 tokens (72%)
|
59
60
|
Factor: 3.5x smaller
|
60
61
|
============================================================
|
61
62
|
```
|
@@ -66,27 +67,27 @@ This single command shows you the potential ROI before you invest any time in op
|
|
66
67
|
|
67
68
|
**[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
|
68
69
|
|
69
|
-
| Page | Human HTML | AI Markdown | Reduction | Factor |
|
70
|
-
|
71
|
-
| Getting Started | 82.0 KB | 4.1 KB | 95% | 20.1x |
|
72
|
-
| Configuration | 86.3 KB | 7.1 KB | 92% | 12.1x |
|
73
|
-
| Routing | 93.6 KB | 14.7 KB | 84% | 6.4x |
|
74
|
-
| Deployment | 122.1 KB | 33.3 KB | 73% | 3.7x |
|
75
|
-
| Producing Messages | 87.7 KB | 8.3 KB | 91% | 10.6x |
|
76
|
-
| Consuming Messages | 105.3 KB | 21.3 KB | 80% | 4.9x |
|
77
|
-
| Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | 5.1x |
|
78
|
-
| Active Job | 88.7 KB | 8.8 KB | 90% | 10.1x |
|
79
|
-
| Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | 3.7x |
|
80
|
-
| Error Handling | 93.8 KB | 13.1 KB | 86% | 7.2x |
|
81
|
-
|
82
|
-
**Average: 83% reduction, 8.4x smaller files**
|
70
|
+
| Page | Human HTML | AI Markdown | Reduction | Tokens Saved | Factor |
|
71
|
+
|------|-----------|-------------|-----------|--------------|---------|
|
72
|
+
| Getting Started | 82.0 KB | 4.1 KB | 95% | ~19,475 | 20.1x |
|
73
|
+
| Configuration | 86.3 KB | 7.1 KB | 92% | ~19,800 | 12.1x |
|
74
|
+
| Routing | 93.6 KB | 14.7 KB | 84% | ~19,725 | 6.4x |
|
75
|
+
| Deployment | 122.1 KB | 33.3 KB | 73% | ~22,200 | 3.7x |
|
76
|
+
| Producing Messages | 87.7 KB | 8.3 KB | 91% | ~19,850 | 10.6x |
|
77
|
+
| Consuming Messages | 105.3 KB | 21.3 KB | 80% | ~21,000 | 4.9x |
|
78
|
+
| Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | ~21,950 | 5.1x |
|
79
|
+
| Active Job | 88.7 KB | 8.8 KB | 90% | ~19,975 | 10.1x |
|
80
|
+
| Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | ~22,050 | 3.7x |
|
81
|
+
| Error Handling | 93.8 KB | 13.1 KB | 86% | ~20,175 | 7.2x |
|
82
|
+
|
83
|
+
**Average: 83% reduction, ~20,620 tokens saved per page, 8.4x smaller files**
|
83
84
|
|
84
85
|
For a typical RAG system making 1,000 documentation queries per day:
|
85
|
-
- **Before**: ~990 KB per day × 1,000 queries = ~
|
86
|
-
- **After**: ~165 KB per day × 1,000 queries = ~
|
87
|
-
- **Savings**: 83% reduction
|
86
|
+
- **Before**: ~990 KB per day (~247,500 tokens) × 1,000 queries = ~247.5M tokens/day
|
87
|
+
- **After**: ~165 KB per day (~41,250 tokens) × 1,000 queries = ~41.25M tokens/day
|
88
|
+
- **Savings**: 83% reduction = ~206.25M tokens saved per day
|
88
89
|
|
89
|
-
At GPT-4 pricing ($2.50/M input tokens), that's approximately **$
|
90
|
+
At GPT-4 pricing ($2.50/M input tokens), that's approximately **$500/day or $183,000/year saved** on a documentation site with moderate traffic.
|
90
91
|
|
91
92
|
## Installation
|
92
93
|
|
@@ -165,8 +166,12 @@ llm-docs-builder transform \
|
|
165
166
|
# llm-docs-builder.yml
|
166
167
|
docs: ./docs
|
167
168
|
base_url: https://myproject.io
|
168
|
-
suffix: .llm
|
169
|
-
convert_urls: true
|
169
|
+
suffix: .llm # Creates README.llm.md alongside README.md
|
170
|
+
convert_urls: true # .html → .md
|
171
|
+
remove_comments: true # Remove HTML comments
|
172
|
+
remove_badges: true # Remove badge/shield images
|
173
|
+
remove_frontmatter: true # Remove YAML/TOML frontmatter
|
174
|
+
normalize_whitespace: true # Clean up excessive blank lines
|
170
175
|
```
|
171
176
|
|
172
177
|
```bash
|
@@ -187,8 +192,12 @@ docs/
|
|
187
192
|
# llm-docs-builder.yml
|
188
193
|
docs: ./docs
|
189
194
|
base_url: https://myproject.io
|
190
|
-
suffix: ""
|
191
|
-
convert_urls: true
|
195
|
+
suffix: "" # Transforms in-place
|
196
|
+
convert_urls: true # Convert .html to .md
|
197
|
+
remove_comments: true # Remove HTML comments
|
198
|
+
remove_badges: true # Remove badge/shield images
|
199
|
+
remove_frontmatter: true # Remove YAML/TOML frontmatter
|
200
|
+
normalize_whitespace: true # Clean up excessive blank lines
|
192
201
|
excludes:
|
193
202
|
- "**/private/**"
|
194
203
|
```
|
@@ -200,10 +209,14 @@ llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
|
200
209
|
Perfect for CI/CD where you transform docs before deployment.
|
201
210
|
|
202
211
|
**What gets normalized:**
|
203
|
-
- Relative
|
204
|
-
- HTML
|
212
|
+
- **Links**: Relative → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
|
213
|
+
- **URLs**: HTML → Markdown format (`.html` → `.md`)
|
214
|
+
- **Comments**: HTML comments removed (`<!-- ... -->`)
|
215
|
+
- **Badges**: Shield/badge images removed (CI badges, version badges, etc.)
|
216
|
+
- **Frontmatter**: YAML/TOML metadata removed (Jekyll, Hugo, etc.)
|
217
|
+
- **Whitespace**: Excessive blank lines reduced (3+ → 2 max)
|
205
218
|
- Clean markdown structure preserved
|
206
|
-
- No content modification, just
|
219
|
+
- No content modification, just intelligent cleanup
|
207
220
|
|
208
221
|
### 3. Generate llms.txt (The Standard)
|
209
222
|
|
@@ -298,6 +311,10 @@ title: My Project
|
|
298
311
|
description: Brief description
|
299
312
|
output: llms.txt
|
300
313
|
convert_urls: true
|
314
|
+
remove_comments: true
|
315
|
+
remove_badges: true
|
316
|
+
remove_frontmatter: true
|
317
|
+
normalize_whitespace: true
|
301
318
|
suffix: .llm
|
302
319
|
verbose: false
|
303
320
|
excludes:
|
@@ -369,20 +386,32 @@ content = LlmDocsBuilder.generate_from_docs('./docs',
|
|
369
386
|
# Transform markdown
|
370
387
|
transformed = LlmDocsBuilder.transform_markdown('README.md',
|
371
388
|
base_url: 'https://myproject.io',
|
372
|
-
convert_urls: true
|
389
|
+
convert_urls: true,
|
390
|
+
remove_comments: true,
|
391
|
+
remove_badges: true,
|
392
|
+
remove_frontmatter: true,
|
393
|
+
normalize_whitespace: true
|
373
394
|
)
|
374
395
|
|
375
396
|
# Bulk transform
|
376
397
|
files = LlmDocsBuilder.bulk_transform('./docs',
|
377
398
|
base_url: 'https://myproject.io',
|
378
399
|
suffix: '.llm',
|
400
|
+
remove_comments: true,
|
401
|
+
remove_badges: true,
|
402
|
+
remove_frontmatter: true,
|
403
|
+
normalize_whitespace: true,
|
379
404
|
excludes: ['**/private/**']
|
380
405
|
)
|
381
406
|
|
382
407
|
# In-place transformation
|
383
408
|
files = LlmDocsBuilder.bulk_transform('./docs',
|
384
409
|
suffix: '', # Empty for in-place
|
385
|
-
base_url: 'https://myproject.io'
|
410
|
+
base_url: 'https://myproject.io',
|
411
|
+
remove_comments: true,
|
412
|
+
remove_badges: true,
|
413
|
+
remove_frontmatter: true,
|
414
|
+
normalize_whitespace: true
|
386
415
|
)
|
387
416
|
```
|
388
417
|
|
@@ -401,6 +430,10 @@ The [Karafka framework](https://github.com/karafka/karafka) processes millions o
|
|
401
430
|
docs: ./online/docs
|
402
431
|
base_url: https://karafka.io/docs
|
403
432
|
convert_urls: true
|
433
|
+
remove_comments: true
|
434
|
+
remove_badges: true
|
435
|
+
remove_frontmatter: true
|
436
|
+
normalize_whitespace: true
|
404
437
|
suffix: "" # In-place transformation for build pipeline
|
405
438
|
excludes:
|
406
439
|
- "**/Enterprise-License-Setup/**"
|
@@ -474,6 +507,10 @@ By detecting AI bots and serving them clean markdown instead of HTML, you sidest
|
|
474
507
|
| `description` | String | Auto-detected | Project description |
|
475
508
|
| `output` | String | `llms.txt` | Output filename for llms.txt generation |
|
476
509
|
| `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
|
510
|
+
| `remove_comments` | Boolean | `false` | Remove HTML comments (`<!-- ... -->`) |
|
511
|
+
| `remove_badges` | Boolean | `false` | Remove badge/shield images (CI, version, etc.) |
|
512
|
+
| `remove_frontmatter` | Boolean | `false` | Remove YAML/TOML frontmatter (Jekyll, Hugo) |
|
513
|
+
| `normalize_whitespace` | Boolean | `false` | Normalize excessive blank lines and trailing spaces |
|
477
514
|
| `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
|
478
515
|
| `excludes` | Array | `[]` | Glob patterns to exclude |
|
479
516
|
| `verbose` | Boolean | `false` | Enable detailed output |
|
@@ -629,16 +666,23 @@ The llms.txt file serves as an efficient entry point, but the real token savings
|
|
629
666
|
4. Build formatted llms.txt with links and descriptions
|
630
667
|
|
631
668
|
**Transformation Process:**
|
632
|
-
1.
|
633
|
-
2.
|
634
|
-
3.
|
635
|
-
4.
|
669
|
+
1. Remove frontmatter (YAML/TOML metadata)
|
670
|
+
2. Expand relative links to absolute URLs
|
671
|
+
3. Convert `.html` URLs to `.md`
|
672
|
+
4. Remove HTML comments
|
673
|
+
5. Remove badge/shield images
|
674
|
+
6. Normalize excessive whitespace
|
675
|
+
7. Write to new file or overwrite in-place
|
636
676
|
|
637
677
|
**Comparison Process:**
|
638
678
|
1. Fetch URL with human User-Agent (or read local file)
|
639
679
|
2. Fetch same URL with AI bot User-Agent
|
640
680
|
3. Calculate size difference and reduction percentage
|
641
|
-
4.
|
681
|
+
4. Estimate token counts using character-based heuristic
|
682
|
+
5. Display human-readable comparison results with byte and token savings
|
683
|
+
|
684
|
+
**Token Estimation:**
|
685
|
+
The tool uses a simple but effective heuristic for estimating token counts: **~4 characters per token**. This approximation works well for English documentation and provides reasonable estimates without requiring external tokenizer dependencies. While not as precise as OpenAI's tiktoken, it's accurate enough (±10-15%) for understanding context window savings and making optimization decisions.
|
642
686
|
|
643
687
|
## FAQ
|
644
688
|
|
data/lib/llm_docs_builder/cli.rb
CHANGED
@@ -351,21 +351,25 @@ module LlmDocsBuilder
|
|
351
351
|
puts 'Context Window Comparison'
|
352
352
|
puts '=' * 60
|
353
353
|
puts ''
|
354
|
-
puts "Human version: #{format_bytes(result[:human_size])}"
|
354
|
+
puts "Human version: #{format_bytes(result[:human_size])} (~#{format_number(result[:human_tokens])} tokens)"
|
355
355
|
puts " Source: #{result[:human_source]}"
|
356
356
|
puts ''
|
357
|
-
puts "AI version: #{format_bytes(result[:ai_size])}"
|
357
|
+
puts "AI version: #{format_bytes(result[:ai_size])} (~#{format_number(result[:ai_tokens])} tokens)"
|
358
358
|
puts " Source: #{result[:ai_source]}"
|
359
359
|
puts ''
|
360
360
|
puts '-' * 60
|
361
361
|
|
362
362
|
if result[:reduction_bytes].positive?
|
363
363
|
puts "Reduction: #{format_bytes(result[:reduction_bytes])} (#{result[:reduction_percent]}%)"
|
364
|
+
puts "Token savings: #{format_number(result[:token_reduction])} tokens (#{result[:token_reduction_percent]}%)"
|
364
365
|
puts "Factor: #{result[:factor]}x smaller"
|
365
366
|
elsif result[:reduction_bytes].negative?
|
366
367
|
increase_bytes = result[:reduction_bytes].abs
|
367
368
|
increase_percent = result[:reduction_percent].abs
|
369
|
+
token_increase = result[:token_reduction].abs
|
370
|
+
token_increase_percent = result[:token_reduction_percent].abs
|
368
371
|
puts "Increase: #{format_bytes(increase_bytes)} (#{increase_percent}%)"
|
372
|
+
puts "Token increase: #{format_number(token_increase)} tokens (#{token_increase_percent}%)"
|
369
373
|
puts "Factor: #{result[:factor]}x larger"
|
370
374
|
else
|
371
375
|
puts 'Same size'
|
@@ -389,6 +393,14 @@ module LlmDocsBuilder
|
|
389
393
|
end
|
390
394
|
end
|
391
395
|
|
396
|
+
# Format number with comma separators for readability
|
397
|
+
#
|
398
|
+
# @param number [Integer] number to format
|
399
|
+
# @return [String] formatted number with commas
|
400
|
+
def format_number(number)
|
401
|
+
number.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
|
402
|
+
end
|
403
|
+
|
392
404
|
# Validate llms.txt file format
|
393
405
|
#
|
394
406
|
# Checks if llms.txt file follows proper format with title, description, and documentation links.
|
@@ -62,6 +62,10 @@ module LlmDocsBuilder
|
|
62
62
|
# - :reduction_bytes [Integer] bytes saved
|
63
63
|
# - :reduction_percent [Integer] percentage reduction
|
64
64
|
# - :factor [Float] compression factor
|
65
|
+
# - :human_tokens [Integer] estimated tokens in human version
|
66
|
+
# - :ai_tokens [Integer] estimated tokens in AI version
|
67
|
+
# - :token_reduction [Integer] estimated tokens saved
|
68
|
+
# - :token_reduction_percent [Integer] percentage of tokens saved
|
65
69
|
# - :human_source [String] source description (URL or file)
|
66
70
|
# - :ai_source [String] source description (URL or file)
|
67
71
|
def compare
|
@@ -85,8 +89,8 @@ module LlmDocsBuilder
|
|
85
89
|
ai_content = fetch_url(url, options[:ai_user_agent])
|
86
90
|
|
87
91
|
calculate_results(
|
88
|
-
human_content
|
89
|
-
ai_content
|
92
|
+
human_content,
|
93
|
+
ai_content,
|
90
94
|
"#{url} (User-Agent: human)",
|
91
95
|
"#{url} (User-Agent: AI)"
|
92
96
|
)
|
@@ -112,8 +116,8 @@ module LlmDocsBuilder
|
|
112
116
|
ai_content = File.read(local_file)
|
113
117
|
|
114
118
|
calculate_results(
|
115
|
-
human_content
|
116
|
-
ai_content
|
119
|
+
human_content,
|
120
|
+
ai_content,
|
117
121
|
url,
|
118
122
|
local_file
|
119
123
|
)
|
@@ -205,12 +209,15 @@ module LlmDocsBuilder
|
|
205
209
|
|
206
210
|
# Calculate comparison statistics
|
207
211
|
#
|
208
|
-
# @param
|
209
|
-
# @param
|
212
|
+
# @param human_content [String] content of human version
|
213
|
+
# @param ai_content [String] content of AI version
|
210
214
|
# @param human_source [String] description of human source
|
211
215
|
# @param ai_source [String] description of AI source
|
212
216
|
# @return [Hash] comparison results
|
213
|
-
def calculate_results(
|
217
|
+
def calculate_results(human_content, ai_content, human_source, ai_source)
|
218
|
+
human_size = human_content.bytesize
|
219
|
+
ai_size = ai_content.bytesize
|
220
|
+
|
214
221
|
reduction_bytes = human_size - ai_size
|
215
222
|
reduction_percent = if human_size.positive?
|
216
223
|
((reduction_bytes.to_f / human_size) * 100).round
|
@@ -224,15 +231,43 @@ module LlmDocsBuilder
|
|
224
231
|
Float::INFINITY
|
225
232
|
end
|
226
233
|
|
234
|
+
# Estimate tokens
|
235
|
+
human_tokens = estimate_tokens(human_content)
|
236
|
+
ai_tokens = estimate_tokens(ai_content)
|
237
|
+
token_reduction = human_tokens - ai_tokens
|
238
|
+
token_reduction_percent = if human_tokens.positive?
|
239
|
+
((token_reduction.to_f / human_tokens) * 100).round
|
240
|
+
else
|
241
|
+
0
|
242
|
+
end
|
243
|
+
|
227
244
|
{
|
228
245
|
human_size: human_size,
|
229
246
|
ai_size: ai_size,
|
230
247
|
reduction_bytes: reduction_bytes,
|
231
248
|
reduction_percent: reduction_percent,
|
232
249
|
factor: factor,
|
250
|
+
human_tokens: human_tokens,
|
251
|
+
ai_tokens: ai_tokens,
|
252
|
+
token_reduction: token_reduction,
|
253
|
+
token_reduction_percent: token_reduction_percent,
|
233
254
|
human_source: human_source,
|
234
255
|
ai_source: ai_source
|
235
256
|
}
|
236
257
|
end
|
258
|
+
|
259
|
+
# Estimate token count using character-based approximation
|
260
|
+
#
|
261
|
+
# Uses the common heuristic that ~4 characters equals 1 token for English text.
|
262
|
+
# This provides reasonable estimates for documentation content without requiring
|
263
|
+
# external tokenizer dependencies.
|
264
|
+
#
|
265
|
+
# @param content [String] text content to estimate tokens for
|
266
|
+
# @return [Integer] estimated number of tokens
|
267
|
+
def estimate_tokens(content)
|
268
|
+
# Use 4 characters per token as a reasonable approximation
|
269
|
+
# This is a common heuristic for English text and works well for documentation
|
270
|
+
(content.length / 4.0).round
|
271
|
+
end
|
237
272
|
end
|
238
273
|
end
|
@@ -67,6 +67,26 @@ module LlmDocsBuilder
|
|
67
67
|
else
|
68
68
|
self['convert_urls'] || false
|
69
69
|
end,
|
70
|
+
remove_comments: if options.key?(:remove_comments)
|
71
|
+
options[:remove_comments]
|
72
|
+
else
|
73
|
+
self['remove_comments'] || false
|
74
|
+
end,
|
75
|
+
normalize_whitespace: if options.key?(:normalize_whitespace)
|
76
|
+
options[:normalize_whitespace]
|
77
|
+
else
|
78
|
+
self['normalize_whitespace'] || false
|
79
|
+
end,
|
80
|
+
remove_badges: if options.key?(:remove_badges)
|
81
|
+
options[:remove_badges]
|
82
|
+
else
|
83
|
+
self['remove_badges'] || false
|
84
|
+
end,
|
85
|
+
remove_frontmatter: if options.key?(:remove_frontmatter)
|
86
|
+
options[:remove_frontmatter]
|
87
|
+
else
|
88
|
+
self['remove_frontmatter'] || false
|
89
|
+
end,
|
70
90
|
verbose: options.key?(:verbose) ? options[:verbose] : (self['verbose'] || false),
|
71
91
|
# Bulk transformation options
|
72
92
|
suffix: options[:suffix] || self['suffix'] || '.llm',
|
@@ -27,6 +27,10 @@ module LlmDocsBuilder
|
|
27
27
|
# @param options [Hash] transformation options
|
28
28
|
# @option options [String] :base_url base URL for expanding relative links
|
29
29
|
# @option options [Boolean] :convert_urls convert HTML URLs to markdown format
|
30
|
+
# @option options [Boolean] :remove_comments remove HTML comments from markdown
|
31
|
+
# @option options [Boolean] :normalize_whitespace normalize excessive whitespace
|
32
|
+
# @option options [Boolean] :remove_badges remove badge/shield images
|
33
|
+
# @option options [Boolean] :remove_frontmatter remove YAML/TOML frontmatter
|
30
34
|
def initialize(file_path, options = {})
|
31
35
|
@file_path = file_path
|
32
36
|
@options = options
|
@@ -35,16 +39,31 @@ module LlmDocsBuilder
|
|
35
39
|
# Transform markdown content to be AI-friendly
|
36
40
|
#
|
37
41
|
# Applies transformations to make the markdown more suitable for LLM processing:
|
42
|
+
# - Removes YAML/TOML frontmatter (if remove_frontmatter enabled)
|
38
43
|
# - Expands relative links to absolute URLs (if base_url provided)
|
39
44
|
# - Converts HTML URLs to markdown format (if convert_urls enabled)
|
45
|
+
# - Removes HTML comments (if remove_comments enabled)
|
46
|
+
# - Removes badge/shield images (if remove_badges enabled)
|
47
|
+
# - Normalizes excessive whitespace (if normalize_whitespace enabled)
|
40
48
|
#
|
41
49
|
# @return [String] transformed markdown content
|
42
50
|
def transform
|
43
51
|
content = File.read(file_path)
|
44
52
|
|
53
|
+
# Remove frontmatter first (before any other processing)
|
54
|
+
content = remove_frontmatter(content) if options[:remove_frontmatter]
|
55
|
+
|
56
|
+
# Link transformations
|
45
57
|
content = expand_relative_links(content) if options[:base_url]
|
46
58
|
content = convert_html_urls(content) if options[:convert_urls]
|
47
59
|
|
60
|
+
# Content cleanup
|
61
|
+
content = remove_comments(content) if options[:remove_comments]
|
62
|
+
content = remove_badges(content) if options[:remove_badges]
|
63
|
+
|
64
|
+
# Whitespace normalization last (after all other transformations)
|
65
|
+
content = normalize_whitespace(content) if options[:normalize_whitespace]
|
66
|
+
|
48
67
|
content
|
49
68
|
end
|
50
69
|
|
@@ -86,5 +105,89 @@ module LlmDocsBuilder
|
|
86
105
|
url.sub(/\.html?$/, '.md')
|
87
106
|
end
|
88
107
|
end
|
108
|
+
|
109
|
+
# Remove HTML comments from markdown content
|
110
|
+
#
|
111
|
+
# Strips out HTML comments (<!-- ... -->) which are typically metadata for developers
|
112
|
+
# and not relevant for LLM consumption. This reduces token usage and improves clarity.
|
113
|
+
#
|
114
|
+
# Handles:
|
115
|
+
# - Single-line comments: <!-- comment -->
|
116
|
+
# - Multi-line comments spanning multiple lines
|
117
|
+
# - Multiple comments in the same content
|
118
|
+
#
|
119
|
+
# @param content [String] markdown content to process
|
120
|
+
# @return [String] content with comments removed
|
121
|
+
def remove_comments(content)
|
122
|
+
# Remove HTML comments (single and multi-line)
|
123
|
+
# The .*? makes it non-greedy so it stops at the first -->
|
124
|
+
content.gsub(/<!--.*?-->/m, '')
|
125
|
+
end
|
126
|
+
|
127
|
+
# Remove badge and shield images from markdown
|
128
|
+
#
|
129
|
+
# Strips out badge/shield images (typically from shields.io, badge.fury.io, etc.)
|
130
|
+
# which are visual indicators for humans but provide no value to LLMs.
|
131
|
+
#
|
132
|
+
# Recognizes common patterns:
|
133
|
+
# - [](link) (linked badges)
|
134
|
+
# -  (unlinked badges)
|
135
|
+
# - Common badge domains: shields.io, badge.fury.io, travis-ci.org, etc.
|
136
|
+
#
|
137
|
+
# @param content [String] markdown content to process
|
138
|
+
# @return [String] content with badges removed
|
139
|
+
def remove_badges(content)
|
140
|
+
# Remove linked badges: [](link-url)
|
141
|
+
content = content.gsub(/\[\!\[([^\]]*)\]\([^\)]*(?:badge|shield|svg|travis|coveralls|fury)[^\)]*\)\]\([^\)]*\)/i, '')
|
142
|
+
|
143
|
+
# Remove standalone badges: 
|
144
|
+
content = content.gsub(/!\[([^\]]*)\]\([^\)]*(?:badge|shield|svg|travis|coveralls|fury)[^\)]*\)/i, '')
|
145
|
+
|
146
|
+
content
|
147
|
+
end
|
148
|
+
|
149
|
+
# Remove YAML or TOML frontmatter from markdown
|
150
|
+
#
|
151
|
+
# Strips out frontmatter blocks which are metadata used by static site generators
|
152
|
+
# (Jekyll, Hugo, etc.) but not relevant for LLM consumption.
|
153
|
+
#
|
154
|
+
# Recognizes:
|
155
|
+
# - YAML frontmatter: --- ... ---
|
156
|
+
# - TOML frontmatter: +++ ... +++
|
157
|
+
#
|
158
|
+
# @param content [String] markdown content to process
|
159
|
+
# @return [String] content with frontmatter removed
|
160
|
+
def remove_frontmatter(content)
|
161
|
+
# Remove YAML frontmatter (--- ... ---)
|
162
|
+
content = content.sub(/\A---\s*$.*?^---\s*$/m, '')
|
163
|
+
|
164
|
+
# Remove TOML frontmatter (+++ ... +++)
|
165
|
+
content = content.sub(/\A\+\+\+\s*$.*?^\+\+\+\s*$/m, '')
|
166
|
+
|
167
|
+
content
|
168
|
+
end
|
169
|
+
|
170
|
+
# Normalize excessive whitespace in markdown
|
171
|
+
#
|
172
|
+
# Reduces excessive blank lines and trailing whitespace to make content more compact
|
173
|
+
# for LLM consumption without affecting readability.
|
174
|
+
#
|
175
|
+
# Transformations:
|
176
|
+
# - Multiple consecutive blank lines (3+) → 2 blank lines max
|
177
|
+
# - Trailing whitespace on lines → removed
|
178
|
+
# - Leading/trailing whitespace in file → trimmed
|
179
|
+
#
|
180
|
+
# @param content [String] markdown content to process
|
181
|
+
# @return [String] content with normalized whitespace
|
182
|
+
def normalize_whitespace(content)
|
183
|
+
# Remove trailing whitespace from each line
|
184
|
+
content = content.gsub(/ +$/, '')
|
185
|
+
|
186
|
+
# Reduce multiple consecutive blank lines to maximum of 2
|
187
|
+
content = content.gsub(/\n{4,}/, "\n\n\n")
|
188
|
+
|
189
|
+
# Trim leading and trailing whitespace from the entire content
|
190
|
+
content.strip
|
191
|
+
end
|
89
192
|
end
|
90
193
|
end
|