llm-docs-builder 0.3.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 68bc5131179dcb94c393cb2973c9fa1ffbf868616effb6814a7e99d340923de2
4
- data.tar.gz: c660150ac38f2687951ebd6f1e2f2014f4c5636aa4c322ef9114427b9d49a235
3
+ metadata.gz: 5f8bf86d722ce6f0727ada59374b507a0f8ce6d240469649ca241ff2c35b2a40
4
+ data.tar.gz: 745f6669e002326127d72252704b956d9c385076115abb7be8be807a476e7449
5
5
  SHA512:
6
- metadata.gz: 45bf6443d05c9173bf308b80240595e509d596447343aca937158ebddc44d9d0d07c0b3bc0db61b743e49f9ff634f799a90e8f72c0105ed3221fc3f11bdd7e05
7
- data.tar.gz: 4a2e9e04aa39dfa2954988a20e825c1d306299084e120f3332c413e008353be45336819bc0127aa03e05aaaa47db6a90b78f1c4033ea95b665262035b0998ace
6
+ metadata.gz: 0ebb2ae3a082748f5b407cafdc90ea36a9b47b1852fd6e05f5370eaa8e671f6160bee17c33d5773232b2b81365aaee87e29a8b2b2d863c9a9fd9b0a96e14da58
7
+ data.tar.gz: d6a2c67eca5455a5fb19962f44f0b1d5e6ebf424a1f3ba268e0c97b351f3f9321568d68fc16496e2383af9023f14f8ffdace517b79056e69c6e7a4a6dce5a4cf
@@ -36,7 +36,7 @@ jobs:
36
36
 
37
37
  - name: Docker meta
38
38
  id: meta
39
- uses: docker/metadata-action@v5
39
+ uses: docker/metadata-action@c1e51972afc2121e065aed6d45c65596fe445f3f # v5
40
40
  with:
41
41
  images: |
42
42
  mensfeld/llm-docs-builder
@@ -50,28 +50,28 @@ jobs:
50
50
  type=raw,value=latest,enable={{is_default_branch}}
51
51
 
52
52
  - name: Set up QEMU
53
- uses: docker/setup-qemu-action@v3
53
+ uses: docker/setup-qemu-action@29109295f81e9208d7d86ff1c6c12d2833863392 # v3
54
54
 
55
55
  - name: Set up Docker Buildx
56
- uses: docker/setup-buildx-action@v3
56
+ uses: docker/setup-buildx-action@e468171a9de216ec08956ac3ada2f0791b6bd435 # v3
57
57
 
58
58
  - name: Login to Docker Hub
59
59
  if: github.event_name != 'pull_request'
60
- uses: docker/login-action@v3
60
+ uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3
61
61
  with:
62
62
  username: ${{ secrets.DOCKERHUB_USERNAME }}
63
63
  password: ${{ secrets.DOCKERHUB_TOKEN }}
64
64
 
65
65
  - name: Login to GitHub Container Registry
66
66
  if: github.event_name != 'pull_request'
67
- uses: docker/login-action@v3
67
+ uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3
68
68
  with:
69
69
  registry: ghcr.io
70
70
  username: ${{ github.actor }}
71
71
  password: ${{ secrets.GITHUB_TOKEN }}
72
72
 
73
73
  - name: Build and push
74
- uses: docker/build-push-action@v5
74
+ uses: docker/build-push-action@263435318d21b8e681c14492fe198d362a7d2c83 # v6
75
75
  with:
76
76
  context: .
77
77
  platforms: linux/amd64,linux/arm64
data/CHANGELOG.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Changelog
2
2
 
3
- ## Unreleased
3
+ ## 0.6.0 (2025-10-09)
4
4
  - [Breaking] **Project renamed from `llms-txt-ruby` to `llm-docs-builder`** to better reflect expanded functionality beyond just llms.txt generation.
5
5
  - Gem name: `llms-txt-ruby` → `llm-docs-builder`
6
6
  - Module name: `LlmsTxt` → `LlmDocsBuilder`
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- llm-docs-builder (0.3.0)
4
+ llm-docs-builder (0.6.0)
5
5
  zeitwerk (~> 2.6)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -12,9 +12,9 @@ llm-docs-builder normalizes markdown documentation to be AI-friendly and generat
12
12
  When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
13
13
 
14
14
  **Real example from Karafka documentation:**
15
- - Human HTML version: 82.0 KB
16
- - AI markdown version: 4.1 KB
17
- - **Result: 95% reduction, 20x smaller**
15
+ - Human HTML version: 82.0 KB (~20,500 tokens)
16
+ - AI markdown version: 4.1 KB (~1,025 tokens)
17
+ - **Result: 95% reduction, 19,475 tokens saved, 20x smaller**
18
18
 
19
19
  With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
20
20
 
@@ -48,14 +48,15 @@ docker run mensfeld/llm-docs-builder compare \
48
48
  Context Window Comparison
49
49
  ============================================================
50
50
 
51
- Human version: 45.2 KB
51
+ Human version: 45.2 KB (~11,300 tokens)
52
52
  Source: https://yoursite.com/docs/page.html (User-Agent: human)
53
53
 
54
- AI version: 12.8 KB
54
+ AI version: 12.8 KB (~3,200 tokens)
55
55
  Source: https://yoursite.com/docs/page.html (User-Agent: AI)
56
56
 
57
57
  ------------------------------------------------------------
58
58
  Reduction: 32.4 KB (72%)
59
+ Token savings: 8,100 tokens (72%)
59
60
  Factor: 3.5x smaller
60
61
  ============================================================
61
62
  ```
@@ -66,27 +67,27 @@ This single command shows you the potential ROI before you invest any time in op
66
67
 
67
68
  **[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
68
69
 
69
- | Page | Human HTML | AI Markdown | Reduction | Factor |
70
- |------|-----------|-------------|-----------|---------|
71
- | Getting Started | 82.0 KB | 4.1 KB | 95% | 20.1x |
72
- | Configuration | 86.3 KB | 7.1 KB | 92% | 12.1x |
73
- | Routing | 93.6 KB | 14.7 KB | 84% | 6.4x |
74
- | Deployment | 122.1 KB | 33.3 KB | 73% | 3.7x |
75
- | Producing Messages | 87.7 KB | 8.3 KB | 91% | 10.6x |
76
- | Consuming Messages | 105.3 KB | 21.3 KB | 80% | 4.9x |
77
- | Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | 5.1x |
78
- | Active Job | 88.7 KB | 8.8 KB | 90% | 10.1x |
79
- | Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | 3.7x |
80
- | Error Handling | 93.8 KB | 13.1 KB | 86% | 7.2x |
81
-
82
- **Average: 83% reduction, 8.4x smaller files**
70
+ | Page | Human HTML | AI Markdown | Reduction | Tokens Saved | Factor |
71
+ |------|-----------|-------------|-----------|--------------|---------|
72
+ | Getting Started | 82.0 KB | 4.1 KB | 95% | ~19,475 | 20.1x |
73
+ | Configuration | 86.3 KB | 7.1 KB | 92% | ~19,800 | 12.1x |
74
+ | Routing | 93.6 KB | 14.7 KB | 84% | ~19,725 | 6.4x |
75
+ | Deployment | 122.1 KB | 33.3 KB | 73% | ~22,200 | 3.7x |
76
+ | Producing Messages | 87.7 KB | 8.3 KB | 91% | ~19,850 | 10.6x |
77
+ | Consuming Messages | 105.3 KB | 21.3 KB | 80% | ~21,000 | 4.9x |
78
+ | Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | ~21,950 | 5.1x |
79
+ | Active Job | 88.7 KB | 8.8 KB | 90% | ~19,975 | 10.1x |
80
+ | Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | ~22,050 | 3.7x |
81
+ | Error Handling | 93.8 KB | 13.1 KB | 86% | ~20,175 | 7.2x |
82
+
83
+ **Average: 83% reduction, ~20,620 tokens saved per page, 8.4x smaller files**
83
84
 
84
85
  For a typical RAG system making 1,000 documentation queries per day:
85
- - **Before**: ~990 KB per day × 1,000 queries = ~990 MB processed
86
- - **After**: ~165 KB per day × 1,000 queries = ~165 MB processed
87
- - **Savings**: 83% reduction in token costs
86
+ - **Before**: ~990 KB per day (~247,500 tokens) × 1,000 queries = ~247.5M tokens/day
87
+ - **After**: ~165 KB per day (~41,250 tokens) × 1,000 queries = ~41.25M tokens/day
88
+ - **Savings**: 83% reduction = ~206.25M tokens saved per day
88
89
 
89
- At GPT-4 pricing ($2.50/M input tokens), that's approximately **$2,000-5,000 saved annually** on a documentation site with moderate traffic.
90
+ At GPT-4 pricing ($2.50/M input tokens), that's approximately **$500/day or $183,000/year saved** on a documentation site with moderate traffic.
90
91
 
91
92
  ## Installation
92
93
 
@@ -165,8 +166,12 @@ llm-docs-builder transform \
165
166
  # llm-docs-builder.yml
166
167
  docs: ./docs
167
168
  base_url: https://myproject.io
168
- suffix: .llm # Creates README.llm.md alongside README.md
169
- convert_urls: true # .html → .md
169
+ suffix: .llm # Creates README.llm.md alongside README.md
170
+ convert_urls: true # .html → .md
171
+ remove_comments: true # Remove HTML comments
172
+ remove_badges: true # Remove badge/shield images
173
+ remove_frontmatter: true # Remove YAML/TOML frontmatter
174
+ normalize_whitespace: true # Clean up excessive blank lines
170
175
  ```
171
176
 
172
177
  ```bash
@@ -187,8 +192,12 @@ docs/
187
192
  # llm-docs-builder.yml
188
193
  docs: ./docs
189
194
  base_url: https://myproject.io
190
- suffix: "" # Transforms in-place
191
- convert_urls: true
195
+ suffix: "" # Transforms in-place
196
+ convert_urls: true # Convert .html to .md
197
+ remove_comments: true # Remove HTML comments
198
+ remove_badges: true # Remove badge/shield images
199
+ remove_frontmatter: true # Remove YAML/TOML frontmatter
200
+ normalize_whitespace: true # Clean up excessive blank lines
192
201
  excludes:
193
202
  - "**/private/**"
194
203
  ```
@@ -200,10 +209,14 @@ llm-docs-builder bulk-transform --config llm-docs-builder.yml
200
209
  Perfect for CI/CD where you transform docs before deployment.
201
210
 
202
211
  **What gets normalized:**
203
- - Relative links → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
204
- - HTML URLs → Markdown format (`.html` → `.md`)
212
+ - **Links**: Relative → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
213
+ - **URLs**: HTML → Markdown format (`.html` → `.md`)
214
+ - **Comments**: HTML comments removed (`<!-- ... -->`)
215
+ - **Badges**: Shield/badge images removed (CI badges, version badges, etc.)
216
+ - **Frontmatter**: YAML/TOML metadata removed (Jekyll, Hugo, etc.)
217
+ - **Whitespace**: Excessive blank lines reduced (3+ → 2 max)
205
218
  - Clean markdown structure preserved
206
- - No content modification, just link normalization
219
+ - No content modification, just intelligent cleanup
207
220
 
208
221
  ### 3. Generate llms.txt (The Standard)
209
222
 
@@ -298,6 +311,10 @@ title: My Project
298
311
  description: Brief description
299
312
  output: llms.txt
300
313
  convert_urls: true
314
+ remove_comments: true
315
+ remove_badges: true
316
+ remove_frontmatter: true
317
+ normalize_whitespace: true
301
318
  suffix: .llm
302
319
  verbose: false
303
320
  excludes:
@@ -369,20 +386,32 @@ content = LlmDocsBuilder.generate_from_docs('./docs',
369
386
  # Transform markdown
370
387
  transformed = LlmDocsBuilder.transform_markdown('README.md',
371
388
  base_url: 'https://myproject.io',
372
- convert_urls: true
389
+ convert_urls: true,
390
+ remove_comments: true,
391
+ remove_badges: true,
392
+ remove_frontmatter: true,
393
+ normalize_whitespace: true
373
394
  )
374
395
 
375
396
  # Bulk transform
376
397
  files = LlmDocsBuilder.bulk_transform('./docs',
377
398
  base_url: 'https://myproject.io',
378
399
  suffix: '.llm',
400
+ remove_comments: true,
401
+ remove_badges: true,
402
+ remove_frontmatter: true,
403
+ normalize_whitespace: true,
379
404
  excludes: ['**/private/**']
380
405
  )
381
406
 
382
407
  # In-place transformation
383
408
  files = LlmDocsBuilder.bulk_transform('./docs',
384
409
  suffix: '', # Empty for in-place
385
- base_url: 'https://myproject.io'
410
+ base_url: 'https://myproject.io',
411
+ remove_comments: true,
412
+ remove_badges: true,
413
+ remove_frontmatter: true,
414
+ normalize_whitespace: true
386
415
  )
387
416
  ```
388
417
 
@@ -401,6 +430,10 @@ The [Karafka framework](https://github.com/karafka/karafka) processes millions o
401
430
  docs: ./online/docs
402
431
  base_url: https://karafka.io/docs
403
432
  convert_urls: true
433
+ remove_comments: true
434
+ remove_badges: true
435
+ remove_frontmatter: true
436
+ normalize_whitespace: true
404
437
  suffix: "" # In-place transformation for build pipeline
405
438
  excludes:
406
439
  - "**/Enterprise-License-Setup/**"
@@ -474,6 +507,10 @@ By detecting AI bots and serving them clean markdown instead of HTML, you sidest
474
507
  | `description` | String | Auto-detected | Project description |
475
508
  | `output` | String | `llms.txt` | Output filename for llms.txt generation |
476
509
  | `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
510
+ | `remove_comments` | Boolean | `false` | Remove HTML comments (`<!-- ... -->`) |
511
+ | `remove_badges` | Boolean | `false` | Remove badge/shield images (CI, version, etc.) |
512
+ | `remove_frontmatter` | Boolean | `false` | Remove YAML/TOML frontmatter (Jekyll, Hugo) |
513
+ | `normalize_whitespace` | Boolean | `false` | Normalize excessive blank lines and trailing spaces |
477
514
  | `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
478
515
  | `excludes` | Array | `[]` | Glob patterns to exclude |
479
516
  | `verbose` | Boolean | `false` | Enable detailed output |
@@ -629,16 +666,23 @@ The llms.txt file serves as an efficient entry point, but the real token savings
629
666
  4. Build formatted llms.txt with links and descriptions
630
667
 
631
668
  **Transformation Process:**
632
- 1. Expand relative links to absolute URLs
633
- 2. Optionally convert `.html` to `.md`
634
- 3. Preserve all content unchanged
635
- 4. Write to new file or overwrite in-place
669
+ 1. Remove frontmatter (YAML/TOML metadata)
670
+ 2. Expand relative links to absolute URLs
671
+ 3. Convert `.html` URLs to `.md`
672
+ 4. Remove HTML comments
673
+ 5. Remove badge/shield images
674
+ 6. Normalize excessive whitespace
675
+ 7. Write to new file or overwrite in-place
636
676
 
637
677
  **Comparison Process:**
638
678
  1. Fetch URL with human User-Agent (or read local file)
639
679
  2. Fetch same URL with AI bot User-Agent
640
680
  3. Calculate size difference and reduction percentage
641
- 4. Display human-readable comparison results
681
+ 4. Estimate token counts using character-based heuristic
682
+ 5. Display human-readable comparison results with byte and token savings
683
+
684
+ **Token Estimation:**
685
+ The tool uses a simple but effective heuristic for estimating token counts: **~4 characters per token**. This approximation works well for English documentation and provides reasonable estimates without requiring external tokenizer dependencies. While not as precise as OpenAI's tiktoken, it's accurate enough (±10-15%) for understanding context window savings and making optimization decisions.
642
686
 
643
687
  ## FAQ
644
688
 
@@ -351,21 +351,25 @@ module LlmDocsBuilder
351
351
  puts 'Context Window Comparison'
352
352
  puts '=' * 60
353
353
  puts ''
354
- puts "Human version: #{format_bytes(result[:human_size])}"
354
+ puts "Human version: #{format_bytes(result[:human_size])} (~#{format_number(result[:human_tokens])} tokens)"
355
355
  puts " Source: #{result[:human_source]}"
356
356
  puts ''
357
- puts "AI version: #{format_bytes(result[:ai_size])}"
357
+ puts "AI version: #{format_bytes(result[:ai_size])} (~#{format_number(result[:ai_tokens])} tokens)"
358
358
  puts " Source: #{result[:ai_source]}"
359
359
  puts ''
360
360
  puts '-' * 60
361
361
 
362
362
  if result[:reduction_bytes].positive?
363
363
  puts "Reduction: #{format_bytes(result[:reduction_bytes])} (#{result[:reduction_percent]}%)"
364
+ puts "Token savings: #{format_number(result[:token_reduction])} tokens (#{result[:token_reduction_percent]}%)"
364
365
  puts "Factor: #{result[:factor]}x smaller"
365
366
  elsif result[:reduction_bytes].negative?
366
367
  increase_bytes = result[:reduction_bytes].abs
367
368
  increase_percent = result[:reduction_percent].abs
369
+ token_increase = result[:token_reduction].abs
370
+ token_increase_percent = result[:token_reduction_percent].abs
368
371
  puts "Increase: #{format_bytes(increase_bytes)} (#{increase_percent}%)"
372
+ puts "Token increase: #{format_number(token_increase)} tokens (#{token_increase_percent}%)"
369
373
  puts "Factor: #{result[:factor]}x larger"
370
374
  else
371
375
  puts 'Same size'
@@ -389,6 +393,14 @@ module LlmDocsBuilder
389
393
  end
390
394
  end
391
395
 
396
+ # Format number with comma separators for readability
397
+ #
398
+ # @param number [Integer] number to format
399
+ # @return [String] formatted number with commas
400
+ def format_number(number)
401
+ number.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
402
+ end
403
+
392
404
  # Validate llms.txt file format
393
405
  #
394
406
  # Checks if llms.txt file follows proper format with title, description, and documentation links.
@@ -62,6 +62,10 @@ module LlmDocsBuilder
62
62
  # - :reduction_bytes [Integer] bytes saved
63
63
  # - :reduction_percent [Integer] percentage reduction
64
64
  # - :factor [Float] compression factor
65
+ # - :human_tokens [Integer] estimated tokens in human version
66
+ # - :ai_tokens [Integer] estimated tokens in AI version
67
+ # - :token_reduction [Integer] estimated tokens saved
68
+ # - :token_reduction_percent [Integer] percentage of tokens saved
65
69
  # - :human_source [String] source description (URL or file)
66
70
  # - :ai_source [String] source description (URL or file)
67
71
  def compare
@@ -85,8 +89,8 @@ module LlmDocsBuilder
85
89
  ai_content = fetch_url(url, options[:ai_user_agent])
86
90
 
87
91
  calculate_results(
88
- human_content.bytesize,
89
- ai_content.bytesize,
92
+ human_content,
93
+ ai_content,
90
94
  "#{url} (User-Agent: human)",
91
95
  "#{url} (User-Agent: AI)"
92
96
  )
@@ -112,8 +116,8 @@ module LlmDocsBuilder
112
116
  ai_content = File.read(local_file)
113
117
 
114
118
  calculate_results(
115
- human_content.bytesize,
116
- ai_content.bytesize,
119
+ human_content,
120
+ ai_content,
117
121
  url,
118
122
  local_file
119
123
  )
@@ -205,12 +209,15 @@ module LlmDocsBuilder
205
209
 
206
210
  # Calculate comparison statistics
207
211
  #
208
- # @param human_size [Integer] size of human version in bytes
209
- # @param ai_size [Integer] size of AI version in bytes
212
+ # @param human_content [String] content of human version
213
+ # @param ai_content [String] content of AI version
210
214
  # @param human_source [String] description of human source
211
215
  # @param ai_source [String] description of AI source
212
216
  # @return [Hash] comparison results
213
- def calculate_results(human_size, ai_size, human_source, ai_source)
217
+ def calculate_results(human_content, ai_content, human_source, ai_source)
218
+ human_size = human_content.bytesize
219
+ ai_size = ai_content.bytesize
220
+
214
221
  reduction_bytes = human_size - ai_size
215
222
  reduction_percent = if human_size.positive?
216
223
  ((reduction_bytes.to_f / human_size) * 100).round
@@ -224,15 +231,43 @@ module LlmDocsBuilder
224
231
  Float::INFINITY
225
232
  end
226
233
 
234
+ # Estimate tokens
235
+ human_tokens = estimate_tokens(human_content)
236
+ ai_tokens = estimate_tokens(ai_content)
237
+ token_reduction = human_tokens - ai_tokens
238
+ token_reduction_percent = if human_tokens.positive?
239
+ ((token_reduction.to_f / human_tokens) * 100).round
240
+ else
241
+ 0
242
+ end
243
+
227
244
  {
228
245
  human_size: human_size,
229
246
  ai_size: ai_size,
230
247
  reduction_bytes: reduction_bytes,
231
248
  reduction_percent: reduction_percent,
232
249
  factor: factor,
250
+ human_tokens: human_tokens,
251
+ ai_tokens: ai_tokens,
252
+ token_reduction: token_reduction,
253
+ token_reduction_percent: token_reduction_percent,
233
254
  human_source: human_source,
234
255
  ai_source: ai_source
235
256
  }
236
257
  end
258
+
259
+ # Estimate token count using character-based approximation
260
+ #
261
+ # Uses the common heuristic that ~4 characters equals 1 token for English text.
262
+ # This provides reasonable estimates for documentation content without requiring
263
+ # external tokenizer dependencies.
264
+ #
265
+ # @param content [String] text content to estimate tokens for
266
+ # @return [Integer] estimated number of tokens
267
+ def estimate_tokens(content)
268
+ # Use 4 characters per token as a reasonable approximation
269
+ # This is a common heuristic for English text and works well for documentation
270
+ (content.length / 4.0).round
271
+ end
237
272
  end
238
273
  end
@@ -67,6 +67,26 @@ module LlmDocsBuilder
67
67
  else
68
68
  self['convert_urls'] || false
69
69
  end,
70
+ remove_comments: if options.key?(:remove_comments)
71
+ options[:remove_comments]
72
+ else
73
+ self['remove_comments'] || false
74
+ end,
75
+ normalize_whitespace: if options.key?(:normalize_whitespace)
76
+ options[:normalize_whitespace]
77
+ else
78
+ self['normalize_whitespace'] || false
79
+ end,
80
+ remove_badges: if options.key?(:remove_badges)
81
+ options[:remove_badges]
82
+ else
83
+ self['remove_badges'] || false
84
+ end,
85
+ remove_frontmatter: if options.key?(:remove_frontmatter)
86
+ options[:remove_frontmatter]
87
+ else
88
+ self['remove_frontmatter'] || false
89
+ end,
70
90
  verbose: options.key?(:verbose) ? options[:verbose] : (self['verbose'] || false),
71
91
  # Bulk transformation options
72
92
  suffix: options[:suffix] || self['suffix'] || '.llm',
@@ -27,6 +27,10 @@ module LlmDocsBuilder
27
27
  # @param options [Hash] transformation options
28
28
  # @option options [String] :base_url base URL for expanding relative links
29
29
  # @option options [Boolean] :convert_urls convert HTML URLs to markdown format
30
+ # @option options [Boolean] :remove_comments remove HTML comments from markdown
31
+ # @option options [Boolean] :normalize_whitespace normalize excessive whitespace
32
+ # @option options [Boolean] :remove_badges remove badge/shield images
33
+ # @option options [Boolean] :remove_frontmatter remove YAML/TOML frontmatter
30
34
  def initialize(file_path, options = {})
31
35
  @file_path = file_path
32
36
  @options = options
@@ -35,16 +39,31 @@ module LlmDocsBuilder
35
39
  # Transform markdown content to be AI-friendly
36
40
  #
37
41
  # Applies transformations to make the markdown more suitable for LLM processing:
42
+ # - Removes YAML/TOML frontmatter (if remove_frontmatter enabled)
38
43
  # - Expands relative links to absolute URLs (if base_url provided)
39
44
  # - Converts HTML URLs to markdown format (if convert_urls enabled)
45
+ # - Removes HTML comments (if remove_comments enabled)
46
+ # - Removes badge/shield images (if remove_badges enabled)
47
+ # - Normalizes excessive whitespace (if normalize_whitespace enabled)
40
48
  #
41
49
  # @return [String] transformed markdown content
42
50
  def transform
43
51
  content = File.read(file_path)
44
52
 
53
+ # Remove frontmatter first (before any other processing)
54
+ content = remove_frontmatter(content) if options[:remove_frontmatter]
55
+
56
+ # Link transformations
45
57
  content = expand_relative_links(content) if options[:base_url]
46
58
  content = convert_html_urls(content) if options[:convert_urls]
47
59
 
60
+ # Content cleanup
61
+ content = remove_comments(content) if options[:remove_comments]
62
+ content = remove_badges(content) if options[:remove_badges]
63
+
64
+ # Whitespace normalization last (after all other transformations)
65
+ content = normalize_whitespace(content) if options[:normalize_whitespace]
66
+
48
67
  content
49
68
  end
50
69
 
@@ -86,5 +105,89 @@ module LlmDocsBuilder
86
105
  url.sub(/\.html?$/, '.md')
87
106
  end
88
107
  end
108
+
109
+ # Remove HTML comments from markdown content
110
+ #
111
+ # Strips out HTML comments (<!-- ... -->) which are typically metadata for developers
112
+ # and not relevant for LLM consumption. This reduces token usage and improves clarity.
113
+ #
114
+ # Handles:
115
+ # - Single-line comments: <!-- comment -->
116
+ # - Multi-line comments spanning multiple lines
117
+ # - Multiple comments in the same content
118
+ #
119
+ # @param content [String] markdown content to process
120
+ # @return [String] content with comments removed
121
+ def remove_comments(content)
122
+ # Remove HTML comments (single and multi-line)
123
+ # The .*? makes it non-greedy so it stops at the first -->
124
+ content.gsub(/<!--.*?-->/m, '')
125
+ end
126
+
127
+ # Remove badge and shield images from markdown
128
+ #
129
+ # Strips out badge/shield images (typically from shields.io, badge.fury.io, etc.)
130
+ # which are visual indicators for humans but provide no value to LLMs.
131
+ #
132
+ # Recognizes common patterns:
133
+ # - [![Badge](badge.svg)](link) (linked badges)
134
+ # - ![Badge](badge.svg) (unlinked badges)
135
+ # - Common badge domains: shields.io, badge.fury.io, travis-ci.org, etc.
136
+ #
137
+ # @param content [String] markdown content to process
138
+ # @return [String] content with badges removed
139
+ def remove_badges(content)
140
+ # Remove linked badges: [![...](badge-url)](link-url)
141
+ content = content.gsub(/\[\!\[([^\]]*)\]\([^\)]*(?:badge|shield|svg|travis|coveralls|fury)[^\)]*\)\]\([^\)]*\)/i, '')
142
+
143
+ # Remove standalone badges: ![...](badge-url)
144
+ content = content.gsub(/!\[([^\]]*)\]\([^\)]*(?:badge|shield|svg|travis|coveralls|fury)[^\)]*\)/i, '')
145
+
146
+ content
147
+ end
148
+
149
+ # Remove YAML or TOML frontmatter from markdown
150
+ #
151
+ # Strips out frontmatter blocks which are metadata used by static site generators
152
+ # (Jekyll, Hugo, etc.) but not relevant for LLM consumption.
153
+ #
154
+ # Recognizes:
155
+ # - YAML frontmatter: --- ... ---
156
+ # - TOML frontmatter: +++ ... +++
157
+ #
158
+ # @param content [String] markdown content to process
159
+ # @return [String] content with frontmatter removed
160
+ def remove_frontmatter(content)
161
+ # Remove YAML frontmatter (--- ... ---)
162
+ content = content.sub(/\A---\s*$.*?^---\s*$/m, '')
163
+
164
+ # Remove TOML frontmatter (+++ ... +++)
165
+ content = content.sub(/\A\+\+\+\s*$.*?^\+\+\+\s*$/m, '')
166
+
167
+ content
168
+ end
169
+
170
+ # Normalize excessive whitespace in markdown
171
+ #
172
+ # Reduces excessive blank lines and trailing whitespace to make content more compact
173
+ # for LLM consumption without affecting readability.
174
+ #
175
+ # Transformations:
176
+ # - Multiple consecutive blank lines (3+) → 2 blank lines max
177
+ # - Trailing whitespace on lines → removed
178
+ # - Leading/trailing whitespace in file → trimmed
179
+ #
180
+ # @param content [String] markdown content to process
181
+ # @return [String] content with normalized whitespace
182
+ def normalize_whitespace(content)
183
+ # Remove trailing whitespace from each line
184
+ content = content.gsub(/ +$/, '')
185
+
186
+ # Reduce multiple consecutive blank lines to maximum of 2
187
+ content = content.gsub(/\n{4,}/, "\n\n\n")
188
+
189
+ # Trim leading and trailing whitespace from the entire content
190
+ content.strip
191
+ end
89
192
  end
90
193
  end
@@ -2,5 +2,5 @@
2
2
 
3
3
  module LlmDocsBuilder
4
4
  # Current version of the LlmDocsBuilder gem
5
- VERSION = '0.3.0'
5
+ VERSION = '0.6.0'
6
6
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: llm-docs-builder
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Maciej Mensfeld