RubyGems - llm-docs-builder - Versions diffs - 0.3.0 → 0.6.0 - Mend

llm-docs-builder 0.3.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/.github/workflows/docker.yml +6 -6
data/CHANGELOG.md +1 -1
data/Gemfile.lock +1 -1
data/README.md +81 -37
data/lib/llm_docs_builder/cli.rb +14 -2
data/lib/llm_docs_builder/comparator.rb +42 -7
data/lib/llm_docs_builder/config.rb +20 -0
data/lib/llm_docs_builder/markdown_transformer.rb +103 -0
data/lib/llm_docs_builder/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 68bc5131179dcb94c393cb2973c9fa1ffbf868616effb6814a7e99d340923de2
-  data.tar.gz: c660150ac38f2687951ebd6f1e2f2014f4c5636aa4c322ef9114427b9d49a235
+  metadata.gz: 5f8bf86d722ce6f0727ada59374b507a0f8ce6d240469649ca241ff2c35b2a40
+  data.tar.gz: 745f6669e002326127d72252704b956d9c385076115abb7be8be807a476e7449
 SHA512:
-  metadata.gz: 45bf6443d05c9173bf308b80240595e509d596447343aca937158ebddc44d9d0d07c0b3bc0db61b743e49f9ff634f799a90e8f72c0105ed3221fc3f11bdd7e05
-  data.tar.gz: 4a2e9e04aa39dfa2954988a20e825c1d306299084e120f3332c413e008353be45336819bc0127aa03e05aaaa47db6a90b78f1c4033ea95b665262035b0998ace
+  metadata.gz: 0ebb2ae3a082748f5b407cafdc90ea36a9b47b1852fd6e05f5370eaa8e671f6160bee17c33d5773232b2b81365aaee87e29a8b2b2d863c9a9fd9b0a96e14da58
+  data.tar.gz: d6a2c67eca5455a5fb19962f44f0b1d5e6ebf424a1f3ba268e0c97b351f3f9321568d68fc16496e2383af9023f14f8ffdace517b79056e69c6e7a4a6dce5a4cf

data/.github/workflows/docker.yml CHANGED Viewed

@@ -36,7 +36,7 @@ jobs:
       - name: Docker meta
         id: meta
-        uses: docker/metadata-action@v5
+        uses: docker/metadata-action@c1e51972afc2121e065aed6d45c65596fe445f3f # v5
         with:
           images: |
             mensfeld/llm-docs-builder
@@ -50,28 +50,28 @@ jobs:
             type=raw,value=latest,enable={{is_default_branch}}
       - name: Set up QEMU
-        uses: docker/setup-qemu-action@v3
+        uses: docker/setup-qemu-action@29109295f81e9208d7d86ff1c6c12d2833863392 # v3
       - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3
+        uses: docker/setup-buildx-action@e468171a9de216ec08956ac3ada2f0791b6bd435 # v3
       - name: Login to Docker Hub
         if: github.event_name != 'pull_request'
-        uses: docker/login-action@v3
+        uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3
         with:
           username: ${{ secrets.DOCKERHUB_USERNAME }}
           password: ${{ secrets.DOCKERHUB_TOKEN }}
       - name: Login to GitHub Container Registry
         if: github.event_name != 'pull_request'
-        uses: docker/login-action@v3
+        uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3
         with:
           registry: ghcr.io
           username: ${{ github.actor }}
           password: ${{ secrets.GITHUB_TOKEN }}
       - name: Build and push
-        uses: docker/build-push-action@v5
+        uses: docker/build-push-action@263435318d21b8e681c14492fe198d362a7d2c83 # v6
         with:
           context: .
           platforms: linux/amd64,linux/arm64

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Changelog
-## Unreleased
+## 0.6.0 (2025-10-09)
 - [Breaking] **Project renamed from `llms-txt-ruby` to `llm-docs-builder`** to better reflect expanded functionality beyond just llms.txt generation.
   - Gem name: `llms-txt-ruby` → `llm-docs-builder`
   - Module name: `LlmsTxt` → `LlmDocsBuilder`

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    llm-docs-builder (0.3.0)
+    llm-docs-builder (0.6.0)
       zeitwerk (~> 2.6)
 GEM

data/README.md CHANGED Viewed

@@ -12,9 +12,9 @@ llm-docs-builder normalizes markdown documentation to be AI-friendly and generat
 When LLMs fetch documentation, they typically get HTML pages designed for humans - complete with navigation bars, footers, JavaScript, CSS, and other overhead. This wastes 70-90% of your context window on content that doesn't help answer questions.
 **Real example from Karafka documentation:**
-- Human HTML version: 82.0 KB
-- AI markdown version: 4.1 KB
-- **Result: 95% reduction, 20x smaller**
+- Human HTML version: 82.0 KB (~20,500 tokens)
+- AI markdown version: 4.1 KB (~1,025 tokens)
+- **Result: 95% reduction, 19,475 tokens saved, 20x smaller**
 With GPT-4's pricing at $2.50 per million input tokens, that's real money saved on every API call. More importantly, you can fit 30x more actual documentation into the same context window.
@@ -48,14 +48,15 @@ docker run mensfeld/llm-docs-builder compare \
 Context Window Comparison
 ============================================================
-Human version:  45.2 KB
+Human version:  45.2 KB (~11,300 tokens)
   Source: https://yoursite.com/docs/page.html (User-Agent: human)
-AI version:     12.8 KB
+AI version:     12.8 KB (~3,200 tokens)
   Source: https://yoursite.com/docs/page.html (User-Agent: AI)
 ------------------------------------------------------------
 Reduction:      32.4 KB (72%)
+Token savings:  8,100 tokens (72%)
 Factor:         3.5x smaller
 ============================================================
 ```
@@ -66,27 +67,27 @@ This single command shows you the potential ROI before you invest any time in op
 **[Karafka Framework Documentation](https://karafka.io/docs)** (10 pages analyzed):
-| Page | Human HTML | AI Markdown | Reduction | Factor |
-|------|-----------|-------------|-----------|---------|
-| Getting Started | 82.0 KB | 4.1 KB | 95% | 20.1x |
-| Configuration | 86.3 KB | 7.1 KB | 92% | 12.1x |
-| Routing | 93.6 KB | 14.7 KB | 84% | 6.4x |
-| Deployment | 122.1 KB | 33.3 KB | 73% | 3.7x |
-| Producing Messages | 87.7 KB | 8.3 KB | 91% | 10.6x |
-| Consuming Messages | 105.3 KB | 21.3 KB | 80% | 4.9x |
-| Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | 5.1x |
-| Active Job | 88.7 KB | 8.8 KB | 90% | 10.1x |
-| Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | 3.7x |
-| Error Handling | 93.8 KB | 13.1 KB | 86% | 7.2x |
-**Average: 83% reduction, 8.4x smaller files**
+| Page | Human HTML | AI Markdown | Reduction | Tokens Saved | Factor |
+|------|-----------|-------------|-----------|--------------|---------|
+| Getting Started | 82.0 KB | 4.1 KB | 95% | ~19,475 | 20.1x |
+| Configuration | 86.3 KB | 7.1 KB | 92% | ~19,800 | 12.1x |
+| Routing | 93.6 KB | 14.7 KB | 84% | ~19,725 | 6.4x |
+| Deployment | 122.1 KB | 33.3 KB | 73% | ~22,200 | 3.7x |
+| Producing Messages | 87.7 KB | 8.3 KB | 91% | ~19,850 | 10.6x |
+| Consuming Messages | 105.3 KB | 21.3 KB | 80% | ~21,000 | 4.9x |
+| Web UI Getting Started | 109.3 KB | 21.5 KB | 80% | ~21,950 | 5.1x |
+| Active Job | 88.7 KB | 8.8 KB | 90% | ~19,975 | 10.1x |
+| Monitoring and Logging | 120.7 KB | 32.5 KB | 73% | ~22,050 | 3.7x |
+| Error Handling | 93.8 KB | 13.1 KB | 86% | ~20,175 | 7.2x |
+**Average: 83% reduction, ~20,620 tokens saved per page, 8.4x smaller files**
 For a typical RAG system making 1,000 documentation queries per day:
-- **Before**: ~990 KB per day × 1,000 queries = ~990 MB processed
-- **After**: ~165 KB per day × 1,000 queries = ~165 MB processed
-- **Savings**: 83% reduction in token costs
+- **Before**: ~990 KB per day (~247,500 tokens) × 1,000 queries = ~247.5M tokens/day
+- **After**: ~165 KB per day (~41,250 tokens) × 1,000 queries = ~41.25M tokens/day
+- **Savings**: 83% reduction = ~206.25M tokens saved per day
-At GPT-4 pricing ($2.50/M input tokens), that's approximately **$2,000-5,000 saved annually** on a documentation site with moderate traffic.
+At GPT-4 pricing ($2.50/M input tokens), that's approximately **$500/day or $183,000/year saved** on a documentation site with moderate traffic.
 ## Installation
@@ -165,8 +166,12 @@ llm-docs-builder transform \
 # llm-docs-builder.yml
 docs: ./docs
 base_url: https://myproject.io
-suffix: .llm         # Creates README.llm.md alongside README.md
-convert_urls: true   # .html → .md
+suffix: .llm                 # Creates README.llm.md alongside README.md
+convert_urls: true           # .html → .md
+remove_comments: true        # Remove HTML comments
+remove_badges: true          # Remove badge/shield images
+remove_frontmatter: true     # Remove YAML/TOML frontmatter
+normalize_whitespace: true   # Clean up excessive blank lines
 ```
 ```bash
@@ -187,8 +192,12 @@ docs/
 # llm-docs-builder.yml
 docs: ./docs
 base_url: https://myproject.io
-suffix: ""           # Transforms in-place
-convert_urls: true
+suffix: ""                   # Transforms in-place
+convert_urls: true           # Convert .html to .md
+remove_comments: true        # Remove HTML comments
+remove_badges: true          # Remove badge/shield images
+remove_frontmatter: true     # Remove YAML/TOML frontmatter
+normalize_whitespace: true   # Clean up excessive blank lines
 excludes:
   - "**/private/**"
 ```
@@ -200,10 +209,14 @@ llm-docs-builder bulk-transform --config llm-docs-builder.yml
 Perfect for CI/CD where you transform docs before deployment.
 **What gets normalized:**
-- Relative links → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
-- HTML URLs → Markdown format (`.html` → `.md`)
+- **Links**: Relative → Absolute URLs (`./api.md` → `https://yoursite.com/api.md`)
+- **URLs**: HTML → Markdown format (`.html` → `.md`)
+- **Comments**: HTML comments removed (`<!-- ... -->`)
+- **Badges**: Shield/badge images removed (CI badges, version badges, etc.)
+- **Frontmatter**: YAML/TOML metadata removed (Jekyll, Hugo, etc.)
+- **Whitespace**: Excessive blank lines reduced (3+ → 2 max)
 - Clean markdown structure preserved
-- No content modification, just link normalization
+- No content modification, just intelligent cleanup
 ### 3. Generate llms.txt (The Standard)
@@ -298,6 +311,10 @@ title: My Project
 description: Brief description
 output: llms.txt
 convert_urls: true
+remove_comments: true
+remove_badges: true
+remove_frontmatter: true
+normalize_whitespace: true
 suffix: .llm
 verbose: false
 excludes:
@@ -369,20 +386,32 @@ content = LlmDocsBuilder.generate_from_docs('./docs',
 # Transform markdown
 transformed = LlmDocsBuilder.transform_markdown('README.md',
   base_url: 'https://myproject.io',
-  convert_urls: true
+  convert_urls: true,
+  remove_comments: true,
+  remove_badges: true,
+  remove_frontmatter: true,
+  normalize_whitespace: true
 )
 # Bulk transform
 files = LlmDocsBuilder.bulk_transform('./docs',
   base_url: 'https://myproject.io',
   suffix: '.llm',
+  remove_comments: true,
+  remove_badges: true,
+  remove_frontmatter: true,
+  normalize_whitespace: true,
   excludes: ['**/private/**']
 )
 # In-place transformation
 files = LlmDocsBuilder.bulk_transform('./docs',
   suffix: '',  # Empty for in-place
-  base_url: 'https://myproject.io'
+  base_url: 'https://myproject.io',
+  remove_comments: true,
+  remove_badges: true,
+  remove_frontmatter: true,
+  normalize_whitespace: true
 )
 ```
@@ -401,6 +430,10 @@ The [Karafka framework](https://github.com/karafka/karafka) processes millions o
 docs: ./online/docs
 base_url: https://karafka.io/docs
 convert_urls: true
+remove_comments: true
+remove_badges: true
+remove_frontmatter: true
+normalize_whitespace: true
 suffix: ""  # In-place transformation for build pipeline
 excludes:
   - "**/Enterprise-License-Setup/**"
@@ -474,6 +507,10 @@ By detecting AI bots and serving them clean markdown instead of HTML, you sidest
 | `description` | String | Auto-detected | Project description |
 | `output` | String | `llms.txt` | Output filename for llms.txt generation |
 | `convert_urls` | Boolean | `false` | Convert `.html`/`.htm` to `.md` |
+| `remove_comments` | Boolean | `false` | Remove HTML comments (`<!-- ... -->`) |
+| `remove_badges` | Boolean | `false` | Remove badge/shield images (CI, version, etc.) |
+| `remove_frontmatter` | Boolean | `false` | Remove YAML/TOML frontmatter (Jekyll, Hugo) |
+| `normalize_whitespace` | Boolean | `false` | Normalize excessive blank lines and trailing spaces |
 | `suffix` | String | `.llm` | Suffix for transformed files (use `""` for in-place) |
 | `excludes` | Array | `[]` | Glob patterns to exclude |
 | `verbose` | Boolean | `false` | Enable detailed output |
@@ -629,16 +666,23 @@ The llms.txt file serves as an efficient entry point, but the real token savings
 4. Build formatted llms.txt with links and descriptions
 **Transformation Process:**
-1. Expand relative links to absolute URLs
-2. Optionally convert `.html` to `.md`
-3. Preserve all content unchanged
-4. Write to new file or overwrite in-place
+1. Remove frontmatter (YAML/TOML metadata)
+2. Expand relative links to absolute URLs
+3. Convert `.html` URLs to `.md`
+4. Remove HTML comments
+5. Remove badge/shield images
+6. Normalize excessive whitespace
+7. Write to new file or overwrite in-place
 **Comparison Process:**
 1. Fetch URL with human User-Agent (or read local file)
 2. Fetch same URL with AI bot User-Agent
 3. Calculate size difference and reduction percentage
-4. Display human-readable comparison results
+4. Estimate token counts using character-based heuristic
+5. Display human-readable comparison results with byte and token savings
+**Token Estimation:**
+The tool uses a simple but effective heuristic for estimating token counts: **~4 characters per token**. This approximation works well for English documentation and provides reasonable estimates without requiring external tokenizer dependencies. While not as precise as OpenAI's tiktoken, it's accurate enough (±10-15%) for understanding context window savings and making optimization decisions.
 ## FAQ

data/lib/llm_docs_builder/cli.rb CHANGED Viewed

@@ -351,21 +351,25 @@ module LlmDocsBuilder
       puts 'Context Window Comparison'
       puts '=' * 60
       puts ''
-      puts "Human version:  #{format_bytes(result[:human_size])}"
+      puts "Human version:  #{format_bytes(result[:human_size])} (~#{format_number(result[:human_tokens])} tokens)"
       puts "  Source: #{result[:human_source]}"
       puts ''
-      puts "AI version:     #{format_bytes(result[:ai_size])}"
+      puts "AI version:     #{format_bytes(result[:ai_size])} (~#{format_number(result[:ai_tokens])} tokens)"
       puts "  Source: #{result[:ai_source]}"
       puts ''
       puts '-' * 60
       if result[:reduction_bytes].positive?
         puts "Reduction:      #{format_bytes(result[:reduction_bytes])} (#{result[:reduction_percent]}%)"
+        puts "Token savings:  #{format_number(result[:token_reduction])} tokens (#{result[:token_reduction_percent]}%)"
         puts "Factor:         #{result[:factor]}x smaller"
       elsif result[:reduction_bytes].negative?
         increase_bytes = result[:reduction_bytes].abs
         increase_percent = result[:reduction_percent].abs
+        token_increase = result[:token_reduction].abs
+        token_increase_percent = result[:token_reduction_percent].abs
         puts "Increase:       #{format_bytes(increase_bytes)} (#{increase_percent}%)"
+        puts "Token increase: #{format_number(token_increase)} tokens (#{token_increase_percent}%)"
         puts "Factor:         #{result[:factor]}x larger"
       else
         puts 'Same size'
@@ -389,6 +393,14 @@ module LlmDocsBuilder
       end
     end
+    # Format number with comma separators for readability
+    #
+    # @param number [Integer] number to format
+    # @return [String] formatted number with commas
+    def format_number(number)
+      number.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
+    end
     # Validate llms.txt file format
     #
     # Checks if llms.txt file follows proper format with title, description, and documentation links.

data/lib/llm_docs_builder/comparator.rb CHANGED Viewed

@@ -62,6 +62,10 @@ module LlmDocsBuilder
     #   - :reduction_bytes [Integer] bytes saved
     #   - :reduction_percent [Integer] percentage reduction
     #   - :factor [Float] compression factor
+    #   - :human_tokens [Integer] estimated tokens in human version
+    #   - :ai_tokens [Integer] estimated tokens in AI version
+    #   - :token_reduction [Integer] estimated tokens saved
+    #   - :token_reduction_percent [Integer] percentage of tokens saved
     #   - :human_source [String] source description (URL or file)
     #   - :ai_source [String] source description (URL or file)
     def compare
@@ -85,8 +89,8 @@ module LlmDocsBuilder
       ai_content = fetch_url(url, options[:ai_user_agent])
       calculate_results(
-        human_content.bytesize,
-        ai_content.bytesize,
+        human_content,
+        ai_content,
         "#{url} (User-Agent: human)",
         "#{url} (User-Agent: AI)"
       )
@@ -112,8 +116,8 @@ module LlmDocsBuilder
       ai_content = File.read(local_file)
       calculate_results(
-        human_content.bytesize,
-        ai_content.bytesize,
+        human_content,
+        ai_content,
         url,
         local_file
       )
@@ -205,12 +209,15 @@ module LlmDocsBuilder
     # Calculate comparison statistics
     #
-    # @param human_size [Integer] size of human version in bytes
-    # @param ai_size [Integer] size of AI version in bytes
+    # @param human_content [String] content of human version
+    # @param ai_content [String] content of AI version
     # @param human_source [String] description of human source
     # @param ai_source [String] description of AI source
     # @return [Hash] comparison results
-    def calculate_results(human_size, ai_size, human_source, ai_source)
+    def calculate_results(human_content, ai_content, human_source, ai_source)
+      human_size = human_content.bytesize
+      ai_size = ai_content.bytesize
       reduction_bytes = human_size - ai_size
       reduction_percent = if human_size.positive?
                             ((reduction_bytes.to_f / human_size) * 100).round
@@ -224,15 +231,43 @@ module LlmDocsBuilder
                  Float::INFINITY
                end
+      # Estimate tokens
+      human_tokens = estimate_tokens(human_content)
+      ai_tokens = estimate_tokens(ai_content)
+      token_reduction = human_tokens - ai_tokens
+      token_reduction_percent = if human_tokens.positive?
+                                  ((token_reduction.to_f / human_tokens) * 100).round
+                                else
+                                  0
+                                end
       {
         human_size: human_size,
         ai_size: ai_size,
         reduction_bytes: reduction_bytes,
         reduction_percent: reduction_percent,
         factor: factor,
+        human_tokens: human_tokens,
+        ai_tokens: ai_tokens,
+        token_reduction: token_reduction,
+        token_reduction_percent: token_reduction_percent,
         human_source: human_source,
         ai_source: ai_source
       }
     end
+    # Estimate token count using character-based approximation
+    #
+    # Uses the common heuristic that ~4 characters equals 1 token for English text.
+    # This provides reasonable estimates for documentation content without requiring
+    # external tokenizer dependencies.
+    #
+    # @param content [String] text content to estimate tokens for
+    # @return [Integer] estimated number of tokens
+    def estimate_tokens(content)
+      # Use 4 characters per token as a reasonable approximation
+      # This is a common heuristic for English text and works well for documentation
+      (content.length / 4.0).round
+    end
   end
 end

data/lib/llm_docs_builder/config.rb CHANGED Viewed

@@ -67,6 +67,26 @@ module LlmDocsBuilder
                       else
                         self['convert_urls'] || false
                       end,
+        remove_comments: if options.key?(:remove_comments)
+                           options[:remove_comments]
+                         else
+                           self['remove_comments'] || false
+                         end,
+        normalize_whitespace: if options.key?(:normalize_whitespace)
+                                options[:normalize_whitespace]
+                              else
+                                self['normalize_whitespace'] || false
+                              end,
+        remove_badges: if options.key?(:remove_badges)
+                         options[:remove_badges]
+                       else
+                         self['remove_badges'] || false
+                       end,
+        remove_frontmatter: if options.key?(:remove_frontmatter)
+                              options[:remove_frontmatter]
+                            else
+                              self['remove_frontmatter'] || false
+                            end,
         verbose: options.key?(:verbose) ? options[:verbose] : (self['verbose'] || false),
         # Bulk transformation options
         suffix: options[:suffix] || self['suffix'] || '.llm',

data/lib/llm_docs_builder/markdown_transformer.rb CHANGED Viewed

@@ -27,6 +27,10 @@ module LlmDocsBuilder
     # @param options [Hash] transformation options
     # @option options [String] :base_url base URL for expanding relative links
     # @option options [Boolean] :convert_urls convert HTML URLs to markdown format
+    # @option options [Boolean] :remove_comments remove HTML comments from markdown
+    # @option options [Boolean] :normalize_whitespace normalize excessive whitespace
+    # @option options [Boolean] :remove_badges remove badge/shield images
+    # @option options [Boolean] :remove_frontmatter remove YAML/TOML frontmatter
     def initialize(file_path, options = {})
       @file_path = file_path
       @options = options
@@ -35,16 +39,31 @@ module LlmDocsBuilder
     # Transform markdown content to be AI-friendly
     #
     # Applies transformations to make the markdown more suitable for LLM processing:
+    # - Removes YAML/TOML frontmatter (if remove_frontmatter enabled)
     # - Expands relative links to absolute URLs (if base_url provided)
     # - Converts HTML URLs to markdown format (if convert_urls enabled)
+    # - Removes HTML comments (if remove_comments enabled)
+    # - Removes badge/shield images (if remove_badges enabled)
+    # - Normalizes excessive whitespace (if normalize_whitespace enabled)
     #
     # @return [String] transformed markdown content
     def transform
       content = File.read(file_path)
+      # Remove frontmatter first (before any other processing)
+      content = remove_frontmatter(content) if options[:remove_frontmatter]
+      # Link transformations
       content = expand_relative_links(content) if options[:base_url]
       content = convert_html_urls(content) if options[:convert_urls]
+      # Content cleanup
+      content = remove_comments(content) if options[:remove_comments]
+      content = remove_badges(content) if options[:remove_badges]
+      # Whitespace normalization last (after all other transformations)
+      content = normalize_whitespace(content) if options[:normalize_whitespace]
       content
     end
@@ -86,5 +105,89 @@ module LlmDocsBuilder
         url.sub(/\.html?$/, '.md')
       end
     end
+    # Remove HTML comments from markdown content
+    #
+    # Strips out HTML comments (<!-- ... -->) which are typically metadata for developers
+    # and not relevant for LLM consumption. This reduces token usage and improves clarity.
+    #
+    # Handles:
+    # - Single-line comments: <!-- comment -->
+    # - Multi-line comments spanning multiple lines
+    # - Multiple comments in the same content
+    #
+    # @param content [String] markdown content to process
+    # @return [String] content with comments removed
+    def remove_comments(content)
+      # Remove HTML comments (single and multi-line)
+      # The .*? makes it non-greedy so it stops at the first -->
+      content.gsub(/<!--.*?-->/m, '')
+    end
+    # Remove badge and shield images from markdown
+    #
+    # Strips out badge/shield images (typically from shields.io, badge.fury.io, etc.)
+    # which are visual indicators for humans but provide no value to LLMs.
+    #
+    # Recognizes common patterns:
+    # - [![Badge](badge.svg)](link) (linked badges)
+    # - ![Badge](badge.svg) (unlinked badges)
+    # - Common badge domains: shields.io, badge.fury.io, travis-ci.org, etc.
+    #
+    # @param content [String] markdown content to process
+    # @return [String] content with badges removed
+    def remove_badges(content)
+      # Remove linked badges: [![...](badge-url)](link-url)
+      content = content.gsub(/\[\!\[([^\]]*)\]\([^\)]*(?:badge|shield|svg|travis|coveralls|fury)[^\)]*\)\]\([^\)]*\)/i, '')
+      # Remove standalone badges: ![...](badge-url)
+      content = content.gsub(/!\[([^\]]*)\]\([^\)]*(?:badge|shield|svg|travis|coveralls|fury)[^\)]*\)/i, '')
+      content
+    end
+    # Remove YAML or TOML frontmatter from markdown
+    #
+    # Strips out frontmatter blocks which are metadata used by static site generators
+    # (Jekyll, Hugo, etc.) but not relevant for LLM consumption.
+    #
+    # Recognizes:
+    # - YAML frontmatter: --- ... ---
+    # - TOML frontmatter: +++ ... +++
+    #
+    # @param content [String] markdown content to process
+    # @return [String] content with frontmatter removed
+    def remove_frontmatter(content)
+      # Remove YAML frontmatter (--- ... ---)
+      content = content.sub(/\A---\s*$.*?^---\s*$/m, '')
+      # Remove TOML frontmatter (+++ ... +++)
+      content = content.sub(/\A\+\+\+\s*$.*?^\+\+\+\s*$/m, '')
+      content
+    end
+    # Normalize excessive whitespace in markdown
+    #
+    # Reduces excessive blank lines and trailing whitespace to make content more compact
+    # for LLM consumption without affecting readability.
+    #
+    # Transformations:
+    # - Multiple consecutive blank lines (3+) → 2 blank lines max
+    # - Trailing whitespace on lines → removed
+    # - Leading/trailing whitespace in file → trimmed
+    #
+    # @param content [String] markdown content to process
+    # @return [String] content with normalized whitespace
+    def normalize_whitespace(content)
+      # Remove trailing whitespace from each line
+      content = content.gsub(/ +$/, '')
+      # Reduce multiple consecutive blank lines to maximum of 2
+      content = content.gsub(/\n{4,}/, "\n\n\n")
+      # Trim leading and trailing whitespace from the entire content
+      content.strip
+    end
   end
 end

data/lib/llm_docs_builder/version.rb CHANGED Viewed

@@ -2,5 +2,5 @@
 module LlmDocsBuilder
   # Current version of the LlmDocsBuilder gem
-  VERSION = '0.3.0'
+  VERSION = '0.6.0'
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: llm-docs-builder
 version: !ruby/object:Gem::Version
-  version: 0.3.0
+  version: 0.6.0
 platform: ruby
 authors:
 - Maciej Mensfeld