RubyGems - llm-docs-builder - Versions diffs - 0.7.0 → 0.8.1 - Mend

llm-docs-builder 0.7.0 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/.github/workflows/docker.yml +10 -15
data/.github/workflows/push.yml +1 -1
data/CHANGELOG.md +25 -0
data/Gemfile.lock +1 -1
data/README.md +93 -1
data/lib/llm_docs_builder/config.rb +33 -1
data/lib/llm_docs_builder/generator.rb +67 -8
data/lib/llm_docs_builder/markdown_transformer.rb +12 -3
data/lib/llm_docs_builder/transformers/heading_transformer.rb +72 -0
data/lib/llm_docs_builder/version.rb +1 -1
metadata +2 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b8fed4a1c362dec44db4b8091bc81a7d1fc2c01e3c104993f74a810c048d5d02
-  data.tar.gz: 575cf59762de9438336d3c4127277f0ba89e6e540772427cdee9af0407b90983
+  metadata.gz: 8104ee730f22ce1c3ff1f13dea2e784e82b617ea5c20332185351922c1d75710
+  data.tar.gz: 4cc077637f04d658a35b97aee35aed031f27c44a59f87d961e64af4aed268f77
 SHA512:
-  metadata.gz: f8474aa5ed99b00c9b7d143a48dae066e7eb3d726fc9570d16b4bedae57e78e09bdf028bb679bf1d4f6c87ee94e690f48420ec8d67c5a22f9c97c5c4d8a064cd
-  data.tar.gz: 4ea7338aed98ff58d1173ceaf426bcd6d79949508e1ec0441652ef0202d58bf3f88d600f3284648d98210de33ead0201aac0de3fde3cfb9bc92ce4e5e3b05132
+  metadata.gz: 22c9b908b4473c8207027107f68c85fc7a8751d459a781ba67bcebc4034da5dcd8b0d748c1931b78b9e7d3ee891ac9f814350fb052930d1715fc38fb0e870eae
+  data.tar.gz: 6dea4a2987218ca92c9eb0162460870a15cdf24a0feaca559e31d734075d40be1da46186d437ccde47bf5cc0c9b2d2784180625fb0754926d0e1c7baf6708e86

data/.github/workflows/docker.yml CHANGED Viewed

@@ -4,30 +4,25 @@ concurrency:
   group: ${{ github.workflow }}-${{ github.ref }}
   cancel-in-progress: true
-# Temporarily disabled - only runs on manual trigger
 on:
+  push:
+    tags:
+      - 'v*'
   workflow_dispatch:
-# Automatic triggers disabled for now:
-#   push:
-#     branches:
-#       - master
-#     tags:
-#       - 'v*'
-#   pull_request:
-#     branches:
-#       - master
-#   schedule:
-#     # Rebuild weekly to get latest base image security updates
-#     - cron: '0 2 * * 0'
 permissions:
   contents: read
-  packages: write
 jobs:
   docker:
+    if: github.repository_owner == 'mensfeld'
     runs-on: ubuntu-latest
+    environment: deployment
+    permissions:
+      contents: read
+      packages: write
+      id-token: write
     steps:
       - name: Checkout
         uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5

data/.github/workflows/push.yml CHANGED Viewed

@@ -24,7 +24,7 @@ jobs:
           fetch-depth: 0
       - name: Set up Ruby
-        uses: ruby/setup-ruby@6797dcbb9a1889fd411d07e8aba7eded53fb8b48 # v1.264.0
+        uses: ruby/setup-ruby@ab177d40ee5483edb974554986f56b33477e21d0 # v1.265.0
         with:
           bundler-cache: false

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,30 @@
 # Changelog
+## 0.8.1 (2025-10-17)
+- [Enhancement] Ship the docker container.
+## 0.8.0 (2025-10-14)
+- [Feature] **RAG Enhancement: Heading Normalization** - Transform headings to include hierarchical context for better RAG retrieval.
+  - Adds parent context to H2-H6 headings (e.g., "Configuration / Consumer Settings / auto_offset_reset")
+  - Makes each section self-contained when documents are chunked
+  - Configurable separator (default: " / ")
+  - Enable with `normalize_headings: true`
+  - Perfect for vector databases and RAG systems
+- [Feature] **RAG Enhancement: Enhanced llms.txt Metadata** - Generate enriched llms.txt files with machine-readable metadata.
+  - Token counts per document (helps AI agents manage context windows)
+  - Last modified timestamps (helps prefer recent docs)
+  - Priority labels: high/medium/low (helps guide which docs to fetch first)
+  - Optional compression ratios (shows optimization effectiveness)
+  - Enable with `include_metadata: true`, `include_tokens: true`, `include_timestamps: true`, `include_priority: true`
+- [Enhancement] Added `HeadingTransformer` class with comprehensive heading hierarchy tracking.
+- [Enhancement] Added priority calculation in Generator (README=high, getting started=high, tutorials=medium, etc.).
+- [Enhancement] Updated `Config#merge_with_options` to support all new RAG options.
+- [Testing] Added 10 comprehensive tests for HeadingTransformer covering edge cases.
+- [Testing] All 303 tests passing with 96.94% line coverage and 85.59% branch coverage.
+- [Documentation] Added "RAG Enhancement Features" section to README with examples and use cases.
+- [Documentation] Added detailed implementation guide in RAG_FEATURES.md.
+- [Documentation] Added example RAG configuration in examples/rag-config.yml.
 ## 0.7.0 (2025-10-09)
 - [Feature] **Advanced Token Optimization** - Added 8 new compression options to reduce token consumption:
   - `remove_code_examples`: Remove code blocks and inline code

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    llm-docs-builder (0.7.0)
+    llm-docs-builder (0.8.1)
       zeitwerk (~> 2.6)
 GEM

data/README.md CHANGED Viewed

@@ -5,7 +5,7 @@
 **Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
-llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, and optimizes documents for LLM context windows.
+llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, optimizes documents for LLM context windows, and enhances documents for RAG retrieval with hierarchical heading context and metadata.
 ## The Problem
@@ -29,6 +29,25 @@ docker run mensfeld/llm-docs-builder compare \
   --url https://yoursite.com/docs/getting-started.html
 ```
+**Example output:**
+```
+============================================================
+Context Window Comparison
+============================================================
+Human version:  127.4 KB (~32,620 tokens)
+  Source: https://karafka.io/docs/Pro-Virtual-Partitions/ (User-Agent: human)
+AI version:     46.3 KB (~11,854 tokens)
+  Source: https://karafka.io/docs/Pro-Virtual-Partitions/ (User-Agent: AI)
+------------------------------------------------------------
+Reduction:      81.1 KB (64%)
+Token savings:  20,766 tokens (64%)
+Factor:         2.8x smaller
+============================================================
+```
 ### Transform Your Documentation
 ```bash
@@ -104,6 +123,15 @@ simplify_links: true
 generate_toc: true
 custom_instruction: "This documentation is optimized for AI consumption"
+# RAG enhancement options
+normalize_headings: true          # Add hierarchical context to headings
+heading_separator: " / "          # Separator for heading hierarchy
+include_metadata: true            # Enable enhanced llms.txt metadata
+include_tokens: true              # Include token counts in llms.txt
+include_timestamps: true          # Include update timestamps in llms.txt
+include_priority: true            # Include priority labels in llms.txt
+calculate_compression: false      # Calculate compression ratios (slower)
 # Exclusions
 excludes:
   - "**/private/**"
@@ -290,6 +318,70 @@ Use `llm-docs-builder compare` to measure before and after.
 **Q: What about private documentation?**
 Use the `excludes` option to skip sensitive files.
+## RAG Enhancement Features
+### Heading Normalization
+Transform headings to include hierarchical context, making each section self-contained for RAG retrieval:
+**Before:**
+```markdown
+# Configuration
+## Consumer Settings
+### auto_offset_reset
+Controls behavior when no offset exists...
+```
+**After (with `normalize_headings: true`):**
+```markdown
+# Configuration
+## Configuration / Consumer Settings
+### Configuration / Consumer Settings / auto_offset_reset
+Controls behavior when no offset exists...
+```
+**Why this matters for RAG:** When documents are chunked and retrieved independently, each section retains full context. An LLM seeing just the `auto_offset_reset` section knows it's about "Configuration / Consumer Settings / auto_offset_reset" not just generic "auto_offset_reset".
+```yaml
+# Enable in config
+normalize_headings: true
+heading_separator: " / "  # Customize separator (default: " / ")
+```
+### Enhanced llms.txt Metadata
+Generate enriched llms.txt files with token counts, timestamps, and priority labels to help AI agents make better decisions:
+**Standard llms.txt:**
+```markdown
+- [Getting Started](https://myproject.io/docs/Getting-Started.md)
+- [Configuration](https://myproject.io/docs/Configuration.md)
+```
+**Enhanced llms.txt (with metadata enabled):**
+```markdown
+- [Getting Started](https://myproject.io/docs/Getting-Started.md) tokens:450 updated:2025-10-13 priority:high
+- [Configuration](https://myproject.io/docs/Configuration.md) tokens:2800 updated:2025-10-12 priority:high
+- [Advanced Topics](https://myproject.io/docs/Advanced.md) tokens:5200 updated:2025-09-15 priority:medium
+```
+**Benefits:**
+- AI agents can see token counts → load multiple small docs vs one large doc
+- Timestamps help prefer recent documentation
+- Priority signals guide which docs to fetch first
+- Compression ratios show optimization effectiveness
+```yaml
+# Enable in config
+include_metadata: true      # Master switch
+include_tokens: true        # Show token counts
+include_timestamps: true    # Show last modified dates
+include_priority: true      # Show priority labels (high/medium/low)
+calculate_compression: true # Show compression ratios (slower, requires transformation)
+```
 ## Advanced Compression Options
 All compression features can be used individually for fine-grained control:

data/lib/llm_docs_builder/config.rb CHANGED Viewed

@@ -128,7 +128,39 @@ module LlmDocsBuilder
                              options[:remove_duplicates]
                            else
                              self['remove_duplicates'] || false
-                           end
+                           end,
+        # New RAG enhancement options
+        normalize_headings: if options.key?(:normalize_headings)
+                              options[:normalize_headings]
+                            else
+                              self['normalize_headings'] || false
+                            end,
+        heading_separator: options[:heading_separator] || self['heading_separator'] || ' / ',
+        include_metadata: if options.key?(:include_metadata)
+                            options[:include_metadata]
+                          else
+                            self['include_metadata'] || false
+                          end,
+        include_tokens: if options.key?(:include_tokens)
+                          options[:include_tokens]
+                        else
+                          self['include_tokens'] || false
+                        end,
+        include_timestamps: if options.key?(:include_timestamps)
+                              options[:include_timestamps]
+                            else
+                              self['include_timestamps'] || false
+                            end,
+        include_priority: if options.key?(:include_priority)
+                            options[:include_priority]
+                          else
+                            self['include_priority'] || false
+                          end,
+        calculate_compression: if options.key?(:calculate_compression)
+                                 options[:calculate_compression]
+                               else
+                                 self['calculate_compression'] || false
+                               end
       }
     end

data/lib/llm_docs_builder/generator.rb CHANGED Viewed

@@ -88,10 +88,10 @@ module LlmDocsBuilder
     # Extracts metadata from a documentation file
     #
-    # Analyzes file content to extract title, description, and priority
+    # Analyzes file content to extract title, description, priority, and optional metadata
     #
     # @param file_path [String] path to file to analyze
-    # @return [Hash] file metadata with :path, :title, :description, :priority
+    # @return [Hash] file metadata with :path, :title, :description, :priority, :tokens, :updated
     def analyze_file(file_path)
       # Handle single file case differently
       relative_path = if File.file?(docs_path)
@@ -102,12 +102,28 @@ module LlmDocsBuilder
       content = File.read(file_path)
-      {
+      metadata = {
         path: relative_path,
         title: extract_title(content, file_path),
         description: extract_description(content),
         priority: calculate_priority(file_path)
       }
+      # Add optional enhanced metadata
+      if options[:include_metadata]
+        metadata[:tokens] = TokenEstimator.estimate(content) if options[:include_tokens]
+        metadata[:updated] = File.mtime(file_path).strftime('%Y-%m-%d') if options[:include_timestamps]
+        # Calculate compression ratio if transformation is enabled
+        if options[:calculate_compression]
+          transformed = apply_transformations(content, file_path)
+          original_tokens = TokenEstimator.estimate(content)
+          transformed_tokens = TokenEstimator.estimate(transformed)
+          metadata[:compression] = (transformed_tokens.to_f / original_tokens).round(2)
+        end
+      end
+      metadata
     end
     # Extracts title from file content or generates from filename
@@ -164,6 +180,21 @@ module LlmDocsBuilder
       7 # default priority
     end
+    # Applies transformations to content for compression ratio calculation
+    #
+    # @param content [String] original content
+    # @param file_path [String] path to file
+    # @return [String] transformed content
+    def apply_transformations(content, file_path)
+      transformer = MarkdownTransformer.new(file_path, options)
+      # Read file again through transformer to get transformed version
+      transformer.transform
+    rescue StandardError
+      # If transformation fails, return original content
+      content
+    end
     # Constructs llms.txt content from analyzed documentation files
     #
     # Combines title, description, and documentation links into formatted output
@@ -186,11 +217,24 @@ module LlmDocsBuilder
         docs.each do |doc|
           url = build_url(doc[:path])
-          content << if doc[:description] && !doc[:description].empty?
-                       "- [#{doc[:title]}](#{url}): #{doc[:description]}"
-                     else
-                       "- [#{doc[:title]}](#{url})"
-                     end
+          line = if doc[:description] && !doc[:description].empty?
+                   "- [#{doc[:title]}](#{url}): #{doc[:description]}"
+                 else
+                   "- [#{doc[:title]}](#{url})"
+                 end
+          # Append metadata if enabled
+          if options[:include_metadata]
+            metadata_parts = []
+            metadata_parts << "tokens:#{doc[:tokens]}" if doc[:tokens]
+            metadata_parts << "compression:#{doc[:compression]}" if doc[:compression]
+            metadata_parts << "updated:#{doc[:updated]}" if doc[:updated]
+            metadata_parts << priority_label(doc[:priority]) if options[:include_priority]
+            line += " #{metadata_parts.join(' ')}" unless metadata_parts.empty?
+          end
+          content << line
         end
       end
@@ -230,5 +274,20 @@ module LlmDocsBuilder
         path
       end
     end
+    # Converts numeric priority to human-readable label
+    #
+    # @param priority [Integer] priority value (1-7)
+    # @return [String] priority label (high, medium, low)
+    def priority_label(priority)
+      case priority
+      when 1..2
+        'priority:high'
+      when 3..5
+        'priority:medium'
+      when 6..7
+        'priority:low'
+      end
+    end
   end
 end

data/lib/llm_docs_builder/markdown_transformer.rb CHANGED Viewed

@@ -48,9 +48,10 @@ module LlmDocsBuilder
     # Processes content through specialized transformers in order:
     # 1. ContentCleanupTransformer - Removes unwanted elements
     # 2. LinkTransformer - Processes links
-    # 3. TextCompressor - Advanced compression (if enabled)
-    # 4. EnhancementTransformer - Adds TOC and instructions
-    # 5. WhitespaceTransformer - Normalizes whitespace
+    # 3. HeadingTransformer - Normalizes heading hierarchy (if enabled)
+    # 4. TextCompressor - Advanced compression (if enabled)
+    # 5. EnhancementTransformer - Adds TOC and instructions
+    # 6. WhitespaceTransformer - Normalizes whitespace
     #
     # @return [String] transformed markdown content
     def transform
@@ -59,6 +60,7 @@ module LlmDocsBuilder
       # Build and execute transformation pipeline
       content = cleanup_transformer.transform(content, options)
       content = link_transformer.transform(content, options)
+      content = heading_transformer.transform(content, options)
       content = compress_content(content) if should_compress?
       content = enhancement_transformer.transform(content, options)
       content = whitespace_transformer.transform(content, options)
@@ -82,6 +84,13 @@ module LlmDocsBuilder
       @link_transformer ||= Transformers::LinkTransformer.new
     end
+    # Get heading transformer instance
+    #
+    # @return [Transformers::HeadingTransformer]
+    def heading_transformer
+      @heading_transformer ||= Transformers::HeadingTransformer.new
+    end
     # Get enhancement transformer instance
     #
     # @return [Transformers::EnhancementTransformer]

data/lib/llm_docs_builder/transformers/heading_transformer.rb ADDED Viewed

@@ -0,0 +1,72 @@
+# frozen_string_literal: true
+module LlmDocsBuilder
+  module Transformers
+    # Normalizes headings to include hierarchical context
+    #
+    # Transforms markdown headings to include parent context, making each section
+    # self-contained for RAG systems. This is particularly useful when documents
+    # are chunked and retrieved independently.
+    #
+    # @example Basic heading normalization
+    #   # Configuration
+    #   ## Consumer Settings
+    #   ### auto_offset_reset
+    #
+    #   Becomes:
+    #   # Configuration
+    #   ## Configuration / Consumer Settings
+    #   ### Configuration / Consumer Settings / auto_offset_reset
+    #
+    # @api public
+    class HeadingTransformer
+      include BaseTransformer
+      # Transform content by normalizing heading hierarchy
+      #
+      # Parses markdown headings and adds parent context to each heading,
+      # making sections self-documenting when retrieved independently.
+      #
+      # @param content [String] markdown content to transform
+      # @param options [Hash] transformation options
+      # @option options [Boolean] :normalize_headings enable heading normalization
+      # @option options [String] :heading_separator separator between heading levels (default: ' / ')
+      # @return [String] transformed content with normalized headings
+      def transform(content, options = {})
+        return content unless options[:normalize_headings]
+        separator = options[:heading_separator] || ' / '
+        heading_stack = []
+        lines = content.lines
+        transformed_lines = lines.map do |line|
+          # Match markdown headings (1-6 hash symbols followed by space and text)
+          heading_match = line.match(/^(#+)\s+(.+)$/)
+          if heading_match && heading_match[1].count('#').between?(1, 6)
+            level = heading_match[1].count('#')
+            title = heading_match[2].strip
+            # Update heading stack to current level
+            heading_stack = heading_stack[0...level - 1]
+            heading_stack << title
+            # Build hierarchical heading
+            if level == 1
+              # H1 stays as-is (top level)
+              line
+            else
+              # H2+ gets parent context
+              hierarchical_title = heading_stack.join(separator)
+              "#{'#' * level} #{hierarchical_title}\n"
+            end
+          else
+            line
+          end
+        end
+        transformed_lines.join
+      end
+    end
+  end
+end

data/lib/llm_docs_builder/version.rb CHANGED Viewed

@@ -2,5 +2,5 @@
 module LlmDocsBuilder
   # Current version of the LlmDocsBuilder gem
-  VERSION = '0.7.0'
+  VERSION = '0.8.1'
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: llm-docs-builder
 version: !ruby/object:Gem::Version
-  version: 0.7.0
+  version: 0.8.1
 platform: ruby
 authors:
 - Maciej Mensfeld
@@ -140,6 +140,7 @@ files:
 - lib/llm_docs_builder/transformers/base_transformer.rb
 - lib/llm_docs_builder/transformers/content_cleanup_transformer.rb
 - lib/llm_docs_builder/transformers/enhancement_transformer.rb
+- lib/llm_docs_builder/transformers/heading_transformer.rb
 - lib/llm_docs_builder/transformers/link_transformer.rb
 - lib/llm_docs_builder/transformers/whitespace_transformer.rb
 - lib/llm_docs_builder/validator.rb