llm-docs-builder 0.7.0 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b8fed4a1c362dec44db4b8091bc81a7d1fc2c01e3c104993f74a810c048d5d02
4
- data.tar.gz: 575cf59762de9438336d3c4127277f0ba89e6e540772427cdee9af0407b90983
3
+ metadata.gz: 8104ee730f22ce1c3ff1f13dea2e784e82b617ea5c20332185351922c1d75710
4
+ data.tar.gz: 4cc077637f04d658a35b97aee35aed031f27c44a59f87d961e64af4aed268f77
5
5
  SHA512:
6
- metadata.gz: f8474aa5ed99b00c9b7d143a48dae066e7eb3d726fc9570d16b4bedae57e78e09bdf028bb679bf1d4f6c87ee94e690f48420ec8d67c5a22f9c97c5c4d8a064cd
7
- data.tar.gz: 4ea7338aed98ff58d1173ceaf426bcd6d79949508e1ec0441652ef0202d58bf3f88d600f3284648d98210de33ead0201aac0de3fde3cfb9bc92ce4e5e3b05132
6
+ metadata.gz: 22c9b908b4473c8207027107f68c85fc7a8751d459a781ba67bcebc4034da5dcd8b0d748c1931b78b9e7d3ee891ac9f814350fb052930d1715fc38fb0e870eae
7
+ data.tar.gz: 6dea4a2987218ca92c9eb0162460870a15cdf24a0feaca559e31d734075d40be1da46186d437ccde47bf5cc0c9b2d2784180625fb0754926d0e1c7baf6708e86
@@ -4,30 +4,25 @@ concurrency:
4
4
  group: ${{ github.workflow }}-${{ github.ref }}
5
5
  cancel-in-progress: true
6
6
 
7
- # Temporarily disabled - only runs on manual trigger
8
7
  on:
8
+ push:
9
+ tags:
10
+ - 'v*'
9
11
  workflow_dispatch:
10
12
 
11
- # Automatic triggers disabled for now:
12
- # push:
13
- # branches:
14
- # - master
15
- # tags:
16
- # - 'v*'
17
- # pull_request:
18
- # branches:
19
- # - master
20
- # schedule:
21
- # # Rebuild weekly to get latest base image security updates
22
- # - cron: '0 2 * * 0'
23
-
24
13
  permissions:
25
14
  contents: read
26
- packages: write
27
15
 
28
16
  jobs:
29
17
  docker:
18
+ if: github.repository_owner == 'mensfeld'
30
19
  runs-on: ubuntu-latest
20
+ environment: deployment
21
+
22
+ permissions:
23
+ contents: read
24
+ packages: write
25
+ id-token: write
31
26
  steps:
32
27
  - name: Checkout
33
28
  uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5
@@ -24,7 +24,7 @@ jobs:
24
24
  fetch-depth: 0
25
25
 
26
26
  - name: Set up Ruby
27
- uses: ruby/setup-ruby@6797dcbb9a1889fd411d07e8aba7eded53fb8b48 # v1.264.0
27
+ uses: ruby/setup-ruby@ab177d40ee5483edb974554986f56b33477e21d0 # v1.265.0
28
28
  with:
29
29
  bundler-cache: false
30
30
 
data/CHANGELOG.md CHANGED
@@ -1,5 +1,30 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.8.1 (2025-10-17)
4
+ - [Enhancement] Ship the docker container.
5
+
6
+ ## 0.8.0 (2025-10-14)
7
+ - [Feature] **RAG Enhancement: Heading Normalization** - Transform headings to include hierarchical context for better RAG retrieval.
8
+ - Adds parent context to H2-H6 headings (e.g., "Configuration / Consumer Settings / auto_offset_reset")
9
+ - Makes each section self-contained when documents are chunked
10
+ - Configurable separator (default: " / ")
11
+ - Enable with `normalize_headings: true`
12
+ - Perfect for vector databases and RAG systems
13
+ - [Feature] **RAG Enhancement: Enhanced llms.txt Metadata** - Generate enriched llms.txt files with machine-readable metadata.
14
+ - Token counts per document (helps AI agents manage context windows)
15
+ - Last modified timestamps (helps prefer recent docs)
16
+ - Priority labels: high/medium/low (helps guide which docs to fetch first)
17
+ - Optional compression ratios (shows optimization effectiveness)
18
+ - Enable with `include_metadata: true`, `include_tokens: true`, `include_timestamps: true`, `include_priority: true`
19
+ - [Enhancement] Added `HeadingTransformer` class with comprehensive heading hierarchy tracking.
20
+ - [Enhancement] Added priority calculation in Generator (README=high, getting started=high, tutorials=medium, etc.).
21
+ - [Enhancement] Updated `Config#merge_with_options` to support all new RAG options.
22
+ - [Testing] Added 10 comprehensive tests for HeadingTransformer covering edge cases.
23
+ - [Testing] All 303 tests passing with 96.94% line coverage and 85.59% branch coverage.
24
+ - [Documentation] Added "RAG Enhancement Features" section to README with examples and use cases.
25
+ - [Documentation] Added detailed implementation guide in RAG_FEATURES.md.
26
+ - [Documentation] Added example RAG configuration in examples/rag-config.yml.
27
+
3
28
  ## 0.7.0 (2025-10-09)
4
29
  - [Feature] **Advanced Token Optimization** - Added 8 new compression options to reduce token consumption:
5
30
  - `remove_code_examples`: Remove code blocks and inline code
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- llm-docs-builder (0.7.0)
4
+ llm-docs-builder (0.8.1)
5
5
  zeitwerk (~> 2.6)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -5,7 +5,7 @@
5
5
 
6
6
  **Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
7
7
 
8
- llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, and optimizes documents for LLM context windows.
8
+ llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, optimizes documents for LLM context windows, and enhances documents for RAG retrieval with hierarchical heading context and metadata.
9
9
 
10
10
  ## The Problem
11
11
 
@@ -29,6 +29,25 @@ docker run mensfeld/llm-docs-builder compare \
29
29
  --url https://yoursite.com/docs/getting-started.html
30
30
  ```
31
31
 
32
+ **Example output:**
33
+ ```
34
+ ============================================================
35
+ Context Window Comparison
36
+ ============================================================
37
+
38
+ Human version: 127.4 KB (~32,620 tokens)
39
+ Source: https://karafka.io/docs/Pro-Virtual-Partitions/ (User-Agent: human)
40
+
41
+ AI version: 46.3 KB (~11,854 tokens)
42
+ Source: https://karafka.io/docs/Pro-Virtual-Partitions/ (User-Agent: AI)
43
+
44
+ ------------------------------------------------------------
45
+ Reduction: 81.1 KB (64%)
46
+ Token savings: 20,766 tokens (64%)
47
+ Factor: 2.8x smaller
48
+ ============================================================
49
+ ```
50
+
32
51
  ### Transform Your Documentation
33
52
 
34
53
  ```bash
@@ -104,6 +123,15 @@ simplify_links: true
104
123
  generate_toc: true
105
124
  custom_instruction: "This documentation is optimized for AI consumption"
106
125
 
126
+ # RAG enhancement options
127
+ normalize_headings: true # Add hierarchical context to headings
128
+ heading_separator: " / " # Separator for heading hierarchy
129
+ include_metadata: true # Enable enhanced llms.txt metadata
130
+ include_tokens: true # Include token counts in llms.txt
131
+ include_timestamps: true # Include update timestamps in llms.txt
132
+ include_priority: true # Include priority labels in llms.txt
133
+ calculate_compression: false # Calculate compression ratios (slower)
134
+
107
135
  # Exclusions
108
136
  excludes:
109
137
  - "**/private/**"
@@ -290,6 +318,70 @@ Use `llm-docs-builder compare` to measure before and after.
290
318
  **Q: What about private documentation?**
291
319
  Use the `excludes` option to skip sensitive files.
292
320
 
321
+ ## RAG Enhancement Features
322
+
323
+ ### Heading Normalization
324
+
325
+ Transform headings to include hierarchical context, making each section self-contained for RAG retrieval:
326
+
327
+ **Before:**
328
+ ```markdown
329
+ # Configuration
330
+ ## Consumer Settings
331
+ ### auto_offset_reset
332
+
333
+ Controls behavior when no offset exists...
334
+ ```
335
+
336
+ **After (with `normalize_headings: true`):**
337
+ ```markdown
338
+ # Configuration
339
+ ## Configuration / Consumer Settings
340
+ ### Configuration / Consumer Settings / auto_offset_reset
341
+
342
+ Controls behavior when no offset exists...
343
+ ```
344
+
345
+ **Why this matters for RAG:** When documents are chunked and retrieved independently, each section retains full context. An LLM seeing just the `auto_offset_reset` section knows it's about "Configuration / Consumer Settings / auto_offset_reset" not just generic "auto_offset_reset".
346
+
347
+ ```yaml
348
+ # Enable in config
349
+ normalize_headings: true
350
+ heading_separator: " / " # Customize separator (default: " / ")
351
+ ```
352
+
353
+ ### Enhanced llms.txt Metadata
354
+
355
+ Generate enriched llms.txt files with token counts, timestamps, and priority labels to help AI agents make better decisions:
356
+
357
+ **Standard llms.txt:**
358
+ ```markdown
359
+ - [Getting Started](https://myproject.io/docs/Getting-Started.md)
360
+ - [Configuration](https://myproject.io/docs/Configuration.md)
361
+ ```
362
+
363
+ **Enhanced llms.txt (with metadata enabled):**
364
+ ```markdown
365
+ - [Getting Started](https://myproject.io/docs/Getting-Started.md) tokens:450 updated:2025-10-13 priority:high
366
+ - [Configuration](https://myproject.io/docs/Configuration.md) tokens:2800 updated:2025-10-12 priority:high
367
+ - [Advanced Topics](https://myproject.io/docs/Advanced.md) tokens:5200 updated:2025-09-15 priority:medium
368
+ ```
369
+
370
+ **Benefits:**
371
+ - AI agents can see token counts → load multiple small docs vs one large doc
372
+ - Timestamps help prefer recent documentation
373
+ - Priority signals guide which docs to fetch first
374
+ - Compression ratios show optimization effectiveness
375
+
376
+ ```yaml
377
+ # Enable in config
378
+ include_metadata: true # Master switch
379
+ include_tokens: true # Show token counts
380
+ include_timestamps: true # Show last modified dates
381
+ include_priority: true # Show priority labels (high/medium/low)
382
+ calculate_compression: true # Show compression ratios (slower, requires transformation)
383
+ ```
384
+
293
385
  ## Advanced Compression Options
294
386
 
295
387
  All compression features can be used individually for fine-grained control:
@@ -128,7 +128,39 @@ module LlmDocsBuilder
128
128
  options[:remove_duplicates]
129
129
  else
130
130
  self['remove_duplicates'] || false
131
- end
131
+ end,
132
+ # New RAG enhancement options
133
+ normalize_headings: if options.key?(:normalize_headings)
134
+ options[:normalize_headings]
135
+ else
136
+ self['normalize_headings'] || false
137
+ end,
138
+ heading_separator: options[:heading_separator] || self['heading_separator'] || ' / ',
139
+ include_metadata: if options.key?(:include_metadata)
140
+ options[:include_metadata]
141
+ else
142
+ self['include_metadata'] || false
143
+ end,
144
+ include_tokens: if options.key?(:include_tokens)
145
+ options[:include_tokens]
146
+ else
147
+ self['include_tokens'] || false
148
+ end,
149
+ include_timestamps: if options.key?(:include_timestamps)
150
+ options[:include_timestamps]
151
+ else
152
+ self['include_timestamps'] || false
153
+ end,
154
+ include_priority: if options.key?(:include_priority)
155
+ options[:include_priority]
156
+ else
157
+ self['include_priority'] || false
158
+ end,
159
+ calculate_compression: if options.key?(:calculate_compression)
160
+ options[:calculate_compression]
161
+ else
162
+ self['calculate_compression'] || false
163
+ end
132
164
  }
133
165
  end
134
166
 
@@ -88,10 +88,10 @@ module LlmDocsBuilder
88
88
 
89
89
  # Extracts metadata from a documentation file
90
90
  #
91
- # Analyzes file content to extract title, description, and priority
91
+ # Analyzes file content to extract title, description, priority, and optional metadata
92
92
  #
93
93
  # @param file_path [String] path to file to analyze
94
- # @return [Hash] file metadata with :path, :title, :description, :priority
94
+ # @return [Hash] file metadata with :path, :title, :description, :priority, :tokens, :updated
95
95
  def analyze_file(file_path)
96
96
  # Handle single file case differently
97
97
  relative_path = if File.file?(docs_path)
@@ -102,12 +102,28 @@ module LlmDocsBuilder
102
102
 
103
103
  content = File.read(file_path)
104
104
 
105
- {
105
+ metadata = {
106
106
  path: relative_path,
107
107
  title: extract_title(content, file_path),
108
108
  description: extract_description(content),
109
109
  priority: calculate_priority(file_path)
110
110
  }
111
+
112
+ # Add optional enhanced metadata
113
+ if options[:include_metadata]
114
+ metadata[:tokens] = TokenEstimator.estimate(content) if options[:include_tokens]
115
+ metadata[:updated] = File.mtime(file_path).strftime('%Y-%m-%d') if options[:include_timestamps]
116
+
117
+ # Calculate compression ratio if transformation is enabled
118
+ if options[:calculate_compression]
119
+ transformed = apply_transformations(content, file_path)
120
+ original_tokens = TokenEstimator.estimate(content)
121
+ transformed_tokens = TokenEstimator.estimate(transformed)
122
+ metadata[:compression] = (transformed_tokens.to_f / original_tokens).round(2)
123
+ end
124
+ end
125
+
126
+ metadata
111
127
  end
112
128
 
113
129
  # Extracts title from file content or generates from filename
@@ -164,6 +180,21 @@ module LlmDocsBuilder
164
180
  7 # default priority
165
181
  end
166
182
 
183
+ # Applies transformations to content for compression ratio calculation
184
+ #
185
+ # @param content [String] original content
186
+ # @param file_path [String] path to file
187
+ # @return [String] transformed content
188
+ def apply_transformations(content, file_path)
189
+ transformer = MarkdownTransformer.new(file_path, options)
190
+
191
+ # Read file again through transformer to get transformed version
192
+ transformer.transform
193
+ rescue StandardError
194
+ # If transformation fails, return original content
195
+ content
196
+ end
197
+
167
198
  # Constructs llms.txt content from analyzed documentation files
168
199
  #
169
200
  # Combines title, description, and documentation links into formatted output
@@ -186,11 +217,24 @@ module LlmDocsBuilder
186
217
 
187
218
  docs.each do |doc|
188
219
  url = build_url(doc[:path])
189
- content << if doc[:description] && !doc[:description].empty?
190
- "- [#{doc[:title]}](#{url}): #{doc[:description]}"
191
- else
192
- "- [#{doc[:title]}](#{url})"
193
- end
220
+ line = if doc[:description] && !doc[:description].empty?
221
+ "- [#{doc[:title]}](#{url}): #{doc[:description]}"
222
+ else
223
+ "- [#{doc[:title]}](#{url})"
224
+ end
225
+
226
+ # Append metadata if enabled
227
+ if options[:include_metadata]
228
+ metadata_parts = []
229
+ metadata_parts << "tokens:#{doc[:tokens]}" if doc[:tokens]
230
+ metadata_parts << "compression:#{doc[:compression]}" if doc[:compression]
231
+ metadata_parts << "updated:#{doc[:updated]}" if doc[:updated]
232
+ metadata_parts << priority_label(doc[:priority]) if options[:include_priority]
233
+
234
+ line += " #{metadata_parts.join(' ')}" unless metadata_parts.empty?
235
+ end
236
+
237
+ content << line
194
238
  end
195
239
  end
196
240
 
@@ -230,5 +274,20 @@ module LlmDocsBuilder
230
274
  path
231
275
  end
232
276
  end
277
+
278
+ # Converts numeric priority to human-readable label
279
+ #
280
+ # @param priority [Integer] priority value (1-7)
281
+ # @return [String] priority label (high, medium, low)
282
+ def priority_label(priority)
283
+ case priority
284
+ when 1..2
285
+ 'priority:high'
286
+ when 3..5
287
+ 'priority:medium'
288
+ when 6..7
289
+ 'priority:low'
290
+ end
291
+ end
233
292
  end
234
293
  end
@@ -48,9 +48,10 @@ module LlmDocsBuilder
48
48
  # Processes content through specialized transformers in order:
49
49
  # 1. ContentCleanupTransformer - Removes unwanted elements
50
50
  # 2. LinkTransformer - Processes links
51
- # 3. TextCompressor - Advanced compression (if enabled)
52
- # 4. EnhancementTransformer - Adds TOC and instructions
53
- # 5. WhitespaceTransformer - Normalizes whitespace
51
+ # 3. HeadingTransformer - Normalizes heading hierarchy (if enabled)
52
+ # 4. TextCompressor - Advanced compression (if enabled)
53
+ # 5. EnhancementTransformer - Adds TOC and instructions
54
+ # 6. WhitespaceTransformer - Normalizes whitespace
54
55
  #
55
56
  # @return [String] transformed markdown content
56
57
  def transform
@@ -59,6 +60,7 @@ module LlmDocsBuilder
59
60
  # Build and execute transformation pipeline
60
61
  content = cleanup_transformer.transform(content, options)
61
62
  content = link_transformer.transform(content, options)
63
+ content = heading_transformer.transform(content, options)
62
64
  content = compress_content(content) if should_compress?
63
65
  content = enhancement_transformer.transform(content, options)
64
66
  content = whitespace_transformer.transform(content, options)
@@ -82,6 +84,13 @@ module LlmDocsBuilder
82
84
  @link_transformer ||= Transformers::LinkTransformer.new
83
85
  end
84
86
 
87
+ # Get heading transformer instance
88
+ #
89
+ # @return [Transformers::HeadingTransformer]
90
+ def heading_transformer
91
+ @heading_transformer ||= Transformers::HeadingTransformer.new
92
+ end
93
+
85
94
  # Get enhancement transformer instance
86
95
  #
87
96
  # @return [Transformers::EnhancementTransformer]
@@ -0,0 +1,72 @@
1
+ # frozen_string_literal: true
2
+
3
+ module LlmDocsBuilder
4
+ module Transformers
5
+ # Normalizes headings to include hierarchical context
6
+ #
7
+ # Transforms markdown headings to include parent context, making each section
8
+ # self-contained for RAG systems. This is particularly useful when documents
9
+ # are chunked and retrieved independently.
10
+ #
11
+ # @example Basic heading normalization
12
+ # # Configuration
13
+ # ## Consumer Settings
14
+ # ### auto_offset_reset
15
+ #
16
+ # Becomes:
17
+ # # Configuration
18
+ # ## Configuration / Consumer Settings
19
+ # ### Configuration / Consumer Settings / auto_offset_reset
20
+ #
21
+ # @api public
22
+ class HeadingTransformer
23
+ include BaseTransformer
24
+
25
+ # Transform content by normalizing heading hierarchy
26
+ #
27
+ # Parses markdown headings and adds parent context to each heading,
28
+ # making sections self-documenting when retrieved independently.
29
+ #
30
+ # @param content [String] markdown content to transform
31
+ # @param options [Hash] transformation options
32
+ # @option options [Boolean] :normalize_headings enable heading normalization
33
+ # @option options [String] :heading_separator separator between heading levels (default: ' / ')
34
+ # @return [String] transformed content with normalized headings
35
+ def transform(content, options = {})
36
+ return content unless options[:normalize_headings]
37
+
38
+ separator = options[:heading_separator] || ' / '
39
+ heading_stack = []
40
+ lines = content.lines
41
+
42
+ transformed_lines = lines.map do |line|
43
+ # Match markdown headings (1-6 hash symbols followed by space and text)
44
+ heading_match = line.match(/^(#+)\s+(.+)$/)
45
+
46
+ if heading_match && heading_match[1].count('#').between?(1, 6)
47
+ level = heading_match[1].count('#')
48
+ title = heading_match[2].strip
49
+
50
+ # Update heading stack to current level
51
+ heading_stack = heading_stack[0...level - 1]
52
+ heading_stack << title
53
+
54
+ # Build hierarchical heading
55
+ if level == 1
56
+ # H1 stays as-is (top level)
57
+ line
58
+ else
59
+ # H2+ gets parent context
60
+ hierarchical_title = heading_stack.join(separator)
61
+ "#{'#' * level} #{hierarchical_title}\n"
62
+ end
63
+ else
64
+ line
65
+ end
66
+ end
67
+
68
+ transformed_lines.join
69
+ end
70
+ end
71
+ end
72
+ end
@@ -2,5 +2,5 @@
2
2
 
3
3
  module LlmDocsBuilder
4
4
  # Current version of the LlmDocsBuilder gem
5
- VERSION = '0.7.0'
5
+ VERSION = '0.8.1'
6
6
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: llm-docs-builder
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.8.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Maciej Mensfeld
@@ -140,6 +140,7 @@ files:
140
140
  - lib/llm_docs_builder/transformers/base_transformer.rb
141
141
  - lib/llm_docs_builder/transformers/content_cleanup_transformer.rb
142
142
  - lib/llm_docs_builder/transformers/enhancement_transformer.rb
143
+ - lib/llm_docs_builder/transformers/heading_transformer.rb
143
144
  - lib/llm_docs_builder/transformers/link_transformer.rb
144
145
  - lib/llm_docs_builder/transformers/whitespace_transformer.rb
145
146
  - lib/llm_docs_builder/validator.rb