llm-docs-builder 0.7.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b8fed4a1c362dec44db4b8091bc81a7d1fc2c01e3c104993f74a810c048d5d02
4
- data.tar.gz: 575cf59762de9438336d3c4127277f0ba89e6e540772427cdee9af0407b90983
3
+ metadata.gz: 85874aed492c7a756acd3fec52228dedc3e3961b2e413a482aafd155e9c01d5e
4
+ data.tar.gz: de9d71e8bf15aace848995366cf1f6e46f758cbc86f9fa9b5bdd40be9e4695ce
5
5
  SHA512:
6
- metadata.gz: f8474aa5ed99b00c9b7d143a48dae066e7eb3d726fc9570d16b4bedae57e78e09bdf028bb679bf1d4f6c87ee94e690f48420ec8d67c5a22f9c97c5c4d8a064cd
7
- data.tar.gz: 4ea7338aed98ff58d1173ceaf426bcd6d79949508e1ec0441652ef0202d58bf3f88d600f3284648d98210de33ead0201aac0de3fde3cfb9bc92ce4e5e3b05132
6
+ metadata.gz: ee2f1c859b24726a812a3d162f6ea59b79e5a19c86d0df4202b4c4b3ad9fcfbe11ef28ea61784ce3594260952fec937b47858f01a2f0e37929173e367aab4547
7
+ data.tar.gz: 4d40ca6817d4b7b5d1ce9d9e6290e76a7cd145159b0ea8ac984e0e12d127cdc861e5a9e351e1cfbf10f93243c1e63956c552b5e3db4edbfdf9cd76c97859efea
data/CHANGELOG.md CHANGED
@@ -1,5 +1,27 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.8.0 (2025-10-14)
4
+ - [Feature] **RAG Enhancement: Heading Normalization** - Transform headings to include hierarchical context for better RAG retrieval.
5
+ - Adds parent context to H2-H6 headings (e.g., "Configuration / Consumer Settings / auto_offset_reset")
6
+ - Makes each section self-contained when documents are chunked
7
+ - Configurable separator (default: " / ")
8
+ - Enable with `normalize_headings: true`
9
+ - Perfect for vector databases and RAG systems
10
+ - [Feature] **RAG Enhancement: Enhanced llms.txt Metadata** - Generate enriched llms.txt files with machine-readable metadata.
11
+ - Token counts per document (helps AI agents manage context windows)
12
+ - Last modified timestamps (helps prefer recent docs)
13
+ - Priority labels: high/medium/low (helps guide which docs to fetch first)
14
+ - Optional compression ratios (shows optimization effectiveness)
15
+ - Enable with `include_metadata: true`, `include_tokens: true`, `include_timestamps: true`, `include_priority: true`
16
+ - [Enhancement] Added `HeadingTransformer` class with comprehensive heading hierarchy tracking.
17
+ - [Enhancement] Added priority calculation in Generator (README=high, getting started=high, tutorials=medium, etc.).
18
+ - [Enhancement] Updated `Config#merge_with_options` to support all new RAG options.
19
+ - [Testing] Added 10 comprehensive tests for HeadingTransformer covering edge cases.
20
+ - [Testing] All 303 tests passing with 96.94% line coverage and 85.59% branch coverage.
21
+ - [Documentation] Added "RAG Enhancement Features" section to README with examples and use cases.
22
+ - [Documentation] Added detailed implementation guide in RAG_FEATURES.md.
23
+ - [Documentation] Added example RAG configuration in examples/rag-config.yml.
24
+
3
25
  ## 0.7.0 (2025-10-09)
4
26
  - [Feature] **Advanced Token Optimization** - Added 8 new compression options to reduce token consumption:
5
27
  - `remove_code_examples`: Remove code blocks and inline code
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- llm-docs-builder (0.7.0)
4
+ llm-docs-builder (0.8.0)
5
5
  zeitwerk (~> 2.6)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -5,7 +5,7 @@
5
5
 
6
6
  **Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
7
7
 
8
- llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, and optimizes documents for LLM context windows.
8
+ llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, optimizes documents for LLM context windows, and enhances documents for RAG retrieval with hierarchical heading context and metadata.
9
9
 
10
10
  ## The Problem
11
11
 
@@ -104,6 +104,15 @@ simplify_links: true
104
104
  generate_toc: true
105
105
  custom_instruction: "This documentation is optimized for AI consumption"
106
106
 
107
+ # RAG enhancement options
108
+ normalize_headings: true # Add hierarchical context to headings
109
+ heading_separator: " / " # Separator for heading hierarchy
110
+ include_metadata: true # Enable enhanced llms.txt metadata
111
+ include_tokens: true # Include token counts in llms.txt
112
+ include_timestamps: true # Include update timestamps in llms.txt
113
+ include_priority: true # Include priority labels in llms.txt
114
+ calculate_compression: false # Calculate compression ratios (slower)
115
+
107
116
  # Exclusions
108
117
  excludes:
109
118
  - "**/private/**"
@@ -290,6 +299,70 @@ Use `llm-docs-builder compare` to measure before and after.
290
299
  **Q: What about private documentation?**
291
300
  Use the `excludes` option to skip sensitive files.
292
301
 
302
+ ## RAG Enhancement Features
303
+
304
+ ### Heading Normalization
305
+
306
+ Transform headings to include hierarchical context, making each section self-contained for RAG retrieval:
307
+
308
+ **Before:**
309
+ ```markdown
310
+ # Configuration
311
+ ## Consumer Settings
312
+ ### auto_offset_reset
313
+
314
+ Controls behavior when no offset exists...
315
+ ```
316
+
317
+ **After (with `normalize_headings: true`):**
318
+ ```markdown
319
+ # Configuration
320
+ ## Configuration / Consumer Settings
321
+ ### Configuration / Consumer Settings / auto_offset_reset
322
+
323
+ Controls behavior when no offset exists...
324
+ ```
325
+
326
+ **Why this matters for RAG:** When documents are chunked and retrieved independently, each section retains full context. An LLM seeing just the `auto_offset_reset` section knows it's about "Configuration / Consumer Settings / auto_offset_reset" not just generic "auto_offset_reset".
327
+
328
+ ```yaml
329
+ # Enable in config
330
+ normalize_headings: true
331
+ heading_separator: " / " # Customize separator (default: " / ")
332
+ ```
333
+
334
+ ### Enhanced llms.txt Metadata
335
+
336
+ Generate enriched llms.txt files with token counts, timestamps, and priority labels to help AI agents make better decisions:
337
+
338
+ **Standard llms.txt:**
339
+ ```markdown
340
+ - [Getting Started](https://myproject.io/docs/Getting-Started.md)
341
+ - [Configuration](https://myproject.io/docs/Configuration.md)
342
+ ```
343
+
344
+ **Enhanced llms.txt (with metadata enabled):**
345
+ ```markdown
346
+ - [Getting Started](https://myproject.io/docs/Getting-Started.md) tokens:450 updated:2025-10-13 priority:high
347
+ - [Configuration](https://myproject.io/docs/Configuration.md) tokens:2800 updated:2025-10-12 priority:high
348
+ - [Advanced Topics](https://myproject.io/docs/Advanced.md) tokens:5200 updated:2025-09-15 priority:medium
349
+ ```
350
+
351
+ **Benefits:**
352
+ - AI agents can see token counts → load multiple small docs vs one large doc
353
+ - Timestamps help prefer recent documentation
354
+ - Priority signals guide which docs to fetch first
355
+ - Compression ratios show optimization effectiveness
356
+
357
+ ```yaml
358
+ # Enable in config
359
+ include_metadata: true # Master switch
360
+ include_tokens: true # Show token counts
361
+ include_timestamps: true # Show last modified dates
362
+ include_priority: true # Show priority labels (high/medium/low)
363
+ calculate_compression: true # Show compression ratios (slower, requires transformation)
364
+ ```
365
+
293
366
  ## Advanced Compression Options
294
367
 
295
368
  All compression features can be used individually for fine-grained control:
@@ -128,7 +128,39 @@ module LlmDocsBuilder
128
128
  options[:remove_duplicates]
129
129
  else
130
130
  self['remove_duplicates'] || false
131
- end
131
+ end,
132
+ # New RAG enhancement options
133
+ normalize_headings: if options.key?(:normalize_headings)
134
+ options[:normalize_headings]
135
+ else
136
+ self['normalize_headings'] || false
137
+ end,
138
+ heading_separator: options[:heading_separator] || self['heading_separator'] || ' / ',
139
+ include_metadata: if options.key?(:include_metadata)
140
+ options[:include_metadata]
141
+ else
142
+ self['include_metadata'] || false
143
+ end,
144
+ include_tokens: if options.key?(:include_tokens)
145
+ options[:include_tokens]
146
+ else
147
+ self['include_tokens'] || false
148
+ end,
149
+ include_timestamps: if options.key?(:include_timestamps)
150
+ options[:include_timestamps]
151
+ else
152
+ self['include_timestamps'] || false
153
+ end,
154
+ include_priority: if options.key?(:include_priority)
155
+ options[:include_priority]
156
+ else
157
+ self['include_priority'] || false
158
+ end,
159
+ calculate_compression: if options.key?(:calculate_compression)
160
+ options[:calculate_compression]
161
+ else
162
+ self['calculate_compression'] || false
163
+ end
132
164
  }
133
165
  end
134
166
 
@@ -88,10 +88,10 @@ module LlmDocsBuilder
88
88
 
89
89
  # Extracts metadata from a documentation file
90
90
  #
91
- # Analyzes file content to extract title, description, and priority
91
+ # Analyzes file content to extract title, description, priority, and optional metadata
92
92
  #
93
93
  # @param file_path [String] path to file to analyze
94
- # @return [Hash] file metadata with :path, :title, :description, :priority
94
+ # @return [Hash] file metadata with :path, :title, :description, :priority, :tokens, :updated
95
95
  def analyze_file(file_path)
96
96
  # Handle single file case differently
97
97
  relative_path = if File.file?(docs_path)
@@ -102,12 +102,28 @@ module LlmDocsBuilder
102
102
 
103
103
  content = File.read(file_path)
104
104
 
105
- {
105
+ metadata = {
106
106
  path: relative_path,
107
107
  title: extract_title(content, file_path),
108
108
  description: extract_description(content),
109
109
  priority: calculate_priority(file_path)
110
110
  }
111
+
112
+ # Add optional enhanced metadata
113
+ if options[:include_metadata]
114
+ metadata[:tokens] = TokenEstimator.estimate(content) if options[:include_tokens]
115
+ metadata[:updated] = File.mtime(file_path).strftime('%Y-%m-%d') if options[:include_timestamps]
116
+
117
+ # Calculate compression ratio if transformation is enabled
118
+ if options[:calculate_compression]
119
+ transformed = apply_transformations(content, file_path)
120
+ original_tokens = TokenEstimator.estimate(content)
121
+ transformed_tokens = TokenEstimator.estimate(transformed)
122
+ metadata[:compression] = (transformed_tokens.to_f / original_tokens).round(2)
123
+ end
124
+ end
125
+
126
+ metadata
111
127
  end
112
128
 
113
129
  # Extracts title from file content or generates from filename
@@ -164,6 +180,21 @@ module LlmDocsBuilder
164
180
  7 # default priority
165
181
  end
166
182
 
183
+ # Applies transformations to content for compression ratio calculation
184
+ #
185
+ # @param content [String] original content
186
+ # @param file_path [String] path to file
187
+ # @return [String] transformed content
188
+ def apply_transformations(content, file_path)
189
+ transformer = MarkdownTransformer.new(file_path, options)
190
+
191
+ # Read file again through transformer to get transformed version
192
+ transformer.transform
193
+ rescue StandardError
194
+ # If transformation fails, return original content
195
+ content
196
+ end
197
+
167
198
  # Constructs llms.txt content from analyzed documentation files
168
199
  #
169
200
  # Combines title, description, and documentation links into formatted output
@@ -186,11 +217,24 @@ module LlmDocsBuilder
186
217
 
187
218
  docs.each do |doc|
188
219
  url = build_url(doc[:path])
189
- content << if doc[:description] && !doc[:description].empty?
190
- "- [#{doc[:title]}](#{url}): #{doc[:description]}"
191
- else
192
- "- [#{doc[:title]}](#{url})"
193
- end
220
+ line = if doc[:description] && !doc[:description].empty?
221
+ "- [#{doc[:title]}](#{url}): #{doc[:description]}"
222
+ else
223
+ "- [#{doc[:title]}](#{url})"
224
+ end
225
+
226
+ # Append metadata if enabled
227
+ if options[:include_metadata]
228
+ metadata_parts = []
229
+ metadata_parts << "tokens:#{doc[:tokens]}" if doc[:tokens]
230
+ metadata_parts << "compression:#{doc[:compression]}" if doc[:compression]
231
+ metadata_parts << "updated:#{doc[:updated]}" if doc[:updated]
232
+ metadata_parts << priority_label(doc[:priority]) if options[:include_priority]
233
+
234
+ line += " #{metadata_parts.join(' ')}" unless metadata_parts.empty?
235
+ end
236
+
237
+ content << line
194
238
  end
195
239
  end
196
240
 
@@ -230,5 +274,20 @@ module LlmDocsBuilder
230
274
  path
231
275
  end
232
276
  end
277
+
278
+ # Converts numeric priority to human-readable label
279
+ #
280
+ # @param priority [Integer] priority value (1-7)
281
+ # @return [String] priority label (high, medium, low)
282
+ def priority_label(priority)
283
+ case priority
284
+ when 1..2
285
+ 'priority:high'
286
+ when 3..5
287
+ 'priority:medium'
288
+ when 6..7
289
+ 'priority:low'
290
+ end
291
+ end
233
292
  end
234
293
  end
@@ -48,9 +48,10 @@ module LlmDocsBuilder
48
48
  # Processes content through specialized transformers in order:
49
49
  # 1. ContentCleanupTransformer - Removes unwanted elements
50
50
  # 2. LinkTransformer - Processes links
51
- # 3. TextCompressor - Advanced compression (if enabled)
52
- # 4. EnhancementTransformer - Adds TOC and instructions
53
- # 5. WhitespaceTransformer - Normalizes whitespace
51
+ # 3. HeadingTransformer - Normalizes heading hierarchy (if enabled)
52
+ # 4. TextCompressor - Advanced compression (if enabled)
53
+ # 5. EnhancementTransformer - Adds TOC and instructions
54
+ # 6. WhitespaceTransformer - Normalizes whitespace
54
55
  #
55
56
  # @return [String] transformed markdown content
56
57
  def transform
@@ -59,6 +60,7 @@ module LlmDocsBuilder
59
60
  # Build and execute transformation pipeline
60
61
  content = cleanup_transformer.transform(content, options)
61
62
  content = link_transformer.transform(content, options)
63
+ content = heading_transformer.transform(content, options)
62
64
  content = compress_content(content) if should_compress?
63
65
  content = enhancement_transformer.transform(content, options)
64
66
  content = whitespace_transformer.transform(content, options)
@@ -82,6 +84,13 @@ module LlmDocsBuilder
82
84
  @link_transformer ||= Transformers::LinkTransformer.new
83
85
  end
84
86
 
87
+ # Get heading transformer instance
88
+ #
89
+ # @return [Transformers::HeadingTransformer]
90
+ def heading_transformer
91
+ @heading_transformer ||= Transformers::HeadingTransformer.new
92
+ end
93
+
85
94
  # Get enhancement transformer instance
86
95
  #
87
96
  # @return [Transformers::EnhancementTransformer]
@@ -0,0 +1,72 @@
1
+ # frozen_string_literal: true
2
+
3
+ module LlmDocsBuilder
4
+ module Transformers
5
+ # Normalizes headings to include hierarchical context
6
+ #
7
+ # Transforms markdown headings to include parent context, making each section
8
+ # self-contained for RAG systems. This is particularly useful when documents
9
+ # are chunked and retrieved independently.
10
+ #
11
+ # @example Basic heading normalization
12
+ # # Configuration
13
+ # ## Consumer Settings
14
+ # ### auto_offset_reset
15
+ #
16
+ # Becomes:
17
+ # # Configuration
18
+ # ## Configuration / Consumer Settings
19
+ # ### Configuration / Consumer Settings / auto_offset_reset
20
+ #
21
+ # @api public
22
+ class HeadingTransformer
23
+ include BaseTransformer
24
+
25
+ # Transform content by normalizing heading hierarchy
26
+ #
27
+ # Parses markdown headings and adds parent context to each heading,
28
+ # making sections self-documenting when retrieved independently.
29
+ #
30
+ # @param content [String] markdown content to transform
31
+ # @param options [Hash] transformation options
32
+ # @option options [Boolean] :normalize_headings enable heading normalization
33
+ # @option options [String] :heading_separator separator between heading levels (default: ' / ')
34
+ # @return [String] transformed content with normalized headings
35
+ def transform(content, options = {})
36
+ return content unless options[:normalize_headings]
37
+
38
+ separator = options[:heading_separator] || ' / '
39
+ heading_stack = []
40
+ lines = content.lines
41
+
42
+ transformed_lines = lines.map do |line|
43
+ # Match markdown headings (1-6 hash symbols followed by space and text)
44
+ heading_match = line.match(/^(#+)\s+(.+)$/)
45
+
46
+ if heading_match && heading_match[1].count('#').between?(1, 6)
47
+ level = heading_match[1].count('#')
48
+ title = heading_match[2].strip
49
+
50
+ # Update heading stack to current level
51
+ heading_stack = heading_stack[0...level - 1]
52
+ heading_stack << title
53
+
54
+ # Build hierarchical heading
55
+ if level == 1
56
+ # H1 stays as-is (top level)
57
+ line
58
+ else
59
+ # H2+ gets parent context
60
+ hierarchical_title = heading_stack.join(separator)
61
+ "#{'#' * level} #{hierarchical_title}\n"
62
+ end
63
+ else
64
+ line
65
+ end
66
+ end
67
+
68
+ transformed_lines.join
69
+ end
70
+ end
71
+ end
72
+ end
@@ -2,5 +2,5 @@
2
2
 
3
3
  module LlmDocsBuilder
4
4
  # Current version of the LlmDocsBuilder gem
5
- VERSION = '0.7.0'
5
+ VERSION = '0.8.0'
6
6
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: llm-docs-builder
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Maciej Mensfeld
@@ -140,6 +140,7 @@ files:
140
140
  - lib/llm_docs_builder/transformers/base_transformer.rb
141
141
  - lib/llm_docs_builder/transformers/content_cleanup_transformer.rb
142
142
  - lib/llm_docs_builder/transformers/enhancement_transformer.rb
143
+ - lib/llm_docs_builder/transformers/heading_transformer.rb
143
144
  - lib/llm_docs_builder/transformers/link_transformer.rb
144
145
  - lib/llm_docs_builder/transformers/whitespace_transformer.rb
145
146
  - lib/llm_docs_builder/validator.rb