llm-docs-builder 0.7.0 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +22 -0
- data/Gemfile.lock +1 -1
- data/README.md +74 -1
- data/lib/llm_docs_builder/config.rb +33 -1
- data/lib/llm_docs_builder/generator.rb +67 -8
- data/lib/llm_docs_builder/markdown_transformer.rb +12 -3
- data/lib/llm_docs_builder/transformers/heading_transformer.rb +72 -0
- data/lib/llm_docs_builder/version.rb +1 -1
- metadata +2 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 85874aed492c7a756acd3fec52228dedc3e3961b2e413a482aafd155e9c01d5e
|
4
|
+
data.tar.gz: de9d71e8bf15aace848995366cf1f6e46f758cbc86f9fa9b5bdd40be9e4695ce
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: ee2f1c859b24726a812a3d162f6ea59b79e5a19c86d0df4202b4c4b3ad9fcfbe11ef28ea61784ce3594260952fec937b47858f01a2f0e37929173e367aab4547
|
7
|
+
data.tar.gz: 4d40ca6817d4b7b5d1ce9d9e6290e76a7cd145159b0ea8ac984e0e12d127cdc861e5a9e351e1cfbf10f93243c1e63956c552b5e3db4edbfdf9cd76c97859efea
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,27 @@
|
|
1
1
|
# Changelog
|
2
2
|
|
3
|
+
## 0.8.0 (2025-10-14)
|
4
|
+
- [Feature] **RAG Enhancement: Heading Normalization** - Transform headings to include hierarchical context for better RAG retrieval.
|
5
|
+
- Adds parent context to H2-H6 headings (e.g., "Configuration / Consumer Settings / auto_offset_reset")
|
6
|
+
- Makes each section self-contained when documents are chunked
|
7
|
+
- Configurable separator (default: " / ")
|
8
|
+
- Enable with `normalize_headings: true`
|
9
|
+
- Perfect for vector databases and RAG systems
|
10
|
+
- [Feature] **RAG Enhancement: Enhanced llms.txt Metadata** - Generate enriched llms.txt files with machine-readable metadata.
|
11
|
+
- Token counts per document (helps AI agents manage context windows)
|
12
|
+
- Last modified timestamps (helps prefer recent docs)
|
13
|
+
- Priority labels: high/medium/low (helps guide which docs to fetch first)
|
14
|
+
- Optional compression ratios (shows optimization effectiveness)
|
15
|
+
- Enable with `include_metadata: true`, `include_tokens: true`, `include_timestamps: true`, `include_priority: true`
|
16
|
+
- [Enhancement] Added `HeadingTransformer` class with comprehensive heading hierarchy tracking.
|
17
|
+
- [Enhancement] Added priority calculation in Generator (README=high, getting started=high, tutorials=medium, etc.).
|
18
|
+
- [Enhancement] Updated `Config#merge_with_options` to support all new RAG options.
|
19
|
+
- [Testing] Added 10 comprehensive tests for HeadingTransformer covering edge cases.
|
20
|
+
- [Testing] All 303 tests passing with 96.94% line coverage and 85.59% branch coverage.
|
21
|
+
- [Documentation] Added "RAG Enhancement Features" section to README with examples and use cases.
|
22
|
+
- [Documentation] Added detailed implementation guide in RAG_FEATURES.md.
|
23
|
+
- [Documentation] Added example RAG configuration in examples/rag-config.yml.
|
24
|
+
|
3
25
|
## 0.7.0 (2025-10-09)
|
4
26
|
- [Feature] **Advanced Token Optimization** - Added 8 new compression options to reduce token consumption:
|
5
27
|
- `remove_code_examples`: Remove code blocks and inline code
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -5,7 +5,7 @@
|
|
5
5
|
|
6
6
|
**Optimize your documentation for LLMs and RAG systems. Reduce token consumption by 67-95%.**
|
7
7
|
|
8
|
-
llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content,
|
8
|
+
llm-docs-builder transforms markdown documentation to be AI-friendly and generates llms.txt files. It normalizes links, removes unnecessary content, optimizes documents for LLM context windows, and enhances documents for RAG retrieval with hierarchical heading context and metadata.
|
9
9
|
|
10
10
|
## The Problem
|
11
11
|
|
@@ -104,6 +104,15 @@ simplify_links: true
|
|
104
104
|
generate_toc: true
|
105
105
|
custom_instruction: "This documentation is optimized for AI consumption"
|
106
106
|
|
107
|
+
# RAG enhancement options
|
108
|
+
normalize_headings: true # Add hierarchical context to headings
|
109
|
+
heading_separator: " / " # Separator for heading hierarchy
|
110
|
+
include_metadata: true # Enable enhanced llms.txt metadata
|
111
|
+
include_tokens: true # Include token counts in llms.txt
|
112
|
+
include_timestamps: true # Include update timestamps in llms.txt
|
113
|
+
include_priority: true # Include priority labels in llms.txt
|
114
|
+
calculate_compression: false # Calculate compression ratios (slower)
|
115
|
+
|
107
116
|
# Exclusions
|
108
117
|
excludes:
|
109
118
|
- "**/private/**"
|
@@ -290,6 +299,70 @@ Use `llm-docs-builder compare` to measure before and after.
|
|
290
299
|
**Q: What about private documentation?**
|
291
300
|
Use the `excludes` option to skip sensitive files.
|
292
301
|
|
302
|
+
## RAG Enhancement Features
|
303
|
+
|
304
|
+
### Heading Normalization
|
305
|
+
|
306
|
+
Transform headings to include hierarchical context, making each section self-contained for RAG retrieval:
|
307
|
+
|
308
|
+
**Before:**
|
309
|
+
```markdown
|
310
|
+
# Configuration
|
311
|
+
## Consumer Settings
|
312
|
+
### auto_offset_reset
|
313
|
+
|
314
|
+
Controls behavior when no offset exists...
|
315
|
+
```
|
316
|
+
|
317
|
+
**After (with `normalize_headings: true`):**
|
318
|
+
```markdown
|
319
|
+
# Configuration
|
320
|
+
## Configuration / Consumer Settings
|
321
|
+
### Configuration / Consumer Settings / auto_offset_reset
|
322
|
+
|
323
|
+
Controls behavior when no offset exists...
|
324
|
+
```
|
325
|
+
|
326
|
+
**Why this matters for RAG:** When documents are chunked and retrieved independently, each section retains full context. An LLM seeing just the `auto_offset_reset` section knows it's about "Configuration / Consumer Settings / auto_offset_reset" not just generic "auto_offset_reset".
|
327
|
+
|
328
|
+
```yaml
|
329
|
+
# Enable in config
|
330
|
+
normalize_headings: true
|
331
|
+
heading_separator: " / " # Customize separator (default: " / ")
|
332
|
+
```
|
333
|
+
|
334
|
+
### Enhanced llms.txt Metadata
|
335
|
+
|
336
|
+
Generate enriched llms.txt files with token counts, timestamps, and priority labels to help AI agents make better decisions:
|
337
|
+
|
338
|
+
**Standard llms.txt:**
|
339
|
+
```markdown
|
340
|
+
- [Getting Started](https://myproject.io/docs/Getting-Started.md)
|
341
|
+
- [Configuration](https://myproject.io/docs/Configuration.md)
|
342
|
+
```
|
343
|
+
|
344
|
+
**Enhanced llms.txt (with metadata enabled):**
|
345
|
+
```markdown
|
346
|
+
- [Getting Started](https://myproject.io/docs/Getting-Started.md) tokens:450 updated:2025-10-13 priority:high
|
347
|
+
- [Configuration](https://myproject.io/docs/Configuration.md) tokens:2800 updated:2025-10-12 priority:high
|
348
|
+
- [Advanced Topics](https://myproject.io/docs/Advanced.md) tokens:5200 updated:2025-09-15 priority:medium
|
349
|
+
```
|
350
|
+
|
351
|
+
**Benefits:**
|
352
|
+
- AI agents can see token counts → load multiple small docs vs one large doc
|
353
|
+
- Timestamps help prefer recent documentation
|
354
|
+
- Priority signals guide which docs to fetch first
|
355
|
+
- Compression ratios show optimization effectiveness
|
356
|
+
|
357
|
+
```yaml
|
358
|
+
# Enable in config
|
359
|
+
include_metadata: true # Master switch
|
360
|
+
include_tokens: true # Show token counts
|
361
|
+
include_timestamps: true # Show last modified dates
|
362
|
+
include_priority: true # Show priority labels (high/medium/low)
|
363
|
+
calculate_compression: true # Show compression ratios (slower, requires transformation)
|
364
|
+
```
|
365
|
+
|
293
366
|
## Advanced Compression Options
|
294
367
|
|
295
368
|
All compression features can be used individually for fine-grained control:
|
@@ -128,7 +128,39 @@ module LlmDocsBuilder
|
|
128
128
|
options[:remove_duplicates]
|
129
129
|
else
|
130
130
|
self['remove_duplicates'] || false
|
131
|
-
end
|
131
|
+
end,
|
132
|
+
# New RAG enhancement options
|
133
|
+
normalize_headings: if options.key?(:normalize_headings)
|
134
|
+
options[:normalize_headings]
|
135
|
+
else
|
136
|
+
self['normalize_headings'] || false
|
137
|
+
end,
|
138
|
+
heading_separator: options[:heading_separator] || self['heading_separator'] || ' / ',
|
139
|
+
include_metadata: if options.key?(:include_metadata)
|
140
|
+
options[:include_metadata]
|
141
|
+
else
|
142
|
+
self['include_metadata'] || false
|
143
|
+
end,
|
144
|
+
include_tokens: if options.key?(:include_tokens)
|
145
|
+
options[:include_tokens]
|
146
|
+
else
|
147
|
+
self['include_tokens'] || false
|
148
|
+
end,
|
149
|
+
include_timestamps: if options.key?(:include_timestamps)
|
150
|
+
options[:include_timestamps]
|
151
|
+
else
|
152
|
+
self['include_timestamps'] || false
|
153
|
+
end,
|
154
|
+
include_priority: if options.key?(:include_priority)
|
155
|
+
options[:include_priority]
|
156
|
+
else
|
157
|
+
self['include_priority'] || false
|
158
|
+
end,
|
159
|
+
calculate_compression: if options.key?(:calculate_compression)
|
160
|
+
options[:calculate_compression]
|
161
|
+
else
|
162
|
+
self['calculate_compression'] || false
|
163
|
+
end
|
132
164
|
}
|
133
165
|
end
|
134
166
|
|
@@ -88,10 +88,10 @@ module LlmDocsBuilder
|
|
88
88
|
|
89
89
|
# Extracts metadata from a documentation file
|
90
90
|
#
|
91
|
-
# Analyzes file content to extract title, description, and
|
91
|
+
# Analyzes file content to extract title, description, priority, and optional metadata
|
92
92
|
#
|
93
93
|
# @param file_path [String] path to file to analyze
|
94
|
-
# @return [Hash] file metadata with :path, :title, :description, :priority
|
94
|
+
# @return [Hash] file metadata with :path, :title, :description, :priority, :tokens, :updated
|
95
95
|
def analyze_file(file_path)
|
96
96
|
# Handle single file case differently
|
97
97
|
relative_path = if File.file?(docs_path)
|
@@ -102,12 +102,28 @@ module LlmDocsBuilder
|
|
102
102
|
|
103
103
|
content = File.read(file_path)
|
104
104
|
|
105
|
-
{
|
105
|
+
metadata = {
|
106
106
|
path: relative_path,
|
107
107
|
title: extract_title(content, file_path),
|
108
108
|
description: extract_description(content),
|
109
109
|
priority: calculate_priority(file_path)
|
110
110
|
}
|
111
|
+
|
112
|
+
# Add optional enhanced metadata
|
113
|
+
if options[:include_metadata]
|
114
|
+
metadata[:tokens] = TokenEstimator.estimate(content) if options[:include_tokens]
|
115
|
+
metadata[:updated] = File.mtime(file_path).strftime('%Y-%m-%d') if options[:include_timestamps]
|
116
|
+
|
117
|
+
# Calculate compression ratio if transformation is enabled
|
118
|
+
if options[:calculate_compression]
|
119
|
+
transformed = apply_transformations(content, file_path)
|
120
|
+
original_tokens = TokenEstimator.estimate(content)
|
121
|
+
transformed_tokens = TokenEstimator.estimate(transformed)
|
122
|
+
metadata[:compression] = (transformed_tokens.to_f / original_tokens).round(2)
|
123
|
+
end
|
124
|
+
end
|
125
|
+
|
126
|
+
metadata
|
111
127
|
end
|
112
128
|
|
113
129
|
# Extracts title from file content or generates from filename
|
@@ -164,6 +180,21 @@ module LlmDocsBuilder
|
|
164
180
|
7 # default priority
|
165
181
|
end
|
166
182
|
|
183
|
+
# Applies transformations to content for compression ratio calculation
|
184
|
+
#
|
185
|
+
# @param content [String] original content
|
186
|
+
# @param file_path [String] path to file
|
187
|
+
# @return [String] transformed content
|
188
|
+
def apply_transformations(content, file_path)
|
189
|
+
transformer = MarkdownTransformer.new(file_path, options)
|
190
|
+
|
191
|
+
# Read file again through transformer to get transformed version
|
192
|
+
transformer.transform
|
193
|
+
rescue StandardError
|
194
|
+
# If transformation fails, return original content
|
195
|
+
content
|
196
|
+
end
|
197
|
+
|
167
198
|
# Constructs llms.txt content from analyzed documentation files
|
168
199
|
#
|
169
200
|
# Combines title, description, and documentation links into formatted output
|
@@ -186,11 +217,24 @@ module LlmDocsBuilder
|
|
186
217
|
|
187
218
|
docs.each do |doc|
|
188
219
|
url = build_url(doc[:path])
|
189
|
-
|
190
|
-
|
191
|
-
|
192
|
-
|
193
|
-
|
220
|
+
line = if doc[:description] && !doc[:description].empty?
|
221
|
+
"- [#{doc[:title]}](#{url}): #{doc[:description]}"
|
222
|
+
else
|
223
|
+
"- [#{doc[:title]}](#{url})"
|
224
|
+
end
|
225
|
+
|
226
|
+
# Append metadata if enabled
|
227
|
+
if options[:include_metadata]
|
228
|
+
metadata_parts = []
|
229
|
+
metadata_parts << "tokens:#{doc[:tokens]}" if doc[:tokens]
|
230
|
+
metadata_parts << "compression:#{doc[:compression]}" if doc[:compression]
|
231
|
+
metadata_parts << "updated:#{doc[:updated]}" if doc[:updated]
|
232
|
+
metadata_parts << priority_label(doc[:priority]) if options[:include_priority]
|
233
|
+
|
234
|
+
line += " #{metadata_parts.join(' ')}" unless metadata_parts.empty?
|
235
|
+
end
|
236
|
+
|
237
|
+
content << line
|
194
238
|
end
|
195
239
|
end
|
196
240
|
|
@@ -230,5 +274,20 @@ module LlmDocsBuilder
|
|
230
274
|
path
|
231
275
|
end
|
232
276
|
end
|
277
|
+
|
278
|
+
# Converts numeric priority to human-readable label
|
279
|
+
#
|
280
|
+
# @param priority [Integer] priority value (1-7)
|
281
|
+
# @return [String] priority label (high, medium, low)
|
282
|
+
def priority_label(priority)
|
283
|
+
case priority
|
284
|
+
when 1..2
|
285
|
+
'priority:high'
|
286
|
+
when 3..5
|
287
|
+
'priority:medium'
|
288
|
+
when 6..7
|
289
|
+
'priority:low'
|
290
|
+
end
|
291
|
+
end
|
233
292
|
end
|
234
293
|
end
|
@@ -48,9 +48,10 @@ module LlmDocsBuilder
|
|
48
48
|
# Processes content through specialized transformers in order:
|
49
49
|
# 1. ContentCleanupTransformer - Removes unwanted elements
|
50
50
|
# 2. LinkTransformer - Processes links
|
51
|
-
# 3.
|
52
|
-
# 4.
|
53
|
-
# 5.
|
51
|
+
# 3. HeadingTransformer - Normalizes heading hierarchy (if enabled)
|
52
|
+
# 4. TextCompressor - Advanced compression (if enabled)
|
53
|
+
# 5. EnhancementTransformer - Adds TOC and instructions
|
54
|
+
# 6. WhitespaceTransformer - Normalizes whitespace
|
54
55
|
#
|
55
56
|
# @return [String] transformed markdown content
|
56
57
|
def transform
|
@@ -59,6 +60,7 @@ module LlmDocsBuilder
|
|
59
60
|
# Build and execute transformation pipeline
|
60
61
|
content = cleanup_transformer.transform(content, options)
|
61
62
|
content = link_transformer.transform(content, options)
|
63
|
+
content = heading_transformer.transform(content, options)
|
62
64
|
content = compress_content(content) if should_compress?
|
63
65
|
content = enhancement_transformer.transform(content, options)
|
64
66
|
content = whitespace_transformer.transform(content, options)
|
@@ -82,6 +84,13 @@ module LlmDocsBuilder
|
|
82
84
|
@link_transformer ||= Transformers::LinkTransformer.new
|
83
85
|
end
|
84
86
|
|
87
|
+
# Get heading transformer instance
|
88
|
+
#
|
89
|
+
# @return [Transformers::HeadingTransformer]
|
90
|
+
def heading_transformer
|
91
|
+
@heading_transformer ||= Transformers::HeadingTransformer.new
|
92
|
+
end
|
93
|
+
|
85
94
|
# Get enhancement transformer instance
|
86
95
|
#
|
87
96
|
# @return [Transformers::EnhancementTransformer]
|
@@ -0,0 +1,72 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module LlmDocsBuilder
|
4
|
+
module Transformers
|
5
|
+
# Normalizes headings to include hierarchical context
|
6
|
+
#
|
7
|
+
# Transforms markdown headings to include parent context, making each section
|
8
|
+
# self-contained for RAG systems. This is particularly useful when documents
|
9
|
+
# are chunked and retrieved independently.
|
10
|
+
#
|
11
|
+
# @example Basic heading normalization
|
12
|
+
# # Configuration
|
13
|
+
# ## Consumer Settings
|
14
|
+
# ### auto_offset_reset
|
15
|
+
#
|
16
|
+
# Becomes:
|
17
|
+
# # Configuration
|
18
|
+
# ## Configuration / Consumer Settings
|
19
|
+
# ### Configuration / Consumer Settings / auto_offset_reset
|
20
|
+
#
|
21
|
+
# @api public
|
22
|
+
class HeadingTransformer
|
23
|
+
include BaseTransformer
|
24
|
+
|
25
|
+
# Transform content by normalizing heading hierarchy
|
26
|
+
#
|
27
|
+
# Parses markdown headings and adds parent context to each heading,
|
28
|
+
# making sections self-documenting when retrieved independently.
|
29
|
+
#
|
30
|
+
# @param content [String] markdown content to transform
|
31
|
+
# @param options [Hash] transformation options
|
32
|
+
# @option options [Boolean] :normalize_headings enable heading normalization
|
33
|
+
# @option options [String] :heading_separator separator between heading levels (default: ' / ')
|
34
|
+
# @return [String] transformed content with normalized headings
|
35
|
+
def transform(content, options = {})
|
36
|
+
return content unless options[:normalize_headings]
|
37
|
+
|
38
|
+
separator = options[:heading_separator] || ' / '
|
39
|
+
heading_stack = []
|
40
|
+
lines = content.lines
|
41
|
+
|
42
|
+
transformed_lines = lines.map do |line|
|
43
|
+
# Match markdown headings (1-6 hash symbols followed by space and text)
|
44
|
+
heading_match = line.match(/^(#+)\s+(.+)$/)
|
45
|
+
|
46
|
+
if heading_match && heading_match[1].count('#').between?(1, 6)
|
47
|
+
level = heading_match[1].count('#')
|
48
|
+
title = heading_match[2].strip
|
49
|
+
|
50
|
+
# Update heading stack to current level
|
51
|
+
heading_stack = heading_stack[0...level - 1]
|
52
|
+
heading_stack << title
|
53
|
+
|
54
|
+
# Build hierarchical heading
|
55
|
+
if level == 1
|
56
|
+
# H1 stays as-is (top level)
|
57
|
+
line
|
58
|
+
else
|
59
|
+
# H2+ gets parent context
|
60
|
+
hierarchical_title = heading_stack.join(separator)
|
61
|
+
"#{'#' * level} #{hierarchical_title}\n"
|
62
|
+
end
|
63
|
+
else
|
64
|
+
line
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
transformed_lines.join
|
69
|
+
end
|
70
|
+
end
|
71
|
+
end
|
72
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: llm-docs-builder
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.8.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Maciej Mensfeld
|
@@ -140,6 +140,7 @@ files:
|
|
140
140
|
- lib/llm_docs_builder/transformers/base_transformer.rb
|
141
141
|
- lib/llm_docs_builder/transformers/content_cleanup_transformer.rb
|
142
142
|
- lib/llm_docs_builder/transformers/enhancement_transformer.rb
|
143
|
+
- lib/llm_docs_builder/transformers/heading_transformer.rb
|
143
144
|
- lib/llm_docs_builder/transformers/link_transformer.rb
|
144
145
|
- lib/llm_docs_builder/transformers/whitespace_transformer.rb
|
145
146
|
- lib/llm_docs_builder/validator.rb
|