llm-docs-builder 0.9.4 → 0.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/push.yml +1 -1
- data/AGENTS.md +20 -0
- data/CHANGELOG.md +20 -0
- data/Gemfile.lock +17 -17
- data/README.md +69 -3
- data/lib/llm_docs_builder/cli.rb +32 -10
- data/lib/llm_docs_builder/comparator.rb +5 -75
- data/lib/llm_docs_builder/config.rb +10 -2
- data/lib/llm_docs_builder/generator.rb +54 -21
- data/lib/llm_docs_builder/markdown_transformer.rb +12 -1
- data/lib/llm_docs_builder/parser.rb +5 -3
- data/lib/llm_docs_builder/url_fetcher.rb +120 -0
- data/lib/llm_docs_builder/version.rb +1 -1
- data/lib/llm_docs_builder.rb +1 -0
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: deeae74a329018b4a43d7845a3be8b7c31347699c3ff7abd93d7b697a48982a3
|
|
4
|
+
data.tar.gz: f9c842caa93a45b4d75c45a15e116e6f98d8463e5268e2de32e498c725877e4f
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 94575eced147bd6740b5395acd41d3f46ffcadf40908831df081d5e03f56b35a2e1e9acfdfc7642af775b2aa86fe48ea322dd11baf48512d2f2ef43a1a491079
|
|
7
|
+
data.tar.gz: 52b7d40d4a95acd20a408f4d453f32c7154637ce42ec210cf67010ce10dbe14ad711c4cbfd4060d8ce5018b91668315d552732ea7118374e69228a869792ff0f
|
data/.github/workflows/push.yml
CHANGED
data/AGENTS.md
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
# Repository Guidelines
|
|
2
|
+
|
|
3
|
+
## Project Structure & Module Organization
|
|
4
|
+
Core gem code lives in `lib/llm_docs_builder`, with single-responsibility modules such as `generator.rb`, `validator.rb`, and the CLI glue in `cli.rb`. Shared entrypoint `lib/llm_docs_builder.rb` wires dependencies. Executables reside in `bin/`: `llm-docs-builder` boots the CLI, while `rspecs` runs the full test matrix. Specs mirror library files under `spec/` with command-level coverage in `spec/integrations`. Static assets (logos, diff screenshots) are in `misc/`. Example configuration templates live at `llm-docs-builder.yml.example`.
|
|
5
|
+
|
|
6
|
+
## Build, Test, and Development Commands
|
|
7
|
+
- `bundle install` — sync gem dependencies defined in `Gemfile`.
|
|
8
|
+
- `bundle exec rake` — default task; runs RSpec and RuboCop together.
|
|
9
|
+
- `bundle exec rspec` or `bin/rspecs` — execute unit and integration specs with doc formatter.
|
|
10
|
+
- `bundle exec rubocop` — enforce the Ruby style guide; mirrors CI.
|
|
11
|
+
- `bin/llm-docs-builder transform --docs README.md` — smoke-test the CLI against a local file.
|
|
12
|
+
|
|
13
|
+
## Coding Style & Naming Conventions
|
|
14
|
+
Target Ruby 3.2 with two-space indentation and trailing newline. Prefer single-quoted strings; enable `# frozen_string_literal: true` headers on Ruby files. Keep lines ≤120 characters except where the RuboCop config allows. Use descriptive module/class names (e.g., `LlmDocsBuilder::Generator`) and predicate methods ending with `?` when returning booleans. Place supporting fixtures in `spec/support` if added, and name files after the class they extend.
|
|
15
|
+
|
|
16
|
+
## Testing Guidelines
|
|
17
|
+
RSpec is the sole testing framework. Name files `*_spec.rb` and align describe blocks with constant paths. Integration scenarios belong in `spec/integrations` to capture CLI behaviors. SimpleCov is enabled by default for line and branch coverage; export `SIMPLECOV=false` for quick local runs. Persist example statuses with the automatically managed `spec/examples.txt`.
|
|
18
|
+
|
|
19
|
+
## Commit & Pull Request Guidelines
|
|
20
|
+
Keep commit subjects short, present-tense, and focused (e.g., `Align CLI config (#27)`). Group related changes together so `git log` remains readable. Pull requests should describe motivation, summarize behavioral impact, link related issues or discussions, and include CLI output or screenshots when touching generated docs. Ensure CI passes (`bundle exec rake`) before requesting review, and note any follow-up work in the PR description.
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,25 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.11.0 (2025-11-03)
|
|
4
|
+
- [Feature] **Transform from URL** — The `transform` command now accepts a remote URL via `--url` and processes fetched content through the standard transformer pipeline.
|
|
5
|
+
- Example: `llm-docs-builder transform --url https://example.com/docs/page.html`
|
|
6
|
+
- Applies all configured transformations and output options identically to local files
|
|
7
|
+
- By @Eric-Guo and @codex in PR #28.
|
|
8
|
+
|
|
9
|
+
## 0.10.0 (2025-10-27)
|
|
10
|
+
- [Feature] **llms.txt Specification Compliance** - Updated output format to fully comply with the llms.txt specification from llmstxt.org.
|
|
11
|
+
- **Metadata Format**: Metadata now appears within the description field using parentheses and comma separators: `- [title](url): description (tokens:450, updated:2025-10-13, priority:high)`
|
|
12
|
+
- **Optional Descriptions**: Parser now correctly handles links without descriptions: `- [title](url)` per spec
|
|
13
|
+
- **Multi-Section Support**: Documents automatically organized into `Documentation`, `Examples`, and `Optional` sections based on priority
|
|
14
|
+
- **Body Content Support**: Added optional `body` config parameter for custom content between description and sections
|
|
15
|
+
- Priority-based categorization: 1-3 → Documentation, 4-5 → Examples, 6-7 → Optional
|
|
16
|
+
- Empty sections are automatically omitted from output
|
|
17
|
+
- Updated parser regex from `/^[-*]\s*\[([^\]]+)\]\(([^)]+)\):\s*(.*)$/m` to `/^[-*]\s*\[([^\]]+)\]\(([^)]+)\)(?::\s*([^\n]*))?$/` to make descriptions optional
|
|
18
|
+
- Fixed multiline regex greedy matching issue that was capturing only one link per section
|
|
19
|
+
- [Test] Added comprehensive test suite for spec compliance (8 new parser tests, 7 new generator tests)
|
|
20
|
+
- [Docs] Updated README with multi-section organization examples and body content usage
|
|
21
|
+
- **Breaking Change**: Metadata format has changed from `tokens:450 updated:2025-10-13` to `(tokens:450, updated:2025-10-13)` for spec compliance
|
|
22
|
+
|
|
3
23
|
## 0.9.4 (2025-10-27)
|
|
4
24
|
- [Feature] **Auto-Exclude Hidden Directories** - Hidden directories (starting with `.`) are now automatically excluded by default to prevent noise from `.git`, `.lint`, `.github`, etc.
|
|
5
25
|
- Adds `include_hidden: false` as default behavior
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
llm-docs-builder (0.
|
|
4
|
+
llm-docs-builder (0.11.0)
|
|
5
5
|
zeitwerk (~> 2.6)
|
|
6
6
|
|
|
7
7
|
GEM
|
|
@@ -12,15 +12,15 @@ GEM
|
|
|
12
12
|
coderay (1.1.3)
|
|
13
13
|
diff-lcs (1.6.2)
|
|
14
14
|
docile (1.4.1)
|
|
15
|
-
json (2.
|
|
15
|
+
json (2.15.2)
|
|
16
16
|
language_server-protocol (3.17.0.5)
|
|
17
17
|
lint_roller (1.1.0)
|
|
18
18
|
method_source (1.1.0)
|
|
19
19
|
parallel (1.27.0)
|
|
20
|
-
parser (3.3.
|
|
20
|
+
parser (3.3.10.0)
|
|
21
21
|
ast (~> 2.4.1)
|
|
22
22
|
racc
|
|
23
|
-
prism (1.
|
|
23
|
+
prism (1.6.0)
|
|
24
24
|
pry (0.15.2)
|
|
25
25
|
coderay (~> 1.1)
|
|
26
26
|
method_source (~> 1.0)
|
|
@@ -29,22 +29,22 @@ GEM
|
|
|
29
29
|
pry (>= 0.13, < 0.16)
|
|
30
30
|
racc (1.8.1)
|
|
31
31
|
rainbow (3.1.1)
|
|
32
|
-
rake (13.3.
|
|
33
|
-
regexp_parser (2.11.
|
|
34
|
-
rspec (3.13.
|
|
32
|
+
rake (13.3.1)
|
|
33
|
+
regexp_parser (2.11.3)
|
|
34
|
+
rspec (3.13.2)
|
|
35
35
|
rspec-core (~> 3.13.0)
|
|
36
36
|
rspec-expectations (~> 3.13.0)
|
|
37
37
|
rspec-mocks (~> 3.13.0)
|
|
38
|
-
rspec-core (3.13.
|
|
38
|
+
rspec-core (3.13.6)
|
|
39
39
|
rspec-support (~> 3.13.0)
|
|
40
40
|
rspec-expectations (3.13.5)
|
|
41
41
|
diff-lcs (>= 1.2.0, < 2.0)
|
|
42
42
|
rspec-support (~> 3.13.0)
|
|
43
|
-
rspec-mocks (3.13.
|
|
43
|
+
rspec-mocks (3.13.6)
|
|
44
44
|
diff-lcs (>= 1.2.0, < 2.0)
|
|
45
45
|
rspec-support (~> 3.13.0)
|
|
46
|
-
rspec-support (3.13.
|
|
47
|
-
rubocop (1.
|
|
46
|
+
rspec-support (3.13.6)
|
|
47
|
+
rubocop (1.81.6)
|
|
48
48
|
json (~> 2.3)
|
|
49
49
|
language_server-protocol (~> 3.17.0.2)
|
|
50
50
|
lint_roller (~> 1.1.0)
|
|
@@ -52,10 +52,10 @@ GEM
|
|
|
52
52
|
parser (>= 3.3.0.2)
|
|
53
53
|
rainbow (>= 2.2.2, < 4.0)
|
|
54
54
|
regexp_parser (>= 2.9.3, < 3.0)
|
|
55
|
-
rubocop-ast (>= 1.
|
|
55
|
+
rubocop-ast (>= 1.47.1, < 2.0)
|
|
56
56
|
ruby-progressbar (~> 1.7)
|
|
57
57
|
unicode-display_width (>= 2.4.0, < 4.0)
|
|
58
|
-
rubocop-ast (1.
|
|
58
|
+
rubocop-ast (1.47.1)
|
|
59
59
|
parser (>= 3.3.7.2)
|
|
60
60
|
prism (~> 1.4)
|
|
61
61
|
ruby-progressbar (1.13.0)
|
|
@@ -65,9 +65,9 @@ GEM
|
|
|
65
65
|
simplecov_json_formatter (~> 0.1)
|
|
66
66
|
simplecov-html (0.13.2)
|
|
67
67
|
simplecov_json_formatter (0.1.4)
|
|
68
|
-
unicode-display_width (3.
|
|
69
|
-
unicode-emoji (~> 4.
|
|
70
|
-
unicode-emoji (4.0
|
|
68
|
+
unicode-display_width (3.2.0)
|
|
69
|
+
unicode-emoji (~> 4.1)
|
|
70
|
+
unicode-emoji (4.1.0)
|
|
71
71
|
zeitwerk (2.7.3)
|
|
72
72
|
|
|
73
73
|
PLATFORMS
|
|
@@ -85,4 +85,4 @@ DEPENDENCIES
|
|
|
85
85
|
simplecov (~> 0.21)
|
|
86
86
|
|
|
87
87
|
BUNDLED WITH
|
|
88
|
-
2.7.
|
|
88
|
+
2.7.2
|
data/README.md
CHANGED
|
@@ -61,6 +61,9 @@ Factor: 2.8x smaller
|
|
|
61
61
|
# Single file
|
|
62
62
|
llm-docs-builder transform --docs README.md
|
|
63
63
|
|
|
64
|
+
# Fetch and transform a remote page
|
|
65
|
+
llm-docs-builder transform --url https://yoursite.com/docs/page.html
|
|
66
|
+
|
|
64
67
|
# Bulk transform with config
|
|
65
68
|
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
|
66
69
|
```
|
|
@@ -109,6 +112,7 @@ docs: ./docs
|
|
|
109
112
|
base_url: https://myproject.io
|
|
110
113
|
title: My Project
|
|
111
114
|
description: Brief description
|
|
115
|
+
body: Optional body content between description and sections
|
|
112
116
|
output: llms.txt
|
|
113
117
|
suffix: .llm
|
|
114
118
|
verbose: false
|
|
@@ -289,9 +293,9 @@ Generate enriched llms.txt files with token counts, timestamps, and priority lab
|
|
|
289
293
|
|
|
290
294
|
**Enhanced llms.txt (with metadata enabled):**
|
|
291
295
|
```markdown
|
|
292
|
-
- [Getting Started](https://myproject.io/docs/Getting-Started.md) tokens:450 updated:2025-10-13 priority:high
|
|
293
|
-
- [Configuration](https://myproject.io/docs/Configuration.md) tokens:2800 updated:2025-10-12 priority:high
|
|
294
|
-
- [Advanced Topics](https://myproject.io/docs/Advanced.md) tokens:5200 updated:2025-09-15 priority:medium
|
|
296
|
+
- [Getting Started](https://myproject.io/docs/Getting-Started.md): Quick start guide (tokens:450, updated:2025-10-13, priority:high)
|
|
297
|
+
- [Configuration](https://myproject.io/docs/Configuration.md): Configuration options (tokens:2800, updated:2025-10-12, priority:high)
|
|
298
|
+
- [Advanced Topics](https://myproject.io/docs/Advanced.md): Deep dive topics (tokens:5200, updated:2025-09-15, priority:medium)
|
|
295
299
|
```
|
|
296
300
|
|
|
297
301
|
**Benefits:**
|
|
@@ -309,6 +313,68 @@ include_priority: true # Show priority labels (high/medium/low)
|
|
|
309
313
|
calculate_compression: true # Show compression ratios (slower, requires transformation)
|
|
310
314
|
```
|
|
311
315
|
|
|
316
|
+
**Note:** Metadata is formatted according to the llms.txt specification, appearing within the description field using parentheses and comma separators for spec compliance.
|
|
317
|
+
|
|
318
|
+
### Multi-Section Organization
|
|
319
|
+
|
|
320
|
+
Documents are automatically organized into multiple sections based on priority, following the llms.txt specification:
|
|
321
|
+
|
|
322
|
+
**Priority-based categorization:**
|
|
323
|
+
- **Documentation** (priority 1-3): Essential docs like README, getting started guides, user guides
|
|
324
|
+
- **Examples** (priority 4-5): Tutorials and example files
|
|
325
|
+
- **Optional** (priority 6-7): Advanced topics and reference documentation
|
|
326
|
+
|
|
327
|
+
**Example output:**
|
|
328
|
+
```markdown
|
|
329
|
+
# My Project
|
|
330
|
+
|
|
331
|
+
> Project description
|
|
332
|
+
|
|
333
|
+
## Documentation
|
|
334
|
+
|
|
335
|
+
- [README](README.md): Main documentation
|
|
336
|
+
- [Getting Started](getting-started.md): Quick start guide
|
|
337
|
+
|
|
338
|
+
## Examples
|
|
339
|
+
|
|
340
|
+
- [Basic Tutorial](tutorial.md): Step-by-step tutorial
|
|
341
|
+
- [Code Examples](examples.md): Example code
|
|
342
|
+
|
|
343
|
+
## Optional
|
|
344
|
+
|
|
345
|
+
- [Advanced Topics](advanced.md): Deep dive into advanced features
|
|
346
|
+
- [API Reference](reference.md): Complete API reference
|
|
347
|
+
```
|
|
348
|
+
|
|
349
|
+
Empty sections are automatically omitted. The "Optional" section aligns with the llms.txt spec for marking secondary content that can be skipped when context windows are limited.
|
|
350
|
+
|
|
351
|
+
### Body Content
|
|
352
|
+
|
|
353
|
+
Add custom body content between the description and documentation sections:
|
|
354
|
+
|
|
355
|
+
```yaml
|
|
356
|
+
# llm-docs-builder.yml
|
|
357
|
+
title: My Project
|
|
358
|
+
description: Brief description
|
|
359
|
+
body: |
|
|
360
|
+
This framework is built on Ruby and focuses on performance.
|
|
361
|
+
Key concepts: streaming, batching, and parallel processing.
|
|
362
|
+
docs: ./docs
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
This produces:
|
|
366
|
+
```markdown
|
|
367
|
+
# My Project
|
|
368
|
+
|
|
369
|
+
> Brief description
|
|
370
|
+
|
|
371
|
+
This framework is built on Ruby and focuses on performance.
|
|
372
|
+
Key concepts: streaming, batching, and parallel processing.
|
|
373
|
+
|
|
374
|
+
## Documentation
|
|
375
|
+
...
|
|
376
|
+
```
|
|
377
|
+
|
|
312
378
|
## Advanced Compression Options
|
|
313
379
|
|
|
314
380
|
All compression features can be used individually for fine-grained control:
|
data/lib/llm_docs_builder/cli.rb
CHANGED
|
@@ -68,8 +68,9 @@ module LlmDocsBuilder
|
|
|
68
68
|
# @param argv [Array<String>] command-line arguments
|
|
69
69
|
# @return [Hash] parsed options including :command, :config, :docs, :output, :verbose
|
|
70
70
|
def parse_options(argv)
|
|
71
|
+
command_token = argv.first
|
|
71
72
|
options = {
|
|
72
|
-
command:
|
|
73
|
+
command: command_token&.match?(/\A[a-z](?:[a-z-]*[a-z])?\z/) ? argv.shift : nil
|
|
73
74
|
}
|
|
74
75
|
|
|
75
76
|
OptionParser.new do |opts|
|
|
@@ -100,7 +101,7 @@ module LlmDocsBuilder
|
|
|
100
101
|
options[:output] = path
|
|
101
102
|
end
|
|
102
103
|
|
|
103
|
-
opts.on('-u', '--url URL', 'URL to fetch for comparison') do |url|
|
|
104
|
+
opts.on('-u', '--url URL', 'URL to fetch for transform or comparison') do |url|
|
|
104
105
|
options[:url] = url
|
|
105
106
|
end
|
|
106
107
|
|
|
@@ -185,21 +186,42 @@ module LlmDocsBuilder
|
|
|
185
186
|
config = LlmDocsBuilder::Config.new(options[:config])
|
|
186
187
|
merged_options = config.merge_with_options(options)
|
|
187
188
|
|
|
188
|
-
|
|
189
|
+
url = options[:url]
|
|
190
|
+
cli_file_path = options[:docs]
|
|
191
|
+
config_file_path = config['docs']
|
|
192
|
+
file_path = url ? cli_file_path : (cli_file_path || config_file_path)
|
|
189
193
|
|
|
190
|
-
|
|
191
|
-
puts '
|
|
194
|
+
if url && cli_file_path
|
|
195
|
+
puts 'Cannot use both --docs and --url for transform command'
|
|
192
196
|
exit 1
|
|
193
197
|
end
|
|
194
198
|
|
|
195
|
-
unless
|
|
196
|
-
|
|
197
|
-
|
|
199
|
+
unless file_path
|
|
200
|
+
unless url
|
|
201
|
+
puts 'File path required for transform command (use -d/--docs)'
|
|
202
|
+
exit 1
|
|
203
|
+
end
|
|
198
204
|
end
|
|
199
205
|
|
|
200
|
-
|
|
206
|
+
content =
|
|
207
|
+
if url
|
|
208
|
+
puts "Fetching #{url}..." if merged_options[:verbose]
|
|
209
|
+
fetcher = LlmDocsBuilder::UrlFetcher.new(verbose: merged_options[:verbose])
|
|
210
|
+
remote_content = fetcher.fetch(url)
|
|
211
|
+
puts "Transforming content from #{url}..." if merged_options[:verbose]
|
|
212
|
+
transform_options = merged_options.merge(content: remote_content, docs: nil, source_url: url)
|
|
213
|
+
LlmDocsBuilder.transform_markdown(nil, transform_options)
|
|
214
|
+
else
|
|
215
|
+
unless File.exist?(file_path)
|
|
216
|
+
puts "File not found: #{file_path}"
|
|
217
|
+
exit 1
|
|
218
|
+
end
|
|
201
219
|
|
|
202
|
-
|
|
220
|
+
puts "Transforming #{file_path}..." if merged_options[:verbose]
|
|
221
|
+
|
|
222
|
+
merged_options[:docs] = file_path
|
|
223
|
+
LlmDocsBuilder.transform_markdown(file_path, merged_options)
|
|
224
|
+
end
|
|
203
225
|
|
|
204
226
|
if merged_options[:output] && merged_options[:output] != 'llms.txt'
|
|
205
227
|
File.write(merged_options[:output], content)
|
|
@@ -1,8 +1,5 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
-
require 'net/http'
|
|
4
|
-
require 'uri'
|
|
5
|
-
|
|
6
3
|
module LlmDocsBuilder
|
|
7
4
|
# Compares content sizes between human and AI versions
|
|
8
5
|
#
|
|
@@ -30,7 +27,7 @@ module LlmDocsBuilder
|
|
|
30
27
|
AI_USER_AGENT = 'Claude-Web/1.0 (Anthropic AI Assistant)'
|
|
31
28
|
|
|
32
29
|
# Maximum number of redirects to follow before raising an error
|
|
33
|
-
MAX_REDIRECTS =
|
|
30
|
+
MAX_REDIRECTS = UrlFetcher::MAX_REDIRECTS
|
|
34
31
|
|
|
35
32
|
# @return [String] URL to compare
|
|
36
33
|
attr_reader :url
|
|
@@ -133,78 +130,11 @@ module LlmDocsBuilder
|
|
|
133
130
|
# @return [String] response body
|
|
134
131
|
# @raise [Errors::GenerationError] if fetch fails or too many redirects
|
|
135
132
|
def fetch_url(url_string, user_agent, redirect_count = 0)
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
"Too many redirects (#{MAX_REDIRECTS}) when fetching #{url_string}"
|
|
140
|
-
)
|
|
141
|
-
end
|
|
142
|
-
|
|
143
|
-
uri = validate_and_parse_url(url_string)
|
|
144
|
-
|
|
145
|
-
http = Net::HTTP.new(uri.host, uri.port)
|
|
146
|
-
http.use_ssl = uri.scheme == 'https'
|
|
147
|
-
http.open_timeout = 10
|
|
148
|
-
http.read_timeout = 30
|
|
149
|
-
|
|
150
|
-
request = Net::HTTP::Get.new(uri.request_uri)
|
|
151
|
-
request['User-Agent'] = user_agent
|
|
152
|
-
|
|
153
|
-
response = http.request(request)
|
|
154
|
-
|
|
155
|
-
case response
|
|
156
|
-
when Net::HTTPSuccess
|
|
157
|
-
response.body
|
|
158
|
-
when Net::HTTPRedirection
|
|
159
|
-
# Follow redirect with incremented counter
|
|
160
|
-
redirect_url = response['location']
|
|
161
|
-
puts " Redirecting to #{redirect_url}..." if options[:verbose] && redirect_count.positive?
|
|
162
|
-
fetch_url(redirect_url, user_agent, redirect_count + 1)
|
|
163
|
-
else
|
|
164
|
-
raise(
|
|
165
|
-
Errors::GenerationError,
|
|
166
|
-
"Failed to fetch #{url_string}: #{response.code} #{response.message}"
|
|
167
|
-
)
|
|
168
|
-
end
|
|
169
|
-
rescue Errors::GenerationError
|
|
170
|
-
raise
|
|
171
|
-
rescue StandardError => e
|
|
172
|
-
raise(
|
|
173
|
-
Errors::GenerationError,
|
|
174
|
-
"Error fetching #{url_string}: #{e.message}"
|
|
175
|
-
)
|
|
176
|
-
end
|
|
177
|
-
|
|
178
|
-
# Validates and parses URL to prevent malformed URLs
|
|
179
|
-
#
|
|
180
|
-
# @param url_string [String] URL to validate and parse
|
|
181
|
-
# @return [URI::HTTP, URI::HTTPS] parsed URI
|
|
182
|
-
# @raise [Errors::GenerationError] if URL is invalid or uses unsupported scheme
|
|
183
|
-
def validate_and_parse_url(url_string)
|
|
184
|
-
uri = URI.parse(url_string)
|
|
185
|
-
|
|
186
|
-
# Only allow HTTP and HTTPS schemes
|
|
187
|
-
unless %w[http https].include?(uri.scheme&.downcase)
|
|
188
|
-
raise(
|
|
189
|
-
Errors::GenerationError,
|
|
190
|
-
"Unsupported URL scheme: #{uri.scheme || 'none'} (only http/https allowed)"
|
|
191
|
-
)
|
|
192
|
-
end
|
|
193
|
-
|
|
194
|
-
# Ensure host is present
|
|
195
|
-
if uri.host.nil? || uri.host.empty?
|
|
196
|
-
raise(
|
|
197
|
-
Errors::GenerationError,
|
|
198
|
-
"Invalid URL: missing host in #{url_string}"
|
|
199
|
-
)
|
|
200
|
-
end
|
|
201
|
-
|
|
202
|
-
uri
|
|
203
|
-
rescue URI::InvalidURIError => e
|
|
204
|
-
raise(
|
|
205
|
-
Errors::GenerationError,
|
|
206
|
-
"Invalid URL format: #{e.message}"
|
|
133
|
+
fetcher = UrlFetcher.new(
|
|
134
|
+
user_agent: user_agent,
|
|
135
|
+
verbose: options[:verbose]
|
|
207
136
|
)
|
|
137
|
+
fetcher.fetch(url_string, redirect_count)
|
|
208
138
|
end
|
|
209
139
|
|
|
210
140
|
# Calculate comparison statistics
|
|
@@ -57,10 +57,15 @@ module LlmDocsBuilder
|
|
|
57
57
|
def merge_with_options(options)
|
|
58
58
|
# CLI options override config file, config file provides defaults
|
|
59
59
|
{
|
|
60
|
-
docs: options
|
|
60
|
+
docs: if options.key?(:docs)
|
|
61
|
+
options[:docs]
|
|
62
|
+
else
|
|
63
|
+
self['docs'] || '.'
|
|
64
|
+
end,
|
|
61
65
|
base_url: options[:base_url] || self['base_url'],
|
|
62
66
|
title: options[:title] || self['title'],
|
|
63
67
|
description: options[:description] || self['description'],
|
|
68
|
+
body: options[:body] || self['body'],
|
|
64
69
|
output: options[:output] || self['output'] || 'llms.txt',
|
|
65
70
|
convert_urls: if options.key?(:convert_urls)
|
|
66
71
|
options[:convert_urls]
|
|
@@ -170,7 +175,10 @@ module LlmDocsBuilder
|
|
|
170
175
|
else
|
|
171
176
|
self['calculate_compression'] || false
|
|
172
177
|
end
|
|
173
|
-
}
|
|
178
|
+
}.tap do |merged|
|
|
179
|
+
merged[:content] = options[:content] if options.key?(:content)
|
|
180
|
+
merged[:source_url] = options[:source_url] if options.key?(:source_url)
|
|
181
|
+
end
|
|
174
182
|
end
|
|
175
183
|
|
|
176
184
|
# Check if a config file was found and exists
|
|
@@ -210,7 +210,11 @@ module LlmDocsBuilder
|
|
|
210
210
|
|
|
211
211
|
# Constructs llms.txt content from analyzed documentation files
|
|
212
212
|
#
|
|
213
|
-
# Combines title, description, and documentation links into formatted output
|
|
213
|
+
# Combines title, description, body content, and documentation links into formatted output.
|
|
214
|
+
# Organizes documents into sections based on priority:
|
|
215
|
+
# - Priority 1-3: Documentation (essential docs like README, getting started)
|
|
216
|
+
# - Priority 4-5: Examples (tutorials, example files)
|
|
217
|
+
# - Priority 6-7: Optional (advanced topics, reference docs)
|
|
214
218
|
#
|
|
215
219
|
# @param docs [Array<Hash>] analyzed file metadata
|
|
216
220
|
# @return [String] formatted llms.txt content
|
|
@@ -224,31 +228,60 @@ module LlmDocsBuilder
|
|
|
224
228
|
content << "> #{description}" if description
|
|
225
229
|
content << ''
|
|
226
230
|
|
|
227
|
-
|
|
228
|
-
|
|
231
|
+
# Add optional body content
|
|
232
|
+
if options[:body] && !options[:body].empty?
|
|
233
|
+
content << options[:body]
|
|
229
234
|
content << ''
|
|
235
|
+
end
|
|
230
236
|
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
237
|
+
if docs.any?
|
|
238
|
+
# Categorize docs by priority into sections
|
|
239
|
+
sections = {
|
|
240
|
+
'Documentation' => docs.select { |d| d[:priority] <= 3 },
|
|
241
|
+
'Examples' => docs.select { |d| d[:priority] >= 4 && d[:priority] <= 5 },
|
|
242
|
+
'Optional' => docs.select { |d| d[:priority] >= 6 }
|
|
243
|
+
}
|
|
244
|
+
|
|
245
|
+
# Build each section (skip empty ones)
|
|
246
|
+
sections.each do |section_name, section_docs|
|
|
247
|
+
next if section_docs.empty?
|
|
248
|
+
|
|
249
|
+
content << "## #{section_name}"
|
|
250
|
+
content << ''
|
|
251
|
+
|
|
252
|
+
section_docs.each do |doc|
|
|
253
|
+
url = build_url(doc[:path])
|
|
254
|
+
|
|
255
|
+
# Build metadata string if enabled
|
|
256
|
+
metadata_str = nil
|
|
257
|
+
if options[:include_metadata]
|
|
258
|
+
metadata_parts = []
|
|
259
|
+
metadata_parts << "tokens:#{doc[:tokens]}" if doc[:tokens]
|
|
260
|
+
metadata_parts << "compression:#{doc[:compression]}" if doc[:compression]
|
|
261
|
+
metadata_parts << "updated:#{doc[:updated]}" if doc[:updated]
|
|
262
|
+
metadata_parts << priority_label(doc[:priority]) if options[:include_priority]
|
|
263
|
+
|
|
264
|
+
metadata_str = "(#{metadata_parts.join(', ')})" unless metadata_parts.empty?
|
|
265
|
+
end
|
|
266
|
+
|
|
267
|
+
# Build line according to spec: - [title](url): description (metadata)
|
|
268
|
+
line = if doc[:description] && !doc[:description].empty?
|
|
269
|
+
base = "- [#{doc[:title]}](#{url}): #{doc[:description]}"
|
|
270
|
+
metadata_str ? "#{base} #{metadata_str}" : base
|
|
271
|
+
else
|
|
272
|
+
# No description: - [title](url) (metadata)
|
|
273
|
+
base = "- [#{doc[:title]}](#{url})"
|
|
274
|
+
metadata_str ? "#{base}: #{metadata_str}" : base
|
|
275
|
+
end
|
|
276
|
+
|
|
277
|
+
content << line
|
|
248
278
|
end
|
|
249
279
|
|
|
250
|
-
content <<
|
|
280
|
+
content << ''
|
|
251
281
|
end
|
|
282
|
+
|
|
283
|
+
# Remove trailing empty line
|
|
284
|
+
content.pop if content.last == ''
|
|
252
285
|
end
|
|
253
286
|
|
|
254
287
|
"#{content.join("\n")}\n"
|
|
@@ -55,7 +55,7 @@ module LlmDocsBuilder
|
|
|
55
55
|
#
|
|
56
56
|
# @return [String] transformed markdown content
|
|
57
57
|
def transform
|
|
58
|
-
content =
|
|
58
|
+
content = load_content
|
|
59
59
|
|
|
60
60
|
# Build and execute transformation pipeline
|
|
61
61
|
content = cleanup_transformer.transform(content, options)
|
|
@@ -124,5 +124,16 @@ module LlmDocsBuilder
|
|
|
124
124
|
}
|
|
125
125
|
compressor.compress(content, compression_methods)
|
|
126
126
|
end
|
|
127
|
+
|
|
128
|
+
# Load source content either from provided string or file path
|
|
129
|
+
#
|
|
130
|
+
# @return [String] markdown content to transform
|
|
131
|
+
def load_content
|
|
132
|
+
if options[:content]
|
|
133
|
+
options[:content].dup
|
|
134
|
+
else
|
|
135
|
+
File.read(file_path)
|
|
136
|
+
end
|
|
137
|
+
end
|
|
127
138
|
end
|
|
128
139
|
end
|
|
@@ -85,7 +85,7 @@ module LlmDocsBuilder
|
|
|
85
85
|
|
|
86
86
|
# Extracts markdown links from section content into structured format
|
|
87
87
|
#
|
|
88
|
-
# Scans for markdown list items with links and descriptions. Returns raw content
|
|
88
|
+
# Scans for markdown list items with links and optional descriptions. Returns raw content
|
|
89
89
|
# if no links are found in the expected format.
|
|
90
90
|
#
|
|
91
91
|
# @param content [String] raw section content
|
|
@@ -93,11 +93,13 @@ module LlmDocsBuilder
|
|
|
93
93
|
def parse_section_content(content)
|
|
94
94
|
links = []
|
|
95
95
|
|
|
96
|
-
|
|
96
|
+
# Updated regex: description is optional (non-capturing group with ?)
|
|
97
|
+
# Use [^\n]* instead of .* to avoid matching across lines
|
|
98
|
+
content.scan(/^[-*]\s*\[([^\]]+)\]\(([^)]+)\)(?::\s*([^\n]*))?$/) do |title, url, description|
|
|
97
99
|
links << {
|
|
98
100
|
title: title,
|
|
99
101
|
url: url,
|
|
100
|
-
description: description
|
|
102
|
+
description: description&.strip || ''
|
|
101
103
|
}
|
|
102
104
|
end
|
|
103
105
|
|
|
@@ -0,0 +1,120 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'net/http'
|
|
4
|
+
require 'uri'
|
|
5
|
+
|
|
6
|
+
module LlmDocsBuilder
|
|
7
|
+
# Lightweight HTTP client for fetching remote documentation pages.
|
|
8
|
+
#
|
|
9
|
+
# Provides common functionality needed by multiple commands (transform, compare)
|
|
10
|
+
# including strict scheme validation, redirect handling and sensible timeouts.
|
|
11
|
+
class UrlFetcher
|
|
12
|
+
DEFAULT_USER_AGENT = 'llm-docs-builder/1.0 (+https://github.com/mensfeld/llm-docs-builder)'
|
|
13
|
+
MAX_REDIRECTS = 10
|
|
14
|
+
|
|
15
|
+
# @param user_agent [String] HTTP user agent header value
|
|
16
|
+
# @param verbose [Boolean] enable redirect logging
|
|
17
|
+
# @param output [IO] IO stream used for redirect logging
|
|
18
|
+
def initialize(user_agent: DEFAULT_USER_AGENT, verbose: false, output: $stdout)
|
|
19
|
+
@user_agent = user_agent
|
|
20
|
+
@verbose = verbose
|
|
21
|
+
@output = output
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
# Fetch remote URL content while following redirects.
|
|
25
|
+
#
|
|
26
|
+
# @param url_string [String] URL to fetch
|
|
27
|
+
# @param redirect_count [Integer] current redirect depth (internal use)
|
|
28
|
+
# @return [String] response body
|
|
29
|
+
# @raise [Errors::GenerationError] on invalid URLs, network failures, or redirect loops
|
|
30
|
+
def fetch(url_string, redirect_count = 0)
|
|
31
|
+
if redirect_count >= MAX_REDIRECTS
|
|
32
|
+
raise(
|
|
33
|
+
Errors::GenerationError,
|
|
34
|
+
"Too many redirects (#{MAX_REDIRECTS}) when fetching #{url_string}"
|
|
35
|
+
)
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
uri = validate_and_parse_url(url_string)
|
|
39
|
+
|
|
40
|
+
http = Net::HTTP.new(uri.host, uri.port)
|
|
41
|
+
http.use_ssl = uri.scheme == 'https'
|
|
42
|
+
http.open_timeout = 10
|
|
43
|
+
http.read_timeout = 30
|
|
44
|
+
|
|
45
|
+
request = Net::HTTP::Get.new(uri.request_uri)
|
|
46
|
+
request['User-Agent'] = @user_agent
|
|
47
|
+
|
|
48
|
+
response = http.request(request)
|
|
49
|
+
|
|
50
|
+
case response
|
|
51
|
+
when Net::HTTPSuccess
|
|
52
|
+
response.body
|
|
53
|
+
when Net::HTTPRedirection
|
|
54
|
+
redirect_url = absolute_redirect_url(uri, response['location'])
|
|
55
|
+
log_redirect(redirect_url)
|
|
56
|
+
fetch(redirect_url, redirect_count + 1)
|
|
57
|
+
else
|
|
58
|
+
raise(
|
|
59
|
+
Errors::GenerationError,
|
|
60
|
+
"Failed to fetch #{url_string}: #{response.code} #{response.message}"
|
|
61
|
+
)
|
|
62
|
+
end
|
|
63
|
+
rescue Errors::GenerationError
|
|
64
|
+
raise
|
|
65
|
+
rescue StandardError => e
|
|
66
|
+
raise(
|
|
67
|
+
Errors::GenerationError,
|
|
68
|
+
"Error fetching #{url_string}: #{e.message}"
|
|
69
|
+
)
|
|
70
|
+
end
|
|
71
|
+
|
|
72
|
+
private
|
|
73
|
+
|
|
74
|
+
def validate_and_parse_url(url_string)
|
|
75
|
+
uri = URI.parse(url_string)
|
|
76
|
+
|
|
77
|
+
unless %w[http https].include?(uri.scheme&.downcase)
|
|
78
|
+
raise(
|
|
79
|
+
Errors::GenerationError,
|
|
80
|
+
"Unsupported URL scheme: #{uri.scheme || 'none'} (only http/https allowed)"
|
|
81
|
+
)
|
|
82
|
+
end
|
|
83
|
+
|
|
84
|
+
if uri.host.nil? || uri.host.empty?
|
|
85
|
+
raise(
|
|
86
|
+
Errors::GenerationError,
|
|
87
|
+
"Invalid URL: missing host in #{url_string}"
|
|
88
|
+
)
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
uri
|
|
92
|
+
rescue URI::InvalidURIError => e
|
|
93
|
+
raise(
|
|
94
|
+
Errors::GenerationError,
|
|
95
|
+
"Invalid URL format: #{e.message}"
|
|
96
|
+
)
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
def absolute_redirect_url(base_uri, location)
|
|
100
|
+
raise(
|
|
101
|
+
Errors::GenerationError,
|
|
102
|
+
"Redirect missing location header for #{base_uri}"
|
|
103
|
+
) if location.nil? || location.empty?
|
|
104
|
+
|
|
105
|
+
URI.join(base_uri, location).to_s
|
|
106
|
+
rescue URI::InvalidURIError => e
|
|
107
|
+
raise(
|
|
108
|
+
Errors::GenerationError,
|
|
109
|
+
"Invalid redirect URL from #{base_uri}: #{e.message}"
|
|
110
|
+
)
|
|
111
|
+
end
|
|
112
|
+
|
|
113
|
+
def log_redirect(url)
|
|
114
|
+
return unless @verbose
|
|
115
|
+
|
|
116
|
+
@output.puts(" Redirecting to #{url}...")
|
|
117
|
+
end
|
|
118
|
+
end
|
|
119
|
+
end
|
|
120
|
+
|
data/lib/llm_docs_builder.rb
CHANGED
|
@@ -25,6 +25,7 @@ module LlmDocsBuilder
|
|
|
25
25
|
# @option options [Boolean] :convert_urls convert HTML URLs to markdown format (overrides
|
|
26
26
|
# config)
|
|
27
27
|
# @option options [Boolean] :verbose enable verbose output (overrides config)
|
|
28
|
+
# @option options [String] :content raw markdown content (used for remote sources)
|
|
28
29
|
# @return [String] generated llms.txt content
|
|
29
30
|
#
|
|
30
31
|
# @example Generate from docs directory
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: llm-docs-builder
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.11.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Maciej Mensfeld
|
|
@@ -116,6 +116,7 @@ files:
|
|
|
116
116
|
- ".rspec"
|
|
117
117
|
- ".rubocop.yml"
|
|
118
118
|
- ".ruby-version"
|
|
119
|
+
- AGENTS.md
|
|
119
120
|
- CHANGELOG.md
|
|
120
121
|
- Dockerfile
|
|
121
122
|
- Gemfile
|
|
@@ -143,6 +144,7 @@ files:
|
|
|
143
144
|
- lib/llm_docs_builder/transformers/heading_transformer.rb
|
|
144
145
|
- lib/llm_docs_builder/transformers/link_transformer.rb
|
|
145
146
|
- lib/llm_docs_builder/transformers/whitespace_transformer.rb
|
|
147
|
+
- lib/llm_docs_builder/url_fetcher.rb
|
|
146
148
|
- lib/llm_docs_builder/validator.rb
|
|
147
149
|
- lib/llm_docs_builder/version.rb
|
|
148
150
|
- llm-docs-builder.gemspec
|