llm-docs-builder 0.10.0 → 0.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/push.yml +1 -1
- data/AGENTS.md +20 -0
- data/CHANGELOG.md +6 -0
- data/Gemfile.lock +17 -17
- data/README.md +3 -0
- data/lib/llm_docs_builder/cli.rb +32 -10
- data/lib/llm_docs_builder/comparator.rb +5 -75
- data/lib/llm_docs_builder/config.rb +9 -2
- data/lib/llm_docs_builder/markdown_transformer.rb +12 -1
- data/lib/llm_docs_builder/url_fetcher.rb +120 -0
- data/lib/llm_docs_builder/version.rb +1 -1
- data/lib/llm_docs_builder.rb +1 -0
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: deeae74a329018b4a43d7845a3be8b7c31347699c3ff7abd93d7b697a48982a3
|
|
4
|
+
data.tar.gz: f9c842caa93a45b4d75c45a15e116e6f98d8463e5268e2de32e498c725877e4f
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 94575eced147bd6740b5395acd41d3f46ffcadf40908831df081d5e03f56b35a2e1e9acfdfc7642af775b2aa86fe48ea322dd11baf48512d2f2ef43a1a491079
|
|
7
|
+
data.tar.gz: 52b7d40d4a95acd20a408f4d453f32c7154637ce42ec210cf67010ce10dbe14ad711c4cbfd4060d8ce5018b91668315d552732ea7118374e69228a869792ff0f
|
data/.github/workflows/push.yml
CHANGED
data/AGENTS.md
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
# Repository Guidelines
|
|
2
|
+
|
|
3
|
+
## Project Structure & Module Organization
|
|
4
|
+
Core gem code lives in `lib/llm_docs_builder`, with single-responsibility modules such as `generator.rb`, `validator.rb`, and the CLI glue in `cli.rb`. Shared entrypoint `lib/llm_docs_builder.rb` wires dependencies. Executables reside in `bin/`: `llm-docs-builder` boots the CLI, while `rspecs` runs the full test matrix. Specs mirror library files under `spec/` with command-level coverage in `spec/integrations`. Static assets (logos, diff screenshots) are in `misc/`. Example configuration templates live at `llm-docs-builder.yml.example`.
|
|
5
|
+
|
|
6
|
+
## Build, Test, and Development Commands
|
|
7
|
+
- `bundle install` — sync gem dependencies defined in `Gemfile`.
|
|
8
|
+
- `bundle exec rake` — default task; runs RSpec and RuboCop together.
|
|
9
|
+
- `bundle exec rspec` or `bin/rspecs` — execute unit and integration specs with doc formatter.
|
|
10
|
+
- `bundle exec rubocop` — enforce the Ruby style guide; mirrors CI.
|
|
11
|
+
- `bin/llm-docs-builder transform --docs README.md` — smoke-test the CLI against a local file.
|
|
12
|
+
|
|
13
|
+
## Coding Style & Naming Conventions
|
|
14
|
+
Target Ruby 3.2 with two-space indentation and trailing newline. Prefer single-quoted strings; enable `# frozen_string_literal: true` headers on Ruby files. Keep lines ≤120 characters except where the RuboCop config allows. Use descriptive module/class names (e.g., `LlmDocsBuilder::Generator`) and predicate methods ending with `?` when returning booleans. Place supporting fixtures in `spec/support` if added, and name files after the class they extend.
|
|
15
|
+
|
|
16
|
+
## Testing Guidelines
|
|
17
|
+
RSpec is the sole testing framework. Name files `*_spec.rb` and align describe blocks with constant paths. Integration scenarios belong in `spec/integrations` to capture CLI behaviors. SimpleCov is enabled by default for line and branch coverage; export `SIMPLECOV=false` for quick local runs. Persist example statuses with the automatically managed `spec/examples.txt`.
|
|
18
|
+
|
|
19
|
+
## Commit & Pull Request Guidelines
|
|
20
|
+
Keep commit subjects short, present-tense, and focused (e.g., `Align CLI config (#27)`). Group related changes together so `git log` remains readable. Pull requests should describe motivation, summarize behavioral impact, link related issues or discussions, and include CLI output or screenshots when touching generated docs. Ensure CI passes (`bundle exec rake`) before requesting review, and note any follow-up work in the PR description.
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,11 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.11.0 (2025-11-03)
|
|
4
|
+
- [Feature] **Transform from URL** — The `transform` command now accepts a remote URL via `--url` and processes fetched content through the standard transformer pipeline.
|
|
5
|
+
- Example: `llm-docs-builder transform --url https://example.com/docs/page.html`
|
|
6
|
+
- Applies all configured transformations and output options identically to local files
|
|
7
|
+
- By @Eric-Guo and @codex in PR #28.
|
|
8
|
+
|
|
3
9
|
## 0.10.0 (2025-10-27)
|
|
4
10
|
- [Feature] **llms.txt Specification Compliance** - Updated output format to fully comply with the llms.txt specification from llmstxt.org.
|
|
5
11
|
- **Metadata Format**: Metadata now appears within the description field using parentheses and comma separators: `- [title](url): description (tokens:450, updated:2025-10-13, priority:high)`
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
llm-docs-builder (0.
|
|
4
|
+
llm-docs-builder (0.11.0)
|
|
5
5
|
zeitwerk (~> 2.6)
|
|
6
6
|
|
|
7
7
|
GEM
|
|
@@ -12,15 +12,15 @@ GEM
|
|
|
12
12
|
coderay (1.1.3)
|
|
13
13
|
diff-lcs (1.6.2)
|
|
14
14
|
docile (1.4.1)
|
|
15
|
-
json (2.
|
|
15
|
+
json (2.15.2)
|
|
16
16
|
language_server-protocol (3.17.0.5)
|
|
17
17
|
lint_roller (1.1.0)
|
|
18
18
|
method_source (1.1.0)
|
|
19
19
|
parallel (1.27.0)
|
|
20
|
-
parser (3.3.
|
|
20
|
+
parser (3.3.10.0)
|
|
21
21
|
ast (~> 2.4.1)
|
|
22
22
|
racc
|
|
23
|
-
prism (1.
|
|
23
|
+
prism (1.6.0)
|
|
24
24
|
pry (0.15.2)
|
|
25
25
|
coderay (~> 1.1)
|
|
26
26
|
method_source (~> 1.0)
|
|
@@ -29,22 +29,22 @@ GEM
|
|
|
29
29
|
pry (>= 0.13, < 0.16)
|
|
30
30
|
racc (1.8.1)
|
|
31
31
|
rainbow (3.1.1)
|
|
32
|
-
rake (13.3.
|
|
33
|
-
regexp_parser (2.11.
|
|
34
|
-
rspec (3.13.
|
|
32
|
+
rake (13.3.1)
|
|
33
|
+
regexp_parser (2.11.3)
|
|
34
|
+
rspec (3.13.2)
|
|
35
35
|
rspec-core (~> 3.13.0)
|
|
36
36
|
rspec-expectations (~> 3.13.0)
|
|
37
37
|
rspec-mocks (~> 3.13.0)
|
|
38
|
-
rspec-core (3.13.
|
|
38
|
+
rspec-core (3.13.6)
|
|
39
39
|
rspec-support (~> 3.13.0)
|
|
40
40
|
rspec-expectations (3.13.5)
|
|
41
41
|
diff-lcs (>= 1.2.0, < 2.0)
|
|
42
42
|
rspec-support (~> 3.13.0)
|
|
43
|
-
rspec-mocks (3.13.
|
|
43
|
+
rspec-mocks (3.13.6)
|
|
44
44
|
diff-lcs (>= 1.2.0, < 2.0)
|
|
45
45
|
rspec-support (~> 3.13.0)
|
|
46
|
-
rspec-support (3.13.
|
|
47
|
-
rubocop (1.
|
|
46
|
+
rspec-support (3.13.6)
|
|
47
|
+
rubocop (1.81.6)
|
|
48
48
|
json (~> 2.3)
|
|
49
49
|
language_server-protocol (~> 3.17.0.2)
|
|
50
50
|
lint_roller (~> 1.1.0)
|
|
@@ -52,10 +52,10 @@ GEM
|
|
|
52
52
|
parser (>= 3.3.0.2)
|
|
53
53
|
rainbow (>= 2.2.2, < 4.0)
|
|
54
54
|
regexp_parser (>= 2.9.3, < 3.0)
|
|
55
|
-
rubocop-ast (>= 1.
|
|
55
|
+
rubocop-ast (>= 1.47.1, < 2.0)
|
|
56
56
|
ruby-progressbar (~> 1.7)
|
|
57
57
|
unicode-display_width (>= 2.4.0, < 4.0)
|
|
58
|
-
rubocop-ast (1.
|
|
58
|
+
rubocop-ast (1.47.1)
|
|
59
59
|
parser (>= 3.3.7.2)
|
|
60
60
|
prism (~> 1.4)
|
|
61
61
|
ruby-progressbar (1.13.0)
|
|
@@ -65,9 +65,9 @@ GEM
|
|
|
65
65
|
simplecov_json_formatter (~> 0.1)
|
|
66
66
|
simplecov-html (0.13.2)
|
|
67
67
|
simplecov_json_formatter (0.1.4)
|
|
68
|
-
unicode-display_width (3.
|
|
69
|
-
unicode-emoji (~> 4.
|
|
70
|
-
unicode-emoji (4.0
|
|
68
|
+
unicode-display_width (3.2.0)
|
|
69
|
+
unicode-emoji (~> 4.1)
|
|
70
|
+
unicode-emoji (4.1.0)
|
|
71
71
|
zeitwerk (2.7.3)
|
|
72
72
|
|
|
73
73
|
PLATFORMS
|
|
@@ -85,4 +85,4 @@ DEPENDENCIES
|
|
|
85
85
|
simplecov (~> 0.21)
|
|
86
86
|
|
|
87
87
|
BUNDLED WITH
|
|
88
|
-
2.7.
|
|
88
|
+
2.7.2
|
data/README.md
CHANGED
|
@@ -61,6 +61,9 @@ Factor: 2.8x smaller
|
|
|
61
61
|
# Single file
|
|
62
62
|
llm-docs-builder transform --docs README.md
|
|
63
63
|
|
|
64
|
+
# Fetch and transform a remote page
|
|
65
|
+
llm-docs-builder transform --url https://yoursite.com/docs/page.html
|
|
66
|
+
|
|
64
67
|
# Bulk transform with config
|
|
65
68
|
llm-docs-builder bulk-transform --config llm-docs-builder.yml
|
|
66
69
|
```
|
data/lib/llm_docs_builder/cli.rb
CHANGED
|
@@ -68,8 +68,9 @@ module LlmDocsBuilder
|
|
|
68
68
|
# @param argv [Array<String>] command-line arguments
|
|
69
69
|
# @return [Hash] parsed options including :command, :config, :docs, :output, :verbose
|
|
70
70
|
def parse_options(argv)
|
|
71
|
+
command_token = argv.first
|
|
71
72
|
options = {
|
|
72
|
-
command:
|
|
73
|
+
command: command_token&.match?(/\A[a-z](?:[a-z-]*[a-z])?\z/) ? argv.shift : nil
|
|
73
74
|
}
|
|
74
75
|
|
|
75
76
|
OptionParser.new do |opts|
|
|
@@ -100,7 +101,7 @@ module LlmDocsBuilder
|
|
|
100
101
|
options[:output] = path
|
|
101
102
|
end
|
|
102
103
|
|
|
103
|
-
opts.on('-u', '--url URL', 'URL to fetch for comparison') do |url|
|
|
104
|
+
opts.on('-u', '--url URL', 'URL to fetch for transform or comparison') do |url|
|
|
104
105
|
options[:url] = url
|
|
105
106
|
end
|
|
106
107
|
|
|
@@ -185,21 +186,42 @@ module LlmDocsBuilder
|
|
|
185
186
|
config = LlmDocsBuilder::Config.new(options[:config])
|
|
186
187
|
merged_options = config.merge_with_options(options)
|
|
187
188
|
|
|
188
|
-
|
|
189
|
+
url = options[:url]
|
|
190
|
+
cli_file_path = options[:docs]
|
|
191
|
+
config_file_path = config['docs']
|
|
192
|
+
file_path = url ? cli_file_path : (cli_file_path || config_file_path)
|
|
189
193
|
|
|
190
|
-
|
|
191
|
-
puts '
|
|
194
|
+
if url && cli_file_path
|
|
195
|
+
puts 'Cannot use both --docs and --url for transform command'
|
|
192
196
|
exit 1
|
|
193
197
|
end
|
|
194
198
|
|
|
195
|
-
unless
|
|
196
|
-
|
|
197
|
-
|
|
199
|
+
unless file_path
|
|
200
|
+
unless url
|
|
201
|
+
puts 'File path required for transform command (use -d/--docs)'
|
|
202
|
+
exit 1
|
|
203
|
+
end
|
|
198
204
|
end
|
|
199
205
|
|
|
200
|
-
|
|
206
|
+
content =
|
|
207
|
+
if url
|
|
208
|
+
puts "Fetching #{url}..." if merged_options[:verbose]
|
|
209
|
+
fetcher = LlmDocsBuilder::UrlFetcher.new(verbose: merged_options[:verbose])
|
|
210
|
+
remote_content = fetcher.fetch(url)
|
|
211
|
+
puts "Transforming content from #{url}..." if merged_options[:verbose]
|
|
212
|
+
transform_options = merged_options.merge(content: remote_content, docs: nil, source_url: url)
|
|
213
|
+
LlmDocsBuilder.transform_markdown(nil, transform_options)
|
|
214
|
+
else
|
|
215
|
+
unless File.exist?(file_path)
|
|
216
|
+
puts "File not found: #{file_path}"
|
|
217
|
+
exit 1
|
|
218
|
+
end
|
|
201
219
|
|
|
202
|
-
|
|
220
|
+
puts "Transforming #{file_path}..." if merged_options[:verbose]
|
|
221
|
+
|
|
222
|
+
merged_options[:docs] = file_path
|
|
223
|
+
LlmDocsBuilder.transform_markdown(file_path, merged_options)
|
|
224
|
+
end
|
|
203
225
|
|
|
204
226
|
if merged_options[:output] && merged_options[:output] != 'llms.txt'
|
|
205
227
|
File.write(merged_options[:output], content)
|
|
@@ -1,8 +1,5 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
-
require 'net/http'
|
|
4
|
-
require 'uri'
|
|
5
|
-
|
|
6
3
|
module LlmDocsBuilder
|
|
7
4
|
# Compares content sizes between human and AI versions
|
|
8
5
|
#
|
|
@@ -30,7 +27,7 @@ module LlmDocsBuilder
|
|
|
30
27
|
AI_USER_AGENT = 'Claude-Web/1.0 (Anthropic AI Assistant)'
|
|
31
28
|
|
|
32
29
|
# Maximum number of redirects to follow before raising an error
|
|
33
|
-
MAX_REDIRECTS =
|
|
30
|
+
MAX_REDIRECTS = UrlFetcher::MAX_REDIRECTS
|
|
34
31
|
|
|
35
32
|
# @return [String] URL to compare
|
|
36
33
|
attr_reader :url
|
|
@@ -133,78 +130,11 @@ module LlmDocsBuilder
|
|
|
133
130
|
# @return [String] response body
|
|
134
131
|
# @raise [Errors::GenerationError] if fetch fails or too many redirects
|
|
135
132
|
def fetch_url(url_string, user_agent, redirect_count = 0)
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
"Too many redirects (#{MAX_REDIRECTS}) when fetching #{url_string}"
|
|
140
|
-
)
|
|
141
|
-
end
|
|
142
|
-
|
|
143
|
-
uri = validate_and_parse_url(url_string)
|
|
144
|
-
|
|
145
|
-
http = Net::HTTP.new(uri.host, uri.port)
|
|
146
|
-
http.use_ssl = uri.scheme == 'https'
|
|
147
|
-
http.open_timeout = 10
|
|
148
|
-
http.read_timeout = 30
|
|
149
|
-
|
|
150
|
-
request = Net::HTTP::Get.new(uri.request_uri)
|
|
151
|
-
request['User-Agent'] = user_agent
|
|
152
|
-
|
|
153
|
-
response = http.request(request)
|
|
154
|
-
|
|
155
|
-
case response
|
|
156
|
-
when Net::HTTPSuccess
|
|
157
|
-
response.body
|
|
158
|
-
when Net::HTTPRedirection
|
|
159
|
-
# Follow redirect with incremented counter
|
|
160
|
-
redirect_url = response['location']
|
|
161
|
-
puts " Redirecting to #{redirect_url}..." if options[:verbose] && redirect_count.positive?
|
|
162
|
-
fetch_url(redirect_url, user_agent, redirect_count + 1)
|
|
163
|
-
else
|
|
164
|
-
raise(
|
|
165
|
-
Errors::GenerationError,
|
|
166
|
-
"Failed to fetch #{url_string}: #{response.code} #{response.message}"
|
|
167
|
-
)
|
|
168
|
-
end
|
|
169
|
-
rescue Errors::GenerationError
|
|
170
|
-
raise
|
|
171
|
-
rescue StandardError => e
|
|
172
|
-
raise(
|
|
173
|
-
Errors::GenerationError,
|
|
174
|
-
"Error fetching #{url_string}: #{e.message}"
|
|
175
|
-
)
|
|
176
|
-
end
|
|
177
|
-
|
|
178
|
-
# Validates and parses URL to prevent malformed URLs
|
|
179
|
-
#
|
|
180
|
-
# @param url_string [String] URL to validate and parse
|
|
181
|
-
# @return [URI::HTTP, URI::HTTPS] parsed URI
|
|
182
|
-
# @raise [Errors::GenerationError] if URL is invalid or uses unsupported scheme
|
|
183
|
-
def validate_and_parse_url(url_string)
|
|
184
|
-
uri = URI.parse(url_string)
|
|
185
|
-
|
|
186
|
-
# Only allow HTTP and HTTPS schemes
|
|
187
|
-
unless %w[http https].include?(uri.scheme&.downcase)
|
|
188
|
-
raise(
|
|
189
|
-
Errors::GenerationError,
|
|
190
|
-
"Unsupported URL scheme: #{uri.scheme || 'none'} (only http/https allowed)"
|
|
191
|
-
)
|
|
192
|
-
end
|
|
193
|
-
|
|
194
|
-
# Ensure host is present
|
|
195
|
-
if uri.host.nil? || uri.host.empty?
|
|
196
|
-
raise(
|
|
197
|
-
Errors::GenerationError,
|
|
198
|
-
"Invalid URL: missing host in #{url_string}"
|
|
199
|
-
)
|
|
200
|
-
end
|
|
201
|
-
|
|
202
|
-
uri
|
|
203
|
-
rescue URI::InvalidURIError => e
|
|
204
|
-
raise(
|
|
205
|
-
Errors::GenerationError,
|
|
206
|
-
"Invalid URL format: #{e.message}"
|
|
133
|
+
fetcher = UrlFetcher.new(
|
|
134
|
+
user_agent: user_agent,
|
|
135
|
+
verbose: options[:verbose]
|
|
207
136
|
)
|
|
137
|
+
fetcher.fetch(url_string, redirect_count)
|
|
208
138
|
end
|
|
209
139
|
|
|
210
140
|
# Calculate comparison statistics
|
|
@@ -57,7 +57,11 @@ module LlmDocsBuilder
|
|
|
57
57
|
def merge_with_options(options)
|
|
58
58
|
# CLI options override config file, config file provides defaults
|
|
59
59
|
{
|
|
60
|
-
docs: options
|
|
60
|
+
docs: if options.key?(:docs)
|
|
61
|
+
options[:docs]
|
|
62
|
+
else
|
|
63
|
+
self['docs'] || '.'
|
|
64
|
+
end,
|
|
61
65
|
base_url: options[:base_url] || self['base_url'],
|
|
62
66
|
title: options[:title] || self['title'],
|
|
63
67
|
description: options[:description] || self['description'],
|
|
@@ -171,7 +175,10 @@ module LlmDocsBuilder
|
|
|
171
175
|
else
|
|
172
176
|
self['calculate_compression'] || false
|
|
173
177
|
end
|
|
174
|
-
}
|
|
178
|
+
}.tap do |merged|
|
|
179
|
+
merged[:content] = options[:content] if options.key?(:content)
|
|
180
|
+
merged[:source_url] = options[:source_url] if options.key?(:source_url)
|
|
181
|
+
end
|
|
175
182
|
end
|
|
176
183
|
|
|
177
184
|
# Check if a config file was found and exists
|
|
@@ -55,7 +55,7 @@ module LlmDocsBuilder
|
|
|
55
55
|
#
|
|
56
56
|
# @return [String] transformed markdown content
|
|
57
57
|
def transform
|
|
58
|
-
content =
|
|
58
|
+
content = load_content
|
|
59
59
|
|
|
60
60
|
# Build and execute transformation pipeline
|
|
61
61
|
content = cleanup_transformer.transform(content, options)
|
|
@@ -124,5 +124,16 @@ module LlmDocsBuilder
|
|
|
124
124
|
}
|
|
125
125
|
compressor.compress(content, compression_methods)
|
|
126
126
|
end
|
|
127
|
+
|
|
128
|
+
# Load source content either from provided string or file path
|
|
129
|
+
#
|
|
130
|
+
# @return [String] markdown content to transform
|
|
131
|
+
def load_content
|
|
132
|
+
if options[:content]
|
|
133
|
+
options[:content].dup
|
|
134
|
+
else
|
|
135
|
+
File.read(file_path)
|
|
136
|
+
end
|
|
137
|
+
end
|
|
127
138
|
end
|
|
128
139
|
end
|
|
@@ -0,0 +1,120 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'net/http'
|
|
4
|
+
require 'uri'
|
|
5
|
+
|
|
6
|
+
module LlmDocsBuilder
|
|
7
|
+
# Lightweight HTTP client for fetching remote documentation pages.
|
|
8
|
+
#
|
|
9
|
+
# Provides common functionality needed by multiple commands (transform, compare)
|
|
10
|
+
# including strict scheme validation, redirect handling and sensible timeouts.
|
|
11
|
+
class UrlFetcher
|
|
12
|
+
DEFAULT_USER_AGENT = 'llm-docs-builder/1.0 (+https://github.com/mensfeld/llm-docs-builder)'
|
|
13
|
+
MAX_REDIRECTS = 10
|
|
14
|
+
|
|
15
|
+
# @param user_agent [String] HTTP user agent header value
|
|
16
|
+
# @param verbose [Boolean] enable redirect logging
|
|
17
|
+
# @param output [IO] IO stream used for redirect logging
|
|
18
|
+
def initialize(user_agent: DEFAULT_USER_AGENT, verbose: false, output: $stdout)
|
|
19
|
+
@user_agent = user_agent
|
|
20
|
+
@verbose = verbose
|
|
21
|
+
@output = output
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
# Fetch remote URL content while following redirects.
|
|
25
|
+
#
|
|
26
|
+
# @param url_string [String] URL to fetch
|
|
27
|
+
# @param redirect_count [Integer] current redirect depth (internal use)
|
|
28
|
+
# @return [String] response body
|
|
29
|
+
# @raise [Errors::GenerationError] on invalid URLs, network failures, or redirect loops
|
|
30
|
+
def fetch(url_string, redirect_count = 0)
|
|
31
|
+
if redirect_count >= MAX_REDIRECTS
|
|
32
|
+
raise(
|
|
33
|
+
Errors::GenerationError,
|
|
34
|
+
"Too many redirects (#{MAX_REDIRECTS}) when fetching #{url_string}"
|
|
35
|
+
)
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
uri = validate_and_parse_url(url_string)
|
|
39
|
+
|
|
40
|
+
http = Net::HTTP.new(uri.host, uri.port)
|
|
41
|
+
http.use_ssl = uri.scheme == 'https'
|
|
42
|
+
http.open_timeout = 10
|
|
43
|
+
http.read_timeout = 30
|
|
44
|
+
|
|
45
|
+
request = Net::HTTP::Get.new(uri.request_uri)
|
|
46
|
+
request['User-Agent'] = @user_agent
|
|
47
|
+
|
|
48
|
+
response = http.request(request)
|
|
49
|
+
|
|
50
|
+
case response
|
|
51
|
+
when Net::HTTPSuccess
|
|
52
|
+
response.body
|
|
53
|
+
when Net::HTTPRedirection
|
|
54
|
+
redirect_url = absolute_redirect_url(uri, response['location'])
|
|
55
|
+
log_redirect(redirect_url)
|
|
56
|
+
fetch(redirect_url, redirect_count + 1)
|
|
57
|
+
else
|
|
58
|
+
raise(
|
|
59
|
+
Errors::GenerationError,
|
|
60
|
+
"Failed to fetch #{url_string}: #{response.code} #{response.message}"
|
|
61
|
+
)
|
|
62
|
+
end
|
|
63
|
+
rescue Errors::GenerationError
|
|
64
|
+
raise
|
|
65
|
+
rescue StandardError => e
|
|
66
|
+
raise(
|
|
67
|
+
Errors::GenerationError,
|
|
68
|
+
"Error fetching #{url_string}: #{e.message}"
|
|
69
|
+
)
|
|
70
|
+
end
|
|
71
|
+
|
|
72
|
+
private
|
|
73
|
+
|
|
74
|
+
def validate_and_parse_url(url_string)
|
|
75
|
+
uri = URI.parse(url_string)
|
|
76
|
+
|
|
77
|
+
unless %w[http https].include?(uri.scheme&.downcase)
|
|
78
|
+
raise(
|
|
79
|
+
Errors::GenerationError,
|
|
80
|
+
"Unsupported URL scheme: #{uri.scheme || 'none'} (only http/https allowed)"
|
|
81
|
+
)
|
|
82
|
+
end
|
|
83
|
+
|
|
84
|
+
if uri.host.nil? || uri.host.empty?
|
|
85
|
+
raise(
|
|
86
|
+
Errors::GenerationError,
|
|
87
|
+
"Invalid URL: missing host in #{url_string}"
|
|
88
|
+
)
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
uri
|
|
92
|
+
rescue URI::InvalidURIError => e
|
|
93
|
+
raise(
|
|
94
|
+
Errors::GenerationError,
|
|
95
|
+
"Invalid URL format: #{e.message}"
|
|
96
|
+
)
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
def absolute_redirect_url(base_uri, location)
|
|
100
|
+
raise(
|
|
101
|
+
Errors::GenerationError,
|
|
102
|
+
"Redirect missing location header for #{base_uri}"
|
|
103
|
+
) if location.nil? || location.empty?
|
|
104
|
+
|
|
105
|
+
URI.join(base_uri, location).to_s
|
|
106
|
+
rescue URI::InvalidURIError => e
|
|
107
|
+
raise(
|
|
108
|
+
Errors::GenerationError,
|
|
109
|
+
"Invalid redirect URL from #{base_uri}: #{e.message}"
|
|
110
|
+
)
|
|
111
|
+
end
|
|
112
|
+
|
|
113
|
+
def log_redirect(url)
|
|
114
|
+
return unless @verbose
|
|
115
|
+
|
|
116
|
+
@output.puts(" Redirecting to #{url}...")
|
|
117
|
+
end
|
|
118
|
+
end
|
|
119
|
+
end
|
|
120
|
+
|
data/lib/llm_docs_builder.rb
CHANGED
|
@@ -25,6 +25,7 @@ module LlmDocsBuilder
|
|
|
25
25
|
# @option options [Boolean] :convert_urls convert HTML URLs to markdown format (overrides
|
|
26
26
|
# config)
|
|
27
27
|
# @option options [Boolean] :verbose enable verbose output (overrides config)
|
|
28
|
+
# @option options [String] :content raw markdown content (used for remote sources)
|
|
28
29
|
# @return [String] generated llms.txt content
|
|
29
30
|
#
|
|
30
31
|
# @example Generate from docs directory
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: llm-docs-builder
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.11.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Maciej Mensfeld
|
|
@@ -116,6 +116,7 @@ files:
|
|
|
116
116
|
- ".rspec"
|
|
117
117
|
- ".rubocop.yml"
|
|
118
118
|
- ".ruby-version"
|
|
119
|
+
- AGENTS.md
|
|
119
120
|
- CHANGELOG.md
|
|
120
121
|
- Dockerfile
|
|
121
122
|
- Gemfile
|
|
@@ -143,6 +144,7 @@ files:
|
|
|
143
144
|
- lib/llm_docs_builder/transformers/heading_transformer.rb
|
|
144
145
|
- lib/llm_docs_builder/transformers/link_transformer.rb
|
|
145
146
|
- lib/llm_docs_builder/transformers/whitespace_transformer.rb
|
|
147
|
+
- lib/llm_docs_builder/url_fetcher.rb
|
|
146
148
|
- lib/llm_docs_builder/validator.rb
|
|
147
149
|
- lib/llm_docs_builder/version.rb
|
|
148
150
|
- llm-docs-builder.gemspec
|