nukitori 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 9aa7b220b6a1cfe138ce6a644fe38bf2aa4c3cd699f1cffd21ec67ccadcb451b
4
+ data.tar.gz: c94ac2f7da447a988c8e6b72049f0d6f908c18ad84c315f7d89988b009aa10a0
5
+ SHA512:
6
+ metadata.gz: 1ee431bc34a28cf4554eec19fe1be3599fa14f3de7f0aeff34efd702198c2e7fd3b7b0a69409b69e87d68811bb695e67deaf41eca755a369f0dfb53b5b00c414
7
+ data.tar.gz: e624dc374ca0d52b1e4a7bef3b892b1ad8284589089df829190d52be95ccd6307a10824e9614c9b7f4d00d3d77808995390553ea8192e8704489a0868940267e
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ # CHANGELOG
2
+
3
+ ## [0.1.0] - 2026-01-06
4
+
5
+ - Initial release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2026 Victor Afanasev
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,313 @@
1
+ # Nukitori
2
+
3
+ <img align="right" height="175px" src="https://habrastorage.org/webt/cc/se/er/ccseeryjqt-rto5biycw4twgyue.png" alt="Nukitori gem logo" />
4
+
5
+ Nukitori is a Ruby gem for HTML data extraction that uses an LLM once to generate reusable XPath schemas, then extracts data using plain Nokogiri (without AI) from similarly structured HTML pages. You describe the data you want to extract; Nukitori generates and reuses the scraping logic for you:
6
+
7
+ - **One-time LLM call** — generates a reusable XPath schema; all subsequent extractions run without AI
8
+ - **Robust reusable schemas** — avoids page-specific IDs, dynamic hashes, and fragile selectors
9
+ - **Transparent output** — generated schemas are plain JSON, easy to inspect, diff, and version
10
+ - **Token-optimized** — strips scripts, styles, and redundant DOM before sending HTML to the LLM
11
+ - **Any LLM provider** — works with OpenAI, Anthropic, Gemini, and local models
12
+
13
+ Define what you want to extract from HTML using a simple schema DSL:
14
+
15
+ ```ruby
16
+ # github_extract.rb
17
+ require 'nukitori'
18
+ require 'json'
19
+
20
+ html = "<HTML DOM from https://github.com/search?q=ruby+web+scraping&type=repositories>"
21
+
22
+ data = Nukitori(html, 'schema.json') do
23
+ integer :repositories_found_count
24
+ array :repositories do
25
+ object do
26
+ string :name
27
+ string :description
28
+ string :url
29
+ string :stars
30
+ array :tags, of: :string
31
+ end
32
+ end
33
+ end
34
+
35
+ File.write('results.json', JSON.pretty_generate(data))
36
+ ```
37
+
38
+ On the first run `$ ruby github_extract.rb` Nukitori uses AI to generate a reusable XPath extraction schema:
39
+
40
+ <details>
41
+ <summary><code>schema.json</code> (click to expand)</summary><br>
42
+
43
+ ```json
44
+ {
45
+ "repositories_found_count": {
46
+ "xpath": "//a[@data-testid='nav-item-repositories']//span[@data-testid='resolved-count-label']",
47
+ "type": "integer"
48
+ },
49
+ "repositories": {
50
+ "type": "array",
51
+ "container_xpath": "//div[@data-testid='results-list']/*[.//div[contains(@class, 'search-title')]]",
52
+ "items": {
53
+ "name": {
54
+ "xpath": ".//div[contains(@class, 'search-title')]//a",
55
+ "type": "string"
56
+ },
57
+ "description": {
58
+ "xpath": ".//h3/following-sibling::div[1]",
59
+ "type": "string"
60
+ },
61
+ "url": {
62
+ "xpath": ".//div[contains(@class, 'search-title')]//a/@href",
63
+ "type": "string"
64
+ },
65
+ "stars": {
66
+ "xpath": ".//a[contains(@href, '/stargazers')]",
67
+ "type": "string"
68
+ },
69
+ "tags": {
70
+ "type": "array",
71
+ "container_xpath": ".//a[contains(@href, '/topics/')]",
72
+ "items": {
73
+ "xpath": ".",
74
+ "type": "string"
75
+ }
76
+ }
77
+ }
78
+ }
79
+ }
80
+ ```
81
+ </details>
82
+
83
+ After that, Nukitori extracts structured data from similar HTMLs without any LLM calls, in milliseconds:
84
+
85
+ <details>
86
+ <summary><code>results.json</code> (click to expand)</summary><br>
87
+
88
+ ```json
89
+ {
90
+ "repositories_found_count": 314,
91
+ "repositories": [
92
+ {
93
+ "name": "sparklemotion/mechanize",
94
+ "description": "Mechanize is a ruby library that makes automated web interaction easy.",
95
+ "url": "/sparklemotion/mechanize",
96
+ "stars": "4.4k",
97
+ "tags": ["ruby", "web", "scraping"]
98
+ },
99
+ {
100
+ "name": "jaimeiniesta/metainspector",
101
+ "description": "Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...",
102
+ "url": "/jaimeiniesta/metainspector",
103
+ "stars": "1k",
104
+ "tags": []
105
+ },
106
+ {
107
+ "name": "vifreefly/kimuraframework",
108
+ "description": "Kimurai is a modern Ruby web scraping framework designed to scrape and interact with JavaScript-rendered websites using headless antidete…",
109
+ "url": "/vifreefly/kimuraframework",
110
+ "stars": "1.1k",
111
+ "tags": ["ruby", "crawler", "scraper", "web-scraping", "scrapy"]
112
+ },
113
+ //...
114
+ ]
115
+ }
116
+ ```
117
+ </details>
118
+
119
+ ## Installation
120
+
121
+ `$ gem install nukitori` or add it to your Gemfile `gem 'nukitori'`. Required Ruby version is `3.2` and up.
122
+
123
+
124
+ ## Configuration
125
+
126
+ ```ruby
127
+ require 'nukitori'
128
+
129
+ Nukitori.configure do |config|
130
+ config.default_model = 'gpt-5.2'
131
+ config.openai_api_key = '<OPENAI_API_KEY>'
132
+
133
+ # or
134
+ config.default_model = 'claude-haiku-4-5-20251001'
135
+ config.anthropic_api_key = '<ANTHROPIC_API_KEY>'
136
+
137
+ # or
138
+ config.default_model = 'gemini-3-flash-preview'
139
+ config.gemini_api_key = '<GEMINI_API_KEY>'
140
+
141
+ # or
142
+ config.default_model = 'deepseek-chat'
143
+ config.deepseek_api_key = '<DEEPSEEK_API_KEY>'
144
+ end
145
+ ```
146
+
147
+ Using custom OpenAI API-compatible models (including local ones). Example with Z.AI:
148
+
149
+ ```ruby
150
+ Nukitori.configure do |config|
151
+ config.default_model = 'glm-4.7'
152
+
153
+ config.openai_use_system_role = true # optionally, depends on API
154
+ config.openai_api_base = 'https://api.z.ai/api/paas/v4/'
155
+ config.openai_api_key = '<ZAI_API_KEY>'
156
+ end
157
+ ```
158
+
159
+ ## Usage
160
+
161
+ Use [format of RubyLLM::Schema](https://github.com/danielfriis/ruby_llm-schema) to define extraction schemas. Supported schema property types:
162
+ * `string` - type you should use in most cases
163
+ * `integer` - parses extracted string to Ruby's Integer
164
+ * `number` - parses extracted string value to Ruby's Float
165
+
166
+ Tip: if LLM having troubles to correctly find correct Xpath for a field, use `description` option to point out what exactly needs to be scraped for this field:
167
+
168
+ ```ruby
169
+ data = Nukitori(html, 'product_schema.json') do
170
+ string :name, description: 'Product name'
171
+ string :availability, description: 'Product availability, in stock or out of stock'
172
+ string :description, description: 'Short product description'
173
+ string :manufacturer
174
+ string :price
175
+ end
176
+ ```
177
+
178
+ ### Extended API
179
+
180
+ ```ruby
181
+ require 'nukitori'
182
+
183
+ # Define extraction schema
184
+ schema_generator = Nukitori::SchemaGenerator.new do
185
+ array :products do
186
+ object do
187
+ string :name
188
+ string :price
189
+ string :availability
190
+ end
191
+ end
192
+ end
193
+
194
+ # Generate extraction schema (uses LLM), returns Ruby hash as schema
195
+ extraction_schema = schema_generator.create_extraction_schema_for(html)
196
+
197
+ # Optionally save for reuse to a file or a database
198
+ # File.write('extraction_schema.json', JSON.pretty_generate(extraction_schema))
199
+
200
+ # Extract data from HTML using previously generated extraction_schema (no LLM)
201
+ schema_extractor = Nukitori::SchemaExtractor.new(extraction_schema)
202
+ data = schema_extractor.extract(html)
203
+ ```
204
+
205
+ ### With Custom Model
206
+
207
+ ```ruby
208
+ schema_generator = Nukitori::SchemaGenerator.new(model: 'claude-haiku-4-5-20251001') do
209
+ string :title
210
+ number :price
211
+ end
212
+
213
+ extraction_schema = schema_generator.create_extraction_schema_for(html)
214
+ ```
215
+
216
+ ### LLM-only extraction (no schemas)
217
+
218
+ Nukitori can also extract data directly with an LLM, without generating or using XPath schemas.
219
+ In this mode, every extraction call invokes the LLM and relies on its structured output capabilities.
220
+
221
+ This approach trades higher cost and latency for greater flexibility: the LLM can not only extract values from HTML, but also normalize, convert, and transform them based on the declared field types.
222
+
223
+ ```ruby
224
+ # If no schema path is provided, Nukitori uses the LLM
225
+ # for data extraction on every run
226
+ data = Nukitori(html) do
227
+ string :repo_name
228
+ number :stars_count
229
+ end
230
+ ```
231
+
232
+ <details>
233
+ <summary>When LLM-only extraction is useful? (click to expand)</summary><br>
234
+
235
+ Consider scraping a GitHub repository page that shows 1.1k stars. With a reusable XPath schema, Nukitori extracts exactly what appears in the HTML.
236
+ If the value is rendered as `"1.1k"`, that is what the extractor receives.
237
+
238
+ ```ruby
239
+ # XPath-based extraction (LLM used only once to generate the schema)
240
+ data = Nukitori(html, 'schema.json') do
241
+ number :stars_count
242
+ end
243
+
244
+ # Result reflects the literal HTML value `1.1k` converted to float:
245
+ # => { "stars_count" => 1.1 }
246
+ ```
247
+
248
+ To convert `"1.1k"` into `1100`, you would need to scrape in string `string :stars_count` and then add custom post-processing conversion logic.
249
+
250
+ With LLM-only extraction, Nukitori can define the intended numeric value directly:
251
+
252
+ ```ruby
253
+ # LLM-only extraction (LLM called on every run)
254
+ data = Nukitori(html) do
255
+ number :stars_count
256
+ end
257
+
258
+ # LLM interprets "1.1k" as 1100
259
+ # => { "stars_count" => 1100 }
260
+ ```
261
+
262
+ **Pros**
263
+ * Flexible output schemas
264
+ * Automatic normalization and value conversion
265
+ * Useful for semantic or non-trivial transformations
266
+
267
+ **Cons**
268
+ * LLM call on every extraction
269
+ * Higher cost and latency
270
+ * Less deterministic than schema-based extraction
271
+
272
+ Use LLM-only extraction when you need semantic understanding or complex value normalization, or when running against cheap or local LLMs. For high-volume or long-running scrapers, reusable XPath schemas are usually the better choice.
273
+
274
+ </details>
275
+
276
+
277
+ ## Model Benchmarks
278
+
279
+ Tested on current page's HTML DOM to generate following extraction schema:
280
+
281
+ ```ruby
282
+ data = Nukitori(html, 'schema.json') do
283
+ string :name
284
+ string :desc
285
+ string :stars_count
286
+ array :tags, of: :string
287
+ end
288
+ ```
289
+
290
+ | Provider | Model | Time |
291
+ |----------|-------|------|
292
+ | OpenAI | `gpt-5.2` | ~7s |
293
+ | OpenAI | `gpt-5` | ~35s |
294
+ | OpenAI | `gpt-5-mini` | ~18s |
295
+ | OpenAI | `gpt-5-nano` | ~32s (may generate incomplete schemas) |
296
+ | Gemini | `gemini-3-flash-preview` | ~11s |
297
+ | Gemini | `gemini-3-pro-preview` | ~30s |
298
+ | Anthropic | `claude-opus-4-5-20251101` | ~6.5s |
299
+ | Anthropic | `claude-sonnet-4-5-20250929` | ~7s |
300
+ | Anthropic | `claude-haiku-4-5-20251001` | ~3.5s |
301
+ | DeepSeek | `deepseek-chat` (V3.2) | ~10s |
302
+ | Z.AI | `glm-4.7` | ~1m |
303
+ | Z.AI | `glm-4.5-airx` | ~30s |
304
+
305
+ **Recommendation:** Based on my testing, models like `gpt-5.2` or `gemini-3-flash-preview` offer the best balance of speed and reliability for generating complex nested extraction schemas. They consistently generate robust XPaths that work across similar HTML pages.
306
+
307
+ ## Thanks to
308
+ * [Nokogiri](https://github.com/sparklemotion/nokogiri)
309
+ * [RubyLLM](https://github.com/crmne/ruby_llm)
310
+
311
+ ## License
312
+
313
+ MIT
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'bundler/gem_tasks'
4
+ require 'rspec/core/rake_task'
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ require 'rubocop/rake_task'
9
+
10
+ RuboCop::RakeTask.new
11
+
12
+ task default: %i[spec rubocop]
@@ -0,0 +1,31 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Nukitori
4
+ class ChatFactory
5
+ class << self
6
+ def create(model: nil)
7
+ options = {}
8
+ options[:model] = model if model
9
+
10
+ begin
11
+ RubyLLM.chat(**options)
12
+ rescue RubyLLM::ModelNotFoundError
13
+ # If custom OpenAI-compatible API is configured, add required options
14
+ if custom_openai_api?
15
+ options[:provider] = :openai
16
+ options[:assume_model_exists] = true
17
+ end
18
+
19
+ RubyLLM.chat(**options)
20
+ end
21
+ end
22
+
23
+ private
24
+
25
+ def custom_openai_api?
26
+ base = RubyLLM.config.openai_api_base
27
+ base && base != 'https://api.openai.com/v1/'
28
+ end
29
+ end
30
+ end
31
+ end
@@ -0,0 +1,21 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Nukitori
4
+ # Preprocesses HTML to reduce token size for LLM
5
+ class HtmlPreprocessor
6
+ # @param html [String, Nokogiri::HTML::Document] HTML string or Nokogiri document
7
+ # @return [String] Cleaned HTML
8
+ def self.process(html)
9
+ doc = html.is_a?(Nokogiri::HTML::Document) ? html.dup : Nokogiri::HTML(html)
10
+
11
+ # Remove non-content elements
12
+ doc.css('script, style, noscript, svg, path, meta, link, head').remove
13
+
14
+ # Remove style attributes
15
+ doc.css('*').each { |node| node.remove_attribute('style') }
16
+
17
+ # Collapse whitespace
18
+ doc.to_html.gsub(/\s+/, ' ')
19
+ end
20
+ end
21
+ end
@@ -0,0 +1,52 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Nukitori
4
+ # Extracts data directly using LLM (no schema generation/caching)
5
+ class LlmExtractor
6
+ class << self
7
+ # Extract data from HTML using LLM directly
8
+ # @param html [String, Nokogiri::HTML::Document] HTML content
9
+ # @param model [String, nil] LLM model to use (overrides default_model)
10
+ # @param block [Proc] Schema definition block
11
+ # @return [Hash] Extracted data
12
+ def extract(html, model: nil, &block)
13
+ raise ArgumentError, 'Block required for schema definition' unless block_given?
14
+
15
+ schema_class = Class.new(RubyLLM::Schema, &block)
16
+ processed_html = HtmlPreprocessor.process(html)
17
+
18
+ chat = ChatFactory.create(model:)
19
+ chat.with_schema(schema_class) if support_structured_output?(chat.model)
20
+ chat.with_instructions(build_prompt(schema_class))
21
+
22
+ response = chat.ask(processed_html)
23
+ ResponseParser.parse(response.content)
24
+ end
25
+
26
+ private
27
+
28
+ def support_structured_output?(model)
29
+ model.capabilities.include?('structured_output') && !model.id.include?('deepseek')
30
+ end
31
+
32
+ def build_prompt(schema_class)
33
+ schema = JSON.parse(schema_class.new.to_json)
34
+ properties = schema.dig('schema', 'properties')
35
+
36
+ <<~PROMPT
37
+ You are a web data extraction expert.
38
+
39
+ ## Task
40
+ Extract data from the provided HTML according to the JSON schema.
41
+ Return ONLY valid JSON, no other text.
42
+ STRICTLY FOLLOW the requirements schema provided.
43
+
44
+ ## Requirements Schema (what to extract)
45
+ ```json
46
+ #{properties.to_json}
47
+ ```
48
+ PROMPT
49
+ end
50
+ end
51
+ end
52
+ end