RubyGems - nukitori - Versions diffs - 0.1.0 - Mend

nukitori 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +7 -0
data/CHANGELOG.md +5 -0
data/LICENSE.txt +21 -0
data/README.md +313 -0
data/Rakefile +12 -0
data/lib/nukitori/chat_factory.rb +31 -0
data/lib/nukitori/html_preprocessor.rb +21 -0
data/lib/nukitori/llm_extractor.rb +52 -0
data/lib/nukitori/models.json +7428 -0
data/lib/nukitori/response_parser.rb +18 -0
data/lib/nukitori/schema_extractor.rb +127 -0
data/lib/nukitori/schema_generator.rb +147 -0
data/lib/nukitori/version.rb +5 -0
data/lib/nukitori.rb +122 -0
data/sig/nukitori.rbs +4 -0
metadata +88 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 9aa7b220b6a1cfe138ce6a644fe38bf2aa4c3cd699f1cffd21ec67ccadcb451b
+  data.tar.gz: c94ac2f7da447a988c8e6b72049f0d6f908c18ad84c315f7d89988b009aa10a0
+SHA512:
+  metadata.gz: 1ee431bc34a28cf4554eec19fe1be3599fa14f3de7f0aeff34efd702198c2e7fd3b7b0a69409b69e87d68811bb695e67deaf41eca755a369f0dfb53b5b00c414
+  data.tar.gz: e624dc374ca0d52b1e4a7bef3b892b1ad8284589089df829190d52be95ccd6307a10824e9614c9b7f4d00d3d77808995390553ea8192e8704489a0868940267e

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,5 @@
+# CHANGELOG
+## [0.1.0] - 2026-01-06
+- Initial release

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2026 Victor Afanasev
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,313 @@
+# Nukitori
+<img align="right" height="175px" src="https://habrastorage.org/webt/cc/se/er/ccseeryjqt-rto5biycw4twgyue.png" alt="Nukitori gem logo" />
+Nukitori is a Ruby gem for HTML data extraction that uses an LLM once to generate reusable XPath schemas, then extracts data using plain Nokogiri (without AI) from similarly structured HTML pages. You describe the data you want to extract; Nukitori generates and reuses the scraping logic for you:
+- **One-time LLM call** — generates a reusable XPath schema; all subsequent extractions run without AI
+- **Robust reusable schemas** — avoids page-specific IDs, dynamic hashes, and fragile selectors
+- **Transparent output** — generated schemas are plain JSON, easy to inspect, diff, and version
+- **Token-optimized** — strips scripts, styles, and redundant DOM before sending HTML to the LLM
+- **Any LLM provider** — works with OpenAI, Anthropic, Gemini, and local models
+Define what you want to extract from HTML using a simple schema DSL:
+```ruby
+# github_extract.rb
+require 'nukitori'
+require 'json'
+html = "<HTML DOM from https://github.com/search?q=ruby+web+scraping&type=repositories>"
+data = Nukitori(html, 'schema.json') do
+  integer :repositories_found_count
+  array :repositories do
+    object do
+      string :name
+      string :description
+      string :url
+      string :stars
+      array :tags, of: :string
+    end
+  end
+end
+File.write('results.json', JSON.pretty_generate(data))
+```
+On the first run `$ ruby github_extract.rb` Nukitori uses AI to generate a reusable XPath extraction schema:
+<details>
+  <summary><code>schema.json</code> (click to expand)</summary><br>
+```json
+{
+  "repositories_found_count": {
+    "xpath": "//a[@data-testid='nav-item-repositories']//span[@data-testid='resolved-count-label']",
+    "type": "integer"
+  },
+  "repositories": {
+    "type": "array",
+    "container_xpath": "//div[@data-testid='results-list']/*[.//div[contains(@class, 'search-title')]]",
+    "items": {
+      "name": {
+        "xpath": ".//div[contains(@class, 'search-title')]//a",
+        "type": "string"
+      },
+      "description": {
+        "xpath": ".//h3/following-sibling::div[1]",
+        "type": "string"
+      },
+      "url": {
+        "xpath": ".//div[contains(@class, 'search-title')]//a/@href",
+        "type": "string"
+      },
+      "stars": {
+        "xpath": ".//a[contains(@href, '/stargazers')]",
+        "type": "string"
+      },
+      "tags": {
+        "type": "array",
+        "container_xpath": ".//a[contains(@href, '/topics/')]",
+        "items": {
+          "xpath": ".",
+          "type": "string"
+        }
+      }
+    }
+  }
+}
+```
+</details>
+After that, Nukitori extracts structured data from similar HTMLs without any LLM calls, in milliseconds:
+<details>
+  <summary><code>results.json</code> (click to expand)</summary><br>
+```json
+{
+  "repositories_found_count": 314,
+  "repositories": [
+    {
+      "name": "sparklemotion/mechanize",
+      "description": "Mechanize is a ruby library that makes automated web interaction easy.",
+      "url": "/sparklemotion/mechanize",
+      "stars": "4.4k",
+      "tags": ["ruby", "web", "scraping"]
+    },
+    {
+      "name": "jaimeiniesta/metainspector",
+      "description": "Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...",
+      "url": "/jaimeiniesta/metainspector",
+      "stars": "1k",
+      "tags": []
+    },
+    {
+      "name": "vifreefly/kimuraframework",
+      "description": "Kimurai is a modern Ruby web scraping framework designed to scrape and interact with JavaScript-rendered websites using headless antidete…",
+      "url": "/vifreefly/kimuraframework",
+      "stars": "1.1k",
+      "tags": ["ruby", "crawler", "scraper", "web-scraping", "scrapy"]
+    },
+    //...
+  ]
+}
+```
+</details>
+## Installation
+`$ gem install nukitori` or add it to your Gemfile `gem 'nukitori'`. Required Ruby version is `3.2` and up.
+## Configuration
+```ruby
+require 'nukitori'
+Nukitori.configure do |config|
+  config.default_model = 'gpt-5.2'
+  config.openai_api_key = '<OPENAI_API_KEY>'
+  # or
+  config.default_model = 'claude-haiku-4-5-20251001'
+  config.anthropic_api_key = '<ANTHROPIC_API_KEY>'
+  # or
+  config.default_model = 'gemini-3-flash-preview'
+  config.gemini_api_key = '<GEMINI_API_KEY>'
+  # or
+  config.default_model = 'deepseek-chat'
+  config.deepseek_api_key = '<DEEPSEEK_API_KEY>'
+end
+```
+Using custom OpenAI API-compatible models (including local ones). Example with Z.AI:
+```ruby
+Nukitori.configure do |config|
+  config.default_model = 'glm-4.7'
+  config.openai_use_system_role = true # optionally, depends on API
+  config.openai_api_base = 'https://api.z.ai/api/paas/v4/'
+  config.openai_api_key = '<ZAI_API_KEY>'
+end
+```
+## Usage
+Use [format of RubyLLM::Schema](https://github.com/danielfriis/ruby_llm-schema) to define extraction schemas. Supported schema property types:
+* `string` - type you should use in most cases
+* `integer` - parses extracted string to Ruby's Integer
+* `number` - parses extracted string value to Ruby's Float
+Tip: if LLM having troubles to correctly find correct Xpath for a field, use `description` option to point out what exactly needs to be scraped for this field:
+```ruby
+data = Nukitori(html, 'product_schema.json') do
+  string :name, description: 'Product name'
+  string :availability, description: 'Product availability, in stock or out of stock'
+  string :description, description: 'Short product description'
+  string :manufacturer
+  string :price
+end
+```
+### Extended API
+```ruby
+require 'nukitori'
+# Define extraction schema
+schema_generator = Nukitori::SchemaGenerator.new do
+  array :products do
+    object do
+      string :name
+      string :price
+      string :availability
+    end
+  end
+end
+# Generate extraction schema (uses LLM), returns Ruby hash as schema
+extraction_schema = schema_generator.create_extraction_schema_for(html)
+# Optionally save for reuse to a file or a database
+# File.write('extraction_schema.json', JSON.pretty_generate(extraction_schema))
+# Extract data from HTML using previously generated extraction_schema (no LLM)
+schema_extractor = Nukitori::SchemaExtractor.new(extraction_schema)
+data = schema_extractor.extract(html)
+```
+### With Custom Model
+```ruby
+schema_generator = Nukitori::SchemaGenerator.new(model: 'claude-haiku-4-5-20251001') do
+  string :title
+  number :price
+end
+extraction_schema = schema_generator.create_extraction_schema_for(html)
+```
+### LLM-only extraction (no schemas)
+Nukitori can also extract data directly with an LLM, without generating or using XPath schemas.
+In this mode, every extraction call invokes the LLM and relies on its structured output capabilities.
+This approach trades higher cost and latency for greater flexibility: the LLM can not only extract values from HTML, but also normalize, convert, and transform them based on the declared field types.
+```ruby
+# If no schema path is provided, Nukitori uses the LLM
+# for data extraction on every run
+data = Nukitori(html) do
+  string  :repo_name
+  number  :stars_count
+end
+```
+<details>
+  <summary>When LLM-only extraction is useful? (click to expand)</summary><br>
+Consider scraping a GitHub repository page that shows 1.1k stars. With a reusable XPath schema, Nukitori extracts exactly what appears in the HTML.
+If the value is rendered as `"1.1k"`, that is what the extractor receives.
+```ruby
+# XPath-based extraction (LLM used only once to generate the schema)
+data = Nukitori(html, 'schema.json') do
+  number :stars_count
+end
+# Result reflects the literal HTML value `1.1k` converted to float:
+# => { "stars_count" => 1.1 }
+```
+To convert `"1.1k"` into `1100`, you would need to scrape in string `string :stars_count` and then add custom post-processing conversion logic.
+With LLM-only extraction, Nukitori can define the intended numeric value directly:
+```ruby
+# LLM-only extraction (LLM called on every run)
+data = Nukitori(html) do
+  number :stars_count
+end
+# LLM interprets "1.1k" as 1100
+# => { "stars_count" => 1100 }
+```
+**Pros**
+* Flexible output schemas
+* Automatic normalization and value conversion
+* Useful for semantic or non-trivial transformations
+**Cons**
+* LLM call on every extraction
+* Higher cost and latency
+* Less deterministic than schema-based extraction
+Use LLM-only extraction when you need semantic understanding or complex value normalization, or when running against cheap or local LLMs. For high-volume or long-running scrapers, reusable XPath schemas are usually the better choice.
+</details>
+## Model Benchmarks
+Tested on current page's HTML DOM to generate following extraction schema:
+```ruby
+data = Nukitori(html, 'schema.json') do
+  string :name
+  string :desc
+  string :stars_count
+  array :tags, of: :string
+end
+```
+| Provider | Model | Time |
+|----------|-------|------|
+| OpenAI | `gpt-5.2` | ~7s |
+| OpenAI | `gpt-5` | ~35s |
+| OpenAI | `gpt-5-mini` | ~18s |
+| OpenAI | `gpt-5-nano` | ~32s (may generate incomplete schemas) |
+| Gemini | `gemini-3-flash-preview` | ~11s |
+| Gemini | `gemini-3-pro-preview` | ~30s |
+| Anthropic | `claude-opus-4-5-20251101` | ~6.5s |
+| Anthropic | `claude-sonnet-4-5-20250929` | ~7s |
+| Anthropic | `claude-haiku-4-5-20251001` | ~3.5s |
+| DeepSeek | `deepseek-chat` (V3.2) | ~10s |
+| Z.AI | `glm-4.7` | ~1m |
+| Z.AI | `glm-4.5-airx` | ~30s |
+**Recommendation:** Based on my testing, models like `gpt-5.2` or `gemini-3-flash-preview` offer the best balance of speed and reliability for generating complex nested extraction schemas. They consistently generate robust XPaths that work across similar HTML pages.
+## Thanks to
+* [Nokogiri](https://github.com/sparklemotion/nokogiri)
+* [RubyLLM](https://github.com/crmne/ruby_llm)
+## License
+MIT

data/Rakefile ADDED Viewed

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+require 'bundler/gem_tasks'
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec)
+require 'rubocop/rake_task'
+RuboCop::RakeTask.new
+task default: %i[spec rubocop]

data/lib/nukitori/chat_factory.rb ADDED Viewed

@@ -0,0 +1,31 @@
+# frozen_string_literal: true
+module Nukitori
+  class ChatFactory
+    class << self
+      def create(model: nil)
+        options = {}
+        options[:model] = model if model
+        begin
+          RubyLLM.chat(**options)
+        rescue RubyLLM::ModelNotFoundError
+          # If custom OpenAI-compatible API is configured, add required options
+          if custom_openai_api?
+            options[:provider] = :openai
+            options[:assume_model_exists] = true
+          end
+          RubyLLM.chat(**options)
+        end
+      end
+      private
+      def custom_openai_api?
+        base = RubyLLM.config.openai_api_base
+        base && base != 'https://api.openai.com/v1/'
+      end
+    end
+  end
+end

data/lib/nukitori/html_preprocessor.rb ADDED Viewed

@@ -0,0 +1,21 @@
+# frozen_string_literal: true
+module Nukitori
+  # Preprocesses HTML to reduce token size for LLM
+  class HtmlPreprocessor
+    # @param html [String, Nokogiri::HTML::Document] HTML string or Nokogiri document
+    # @return [String] Cleaned HTML
+    def self.process(html)
+      doc = html.is_a?(Nokogiri::HTML::Document) ? html.dup : Nokogiri::HTML(html)
+      # Remove non-content elements
+      doc.css('script, style, noscript, svg, path, meta, link, head').remove
+      # Remove style attributes
+      doc.css('*').each { |node| node.remove_attribute('style') }
+      # Collapse whitespace
+      doc.to_html.gsub(/\s+/, ' ')
+    end
+  end
+end

data/lib/nukitori/llm_extractor.rb ADDED Viewed

@@ -0,0 +1,52 @@
+# frozen_string_literal: true
+module Nukitori
+  # Extracts data directly using LLM (no schema generation/caching)
+  class LlmExtractor
+    class << self
+      # Extract data from HTML using LLM directly
+      # @param html [String, Nokogiri::HTML::Document] HTML content
+      # @param model [String, nil] LLM model to use (overrides default_model)
+      # @param block [Proc] Schema definition block
+      # @return [Hash] Extracted data
+      def extract(html, model: nil, &block)
+        raise ArgumentError, 'Block required for schema definition' unless block_given?
+        schema_class = Class.new(RubyLLM::Schema, &block)
+        processed_html = HtmlPreprocessor.process(html)
+        chat = ChatFactory.create(model:)
+        chat.with_schema(schema_class) if support_structured_output?(chat.model)
+        chat.with_instructions(build_prompt(schema_class))
+        response = chat.ask(processed_html)
+        ResponseParser.parse(response.content)
+      end
+      private
+      def support_structured_output?(model)
+        model.capabilities.include?('structured_output') && !model.id.include?('deepseek')
+      end
+      def build_prompt(schema_class)
+        schema = JSON.parse(schema_class.new.to_json)
+        properties = schema.dig('schema', 'properties')
+        <<~PROMPT
+          You are a web data extraction expert.
+          ## Task
+          Extract data from the provided HTML according to the JSON schema.
+          Return ONLY valid JSON, no other text.
+          STRICTLY FOLLOW the requirements schema provided.
+          ## Requirements Schema (what to extract)
+          ```json
+          #{properties.to_json}
+          ```
+        PROMPT
+      end
+    end
+  end
+end