RubyGems - kimurai - Versions diffs - 2.0.1 → 2.1.0 - Mend

kimurai 2.0.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/.rspec +3 -0
data/.rubocop.yml +2 -0
data/CHANGELOG.md +14 -0
data/LICENSE.txt +1 -1
data/README.md +183 -39
data/Rakefile +3 -7
data/kimurai.gemspec +5 -1
data/lib/kimurai/base/saver.rb +34 -20
data/lib/kimurai/base.rb +15 -4
data/lib/kimurai/base_helper.rb +10 -0
data/lib/kimurai/browser_builder.rb +7 -0
data/lib/kimurai/capybara_ext/session.rb +5 -3
data/lib/kimurai/version.rb +1 -1
data/lib/kimurai.rb +44 -0
metadata +45 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 326378ff2c70df034e5e13ce8006f0c6efbbd53f228e55c241be8d50ed3ee5e7
-  data.tar.gz: de727e434b146f8671d3cd524b9b617f147ba1d2e96d4346a07511dc6dc59a88
+  metadata.gz: b0f990c2292eebb911b6036b7515fdbe4b844f75dc20ec032c1da352de740c80
+  data.tar.gz: 13f35756781bb2a0c8f14fe246edc87e428cd5de6bc46c269128d764b9f2763c
 SHA512:
-  metadata.gz: 3e1776777a7e65328d0ad42a931edf867c4240a5e173083246157a1376880e84608a2fba4564f61752190defbe080af1271695a5329e3f83e4a635dfb43ecae8
-  data.tar.gz: a011359fc944037d0305f5041c018d56a0b47d7c30d0e199bd7f5d62570a741ba6810dbdbad4edd2f702b72823712718a7b04495711772340900a2ca49bac4a1
+  metadata.gz: b2f236b701e505bba6e03083fc5e4308b125ba09203db04d41c1e72b0ccfbcb6957c35e6cc05e8f10ae2d3d96aeb7cdb1e7f24e7770406a8894f6020ec53c5c4
+  data.tar.gz: d47288711341145af98b0ad547583dae4ce26f38d6b024b4994a40eb9fc13f802238b0bf93eb0cd8b06d91ff098eceb2b837b50ef9e84ef3d0a98a25a70d0ce8

data/.gitignore CHANGED Viewed

@@ -1,3 +1,4 @@
+/.claude/
 /.bundle/
 /.yardoc
 /_yardoc/

data/.rspec ADDED Viewed

@@ -0,0 +1,3 @@
+--require spec_helper
+--color
+--format documentation

data/.rubocop.yml CHANGED Viewed

@@ -6,4 +6,6 @@ Style/TrivialAccessors:
 Style/RescueModifier:
   Enabled: false
 Style/FrozenStringLiteralComment:
+  Enabled: false
+Style/Documentation:
   Enabled: false

data/CHANGELOG.md CHANGED Viewed

@@ -1,4 +1,18 @@
 # CHANGELOG
+## 2.1.0
+### New
+* Min. required Ruby version is 3.2.0
+* **AI-powered data extraction with `extract` method** — Powered by [Nukitori](https://github.com/vifreefly/nukitori). Describe the data structure you want and let AI generate XPath selectors automatically. Selectors are cached for reuse, so AI is only called once per page type
+* **Configure Nukitori via Kimurai** — Set LLM provider settings (OpenAI, Anthropic, Gemini, etc.) directly in `Kimurai.configure` block
+* **Engine aliases** — Use shorter engine names: `:chrome` (alias for `:selenium_chrome`), `:firefox` (alias for `:selenium_firefox`)
+* **Top-level `@delay` option** — Set request delay directly as `@delay = 2..5` instead of nested `@config = { before_request: { delay: 2..5 } }`
+* **Auto spider name** — If `@name` is not provided, it's automatically derived from the class name
+* **Save array of items** — `save_to` helper now accepts an array of items to save at once
+### Improvements
+* `save_to` helper now uses pretty JSON by default for `:json` format (use `format: :compact_json` for compact output)
+* Request delay is now applied before the response is passed to the callback
 ## 2.0.1
 ### Fixes
 * Remove xpath as default Capybara selector type (fixes https://github.com/vifreefly/kimuraframework/issues/28)

data/LICENSE.txt CHANGED Viewed

@@ -1,6 +1,6 @@
 The MIT License (MIT)
-Copyright (c) 2018 Victor Afanasev
+Copyright (c) 2026 Victor Afanasev
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

data/README.md CHANGED Viewed

@@ -1,20 +1,21 @@
-# Kimurai
+<div align="center">
+  <a href="https://github.com/vifreefly/kimuraframework">
+    <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
+  </a>
-Kimurai is a modern web scraping framework written in Ruby which **works out of the box with Headless Chromium/Firefox** or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
+  <h1>Kimurai</h1>
+</div>
-Kimurai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
+Kimurai is a modern Ruby web scraping framework designed to scrape and interact with JavaScript-rendered websites using headless antidetect Chromium, Firefox, or simple HTTP requests — right out of the box:
 ```ruby
 # github_spider.rb
 require 'kimurai'
 class GithubSpider < Kimurai::Base
-  @name = "github_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
-  @config = {
-    before_request: { delay: 3..5 }
-  }
+  @delay = 3..5
   def parse(response, url:, data: {})
     response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
@@ -149,8 +150,7 @@ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? L
 require 'kimurai'
 class InfiniteScrollSpider < Kimurai::Base
-  @name = "infinite_scroll_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://infinite-scroll.com/demo/full-page/"]
   def parse(response, url:, data: {})
@@ -194,14 +194,82 @@ I, [2025-12-16 12:47:15]  INFO -- infinite_scroll_spider: > Continue scrolling,
 I, [2025-12-16 12:47:17]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
 I, [2025-12-16 12:47:19]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
 I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > Pagination is done
-I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
+I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > All posts from page:
+1a - Infinite Scroll full page demo;
+1b - RGB Schemes logo in Computer Arts;
+2a - RGB Schemes logo;
+2b - Masonry gets horizontalOrder;
+2c - Every vector 2016;
+3a - Logo Pizza delivered;
+3b - Some CodePens;
+3c - 365daysofmusic.com;
+3d - Holograms;
+4a - Huebee: 1-click color picker;
+4b - Word is Flickity is good;
+Flickity v2 released: groupCells, adaptiveHeight, parallax;
+New tech gets chatter; Isotope v3 released: stagger in, IE8 out;
+Packery v2 released
 I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
 I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
 ```
-</details><br>
+</details>
+## AI-Powered Extraction
+What if you could just describe the data you want and let AI figure out how to extract it? With the built-in `extract` method powered by [Nukitori](https://github.com/vifreefly/nukitori), you can:
+```ruby
+# github_spider_ai.rb
+require 'kimurai'
+Kimurai.configure do |config|
+  config.default_model = "gemini-3-flash-preview" # OpenAI, Anthropic, Gemini, local LLMs, etc.
+  config.gemini_api_key = ENV["GEMINI_API_KEY"]
+end
+class GithubSpider < Kimurai::Base
+  @engine = :chrome
+  @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
+  @delay = 3..5
+  def parse(response, url:, data: {})
+    data = extract(response) do
+      string :next_page_url, description: "Next page path url"
+      array :repos do
+        object do
+          string :name
+          string :url
+          string :description
+          string :stars
+          string :language
+          array :tags, of: :string
+        end
+      end
+    end
+    save_to "results.json", data[:repos], format: :json
+    if data[:next_page_url]
+      request_to :parse, url: absolute_url(data[:next_page_url], base: url)
+    end
+  end
+end
+GithubSpider.crawl!
+```
+**How it works:**
+1. On the first page, `extract` sends the HTML to an LLM which generates XPath rules for your schema
+2. These rules are cached in a JSON file alongside your spider
+3. **All subsequent pages use the cached XPath — no more AI calls, pure fast extraction**
+4. When there's no "Next" link on the last page, the extracted value is `nil` and pagination stops
+Zero manual selectors. The AI figured out where everything lives, and that knowledge is reused for the entire crawl.
 ## Features
+* **AI-powered data extraction**: Use [Nukitori](https://github.com/vifreefly/nukitori) to extract structured data without writing XPath/CSS selectors — just describe what you want, and AI figures out how to extract it
 * Scrape JavaScript rendered websites out of the box
 * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
 * Write spider code once, and use it with any supported engine later
@@ -229,6 +297,7 @@ I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Spider: stopped: {spid
     * [browser object](#browser-object)
     * [request_to method](#request_to-method)
     * [save_to helper](#save_to-helper)
+    * [AI-powered extraction with extract](#ai-powered-extraction-with-extract)
     * [Skip duplicates](#skip-duplicates)
       * [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
       * [Storage object](#storage-object)
@@ -262,7 +331,7 @@ I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Spider: stopped: {spid
 ## Installation
-Kimurai requires Ruby version `>= 3.1.0`. Officially supported platforms: `Linux` and `macOS`.
+Kimurai requires Ruby version `>= 3.2.0`. Officially supported platforms: `Linux` and `macOS`.
 1) If your system doesn't have the appropriate Ruby version, install it:
@@ -312,7 +381,7 @@ gem update --system
 ```bash
 # Install basic tools
-sudo apt install -q -y unzip wget tar openssl
+sudo apt install -q -y unzip wget tar openssl lsof
 # Install xvfb (for virtual_display headless mode, in addition to native)
 sudo apt install -q -y xvfb
@@ -409,8 +478,8 @@ CLI arguments:
 Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
 * `:mechanize` – [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
-* `:selenium_chrome` – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
-* `:selenium_firefox` – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
+* `:chrome` (`:selenium_chrome` alias) – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
+* `:firefox` (`:selenium_firefox` alias) – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
 **Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
@@ -423,7 +492,7 @@ require 'kimurai'
 class SimpleSpider < Kimurai::Base
   @name = "simple_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def parse(response, url:, data: {})
@@ -434,8 +503,8 @@ SimpleSpider.crawl!
 ```
 Where:
-* `@name` – a name for the spider
-* `@engine` – engine to use for the spider
+* `@name` – a name for the spider (optional)
+* `@engine` – engine to use for the spider (optional, default is `:mechanize`)
 * `@start_urls` – array of urls to process one-by-one inside the `parse` method
 * The `parse` method is the entry point, and should always be present in a spider class
@@ -458,7 +527,7 @@ Imagine that there is a product page that doesn't contain a category name. The c
 ```ruby
 class ProductsSpider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example-shop.com/example-product-category"]
   def parse(response, url:, data: {})
@@ -497,8 +566,7 @@ But, if you need to interact with a page (like filling form fields, clicking ele
 ```ruby
 class GoogleSpider < Kimurai::Base
-  @name = "google_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://www.google.com/"]
   def parse(response, url:, data: {})
@@ -529,7 +597,7 @@ For making requests to a particular method, there is `request_to`. It requires a
 ```ruby
 class Spider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def parse(response, url:, data: {})
@@ -565,7 +633,7 @@ The `request_to` helper method makes things simpler. We could also do something
 ```ruby
 class Spider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def parse(response, url:, data: {})
@@ -588,7 +656,7 @@ Sometimes all you need is to simply save scraped data to a file. You can use the
 ```ruby
 class ProductsSpider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example-shop.com/"]
   # ...
@@ -607,12 +675,12 @@ end
 ```
 Supported formats:
-* `:json` – JSON
-* `:pretty_json` – "pretty" JSON (`JSON.pretty_generate`)
+* `:json` – JSON (`JSON.pretty_generate`)
+* `:compact_json` – JSON
 * `:jsonlines` – [JSON Lines](http://jsonlines.org/)
 * `:csv` – CSV
-Note: `save_to` requires the data (item) to save to be a `Hash`.
+Note: `save_to` requires the data (item) to save to be a  Hash or Array of Hashes.
 By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
@@ -622,13 +690,91 @@ While the spider is running, each new item will be appended to the output file.
 > If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
+### AI-powered extraction with `extract`
+Writing and maintaining XPath/CSS selectors is tedious and error-prone. The `extract` method uses AI to generate selectors automatically — you just describe the data structure you want.
+**Configuration:**
+First, configure an LLM provider in your application:
+```ruby
+Kimurai.configure do |config|
+  config.default_model = 'gemini-3-flash-preview'
+  config.gemini_api_key = ENV['GEMINI_API_KEY']
+  # Or use OpenAI
+  # config.default_model = 'gpt-5.2'
+  # config.openai_api_key = ENV['OPENAI_API_KEY']
+  # Or Anthropic
+  # config.default_model = 'claude-sonnet-4-5'
+  # config.anthropic_api_key = ENV['ANTHROPIC_API_KEY']
+end
+```
+**Usage:**
+```ruby
+def parse(response, url:, data: {})
+  data = extract(response) do
+    string :title
+    string :price
+    string :description
+    array :features, of: :string
+  end
+  save_to "products.json", data, format: :json
+end
+```
+**Schema DSL:**
+- `string :field_name` — extracts text
+- `integer :field_name` — extracts integer
+- `number :field_name` — extracts float/decimal
+- `array :items do ... end` — extracts list of objects
+- `array :tags, of: :string` — extracts list of strings
+- `object do ... end` — nested structure
+- `description: '...'` — hint for AI about what to look for
+**How it works:**
+1. On first run, `extract` sends the HTML and your schema to an LLM
+2. The LLM returns XPath rules for each field
+3. These rules are cached in `SpiderName.json` alongside your spider file
+4. All subsequent extractions use cached XPath — fast and free, no more AI calls
+5. Each method gets its own prefix in the schema file, so different parse methods can have different schemas
+**Automatic pagination:**
+Include a next page field in your schema:
+```ruby
+data = extract(response) do
+  string :next_page_url, description: 'Next page link'
+  array :products do
+    object do
+      string :name
+      string :price
+    end
+  end
+end
+if data[:next_page_url]
+  request_to :parse, url: absolute_url(data[:next_page_url], base: url)
+end
+```
+When the last page has no "Next" link, the extracted value is `nil` and pagination stops naturally.
 ### Skip duplicates
 It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
 ```ruby
 class ProductsSpider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example-shop.com/"]
   def parse(response, url:, data: {})
@@ -842,8 +988,7 @@ The `run_info` method is available from the `open_spider` and `close_spider` cla
 ```ruby
 class ExampleSpider < Kimurai::Base
-  @name = "example_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def self.close_spider
@@ -895,7 +1040,7 @@ You can also use the additional methods `completed?` or `failed?`
 ```ruby
 class Spider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def self.close_spider
@@ -933,7 +1078,7 @@ Kimurai supports environments. The default is `development`. To provide a custom
 Usage example:
 ```ruby
 class Spider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def self.close_spider
@@ -956,7 +1101,6 @@ Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, t
 require 'kimurai'
 class AmazonSpider < Kimurai::Base
-  @name = "amazon_spider"
   @engine = :mechanize
   @start_urls = ["https://www.amazon.com/"]
@@ -1068,7 +1212,7 @@ vic@Vics-MacBook-Air single %
 * `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
 * `delay:` – set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
-* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :selenium_chrome)`
+* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :chrome)`
 * `config:` – set custom [config](#spider-config) options
 ### Active Support included
@@ -1170,7 +1314,7 @@ Kimurai.configure do |config|
   # Custom time zone (for logs):
   # config.time_zone = "UTC"
-  # config.time_zone = "Europe/Moscow"
+  # config.time_zone = "Europe/Berlin"
   # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
   # config.selenium_chrome_path = "/usr/bin/chromium-browser"
@@ -1286,7 +1430,7 @@ class Spider < Kimurai::Base
   USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
   PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   @config = {
     headers: { "custom_header" => "custom_value" },
@@ -1649,7 +1793,7 @@ end
 spiders/application_spider.rb
 ```ruby
 class ApplicationSpider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   # Define pipelines (by order) for all spiders:
   @pipelines = [:validator, :saver]
@@ -1726,7 +1870,7 @@ spiders/github_spider.rb
 ```ruby
 class GithubSpider < Kimurai::Base
   @name = "github_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
   @config = {
     before_request: { delay: 3..5 }

data/Rakefile CHANGED Viewed

@@ -1,10 +1,6 @@
 require 'bundler/gem_tasks'
-require 'rake/testtask'
+require 'rspec/core/rake_task'
-Rake::TestTask.new(:test) do |t|
-  t.libs << 'test'
-  t.libs << 'lib'
-  t.test_files = FileList['test/**/*_test.rb']
-end
+RSpec::Core::RakeTask.new(:spec)
-task default: :test
+task default: :spec

data/kimurai.gemspec CHANGED Viewed

@@ -20,7 +20,7 @@ Gem::Specification.new do |spec|
   spec.bindir        = 'exe'
   spec.executables   = 'kimurai'
   spec.require_paths = ['lib']
-  spec.required_ruby_version = '>= 3.1.0'
+  spec.required_ruby_version = '>= 3.2.0'
   spec.add_dependency 'activesupport'
   spec.add_dependency 'cliver'
@@ -46,4 +46,8 @@ Gem::Specification.new do |spec|
   spec.add_dependency 'pry'
   spec.add_dependency 'rbcat', '~> 1.0'
+  spec.add_dependency 'nukitori'
+  spec.add_development_dependency 'rake', '~> 13.0'
+  spec.add_development_dependency 'rspec', '~> 3.13'
 end

data/lib/kimurai/base/saver.rb CHANGED Viewed

@@ -7,10 +7,11 @@ module Kimurai
       attr_reader :format, :path, :position, :append
       def initialize(path, format:, position: true, append: false)
-        raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json jsonlines csv].include?(format)
+        raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json compact_json jsonlines csv].include?(format)
         @path = path
         @format = format
+        @format = :json if format == :pretty_json # :pretty_json is now an alias for :json
         @position = position
         @index = 0
         @append = append
@@ -19,44 +20,57 @@ module Kimurai
       def save(item)
         @mutex.synchronize do
-          @index += 1
-          item[:position] = @index if position
-          case format
-          when :json
-            save_to_json(item)
-          when :pretty_json
-            save_to_pretty_json(item)
-          when :jsonlines
-            save_to_jsonlines(item)
-          when :csv
-            save_to_csv(item)
+          if item.is_a?(Array)
+            item.each do |it|
+              @index += 1
+              it[:position] = @index if position
+              save_item(it)
+            end
+          else
+            @index += 1
+            item[:position] = @index if position
+            save_item(item)
           end
         end
       end
       private
+      def save_item(item)
+        case format
+        when :json
+          save_to_json(item)
+        when :compact_json
+          save_to_compact_json(item)
+        when :jsonlines
+          save_to_jsonlines(item)
+        when :csv
+          save_to_csv(item)
+        end
+      end
       def save_to_json(item)
-        data = JSON.generate([item])
+        data = JSON.pretty_generate([item])
         if @index > 1 || append && File.exist?(path)
-          file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
+          file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
           File.open(path, 'w') do |f|
-            f.write(file_content + data.sub(/\A\[/, ''))
+            f.write(file_content + data.sub(/\A\[\n/, ''))
           end
         else
           File.open(path, 'w') { |f| f.write(data) }
         end
       end
-      def save_to_pretty_json(item)
-        data = JSON.pretty_generate([item])
+      def save_to_compact_json(item)
+        data = JSON.generate([item])
         if @index > 1 || append && File.exist?(path)
-          file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
+          file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
           File.open(path, 'w') do |f|
-            f.write(file_content + data.sub(/\A\[\n/, ''))
+            f.write(file_content + data.sub(/\A\[/, ''))
           end
         else
           File.open(path, 'w') { |f| f.write(data) }

data/lib/kimurai/base.rb CHANGED Viewed

@@ -69,7 +69,7 @@ module Kimurai
     @config = {}
     def self.name
-      @name
+      @name || to_s.underscore
     end
     def self.engine
@@ -84,11 +84,22 @@ module Kimurai
       @start_urls
     end
+    def self.delay
+      @delay ||= superclass.respond_to?(:delay) ? superclass.delay : nil
+    end
     def self.config
-      if superclass.equal?(::Object)
-        @config
+      base_config = if superclass.equal?(::Object)
+                      @config
+                    else
+                      superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
+                    end
+      # Merge @delay shortcut into config if set
+      if delay
+        base_config.deep_merge_excl({ before_request: { delay: delay } }, DMERGE_EXCLUDE)
       else
-        superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
+        base_config
       end
     end

data/lib/kimurai/base_helper.rb CHANGED Viewed

@@ -1,5 +1,15 @@
 module Kimurai
   module BaseHelper
+    def extract(response, model: nil, &block)
+      caller_info = caller_locations(1, 1).first
+      method_name = caller_info.base_label
+      spider_dir = File.dirname(caller_info.path)
+      schema_path = File.join(spider_dir, "#{self.class.name}.json")
+      data = Nukitori(response, schema_path, prefix: method_name, model:, &block)
+      data.deep_symbolize_keys
+    end
     private
     def absolute_url(url, base:)

data/lib/kimurai/browser_builder.rb CHANGED Viewed

@@ -1,6 +1,13 @@
 module Kimurai
   module BrowserBuilder
+    ENGINE_ALIASES = {
+      chrome: :selenium_chrome,
+      firefox: :selenium_firefox
+    }.freeze
     def self.build(engine, config = {}, spider:)
+      engine = ENGINE_ALIASES.fetch(engine, engine)
       begin
         require "kimurai/browser_builder/#{engine}_builder"
       rescue LoadError

data/lib/kimurai/capybara_ext/session.rb CHANGED Viewed

@@ -10,7 +10,6 @@ module Capybara
     alias original_visit visit
     def visit(visit_uri, delay: config.before_request[:delay], skip_request_options: false, max_retries: 3)
       if spider
-        process_delay(delay) if delay
         retries = 0
         sleep_interval = 0
@@ -20,6 +19,9 @@ module Capybara
           spider.class.update(:visits, :requests) if spider.with_info
           original_visit(visit_uri)
+          logger.info "Browser: finished get request to: #{visit_uri}"
+          process_delay(delay) if delay
         rescue StandardError => e
           if match_error?(e, type: :to_skip)
             logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
@@ -40,7 +42,7 @@ module Capybara
             raise e
           end
         else
-          driver.responses += 1 and logger.info "Browser: finished get request to: #{visit_uri}"
+          driver.responses += 1
           spider.class.update(:visits, :responses) if spider.with_info
           driver.visited = true unless driver.visited
           true
@@ -170,7 +172,7 @@ module Capybara
     def process_delay(delay)
       interval = (delay.instance_of?(Range) ? rand(delay) : delay)
-      logger.debug "Browser: sleep #{interval.round(2)} #{'second'.pluralize(interval)} before request..."
+      logger.debug "Browser: delay #{interval.round(2)} #{'second'.pluralize(interval)}..."
       sleep interval
     end

data/lib/kimurai/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kimurai
-  VERSION = '2.0.1'.freeze
+  VERSION = '2.1.0'.freeze
 end

data/lib/kimurai.rb CHANGED Viewed

@@ -6,6 +6,7 @@ require 'uri'
 require 'active_support'
 require 'active_support/core_ext'
 require 'rbcat'
+require 'nukitori'
 require_relative 'kimurai/version'
@@ -20,6 +21,33 @@ require_relative 'kimurai/pipeline'
 require_relative 'kimurai/base'
 module Kimurai
+  # Settings that will be forwarded to Nukitori configuration
+  NUKITORI_SETTINGS = %i[
+    openai_api_key
+    anthropic_api_key
+    gemini_api_key
+    vertexai_project_id
+    vertexai_location
+    deepseek_api_key
+    mistral_api_key
+    perplexity_api_key
+    openrouter_api_key
+    gpustack_api_key
+    openai_api_base
+    gemini_api_base
+    ollama_api_base
+    gpustack_api_base
+    openai_organization_id
+    openai_project_id
+    openai_use_system_role
+    bedrock_api_key
+    bedrock_secret_key
+    bedrock_region
+    bedrock_session_token
+    default_model
+    model_registry_file
+  ].freeze
   class << self
     def configuration
       @configuration ||= OpenStruct.new
@@ -27,6 +55,22 @@ module Kimurai
     def configure
       yield(configuration)
+      apply_nukitori_configuration
+    end
+    def apply_nukitori_configuration
+      nukitori_settings = NUKITORI_SETTINGS.filter_map do |setting|
+        value = configuration[setting]
+        [setting, value] if value
+      end.to_h
+      return if nukitori_settings.empty?
+      Nukitori.configure do |config|
+        nukitori_settings.each do |setting, value|
+          config.public_send("#{setting}=", value)
+        end
+      end
     end
     def env

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: kimurai
 version: !ruby/object:Gem::Version
-  version: 2.0.1
+  version: 2.1.0
 platform: ruby
 authors:
 - Victor Afanasev
@@ -261,6 +261,48 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '1.0'
+- !ruby/object:Gem::Dependency
+  name: nukitori
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.13'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.13'
 email:
 - vicfreefly@gmail.com
 executables:
@@ -269,6 +311,7 @@ extensions: []
 extra_rdoc_files: []
 files:
 - ".gitignore"
+- ".rspec"
 - ".rubocop.yml"
 - CHANGELOG.md
 - Gemfile
@@ -329,7 +372,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: 3.1.0
+      version: 3.2.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="