RubyGems - kimurai - Versions diffs - 2.0.1 → 2.2.0 - Mend

kimurai 2.0.1 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/.rspec +3 -0
data/.rubocop.yml +2 -0
data/CHANGELOG.md +19 -0
data/LICENSE.txt +1 -1
data/README.md +242 -48
data/Rakefile +3 -7
data/kimurai.gemspec +5 -1
data/lib/kimurai/base/saver.rb +34 -20
data/lib/kimurai/base.rb +16 -5
data/lib/kimurai/base_helper.rb +10 -0
data/lib/kimurai/browser_builder.rb +7 -0
data/lib/kimurai/capybara_ext/session.rb +5 -3
data/lib/kimurai/cli/generator.rb +1 -1
data/lib/kimurai/version.rb +1 -1
data/lib/kimurai.rb +44 -0
metadata +45 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 326378ff2c70df034e5e13ce8006f0c6efbbd53f228e55c241be8d50ed3ee5e7
-  data.tar.gz: de727e434b146f8671d3cd524b9b617f147ba1d2e96d4346a07511dc6dc59a88
+  metadata.gz: eeeca8fc2ae390e6c557f435478ee4ea8273920e3ab7c590800c338574f364d0
+  data.tar.gz: d7d2d799a97c51c0e1837080c249316651beedb58b312df3df9bc69fefabac31
 SHA512:
-  metadata.gz: 3e1776777a7e65328d0ad42a931edf867c4240a5e173083246157a1376880e84608a2fba4564f61752190defbe080af1271695a5329e3f83e4a635dfb43ecae8
-  data.tar.gz: a011359fc944037d0305f5041c018d56a0b47d7c30d0e199bd7f5d62570a741ba6810dbdbad4edd2f702b72823712718a7b04495711772340900a2ca49bac4a1
+  metadata.gz: 80f449d68068d238da99dbbb83b710e8071e7fd4ded76bc17f096fac1446a307e841d87300c461299c6c01fe9abaecbcc5c91d9a2fa9878d5af5d1cac888349a
+  data.tar.gz: 793f3301353e135484ad0283973cd4c3c181fa43a9510339f06a61152a557f758f1d9e32a2fec12954ce5b9e3a409c554efea13485f7d66c67ba8654e4f6baeb

data/.gitignore CHANGED Viewed

@@ -1,3 +1,4 @@
+/.claude/
 /.bundle/
 /.yardoc
 /_yardoc/

data/.rspec ADDED Viewed

@@ -0,0 +1,3 @@
+--require spec_helper
+--color
+--format documentation

data/.rubocop.yml CHANGED Viewed

@@ -6,4 +6,6 @@ Style/TrivialAccessors:
 Style/RescueModifier:
   Enabled: false
 Style/FrozenStringLiteralComment:
+  Enabled: false
+Style/Documentation:
   Enabled: false

data/CHANGELOG.md CHANGED Viewed

@@ -1,4 +1,23 @@
 # CHANGELOG
+## 2.2.0
+### New
+* Default engine now is `:chrome` (was `:mechanize`)
+## 2.1.0
+### New
+* Min. required Ruby version is 3.2.0
+* **AI-powered data extraction with `extract` method** — Powered by [Nukitori](https://github.com/vifreefly/nukitori). Describe the data structure you want and let AI generate XPath selectors automatically. Selectors are cached for reuse, so AI is only called once per page type
+* **Configure Nukitori via Kimurai** — Set LLM provider settings (OpenAI, Anthropic, Gemini, etc.) directly in `Kimurai.configure` block
+* **Engine aliases** — Use shorter engine names: `:chrome` (alias for `:selenium_chrome`), `:firefox` (alias for `:selenium_firefox`)
+* **Top-level `@delay` option** — Set request delay directly as `@delay = 2..5` instead of nested `@config = { before_request: { delay: 2..5 } }`
+* **Auto spider name** — If `@name` is not provided, it's automatically derived from the class name
+* **Save array of items** — `save_to` helper now accepts an array of items to save at once
+### Improvements
+* `save_to` helper now uses pretty JSON by default for `:json` format (use `format: :compact_json` for compact output)
+* Request delay is now applied before the response is passed to the callback
 ## 2.0.1
 ### Fixes
 * Remove xpath as default Capybara selector type (fixes https://github.com/vifreefly/kimuraframework/issues/28)

data/LICENSE.txt CHANGED Viewed

@@ -1,6 +1,6 @@
 The MIT License (MIT)
-Copyright (c) 2018 Victor Afanasev
+Copyright (c) 2026 Victor Afanasev
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

data/README.md CHANGED Viewed

@@ -1,20 +1,75 @@
-# Kimurai
+<div align="center">
+  <a href="https://github.com/vifreefly/kimuraframework">
+    <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
+  </a>
-Kimurai is a modern web scraping framework written in Ruby which **works out of the box with Headless Chromium/Firefox** or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
+  <h1>Kimurai: AI-First Web Scraping Framework for Ruby</h1>
+</div>
-Kimurai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
+Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs:
+```ruby
+# google_spider.rb
+require 'kimurai'
+class GoogleSpider < Kimurai::Base
+  @start_urls = ['https://www.google.com/search?q=web+scraping+ai']
+  @delay = 1
+  def parse(response, url:, data: {})
+    results = extract(response) do
+      array :organic_results do
+        object do
+          string :title
+          string :snippet
+          string :url
+        end
+      end
+      array :sponsored_results do
+        object do
+          string :title
+          string :snippet
+          string :url
+        end
+      end
+      array :people_also_search_for, of: :string
+      string :next_page_link
+      number :current_page_number
+    end
+    save_to 'google_results.json', results, format: :json
+    if results[:next_page_link] && results[:current_page_number] < 3
+      request_to :parse, url: absolute_url(results[:next_page_link], base: url)
+    end
+  end
+end
+GoogleSpider.crawl!
+```
+**How it works:**
+1. On the first request, `extract` sends the HTML + your schema to an LLM
+2. The LLM generates XPath selectors and caches them in `google_spider.json`
+3. **All subsequent requests use cached XPath — zero AI calls, pure fast Ruby extraction**
+4. Supports OpenAI, Anthropic, Gemini, or local LLMs via [Nukitori](https://github.com/vifreefly/nukitori)
+## Traditional Mode
+Prefer writing your own selectors? Kimurai works great as a traditional scraper too — with headless antidetect Chromium, Firefox, or simple HTTP requests:
 ```ruby
 # github_spider.rb
 require 'kimurai'
 class GithubSpider < Kimurai::Base
-  @name = "github_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
-  @config = {
-    before_request: { delay: 3..5 }
-  }
+  @delay = 3..5
   def parse(response, url:, data: {})
     response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
@@ -149,8 +204,7 @@ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? L
 require 'kimurai'
 class InfiniteScrollSpider < Kimurai::Base
-  @name = "infinite_scroll_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://infinite-scroll.com/demo/full-page/"]
   def parse(response, url:, data: {})
@@ -194,14 +248,76 @@ I, [2025-12-16 12:47:15]  INFO -- infinite_scroll_spider: > Continue scrolling,
 I, [2025-12-16 12:47:17]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
 I, [2025-12-16 12:47:19]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
 I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > Pagination is done
-I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
+I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > All posts from page:
+1a - Infinite Scroll full page demo;
+1b - RGB Schemes logo in Computer Arts;
+2a - RGB Schemes logo;
+2b - Masonry gets horizontalOrder;
+2c - Every vector 2016;
+3a - Logo Pizza delivered;
+3b - Some CodePens;
+3c - 365daysofmusic.com;
+3d - Holograms;
+4a - Huebee: 1-click color picker;
+4b - Word is Flickity is good;
+Flickity v2 released: groupCells, adaptiveHeight, parallax;
+New tech gets chatter; Isotope v3 released: stagger in, IE8 out;
+Packery v2 released
 I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
 I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
 ```
-</details><br>
+</details>
+## AI Extraction — Configuration
+Configure your LLM provider to start using AI extraction. The `extract` method is powered by [Nukitori](https://github.com/vifreefly/nukitori):
+```ruby
+# github_spider_ai.rb
+require 'kimurai'
+Kimurai.configure do |config|
+  config.default_model = "gemini-3-flash-preview" # OpenAI, Anthropic, Gemini, local LLMs, etc.
+  config.gemini_api_key = ENV["GEMINI_API_KEY"]
+end
+class GithubSpider < Kimurai::Base
+  @engine = :chrome
+  @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
+  @delay = 3..5
+  def parse(response, url:, data: {})
+    data = extract(response) do
+      string :next_page_url, description: "Next page path url"
+      array :repos do
+        object do
+          string :name
+          string :url
+          string :description
+          string :stars
+          string :language
+          array :tags, of: :string
+        end
+      end
+    end
+    save_to "results.json", data[:repos], format: :json
+    if data[:next_page_url]
+      request_to :parse, url: absolute_url(data[:next_page_url], base: url)
+    end
+  end
+end
+GithubSpider.crawl!
+```
+Selectors are cached in `github_spider_ai.json` after the first AI call — all subsequent requests use pure Ruby extraction.
 ## Features
+* **AI-powered data extraction**: Use [Nukitori](https://github.com/vifreefly/nukitori) to extract structured data without writing XPath/CSS selectors — just describe what you want, and AI figures out how to extract it
 * Scrape JavaScript rendered websites out of the box
 * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
 * Write spider code once, and use it with any supported engine later
@@ -218,6 +334,8 @@ I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Spider: stopped: {spid
 ## Table of Contents
 * [Kimurai](#kimurai)
+  * [Traditional Mode](#traditional-mode)
+  * [AI Extraction — Configuration](#ai-extraction--configuration)
   * [Features](#features)
   * [Table of Contents](#table-of-contents)
   * [Installation](#installation)
@@ -229,6 +347,7 @@ I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Spider: stopped: {spid
     * [browser object](#browser-object)
     * [request_to method](#request_to-method)
     * [save_to helper](#save_to-helper)
+    * [AI-powered extraction with extract](#ai-powered-extraction-with-extract)
     * [Skip duplicates](#skip-duplicates)
       * [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
       * [Storage object](#storage-object)
@@ -262,7 +381,7 @@ I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: Spider: stopped: {spid
 ## Installation
-Kimurai requires Ruby version `>= 3.1.0`. Officially supported platforms: `Linux` and `macOS`.
+Kimurai requires Ruby version `>= 3.2.0`. Officially supported platforms: `Linux` and `macOS`.
 1) If your system doesn't have the appropriate Ruby version, install it:
@@ -312,7 +431,7 @@ gem update --system
 ```bash
 # Install basic tools
-sudo apt install -q -y unzip wget tar openssl
+sudo apt install -q -y unzip wget tar openssl lsof
 # Install xvfb (for virtual_display headless mode, in addition to native)
 sudo apt install -q -y xvfb
@@ -409,8 +528,8 @@ CLI arguments:
 Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
 * `:mechanize` – [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
-* `:selenium_chrome` – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
-* `:selenium_firefox` – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
+* `:chrome` (`:selenium_chrome` alias) – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
+* `:firefox` (`:selenium_firefox` alias) – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
 **Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
@@ -423,7 +542,7 @@ require 'kimurai'
 class SimpleSpider < Kimurai::Base
   @name = "simple_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def parse(response, url:, data: {})
@@ -434,8 +553,8 @@ SimpleSpider.crawl!
 ```
 Where:
-* `@name` – a name for the spider
-* `@engine` – engine to use for the spider
+* `@name` – a name for the spider (optional)
+* `@engine` – engine to use for the spider (optional, default is `:selenium_chrome`)
 * `@start_urls` – array of urls to process one-by-one inside the `parse` method
 * The `parse` method is the entry point, and should always be present in a spider class
@@ -458,7 +577,7 @@ Imagine that there is a product page that doesn't contain a category name. The c
 ```ruby
 class ProductsSpider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example-shop.com/example-product-category"]
   def parse(response, url:, data: {})
@@ -497,8 +616,7 @@ But, if you need to interact with a page (like filling form fields, clicking ele
 ```ruby
 class GoogleSpider < Kimurai::Base
-  @name = "google_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://www.google.com/"]
   def parse(response, url:, data: {})
@@ -529,7 +647,7 @@ For making requests to a particular method, there is `request_to`. It requires a
 ```ruby
 class Spider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def parse(response, url:, data: {})
@@ -565,7 +683,7 @@ The `request_to` helper method makes things simpler. We could also do something
 ```ruby
 class Spider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def parse(response, url:, data: {})
@@ -588,7 +706,7 @@ Sometimes all you need is to simply save scraped data to a file. You can use the
 ```ruby
 class ProductsSpider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example-shop.com/"]
   # ...
@@ -607,12 +725,12 @@ end
 ```
 Supported formats:
-* `:json` – JSON
-* `:pretty_json` – "pretty" JSON (`JSON.pretty_generate`)
+* `:json` – JSON (`JSON.pretty_generate`)
+* `:compact_json` – JSON
 * `:jsonlines` – [JSON Lines](http://jsonlines.org/)
 * `:csv` – CSV
-Note: `save_to` requires the data (item) to save to be a `Hash`.
+Note: `save_to` requires the data (item) to save to be a  Hash or Array of Hashes.
 By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
@@ -622,13 +740,91 @@ While the spider is running, each new item will be appended to the output file.
 > If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
+### AI-powered extraction with `extract`
+Writing and maintaining XPath/CSS selectors is tedious and error-prone. The `extract` method uses AI to generate selectors automatically — you just describe the data structure you want.
+**Configuration:**
+First, configure an LLM provider in your application:
+```ruby
+Kimurai.configure do |config|
+  config.default_model = 'gemini-3-flash-preview'
+  config.gemini_api_key = ENV['GEMINI_API_KEY']
+  # Or use OpenAI
+  # config.default_model = 'gpt-5.2'
+  # config.openai_api_key = ENV['OPENAI_API_KEY']
+  # Or Anthropic
+  # config.default_model = 'claude-sonnet-4-5'
+  # config.anthropic_api_key = ENV['ANTHROPIC_API_KEY']
+end
+```
+**Usage:**
+```ruby
+def parse(response, url:, data: {})
+  data = extract(response) do
+    string :title
+    string :price
+    string :description
+    array :features, of: :string
+  end
+  save_to "products.json", data, format: :json
+end
+```
+**Schema DSL:**
+- `string :field_name` — extracts text
+- `integer :field_name` — extracts integer
+- `number :field_name` — extracts float/decimal
+- `array :items do ... end` — extracts list of objects
+- `array :tags, of: :string` — extracts list of strings
+- `object do ... end` — nested structure
+- `description: '...'` — hint for AI about what to look for
+**How it works:**
+1. On first run, `extract` sends the HTML and your schema to an LLM
+2. The LLM returns XPath rules for each field
+3. These rules are cached in `SpiderName.json` alongside your spider file
+4. All subsequent extractions use cached XPath — fast and free, no more AI calls
+5. Each method gets its own prefix in the schema file, so different parse methods can have different schemas
+**Automatic pagination:**
+Include a next page field in your schema:
+```ruby
+data = extract(response) do
+  string :next_page_url, description: 'Next page link'
+  array :products do
+    object do
+      string :name
+      string :price
+    end
+  end
+end
+if data[:next_page_url]
+  request_to :parse, url: absolute_url(data[:next_page_url], base: url)
+end
+```
+When the last page has no "Next" link, the extracted value is `nil` and pagination stops naturally.
 ### Skip duplicates
 It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
 ```ruby
 class ProductsSpider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example-shop.com/"]
   def parse(response, url:, data: {})
@@ -842,8 +1038,7 @@ The `run_info` method is available from the `open_spider` and `close_spider` cla
 ```ruby
 class ExampleSpider < Kimurai::Base
-  @name = "example_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def self.close_spider
@@ -895,7 +1090,7 @@ You can also use the additional methods `completed?` or `failed?`
 ```ruby
 class Spider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def self.close_spider
@@ -933,7 +1128,7 @@ Kimurai supports environments. The default is `development`. To provide a custom
 Usage example:
 ```ruby
 class Spider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   def self.close_spider
@@ -956,7 +1151,6 @@ Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, t
 require 'kimurai'
 class AmazonSpider < Kimurai::Base
-  @name = "amazon_spider"
   @engine = :mechanize
   @start_urls = ["https://www.amazon.com/"]
@@ -1068,7 +1262,7 @@ vic@Vics-MacBook-Air single %
 * `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
 * `delay:` – set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
-* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :selenium_chrome)`
+* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :chrome)`
 * `config:` – set custom [config](#spider-config) options
 ### Active Support included
@@ -1170,7 +1364,7 @@ Kimurai.configure do |config|
   # Custom time zone (for logs):
   # config.time_zone = "UTC"
-  # config.time_zone = "Europe/Moscow"
+  # config.time_zone = "Europe/Berlin"
   # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
   # config.selenium_chrome_path = "/usr/bin/chromium-browser"
@@ -1286,7 +1480,7 @@ class Spider < Kimurai::Base
   USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
   PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://example.com/"]
   @config = {
     headers: { "custom_header" => "custom_value" },
@@ -1328,7 +1522,7 @@ end
   # Custom User Agent – string or lambda
   #
   # Use lambda if you want to rotate user agents before each run:
-  # 	user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
+  #   user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
   #
   # Works for all engines
   user_agent: "Mozilla/5.0 Firefox/61.0",
@@ -1340,10 +1534,10 @@ end
   cookies: [],
   # Proxy – string or lambda. Format for a proxy string: "ip:port:protocol:user:password"
-  # 	`protocol` can be http or socks5. User and password are optional.
+  #   `protocol` can be http or socks5. User and password are optional.
   #
   # Use lambda if you want to rotate proxies before each run:
-  # 	proxy: -> { ARRAY_OF_PROXIES.sample }
+  #   proxy: -> { ARRAY_OF_PROXIES.sample }
   #
   # Works for all engines, but keep in mind that Selenium drivers don't support proxies
   # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http).
@@ -1387,10 +1581,10 @@ end
   # and if the url already exists in this scope, the request will be skipped.
   #
   # You can configure this setting by providing additional options as hash:
-  # 	`skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
-  # 		`scope:` – use a custom scope other than `:requests_urls`
-  # 		`check_only:` – if true, the url will not be added to the scope
-  #
+  #   `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
+  #     `scope:` – use a custom scope other than `:requests_urls`
+  #     `check_only:` – if true, the url will not be added to the scope
+  #
   # Works for all drivers
   skip_duplicate_requests: true,
@@ -1421,8 +1615,8 @@ end
   # Handle page encoding while parsing html response using Nokogiri
   #
   # There are two ways to use this option:
-  # 	encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
-  #		encoding: "GB2312" # set encoding manually
+  #   encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
+  #   encoding: "GB2312" # set encoding manually
   #
   # This option is not set by default
   encoding: nil,
@@ -1649,7 +1843,7 @@ end
 spiders/application_spider.rb
 ```ruby
 class ApplicationSpider < Kimurai::Base
-  @engine = :selenium_chrome
+  @engine = :chrome
   # Define pipelines (by order) for all spiders:
   @pipelines = [:validator, :saver]
@@ -1726,7 +1920,7 @@ spiders/github_spider.rb
 ```ruby
 class GithubSpider < Kimurai::Base
   @name = "github_spider"
-  @engine = :selenium_chrome
+  @engine = :chrome
   @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
   @config = {
     before_request: { delay: 3..5 }

data/Rakefile CHANGED Viewed

@@ -1,10 +1,6 @@
 require 'bundler/gem_tasks'
-require 'rake/testtask'
+require 'rspec/core/rake_task'
-Rake::TestTask.new(:test) do |t|
-  t.libs << 'test'
-  t.libs << 'lib'
-  t.test_files = FileList['test/**/*_test.rb']
-end
+RSpec::Core::RakeTask.new(:spec)
-task default: :test
+task default: :spec

data/kimurai.gemspec CHANGED Viewed

@@ -20,7 +20,7 @@ Gem::Specification.new do |spec|
   spec.bindir        = 'exe'
   spec.executables   = 'kimurai'
   spec.require_paths = ['lib']
-  spec.required_ruby_version = '>= 3.1.0'
+  spec.required_ruby_version = '>= 3.2.0'
   spec.add_dependency 'activesupport'
   spec.add_dependency 'cliver'
@@ -46,4 +46,8 @@ Gem::Specification.new do |spec|
   spec.add_dependency 'pry'
   spec.add_dependency 'rbcat', '~> 1.0'
+  spec.add_dependency 'nukitori'
+  spec.add_development_dependency 'rake', '~> 13.0'
+  spec.add_development_dependency 'rspec', '~> 3.13'
 end

data/lib/kimurai/base/saver.rb CHANGED Viewed

@@ -7,10 +7,11 @@ module Kimurai
       attr_reader :format, :path, :position, :append
       def initialize(path, format:, position: true, append: false)
-        raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json jsonlines csv].include?(format)
+        raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json compact_json jsonlines csv].include?(format)
         @path = path
         @format = format
+        @format = :json if format == :pretty_json # :pretty_json is now an alias for :json
         @position = position
         @index = 0
         @append = append
@@ -19,44 +20,57 @@ module Kimurai
       def save(item)
         @mutex.synchronize do
-          @index += 1
-          item[:position] = @index if position
-          case format
-          when :json
-            save_to_json(item)
-          when :pretty_json
-            save_to_pretty_json(item)
-          when :jsonlines
-            save_to_jsonlines(item)
-          when :csv
-            save_to_csv(item)
+          if item.is_a?(Array)
+            item.each do |it|
+              @index += 1
+              it[:position] = @index if position
+              save_item(it)
+            end
+          else
+            @index += 1
+            item[:position] = @index if position
+            save_item(item)
           end
         end
       end
       private
+      def save_item(item)
+        case format
+        when :json
+          save_to_json(item)
+        when :compact_json
+          save_to_compact_json(item)
+        when :jsonlines
+          save_to_jsonlines(item)
+        when :csv
+          save_to_csv(item)
+        end
+      end
       def save_to_json(item)
-        data = JSON.generate([item])
+        data = JSON.pretty_generate([item])
         if @index > 1 || append && File.exist?(path)
-          file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
+          file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
           File.open(path, 'w') do |f|
-            f.write(file_content + data.sub(/\A\[/, ''))
+            f.write(file_content + data.sub(/\A\[\n/, ''))
           end
         else
           File.open(path, 'w') { |f| f.write(data) }
         end
       end
-      def save_to_pretty_json(item)
-        data = JSON.pretty_generate([item])
+      def save_to_compact_json(item)
+        data = JSON.generate([item])
         if @index > 1 || append && File.exist?(path)
-          file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
+          file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
           File.open(path, 'w') do |f|
-            f.write(file_content + data.sub(/\A\[\n/, ''))
+            f.write(file_content + data.sub(/\A\[/, ''))
           end
         else
           File.open(path, 'w') { |f| f.write(data) }

data/lib/kimurai/base.rb CHANGED Viewed

@@ -64,12 +64,12 @@ module Kimurai
     ###
-    @engine = :mechanize
+    @engine = :selenium_chrome
     @pipelines = []
     @config = {}
     def self.name
-      @name
+      @name || to_s.underscore
     end
     def self.engine
@@ -84,11 +84,22 @@ module Kimurai
       @start_urls
     end
+    def self.delay
+      @delay ||= superclass.respond_to?(:delay) ? superclass.delay : nil
+    end
     def self.config
-      if superclass.equal?(::Object)
-        @config
+      base_config = if superclass.equal?(::Object)
+                      @config
+                    else
+                      superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
+                    end
+      # Merge @delay shortcut into config if set
+      if delay
+        base_config.deep_merge_excl({ before_request: { delay: delay } }, DMERGE_EXCLUDE)
       else
-        superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
+        base_config
       end
     end

data/lib/kimurai/base_helper.rb CHANGED Viewed

@@ -1,5 +1,15 @@
 module Kimurai
   module BaseHelper
+    def extract(response, model: nil, &block)
+      caller_info = caller_locations(1, 1).first
+      method_name = caller_info.base_label
+      spider_dir = File.dirname(caller_info.path)
+      schema_path = File.join(spider_dir, "#{self.class.name}.json")
+      data = Nukitori(response, schema_path, prefix: method_name, model:, &block)
+      data.deep_symbolize_keys
+    end
     private
     def absolute_url(url, base:)

data/lib/kimurai/browser_builder.rb CHANGED Viewed

@@ -1,6 +1,13 @@
 module Kimurai
   module BrowserBuilder
+    ENGINE_ALIASES = {
+      chrome: :selenium_chrome,
+      firefox: :selenium_firefox
+    }.freeze
     def self.build(engine, config = {}, spider:)
+      engine = ENGINE_ALIASES.fetch(engine, engine)
       begin
         require "kimurai/browser_builder/#{engine}_builder"
       rescue LoadError

data/lib/kimurai/capybara_ext/session.rb CHANGED Viewed

@@ -10,7 +10,6 @@ module Capybara
     alias original_visit visit
     def visit(visit_uri, delay: config.before_request[:delay], skip_request_options: false, max_retries: 3)
       if spider
-        process_delay(delay) if delay
         retries = 0
         sleep_interval = 0
@@ -20,6 +19,9 @@ module Capybara
           spider.class.update(:visits, :requests) if spider.with_info
           original_visit(visit_uri)
+          logger.info "Browser: finished get request to: #{visit_uri}"
+          process_delay(delay) if delay
         rescue StandardError => e
           if match_error?(e, type: :to_skip)
             logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
@@ -40,7 +42,7 @@ module Capybara
             raise e
           end
         else
-          driver.responses += 1 and logger.info "Browser: finished get request to: #{visit_uri}"
+          driver.responses += 1
           spider.class.update(:visits, :responses) if spider.with_info
           driver.visited = true unless driver.visited
           true
@@ -170,7 +172,7 @@ module Capybara
     def process_delay(delay)
       interval = (delay.instance_of?(Range) ? rand(delay) : delay)
-      logger.debug "Browser: sleep #{interval.round(2)} #{'second'.pluralize(interval)} before request..."
+      logger.debug "Browser: delay #{interval.round(2)} #{'second'.pluralize(interval)}..."
       sleep interval
     end

data/lib/kimurai/cli/generator.rb CHANGED Viewed

@@ -35,7 +35,7 @@ module Kimurai
         return if in_project
-        insert_into_file spider_path, "  @engine = :mechanize\n", after: "@name = \"#{spider_name}\"\n"
+        insert_into_file spider_path, "  @engine = :chrome\n", after: "@name = \"#{spider_name}\"\n"
         prepend_to_file spider_path, "require 'kimurai'\n\n"
         append_to_file spider_path, "\n#{spider_class}.crawl!"
       end

data/lib/kimurai/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Kimurai
-  VERSION = '2.0.1'.freeze
+  VERSION = '2.2.0'.freeze
 end

data/lib/kimurai.rb CHANGED Viewed

@@ -6,6 +6,7 @@ require 'uri'
 require 'active_support'
 require 'active_support/core_ext'
 require 'rbcat'
+require 'nukitori'
 require_relative 'kimurai/version'
@@ -20,6 +21,33 @@ require_relative 'kimurai/pipeline'
 require_relative 'kimurai/base'
 module Kimurai
+  # Settings that will be forwarded to Nukitori configuration
+  NUKITORI_SETTINGS = %i[
+    openai_api_key
+    anthropic_api_key
+    gemini_api_key
+    vertexai_project_id
+    vertexai_location
+    deepseek_api_key
+    mistral_api_key
+    perplexity_api_key
+    openrouter_api_key
+    gpustack_api_key
+    openai_api_base
+    gemini_api_base
+    ollama_api_base
+    gpustack_api_base
+    openai_organization_id
+    openai_project_id
+    openai_use_system_role
+    bedrock_api_key
+    bedrock_secret_key
+    bedrock_region
+    bedrock_session_token
+    default_model
+    model_registry_file
+  ].freeze
   class << self
     def configuration
       @configuration ||= OpenStruct.new
@@ -27,6 +55,22 @@ module Kimurai
     def configure
       yield(configuration)
+      apply_nukitori_configuration
+    end
+    def apply_nukitori_configuration
+      nukitori_settings = NUKITORI_SETTINGS.filter_map do |setting|
+        value = configuration[setting]
+        [setting, value] if value
+      end.to_h
+      return if nukitori_settings.empty?
+      Nukitori.configure do |config|
+        nukitori_settings.each do |setting, value|
+          config.public_send("#{setting}=", value)
+        end
+      end
     end
     def env

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: kimurai
 version: !ruby/object:Gem::Version
-  version: 2.0.1
+  version: 2.2.0
 platform: ruby
 authors:
 - Victor Afanasev
@@ -261,6 +261,48 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '1.0'
+- !ruby/object:Gem::Dependency
+  name: nukitori
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.13'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.13'
 email:
 - vicfreefly@gmail.com
 executables:
@@ -269,6 +311,7 @@ extensions: []
 extra_rdoc_files: []
 files:
 - ".gitignore"
+- ".rspec"
 - ".rubocop.yml"
 - CHANGELOG.md
 - Gemfile
@@ -329,7 +372,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: 3.1.0
+      version: 3.2.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="