kimurai 2.0.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 326378ff2c70df034e5e13ce8006f0c6efbbd53f228e55c241be8d50ed3ee5e7
4
- data.tar.gz: de727e434b146f8671d3cd524b9b617f147ba1d2e96d4346a07511dc6dc59a88
3
+ metadata.gz: b0f990c2292eebb911b6036b7515fdbe4b844f75dc20ec032c1da352de740c80
4
+ data.tar.gz: 13f35756781bb2a0c8f14fe246edc87e428cd5de6bc46c269128d764b9f2763c
5
5
  SHA512:
6
- metadata.gz: 3e1776777a7e65328d0ad42a931edf867c4240a5e173083246157a1376880e84608a2fba4564f61752190defbe080af1271695a5329e3f83e4a635dfb43ecae8
7
- data.tar.gz: a011359fc944037d0305f5041c018d56a0b47d7c30d0e199bd7f5d62570a741ba6810dbdbad4edd2f702b72823712718a7b04495711772340900a2ca49bac4a1
6
+ metadata.gz: b2f236b701e505bba6e03083fc5e4308b125ba09203db04d41c1e72b0ccfbcb6957c35e6cc05e8f10ae2d3d96aeb7cdb1e7f24e7770406a8894f6020ec53c5c4
7
+ data.tar.gz: d47288711341145af98b0ad547583dae4ce26f38d6b024b4994a40eb9fc13f802238b0bf93eb0cd8b06d91ff098eceb2b837b50ef9e84ef3d0a98a25a70d0ce8
data/.gitignore CHANGED
@@ -1,3 +1,4 @@
1
+ /.claude/
1
2
  /.bundle/
2
3
  /.yardoc
3
4
  /_yardoc/
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --require spec_helper
2
+ --color
3
+ --format documentation
data/.rubocop.yml CHANGED
@@ -6,4 +6,6 @@ Style/TrivialAccessors:
6
6
  Style/RescueModifier:
7
7
  Enabled: false
8
8
  Style/FrozenStringLiteralComment:
9
+ Enabled: false
10
+ Style/Documentation:
9
11
  Enabled: false
data/CHANGELOG.md CHANGED
@@ -1,4 +1,18 @@
1
1
  # CHANGELOG
2
+ ## 2.1.0
3
+ ### New
4
+ * Min. required Ruby version is 3.2.0
5
+ * **AI-powered data extraction with `extract` method** — Powered by [Nukitori](https://github.com/vifreefly/nukitori). Describe the data structure you want and let AI generate XPath selectors automatically. Selectors are cached for reuse, so AI is only called once per page type
6
+ * **Configure Nukitori via Kimurai** — Set LLM provider settings (OpenAI, Anthropic, Gemini, etc.) directly in `Kimurai.configure` block
7
+ * **Engine aliases** — Use shorter engine names: `:chrome` (alias for `:selenium_chrome`), `:firefox` (alias for `:selenium_firefox`)
8
+ * **Top-level `@delay` option** — Set request delay directly as `@delay = 2..5` instead of nested `@config = { before_request: { delay: 2..5 } }`
9
+ * **Auto spider name** — If `@name` is not provided, it's automatically derived from the class name
10
+ * **Save array of items** — `save_to` helper now accepts an array of items to save at once
11
+
12
+ ### Improvements
13
+ * `save_to` helper now uses pretty JSON by default for `:json` format (use `format: :compact_json` for compact output)
14
+ * Request delay is now applied before the response is passed to the callback
15
+
2
16
  ## 2.0.1
3
17
  ### Fixes
4
18
  * Remove xpath as default Capybara selector type (fixes https://github.com/vifreefly/kimuraframework/issues/28)
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2018 Victor Afanasev
3
+ Copyright (c) 2026 Victor Afanasev
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -1,20 +1,21 @@
1
- # Kimurai
1
+ <div align="center">
2
+ <a href="https://github.com/vifreefly/kimuraframework">
3
+ <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
4
+ </a>
2
5
 
3
- Kimurai is a modern web scraping framework written in Ruby which **works out of the box with Headless Chromium/Firefox** or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
6
+ <h1>Kimurai</h1>
7
+ </div>
4
8
 
5
- Kimurai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
9
+ Kimurai is a modern Ruby web scraping framework designed to scrape and interact with JavaScript-rendered websites using headless antidetect Chromium, Firefox, or simple HTTP requests right out of the box:
6
10
 
7
11
  ```ruby
8
12
  # github_spider.rb
9
13
  require 'kimurai'
10
14
 
11
15
  class GithubSpider < Kimurai::Base
12
- @name = "github_spider"
13
- @engine = :selenium_chrome
16
+ @engine = :chrome
14
17
  @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
15
- @config = {
16
- before_request: { delay: 3..5 }
17
- }
18
+ @delay = 3..5
18
19
 
19
20
  def parse(response, url:, data: {})
20
21
  response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
@@ -149,8 +150,7 @@ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? L
149
150
  require 'kimurai'
150
151
 
151
152
  class InfiniteScrollSpider < Kimurai::Base
152
- @name = "infinite_scroll_spider"
153
- @engine = :selenium_chrome
153
+ @engine = :chrome
154
154
  @start_urls = ["https://infinite-scroll.com/demo/full-page/"]
155
155
 
156
156
  def parse(response, url:, data: {})
@@ -194,14 +194,82 @@ I, [2025-12-16 12:47:15] INFO -- infinite_scroll_spider: > Continue scrolling,
194
194
  I, [2025-12-16 12:47:17] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
195
195
  I, [2025-12-16 12:47:19] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
196
196
  I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > Pagination is done
197
- I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
197
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page:
198
+
199
+ 1a - Infinite Scroll full page demo;
200
+ 1b - RGB Schemes logo in Computer Arts;
201
+ 2a - RGB Schemes logo;
202
+ 2b - Masonry gets horizontalOrder;
203
+ 2c - Every vector 2016;
204
+ 3a - Logo Pizza delivered;
205
+ 3b - Some CodePens;
206
+ 3c - 365daysofmusic.com;
207
+ 3d - Holograms;
208
+ 4a - Huebee: 1-click color picker;
209
+ 4b - Word is Flickity is good;
210
+ Flickity v2 released: groupCells, adaptiveHeight, parallax;
211
+ New tech gets chatter; Isotope v3 released: stagger in, IE8 out;
212
+ Packery v2 released
213
+
198
214
  I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
199
215
  I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
200
216
  ```
201
- </details><br>
217
+ </details>
202
218
 
219
+ ## AI-Powered Extraction
220
+
221
+ What if you could just describe the data you want and let AI figure out how to extract it? With the built-in `extract` method powered by [Nukitori](https://github.com/vifreefly/nukitori), you can:
222
+
223
+ ```ruby
224
+ # github_spider_ai.rb
225
+ require 'kimurai'
226
+
227
+ Kimurai.configure do |config|
228
+ config.default_model = "gemini-3-flash-preview" # OpenAI, Anthropic, Gemini, local LLMs, etc.
229
+ config.gemini_api_key = ENV["GEMINI_API_KEY"]
230
+ end
231
+
232
+ class GithubSpider < Kimurai::Base
233
+ @engine = :chrome
234
+ @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
235
+ @delay = 3..5
236
+
237
+ def parse(response, url:, data: {})
238
+ data = extract(response) do
239
+ string :next_page_url, description: "Next page path url"
240
+ array :repos do
241
+ object do
242
+ string :name
243
+ string :url
244
+ string :description
245
+ string :stars
246
+ string :language
247
+ array :tags, of: :string
248
+ end
249
+ end
250
+ end
251
+
252
+ save_to "results.json", data[:repos], format: :json
253
+
254
+ if data[:next_page_url]
255
+ request_to :parse, url: absolute_url(data[:next_page_url], base: url)
256
+ end
257
+ end
258
+ end
259
+
260
+ GithubSpider.crawl!
261
+ ```
262
+
263
+ **How it works:**
264
+ 1. On the first page, `extract` sends the HTML to an LLM which generates XPath rules for your schema
265
+ 2. These rules are cached in a JSON file alongside your spider
266
+ 3. **All subsequent pages use the cached XPath — no more AI calls, pure fast extraction**
267
+ 4. When there's no "Next" link on the last page, the extracted value is `nil` and pagination stops
268
+
269
+ Zero manual selectors. The AI figured out where everything lives, and that knowledge is reused for the entire crawl.
203
270
 
204
271
  ## Features
272
+ * **AI-powered data extraction**: Use [Nukitori](https://github.com/vifreefly/nukitori) to extract structured data without writing XPath/CSS selectors — just describe what you want, and AI figures out how to extract it
205
273
  * Scrape JavaScript rendered websites out of the box
206
274
  * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
207
275
  * Write spider code once, and use it with any supported engine later
@@ -229,6 +297,7 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
229
297
  * [browser object](#browser-object)
230
298
  * [request_to method](#request_to-method)
231
299
  * [save_to helper](#save_to-helper)
300
+ * [AI-powered extraction with extract](#ai-powered-extraction-with-extract)
232
301
  * [Skip duplicates](#skip-duplicates)
233
302
  * [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
234
303
  * [Storage object](#storage-object)
@@ -262,7 +331,7 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
262
331
 
263
332
 
264
333
  ## Installation
265
- Kimurai requires Ruby version `>= 3.1.0`. Officially supported platforms: `Linux` and `macOS`.
334
+ Kimurai requires Ruby version `>= 3.2.0`. Officially supported platforms: `Linux` and `macOS`.
266
335
 
267
336
  1) If your system doesn't have the appropriate Ruby version, install it:
268
337
 
@@ -312,7 +381,7 @@ gem update --system
312
381
 
313
382
  ```bash
314
383
  # Install basic tools
315
- sudo apt install -q -y unzip wget tar openssl
384
+ sudo apt install -q -y unzip wget tar openssl lsof
316
385
 
317
386
  # Install xvfb (for virtual_display headless mode, in addition to native)
318
387
  sudo apt install -q -y xvfb
@@ -409,8 +478,8 @@ CLI arguments:
409
478
  Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
410
479
 
411
480
  * `:mechanize` – [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
412
- * `:selenium_chrome` – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
413
- * `:selenium_firefox` – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
481
+ * `:chrome` (`:selenium_chrome` alias) – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
482
+ * `:firefox` (`:selenium_firefox` alias) – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
414
483
 
415
484
  **Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
416
485
 
@@ -423,7 +492,7 @@ require 'kimurai'
423
492
 
424
493
  class SimpleSpider < Kimurai::Base
425
494
  @name = "simple_spider"
426
- @engine = :selenium_chrome
495
+ @engine = :chrome
427
496
  @start_urls = ["https://example.com/"]
428
497
 
429
498
  def parse(response, url:, data: {})
@@ -434,8 +503,8 @@ SimpleSpider.crawl!
434
503
  ```
435
504
 
436
505
  Where:
437
- * `@name` – a name for the spider
438
- * `@engine` – engine to use for the spider
506
+ * `@name` – a name for the spider (optional)
507
+ * `@engine` – engine to use for the spider (optional, default is `:mechanize`)
439
508
  * `@start_urls` – array of urls to process one-by-one inside the `parse` method
440
509
  * The `parse` method is the entry point, and should always be present in a spider class
441
510
 
@@ -458,7 +527,7 @@ Imagine that there is a product page that doesn't contain a category name. The c
458
527
 
459
528
  ```ruby
460
529
  class ProductsSpider < Kimurai::Base
461
- @engine = :selenium_chrome
530
+ @engine = :chrome
462
531
  @start_urls = ["https://example-shop.com/example-product-category"]
463
532
 
464
533
  def parse(response, url:, data: {})
@@ -497,8 +566,7 @@ But, if you need to interact with a page (like filling form fields, clicking ele
497
566
 
498
567
  ```ruby
499
568
  class GoogleSpider < Kimurai::Base
500
- @name = "google_spider"
501
- @engine = :selenium_chrome
569
+ @engine = :chrome
502
570
  @start_urls = ["https://www.google.com/"]
503
571
 
504
572
  def parse(response, url:, data: {})
@@ -529,7 +597,7 @@ For making requests to a particular method, there is `request_to`. It requires a
529
597
 
530
598
  ```ruby
531
599
  class Spider < Kimurai::Base
532
- @engine = :selenium_chrome
600
+ @engine = :chrome
533
601
  @start_urls = ["https://example.com/"]
534
602
 
535
603
  def parse(response, url:, data: {})
@@ -565,7 +633,7 @@ The `request_to` helper method makes things simpler. We could also do something
565
633
 
566
634
  ```ruby
567
635
  class Spider < Kimurai::Base
568
- @engine = :selenium_chrome
636
+ @engine = :chrome
569
637
  @start_urls = ["https://example.com/"]
570
638
 
571
639
  def parse(response, url:, data: {})
@@ -588,7 +656,7 @@ Sometimes all you need is to simply save scraped data to a file. You can use the
588
656
 
589
657
  ```ruby
590
658
  class ProductsSpider < Kimurai::Base
591
- @engine = :selenium_chrome
659
+ @engine = :chrome
592
660
  @start_urls = ["https://example-shop.com/"]
593
661
 
594
662
  # ...
@@ -607,12 +675,12 @@ end
607
675
  ```
608
676
 
609
677
  Supported formats:
610
- * `:json` – JSON
611
- * `:pretty_json` – "pretty" JSON (`JSON.pretty_generate`)
678
+ * `:json` – JSON (`JSON.pretty_generate`)
679
+ * `:compact_json` – JSON
612
680
  * `:jsonlines` – [JSON Lines](http://jsonlines.org/)
613
681
  * `:csv` – CSV
614
682
 
615
- Note: `save_to` requires the data (item) to save to be a `Hash`.
683
+ Note: `save_to` requires the data (item) to save to be a Hash or Array of Hashes.
616
684
 
617
685
  By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
618
686
 
@@ -622,13 +690,91 @@ While the spider is running, each new item will be appended to the output file.
622
690
 
623
691
  > If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
624
692
 
693
+ ### AI-powered extraction with `extract`
694
+
695
+ Writing and maintaining XPath/CSS selectors is tedious and error-prone. The `extract` method uses AI to generate selectors automatically — you just describe the data structure you want.
696
+
697
+ **Configuration:**
698
+
699
+ First, configure an LLM provider in your application:
700
+
701
+ ```ruby
702
+ Kimurai.configure do |config|
703
+ config.default_model = 'gemini-3-flash-preview'
704
+ config.gemini_api_key = ENV['GEMINI_API_KEY']
705
+
706
+ # Or use OpenAI
707
+ # config.default_model = 'gpt-5.2'
708
+ # config.openai_api_key = ENV['OPENAI_API_KEY']
709
+
710
+ # Or Anthropic
711
+ # config.default_model = 'claude-sonnet-4-5'
712
+ # config.anthropic_api_key = ENV['ANTHROPIC_API_KEY']
713
+ end
714
+ ```
715
+
716
+ **Usage:**
717
+
718
+ ```ruby
719
+ def parse(response, url:, data: {})
720
+ data = extract(response) do
721
+ string :title
722
+ string :price
723
+ string :description
724
+ array :features, of: :string
725
+ end
726
+
727
+ save_to "products.json", data, format: :json
728
+ end
729
+ ```
730
+
731
+ **Schema DSL:**
732
+
733
+ - `string :field_name` — extracts text
734
+ - `integer :field_name` — extracts integer
735
+ - `number :field_name` — extracts float/decimal
736
+ - `array :items do ... end` — extracts list of objects
737
+ - `array :tags, of: :string` — extracts list of strings
738
+ - `object do ... end` — nested structure
739
+ - `description: '...'` — hint for AI about what to look for
740
+
741
+ **How it works:**
742
+
743
+ 1. On first run, `extract` sends the HTML and your schema to an LLM
744
+ 2. The LLM returns XPath rules for each field
745
+ 3. These rules are cached in `SpiderName.json` alongside your spider file
746
+ 4. All subsequent extractions use cached XPath — fast and free, no more AI calls
747
+ 5. Each method gets its own prefix in the schema file, so different parse methods can have different schemas
748
+
749
+ **Automatic pagination:**
750
+
751
+ Include a next page field in your schema:
752
+
753
+ ```ruby
754
+ data = extract(response) do
755
+ string :next_page_url, description: 'Next page link'
756
+ array :products do
757
+ object do
758
+ string :name
759
+ string :price
760
+ end
761
+ end
762
+ end
763
+
764
+ if data[:next_page_url]
765
+ request_to :parse, url: absolute_url(data[:next_page_url], base: url)
766
+ end
767
+ ```
768
+
769
+ When the last page has no "Next" link, the extracted value is `nil` and pagination stops naturally.
770
+
625
771
  ### Skip duplicates
626
772
 
627
773
  It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
628
774
 
629
775
  ```ruby
630
776
  class ProductsSpider < Kimurai::Base
631
- @engine = :selenium_chrome
777
+ @engine = :chrome
632
778
  @start_urls = ["https://example-shop.com/"]
633
779
 
634
780
  def parse(response, url:, data: {})
@@ -842,8 +988,7 @@ The `run_info` method is available from the `open_spider` and `close_spider` cla
842
988
 
843
989
  ```ruby
844
990
  class ExampleSpider < Kimurai::Base
845
- @name = "example_spider"
846
- @engine = :selenium_chrome
991
+ @engine = :chrome
847
992
  @start_urls = ["https://example.com/"]
848
993
 
849
994
  def self.close_spider
@@ -895,7 +1040,7 @@ You can also use the additional methods `completed?` or `failed?`
895
1040
 
896
1041
  ```ruby
897
1042
  class Spider < Kimurai::Base
898
- @engine = :selenium_chrome
1043
+ @engine = :chrome
899
1044
  @start_urls = ["https://example.com/"]
900
1045
 
901
1046
  def self.close_spider
@@ -933,7 +1078,7 @@ Kimurai supports environments. The default is `development`. To provide a custom
933
1078
  Usage example:
934
1079
  ```ruby
935
1080
  class Spider < Kimurai::Base
936
- @engine = :selenium_chrome
1081
+ @engine = :chrome
937
1082
  @start_urls = ["https://example.com/"]
938
1083
 
939
1084
  def self.close_spider
@@ -956,7 +1101,6 @@ Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, t
956
1101
  require 'kimurai'
957
1102
 
958
1103
  class AmazonSpider < Kimurai::Base
959
- @name = "amazon_spider"
960
1104
  @engine = :mechanize
961
1105
  @start_urls = ["https://www.amazon.com/"]
962
1106
 
@@ -1068,7 +1212,7 @@ vic@Vics-MacBook-Air single %
1068
1212
 
1069
1213
  * `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
1070
1214
  * `delay:` – set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
1071
- * `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :selenium_chrome)`
1215
+ * `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :chrome)`
1072
1216
  * `config:` – set custom [config](#spider-config) options
1073
1217
 
1074
1218
  ### Active Support included
@@ -1170,7 +1314,7 @@ Kimurai.configure do |config|
1170
1314
 
1171
1315
  # Custom time zone (for logs):
1172
1316
  # config.time_zone = "UTC"
1173
- # config.time_zone = "Europe/Moscow"
1317
+ # config.time_zone = "Europe/Berlin"
1174
1318
 
1175
1319
  # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
1176
1320
  # config.selenium_chrome_path = "/usr/bin/chromium-browser"
@@ -1286,7 +1430,7 @@ class Spider < Kimurai::Base
1286
1430
  USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
1287
1431
  PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
1288
1432
 
1289
- @engine = :selenium_chrome
1433
+ @engine = :chrome
1290
1434
  @start_urls = ["https://example.com/"]
1291
1435
  @config = {
1292
1436
  headers: { "custom_header" => "custom_value" },
@@ -1649,7 +1793,7 @@ end
1649
1793
  spiders/application_spider.rb
1650
1794
  ```ruby
1651
1795
  class ApplicationSpider < Kimurai::Base
1652
- @engine = :selenium_chrome
1796
+ @engine = :chrome
1653
1797
 
1654
1798
  # Define pipelines (by order) for all spiders:
1655
1799
  @pipelines = [:validator, :saver]
@@ -1726,7 +1870,7 @@ spiders/github_spider.rb
1726
1870
  ```ruby
1727
1871
  class GithubSpider < Kimurai::Base
1728
1872
  @name = "github_spider"
1729
- @engine = :selenium_chrome
1873
+ @engine = :chrome
1730
1874
  @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
1731
1875
  @config = {
1732
1876
  before_request: { delay: 3..5 }
data/Rakefile CHANGED
@@ -1,10 +1,6 @@
1
1
  require 'bundler/gem_tasks'
2
- require 'rake/testtask'
2
+ require 'rspec/core/rake_task'
3
3
 
4
- Rake::TestTask.new(:test) do |t|
5
- t.libs << 'test'
6
- t.libs << 'lib'
7
- t.test_files = FileList['test/**/*_test.rb']
8
- end
4
+ RSpec::Core::RakeTask.new(:spec)
9
5
 
10
- task default: :test
6
+ task default: :spec
data/kimurai.gemspec CHANGED
@@ -20,7 +20,7 @@ Gem::Specification.new do |spec|
20
20
  spec.bindir = 'exe'
21
21
  spec.executables = 'kimurai'
22
22
  spec.require_paths = ['lib']
23
- spec.required_ruby_version = '>= 3.1.0'
23
+ spec.required_ruby_version = '>= 3.2.0'
24
24
 
25
25
  spec.add_dependency 'activesupport'
26
26
  spec.add_dependency 'cliver'
@@ -46,4 +46,8 @@ Gem::Specification.new do |spec|
46
46
 
47
47
  spec.add_dependency 'pry'
48
48
  spec.add_dependency 'rbcat', '~> 1.0'
49
+ spec.add_dependency 'nukitori'
50
+
51
+ spec.add_development_dependency 'rake', '~> 13.0'
52
+ spec.add_development_dependency 'rspec', '~> 3.13'
49
53
  end
@@ -7,10 +7,11 @@ module Kimurai
7
7
  attr_reader :format, :path, :position, :append
8
8
 
9
9
  def initialize(path, format:, position: true, append: false)
10
- raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json jsonlines csv].include?(format)
10
+ raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json compact_json jsonlines csv].include?(format)
11
11
 
12
12
  @path = path
13
13
  @format = format
14
+ @format = :json if format == :pretty_json # :pretty_json is now an alias for :json
14
15
  @position = position
15
16
  @index = 0
16
17
  @append = append
@@ -19,44 +20,57 @@ module Kimurai
19
20
 
20
21
  def save(item)
21
22
  @mutex.synchronize do
22
- @index += 1
23
- item[:position] = @index if position
24
-
25
- case format
26
- when :json
27
- save_to_json(item)
28
- when :pretty_json
29
- save_to_pretty_json(item)
30
- when :jsonlines
31
- save_to_jsonlines(item)
32
- when :csv
33
- save_to_csv(item)
23
+ if item.is_a?(Array)
24
+ item.each do |it|
25
+ @index += 1
26
+ it[:position] = @index if position
27
+
28
+ save_item(it)
29
+ end
30
+ else
31
+ @index += 1
32
+ item[:position] = @index if position
33
+
34
+ save_item(item)
34
35
  end
35
36
  end
36
37
  end
37
38
 
38
39
  private
39
40
 
41
+ def save_item(item)
42
+ case format
43
+ when :json
44
+ save_to_json(item)
45
+ when :compact_json
46
+ save_to_compact_json(item)
47
+ when :jsonlines
48
+ save_to_jsonlines(item)
49
+ when :csv
50
+ save_to_csv(item)
51
+ end
52
+ end
53
+
40
54
  def save_to_json(item)
41
- data = JSON.generate([item])
55
+ data = JSON.pretty_generate([item])
42
56
 
43
57
  if @index > 1 || append && File.exist?(path)
44
- file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
58
+ file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
45
59
  File.open(path, 'w') do |f|
46
- f.write(file_content + data.sub(/\A\[/, ''))
60
+ f.write(file_content + data.sub(/\A\[\n/, ''))
47
61
  end
48
62
  else
49
63
  File.open(path, 'w') { |f| f.write(data) }
50
64
  end
51
65
  end
52
66
 
53
- def save_to_pretty_json(item)
54
- data = JSON.pretty_generate([item])
67
+ def save_to_compact_json(item)
68
+ data = JSON.generate([item])
55
69
 
56
70
  if @index > 1 || append && File.exist?(path)
57
- file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
71
+ file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
58
72
  File.open(path, 'w') do |f|
59
- f.write(file_content + data.sub(/\A\[\n/, ''))
73
+ f.write(file_content + data.sub(/\A\[/, ''))
60
74
  end
61
75
  else
62
76
  File.open(path, 'w') { |f| f.write(data) }
data/lib/kimurai/base.rb CHANGED
@@ -69,7 +69,7 @@ module Kimurai
69
69
  @config = {}
70
70
 
71
71
  def self.name
72
- @name
72
+ @name || to_s.underscore
73
73
  end
74
74
 
75
75
  def self.engine
@@ -84,11 +84,22 @@ module Kimurai
84
84
  @start_urls
85
85
  end
86
86
 
87
+ def self.delay
88
+ @delay ||= superclass.respond_to?(:delay) ? superclass.delay : nil
89
+ end
90
+
87
91
  def self.config
88
- if superclass.equal?(::Object)
89
- @config
92
+ base_config = if superclass.equal?(::Object)
93
+ @config
94
+ else
95
+ superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
96
+ end
97
+
98
+ # Merge @delay shortcut into config if set
99
+ if delay
100
+ base_config.deep_merge_excl({ before_request: { delay: delay } }, DMERGE_EXCLUDE)
90
101
  else
91
- superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
102
+ base_config
92
103
  end
93
104
  end
94
105
 
@@ -1,5 +1,15 @@
1
1
  module Kimurai
2
2
  module BaseHelper
3
+ def extract(response, model: nil, &block)
4
+ caller_info = caller_locations(1, 1).first
5
+ method_name = caller_info.base_label
6
+ spider_dir = File.dirname(caller_info.path)
7
+ schema_path = File.join(spider_dir, "#{self.class.name}.json")
8
+
9
+ data = Nukitori(response, schema_path, prefix: method_name, model:, &block)
10
+ data.deep_symbolize_keys
11
+ end
12
+
3
13
  private
4
14
 
5
15
  def absolute_url(url, base:)
@@ -1,6 +1,13 @@
1
1
  module Kimurai
2
2
  module BrowserBuilder
3
+ ENGINE_ALIASES = {
4
+ chrome: :selenium_chrome,
5
+ firefox: :selenium_firefox
6
+ }.freeze
7
+
3
8
  def self.build(engine, config = {}, spider:)
9
+ engine = ENGINE_ALIASES.fetch(engine, engine)
10
+
4
11
  begin
5
12
  require "kimurai/browser_builder/#{engine}_builder"
6
13
  rescue LoadError
@@ -10,7 +10,6 @@ module Capybara
10
10
  alias original_visit visit
11
11
  def visit(visit_uri, delay: config.before_request[:delay], skip_request_options: false, max_retries: 3)
12
12
  if spider
13
- process_delay(delay) if delay
14
13
  retries = 0
15
14
  sleep_interval = 0
16
15
 
@@ -20,6 +19,9 @@ module Capybara
20
19
  spider.class.update(:visits, :requests) if spider.with_info
21
20
 
22
21
  original_visit(visit_uri)
22
+
23
+ logger.info "Browser: finished get request to: #{visit_uri}"
24
+ process_delay(delay) if delay
23
25
  rescue StandardError => e
24
26
  if match_error?(e, type: :to_skip)
25
27
  logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
@@ -40,7 +42,7 @@ module Capybara
40
42
  raise e
41
43
  end
42
44
  else
43
- driver.responses += 1 and logger.info "Browser: finished get request to: #{visit_uri}"
45
+ driver.responses += 1
44
46
  spider.class.update(:visits, :responses) if spider.with_info
45
47
  driver.visited = true unless driver.visited
46
48
  true
@@ -170,7 +172,7 @@ module Capybara
170
172
 
171
173
  def process_delay(delay)
172
174
  interval = (delay.instance_of?(Range) ? rand(delay) : delay)
173
- logger.debug "Browser: sleep #{interval.round(2)} #{'second'.pluralize(interval)} before request..."
175
+ logger.debug "Browser: delay #{interval.round(2)} #{'second'.pluralize(interval)}..."
174
176
  sleep interval
175
177
  end
176
178
 
@@ -1,3 +1,3 @@
1
1
  module Kimurai
2
- VERSION = '2.0.1'.freeze
2
+ VERSION = '2.1.0'.freeze
3
3
  end
data/lib/kimurai.rb CHANGED
@@ -6,6 +6,7 @@ require 'uri'
6
6
  require 'active_support'
7
7
  require 'active_support/core_ext'
8
8
  require 'rbcat'
9
+ require 'nukitori'
9
10
 
10
11
  require_relative 'kimurai/version'
11
12
 
@@ -20,6 +21,33 @@ require_relative 'kimurai/pipeline'
20
21
  require_relative 'kimurai/base'
21
22
 
22
23
  module Kimurai
24
+ # Settings that will be forwarded to Nukitori configuration
25
+ NUKITORI_SETTINGS = %i[
26
+ openai_api_key
27
+ anthropic_api_key
28
+ gemini_api_key
29
+ vertexai_project_id
30
+ vertexai_location
31
+ deepseek_api_key
32
+ mistral_api_key
33
+ perplexity_api_key
34
+ openrouter_api_key
35
+ gpustack_api_key
36
+ openai_api_base
37
+ gemini_api_base
38
+ ollama_api_base
39
+ gpustack_api_base
40
+ openai_organization_id
41
+ openai_project_id
42
+ openai_use_system_role
43
+ bedrock_api_key
44
+ bedrock_secret_key
45
+ bedrock_region
46
+ bedrock_session_token
47
+ default_model
48
+ model_registry_file
49
+ ].freeze
50
+
23
51
  class << self
24
52
  def configuration
25
53
  @configuration ||= OpenStruct.new
@@ -27,6 +55,22 @@ module Kimurai
27
55
 
28
56
  def configure
29
57
  yield(configuration)
58
+ apply_nukitori_configuration
59
+ end
60
+
61
+ def apply_nukitori_configuration
62
+ nukitori_settings = NUKITORI_SETTINGS.filter_map do |setting|
63
+ value = configuration[setting]
64
+ [setting, value] if value
65
+ end.to_h
66
+
67
+ return if nukitori_settings.empty?
68
+
69
+ Nukitori.configure do |config|
70
+ nukitori_settings.each do |setting, value|
71
+ config.public_send("#{setting}=", value)
72
+ end
73
+ end
30
74
  end
31
75
 
32
76
  def env
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kimurai
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.1
4
+ version: 2.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Afanasev
@@ -261,6 +261,48 @@ dependencies:
261
261
  - - "~>"
262
262
  - !ruby/object:Gem::Version
263
263
  version: '1.0'
264
+ - !ruby/object:Gem::Dependency
265
+ name: nukitori
266
+ requirement: !ruby/object:Gem::Requirement
267
+ requirements:
268
+ - - ">="
269
+ - !ruby/object:Gem::Version
270
+ version: '0'
271
+ type: :runtime
272
+ prerelease: false
273
+ version_requirements: !ruby/object:Gem::Requirement
274
+ requirements:
275
+ - - ">="
276
+ - !ruby/object:Gem::Version
277
+ version: '0'
278
+ - !ruby/object:Gem::Dependency
279
+ name: rake
280
+ requirement: !ruby/object:Gem::Requirement
281
+ requirements:
282
+ - - "~>"
283
+ - !ruby/object:Gem::Version
284
+ version: '13.0'
285
+ type: :development
286
+ prerelease: false
287
+ version_requirements: !ruby/object:Gem::Requirement
288
+ requirements:
289
+ - - "~>"
290
+ - !ruby/object:Gem::Version
291
+ version: '13.0'
292
+ - !ruby/object:Gem::Dependency
293
+ name: rspec
294
+ requirement: !ruby/object:Gem::Requirement
295
+ requirements:
296
+ - - "~>"
297
+ - !ruby/object:Gem::Version
298
+ version: '3.13'
299
+ type: :development
300
+ prerelease: false
301
+ version_requirements: !ruby/object:Gem::Requirement
302
+ requirements:
303
+ - - "~>"
304
+ - !ruby/object:Gem::Version
305
+ version: '3.13'
264
306
  email:
265
307
  - vicfreefly@gmail.com
266
308
  executables:
@@ -269,6 +311,7 @@ extensions: []
269
311
  extra_rdoc_files: []
270
312
  files:
271
313
  - ".gitignore"
314
+ - ".rspec"
272
315
  - ".rubocop.yml"
273
316
  - CHANGELOG.md
274
317
  - Gemfile
@@ -329,7 +372,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
329
372
  requirements:
330
373
  - - ">="
331
374
  - !ruby/object:Gem::Version
332
- version: 3.1.0
375
+ version: 3.2.0
333
376
  required_rubygems_version: !ruby/object:Gem::Requirement
334
377
  requirements:
335
378
  - - ">="