kimurai 2.0.1 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 326378ff2c70df034e5e13ce8006f0c6efbbd53f228e55c241be8d50ed3ee5e7
4
- data.tar.gz: de727e434b146f8671d3cd524b9b617f147ba1d2e96d4346a07511dc6dc59a88
3
+ metadata.gz: eeeca8fc2ae390e6c557f435478ee4ea8273920e3ab7c590800c338574f364d0
4
+ data.tar.gz: d7d2d799a97c51c0e1837080c249316651beedb58b312df3df9bc69fefabac31
5
5
  SHA512:
6
- metadata.gz: 3e1776777a7e65328d0ad42a931edf867c4240a5e173083246157a1376880e84608a2fba4564f61752190defbe080af1271695a5329e3f83e4a635dfb43ecae8
7
- data.tar.gz: a011359fc944037d0305f5041c018d56a0b47d7c30d0e199bd7f5d62570a741ba6810dbdbad4edd2f702b72823712718a7b04495711772340900a2ca49bac4a1
6
+ metadata.gz: 80f449d68068d238da99dbbb83b710e8071e7fd4ded76bc17f096fac1446a307e841d87300c461299c6c01fe9abaecbcc5c91d9a2fa9878d5af5d1cac888349a
7
+ data.tar.gz: 793f3301353e135484ad0283973cd4c3c181fa43a9510339f06a61152a557f758f1d9e32a2fec12954ce5b9e3a409c554efea13485f7d66c67ba8654e4f6baeb
data/.gitignore CHANGED
@@ -1,3 +1,4 @@
1
+ /.claude/
1
2
  /.bundle/
2
3
  /.yardoc
3
4
  /_yardoc/
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --require spec_helper
2
+ --color
3
+ --format documentation
data/.rubocop.yml CHANGED
@@ -6,4 +6,6 @@ Style/TrivialAccessors:
6
6
  Style/RescueModifier:
7
7
  Enabled: false
8
8
  Style/FrozenStringLiteralComment:
9
+ Enabled: false
10
+ Style/Documentation:
9
11
  Enabled: false
data/CHANGELOG.md CHANGED
@@ -1,4 +1,23 @@
1
1
  # CHANGELOG
2
+
3
+ ## 2.2.0
4
+ ### New
5
+ * Default engine now is `:chrome` (was `:mechanize`)
6
+
7
+ ## 2.1.0
8
+ ### New
9
+ * Min. required Ruby version is 3.2.0
10
+ * **AI-powered data extraction with `extract` method** — Powered by [Nukitori](https://github.com/vifreefly/nukitori). Describe the data structure you want and let AI generate XPath selectors automatically. Selectors are cached for reuse, so AI is only called once per page type
11
+ * **Configure Nukitori via Kimurai** — Set LLM provider settings (OpenAI, Anthropic, Gemini, etc.) directly in `Kimurai.configure` block
12
+ * **Engine aliases** — Use shorter engine names: `:chrome` (alias for `:selenium_chrome`), `:firefox` (alias for `:selenium_firefox`)
13
+ * **Top-level `@delay` option** — Set request delay directly as `@delay = 2..5` instead of nested `@config = { before_request: { delay: 2..5 } }`
14
+ * **Auto spider name** — If `@name` is not provided, it's automatically derived from the class name
15
+ * **Save array of items** — `save_to` helper now accepts an array of items to save at once
16
+
17
+ ### Improvements
18
+ * `save_to` helper now uses pretty JSON by default for `:json` format (use `format: :compact_json` for compact output)
19
+ * Request delay is now applied before the response is passed to the callback
20
+
2
21
  ## 2.0.1
3
22
  ### Fixes
4
23
  * Remove xpath as default Capybara selector type (fixes https://github.com/vifreefly/kimuraframework/issues/28)
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2018 Victor Afanasev
3
+ Copyright (c) 2026 Victor Afanasev
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -1,20 +1,75 @@
1
- # Kimurai
1
+ <div align="center">
2
+ <a href="https://github.com/vifreefly/kimuraframework">
3
+ <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
4
+ </a>
2
5
 
3
- Kimurai is a modern web scraping framework written in Ruby which **works out of the box with Headless Chromium/Firefox** or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
6
+ <h1>Kimurai: AI-First Web Scraping Framework for Ruby</h1>
7
+ </div>
4
8
 
5
- Kimurai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
9
+ Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs:
10
+
11
+ ```ruby
12
+ # google_spider.rb
13
+ require 'kimurai'
14
+
15
+ class GoogleSpider < Kimurai::Base
16
+ @start_urls = ['https://www.google.com/search?q=web+scraping+ai']
17
+ @delay = 1
18
+
19
+ def parse(response, url:, data: {})
20
+ results = extract(response) do
21
+ array :organic_results do
22
+ object do
23
+ string :title
24
+ string :snippet
25
+ string :url
26
+ end
27
+ end
28
+
29
+ array :sponsored_results do
30
+ object do
31
+ string :title
32
+ string :snippet
33
+ string :url
34
+ end
35
+ end
36
+
37
+ array :people_also_search_for, of: :string
38
+
39
+ string :next_page_link
40
+ number :current_page_number
41
+ end
42
+
43
+ save_to 'google_results.json', results, format: :json
44
+
45
+ if results[:next_page_link] && results[:current_page_number] < 3
46
+ request_to :parse, url: absolute_url(results[:next_page_link], base: url)
47
+ end
48
+ end
49
+ end
50
+
51
+ GoogleSpider.crawl!
52
+ ```
53
+
54
+ **How it works:**
55
+ 1. On the first request, `extract` sends the HTML + your schema to an LLM
56
+ 2. The LLM generates XPath selectors and caches them in `google_spider.json`
57
+ 3. **All subsequent requests use cached XPath — zero AI calls, pure fast Ruby extraction**
58
+ 4. Supports OpenAI, Anthropic, Gemini, or local LLMs via [Nukitori](https://github.com/vifreefly/nukitori)
59
+
60
+
61
+ ## Traditional Mode
62
+
63
+ Prefer writing your own selectors? Kimurai works great as a traditional scraper too — with headless antidetect Chromium, Firefox, or simple HTTP requests:
6
64
 
7
65
  ```ruby
8
66
  # github_spider.rb
9
67
  require 'kimurai'
10
68
 
11
69
  class GithubSpider < Kimurai::Base
12
- @name = "github_spider"
13
- @engine = :selenium_chrome
70
+ @engine = :chrome
14
71
  @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
15
- @config = {
16
- before_request: { delay: 3..5 }
17
- }
72
+ @delay = 3..5
18
73
 
19
74
  def parse(response, url:, data: {})
20
75
  response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
@@ -149,8 +204,7 @@ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? L
149
204
  require 'kimurai'
150
205
 
151
206
  class InfiniteScrollSpider < Kimurai::Base
152
- @name = "infinite_scroll_spider"
153
- @engine = :selenium_chrome
207
+ @engine = :chrome
154
208
  @start_urls = ["https://infinite-scroll.com/demo/full-page/"]
155
209
 
156
210
  def parse(response, url:, data: {})
@@ -194,14 +248,76 @@ I, [2025-12-16 12:47:15] INFO -- infinite_scroll_spider: > Continue scrolling,
194
248
  I, [2025-12-16 12:47:17] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
195
249
  I, [2025-12-16 12:47:19] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
196
250
  I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > Pagination is done
197
- I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
251
+ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page:
252
+
253
+ 1a - Infinite Scroll full page demo;
254
+ 1b - RGB Schemes logo in Computer Arts;
255
+ 2a - RGB Schemes logo;
256
+ 2b - Masonry gets horizontalOrder;
257
+ 2c - Every vector 2016;
258
+ 3a - Logo Pizza delivered;
259
+ 3b - Some CodePens;
260
+ 3c - 365daysofmusic.com;
261
+ 3d - Holograms;
262
+ 4a - Huebee: 1-click color picker;
263
+ 4b - Word is Flickity is good;
264
+ Flickity v2 released: groupCells, adaptiveHeight, parallax;
265
+ New tech gets chatter; Isotope v3 released: stagger in, IE8 out;
266
+ Packery v2 released
267
+
198
268
  I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
199
269
  I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
200
270
  ```
201
- </details><br>
271
+ </details>
272
+
273
+ ## AI Extraction — Configuration
274
+
275
+ Configure your LLM provider to start using AI extraction. The `extract` method is powered by [Nukitori](https://github.com/vifreefly/nukitori):
276
+
277
+ ```ruby
278
+ # github_spider_ai.rb
279
+ require 'kimurai'
280
+
281
+ Kimurai.configure do |config|
282
+ config.default_model = "gemini-3-flash-preview" # OpenAI, Anthropic, Gemini, local LLMs, etc.
283
+ config.gemini_api_key = ENV["GEMINI_API_KEY"]
284
+ end
285
+
286
+ class GithubSpider < Kimurai::Base
287
+ @engine = :chrome
288
+ @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
289
+ @delay = 3..5
202
290
 
291
+ def parse(response, url:, data: {})
292
+ data = extract(response) do
293
+ string :next_page_url, description: "Next page path url"
294
+ array :repos do
295
+ object do
296
+ string :name
297
+ string :url
298
+ string :description
299
+ string :stars
300
+ string :language
301
+ array :tags, of: :string
302
+ end
303
+ end
304
+ end
305
+
306
+ save_to "results.json", data[:repos], format: :json
307
+
308
+ if data[:next_page_url]
309
+ request_to :parse, url: absolute_url(data[:next_page_url], base: url)
310
+ end
311
+ end
312
+ end
313
+
314
+ GithubSpider.crawl!
315
+ ```
316
+
317
+ Selectors are cached in `github_spider_ai.json` after the first AI call — all subsequent requests use pure Ruby extraction.
203
318
 
204
319
  ## Features
320
+ * **AI-powered data extraction**: Use [Nukitori](https://github.com/vifreefly/nukitori) to extract structured data without writing XPath/CSS selectors — just describe what you want, and AI figures out how to extract it
205
321
  * Scrape JavaScript rendered websites out of the box
206
322
  * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
207
323
  * Write spider code once, and use it with any supported engine later
@@ -218,6 +334,8 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
218
334
 
219
335
  ## Table of Contents
220
336
  * [Kimurai](#kimurai)
337
+ * [Traditional Mode](#traditional-mode)
338
+ * [AI Extraction — Configuration](#ai-extraction--configuration)
221
339
  * [Features](#features)
222
340
  * [Table of Contents](#table-of-contents)
223
341
  * [Installation](#installation)
@@ -229,6 +347,7 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
229
347
  * [browser object](#browser-object)
230
348
  * [request_to method](#request_to-method)
231
349
  * [save_to helper](#save_to-helper)
350
+ * [AI-powered extraction with extract](#ai-powered-extraction-with-extract)
232
351
  * [Skip duplicates](#skip-duplicates)
233
352
  * [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
234
353
  * [Storage object](#storage-object)
@@ -262,7 +381,7 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
262
381
 
263
382
 
264
383
  ## Installation
265
- Kimurai requires Ruby version `>= 3.1.0`. Officially supported platforms: `Linux` and `macOS`.
384
+ Kimurai requires Ruby version `>= 3.2.0`. Officially supported platforms: `Linux` and `macOS`.
266
385
 
267
386
  1) If your system doesn't have the appropriate Ruby version, install it:
268
387
 
@@ -312,7 +431,7 @@ gem update --system
312
431
 
313
432
  ```bash
314
433
  # Install basic tools
315
- sudo apt install -q -y unzip wget tar openssl
434
+ sudo apt install -q -y unzip wget tar openssl lsof
316
435
 
317
436
  # Install xvfb (for virtual_display headless mode, in addition to native)
318
437
  sudo apt install -q -y xvfb
@@ -409,8 +528,8 @@ CLI arguments:
409
528
  Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
410
529
 
411
530
  * `:mechanize` – [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
412
- * `:selenium_chrome` – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
413
- * `:selenium_firefox` – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
531
+ * `:chrome` (`:selenium_chrome` alias) – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
532
+ * `:firefox` (`:selenium_firefox` alias) – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
414
533
 
415
534
  **Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
416
535
 
@@ -423,7 +542,7 @@ require 'kimurai'
423
542
 
424
543
  class SimpleSpider < Kimurai::Base
425
544
  @name = "simple_spider"
426
- @engine = :selenium_chrome
545
+ @engine = :chrome
427
546
  @start_urls = ["https://example.com/"]
428
547
 
429
548
  def parse(response, url:, data: {})
@@ -434,8 +553,8 @@ SimpleSpider.crawl!
434
553
  ```
435
554
 
436
555
  Where:
437
- * `@name` – a name for the spider
438
- * `@engine` – engine to use for the spider
556
+ * `@name` – a name for the spider (optional)
557
+ * `@engine` – engine to use for the spider (optional, default is `:selenium_chrome`)
439
558
  * `@start_urls` – array of urls to process one-by-one inside the `parse` method
440
559
  * The `parse` method is the entry point, and should always be present in a spider class
441
560
 
@@ -458,7 +577,7 @@ Imagine that there is a product page that doesn't contain a category name. The c
458
577
 
459
578
  ```ruby
460
579
  class ProductsSpider < Kimurai::Base
461
- @engine = :selenium_chrome
580
+ @engine = :chrome
462
581
  @start_urls = ["https://example-shop.com/example-product-category"]
463
582
 
464
583
  def parse(response, url:, data: {})
@@ -497,8 +616,7 @@ But, if you need to interact with a page (like filling form fields, clicking ele
497
616
 
498
617
  ```ruby
499
618
  class GoogleSpider < Kimurai::Base
500
- @name = "google_spider"
501
- @engine = :selenium_chrome
619
+ @engine = :chrome
502
620
  @start_urls = ["https://www.google.com/"]
503
621
 
504
622
  def parse(response, url:, data: {})
@@ -529,7 +647,7 @@ For making requests to a particular method, there is `request_to`. It requires a
529
647
 
530
648
  ```ruby
531
649
  class Spider < Kimurai::Base
532
- @engine = :selenium_chrome
650
+ @engine = :chrome
533
651
  @start_urls = ["https://example.com/"]
534
652
 
535
653
  def parse(response, url:, data: {})
@@ -565,7 +683,7 @@ The `request_to` helper method makes things simpler. We could also do something
565
683
 
566
684
  ```ruby
567
685
  class Spider < Kimurai::Base
568
- @engine = :selenium_chrome
686
+ @engine = :chrome
569
687
  @start_urls = ["https://example.com/"]
570
688
 
571
689
  def parse(response, url:, data: {})
@@ -588,7 +706,7 @@ Sometimes all you need is to simply save scraped data to a file. You can use the
588
706
 
589
707
  ```ruby
590
708
  class ProductsSpider < Kimurai::Base
591
- @engine = :selenium_chrome
709
+ @engine = :chrome
592
710
  @start_urls = ["https://example-shop.com/"]
593
711
 
594
712
  # ...
@@ -607,12 +725,12 @@ end
607
725
  ```
608
726
 
609
727
  Supported formats:
610
- * `:json` – JSON
611
- * `:pretty_json` – "pretty" JSON (`JSON.pretty_generate`)
728
+ * `:json` – JSON (`JSON.pretty_generate`)
729
+ * `:compact_json` – JSON
612
730
  * `:jsonlines` – [JSON Lines](http://jsonlines.org/)
613
731
  * `:csv` – CSV
614
732
 
615
- Note: `save_to` requires the data (item) to save to be a `Hash`.
733
+ Note: `save_to` requires the data (item) to save to be a Hash or Array of Hashes.
616
734
 
617
735
  By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
618
736
 
@@ -622,13 +740,91 @@ While the spider is running, each new item will be appended to the output file.
622
740
 
623
741
  > If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
624
742
 
743
+ ### AI-powered extraction with `extract`
744
+
745
+ Writing and maintaining XPath/CSS selectors is tedious and error-prone. The `extract` method uses AI to generate selectors automatically — you just describe the data structure you want.
746
+
747
+ **Configuration:**
748
+
749
+ First, configure an LLM provider in your application:
750
+
751
+ ```ruby
752
+ Kimurai.configure do |config|
753
+ config.default_model = 'gemini-3-flash-preview'
754
+ config.gemini_api_key = ENV['GEMINI_API_KEY']
755
+
756
+ # Or use OpenAI
757
+ # config.default_model = 'gpt-5.2'
758
+ # config.openai_api_key = ENV['OPENAI_API_KEY']
759
+
760
+ # Or Anthropic
761
+ # config.default_model = 'claude-sonnet-4-5'
762
+ # config.anthropic_api_key = ENV['ANTHROPIC_API_KEY']
763
+ end
764
+ ```
765
+
766
+ **Usage:**
767
+
768
+ ```ruby
769
+ def parse(response, url:, data: {})
770
+ data = extract(response) do
771
+ string :title
772
+ string :price
773
+ string :description
774
+ array :features, of: :string
775
+ end
776
+
777
+ save_to "products.json", data, format: :json
778
+ end
779
+ ```
780
+
781
+ **Schema DSL:**
782
+
783
+ - `string :field_name` — extracts text
784
+ - `integer :field_name` — extracts integer
785
+ - `number :field_name` — extracts float/decimal
786
+ - `array :items do ... end` — extracts list of objects
787
+ - `array :tags, of: :string` — extracts list of strings
788
+ - `object do ... end` — nested structure
789
+ - `description: '...'` — hint for AI about what to look for
790
+
791
+ **How it works:**
792
+
793
+ 1. On first run, `extract` sends the HTML and your schema to an LLM
794
+ 2. The LLM returns XPath rules for each field
795
+ 3. These rules are cached in `SpiderName.json` alongside your spider file
796
+ 4. All subsequent extractions use cached XPath — fast and free, no more AI calls
797
+ 5. Each method gets its own prefix in the schema file, so different parse methods can have different schemas
798
+
799
+ **Automatic pagination:**
800
+
801
+ Include a next page field in your schema:
802
+
803
+ ```ruby
804
+ data = extract(response) do
805
+ string :next_page_url, description: 'Next page link'
806
+ array :products do
807
+ object do
808
+ string :name
809
+ string :price
810
+ end
811
+ end
812
+ end
813
+
814
+ if data[:next_page_url]
815
+ request_to :parse, url: absolute_url(data[:next_page_url], base: url)
816
+ end
817
+ ```
818
+
819
+ When the last page has no "Next" link, the extracted value is `nil` and pagination stops naturally.
820
+
625
821
  ### Skip duplicates
626
822
 
627
823
  It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
628
824
 
629
825
  ```ruby
630
826
  class ProductsSpider < Kimurai::Base
631
- @engine = :selenium_chrome
827
+ @engine = :chrome
632
828
  @start_urls = ["https://example-shop.com/"]
633
829
 
634
830
  def parse(response, url:, data: {})
@@ -842,8 +1038,7 @@ The `run_info` method is available from the `open_spider` and `close_spider` cla
842
1038
 
843
1039
  ```ruby
844
1040
  class ExampleSpider < Kimurai::Base
845
- @name = "example_spider"
846
- @engine = :selenium_chrome
1041
+ @engine = :chrome
847
1042
  @start_urls = ["https://example.com/"]
848
1043
 
849
1044
  def self.close_spider
@@ -895,7 +1090,7 @@ You can also use the additional methods `completed?` or `failed?`
895
1090
 
896
1091
  ```ruby
897
1092
  class Spider < Kimurai::Base
898
- @engine = :selenium_chrome
1093
+ @engine = :chrome
899
1094
  @start_urls = ["https://example.com/"]
900
1095
 
901
1096
  def self.close_spider
@@ -933,7 +1128,7 @@ Kimurai supports environments. The default is `development`. To provide a custom
933
1128
  Usage example:
934
1129
  ```ruby
935
1130
  class Spider < Kimurai::Base
936
- @engine = :selenium_chrome
1131
+ @engine = :chrome
937
1132
  @start_urls = ["https://example.com/"]
938
1133
 
939
1134
  def self.close_spider
@@ -956,7 +1151,6 @@ Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, t
956
1151
  require 'kimurai'
957
1152
 
958
1153
  class AmazonSpider < Kimurai::Base
959
- @name = "amazon_spider"
960
1154
  @engine = :mechanize
961
1155
  @start_urls = ["https://www.amazon.com/"]
962
1156
 
@@ -1068,7 +1262,7 @@ vic@Vics-MacBook-Air single %
1068
1262
 
1069
1263
  * `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
1070
1264
  * `delay:` – set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
1071
- * `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :selenium_chrome)`
1265
+ * `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :chrome)`
1072
1266
  * `config:` – set custom [config](#spider-config) options
1073
1267
 
1074
1268
  ### Active Support included
@@ -1170,7 +1364,7 @@ Kimurai.configure do |config|
1170
1364
 
1171
1365
  # Custom time zone (for logs):
1172
1366
  # config.time_zone = "UTC"
1173
- # config.time_zone = "Europe/Moscow"
1367
+ # config.time_zone = "Europe/Berlin"
1174
1368
 
1175
1369
  # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
1176
1370
  # config.selenium_chrome_path = "/usr/bin/chromium-browser"
@@ -1286,7 +1480,7 @@ class Spider < Kimurai::Base
1286
1480
  USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
1287
1481
  PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
1288
1482
 
1289
- @engine = :selenium_chrome
1483
+ @engine = :chrome
1290
1484
  @start_urls = ["https://example.com/"]
1291
1485
  @config = {
1292
1486
  headers: { "custom_header" => "custom_value" },
@@ -1328,7 +1522,7 @@ end
1328
1522
  # Custom User Agent – string or lambda
1329
1523
  #
1330
1524
  # Use lambda if you want to rotate user agents before each run:
1331
- # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1525
+ # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1332
1526
  #
1333
1527
  # Works for all engines
1334
1528
  user_agent: "Mozilla/5.0 Firefox/61.0",
@@ -1340,10 +1534,10 @@ end
1340
1534
  cookies: [],
1341
1535
 
1342
1536
  # Proxy – string or lambda. Format for a proxy string: "ip:port:protocol:user:password"
1343
- # `protocol` can be http or socks5. User and password are optional.
1537
+ # `protocol` can be http or socks5. User and password are optional.
1344
1538
  #
1345
1539
  # Use lambda if you want to rotate proxies before each run:
1346
- # proxy: -> { ARRAY_OF_PROXIES.sample }
1540
+ # proxy: -> { ARRAY_OF_PROXIES.sample }
1347
1541
  #
1348
1542
  # Works for all engines, but keep in mind that Selenium drivers don't support proxies
1349
1543
  # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http).
@@ -1387,10 +1581,10 @@ end
1387
1581
  # and if the url already exists in this scope, the request will be skipped.
1388
1582
  #
1389
1583
  # You can configure this setting by providing additional options as hash:
1390
- # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1391
- # `scope:` – use a custom scope other than `:requests_urls`
1392
- # `check_only:` – if true, the url will not be added to the scope
1393
- #
1584
+ # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1585
+ # `scope:` – use a custom scope other than `:requests_urls`
1586
+ # `check_only:` – if true, the url will not be added to the scope
1587
+ #
1394
1588
  # Works for all drivers
1395
1589
  skip_duplicate_requests: true,
1396
1590
 
@@ -1421,8 +1615,8 @@ end
1421
1615
  # Handle page encoding while parsing html response using Nokogiri
1422
1616
  #
1423
1617
  # There are two ways to use this option:
1424
- # encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
1425
- # encoding: "GB2312" # set encoding manually
1618
+ # encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
1619
+ # encoding: "GB2312" # set encoding manually
1426
1620
  #
1427
1621
  # This option is not set by default
1428
1622
  encoding: nil,
@@ -1649,7 +1843,7 @@ end
1649
1843
  spiders/application_spider.rb
1650
1844
  ```ruby
1651
1845
  class ApplicationSpider < Kimurai::Base
1652
- @engine = :selenium_chrome
1846
+ @engine = :chrome
1653
1847
 
1654
1848
  # Define pipelines (by order) for all spiders:
1655
1849
  @pipelines = [:validator, :saver]
@@ -1726,7 +1920,7 @@ spiders/github_spider.rb
1726
1920
  ```ruby
1727
1921
  class GithubSpider < Kimurai::Base
1728
1922
  @name = "github_spider"
1729
- @engine = :selenium_chrome
1923
+ @engine = :chrome
1730
1924
  @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
1731
1925
  @config = {
1732
1926
  before_request: { delay: 3..5 }
data/Rakefile CHANGED
@@ -1,10 +1,6 @@
1
1
  require 'bundler/gem_tasks'
2
- require 'rake/testtask'
2
+ require 'rspec/core/rake_task'
3
3
 
4
- Rake::TestTask.new(:test) do |t|
5
- t.libs << 'test'
6
- t.libs << 'lib'
7
- t.test_files = FileList['test/**/*_test.rb']
8
- end
4
+ RSpec::Core::RakeTask.new(:spec)
9
5
 
10
- task default: :test
6
+ task default: :spec
data/kimurai.gemspec CHANGED
@@ -20,7 +20,7 @@ Gem::Specification.new do |spec|
20
20
  spec.bindir = 'exe'
21
21
  spec.executables = 'kimurai'
22
22
  spec.require_paths = ['lib']
23
- spec.required_ruby_version = '>= 3.1.0'
23
+ spec.required_ruby_version = '>= 3.2.0'
24
24
 
25
25
  spec.add_dependency 'activesupport'
26
26
  spec.add_dependency 'cliver'
@@ -46,4 +46,8 @@ Gem::Specification.new do |spec|
46
46
 
47
47
  spec.add_dependency 'pry'
48
48
  spec.add_dependency 'rbcat', '~> 1.0'
49
+ spec.add_dependency 'nukitori'
50
+
51
+ spec.add_development_dependency 'rake', '~> 13.0'
52
+ spec.add_development_dependency 'rspec', '~> 3.13'
49
53
  end
@@ -7,10 +7,11 @@ module Kimurai
7
7
  attr_reader :format, :path, :position, :append
8
8
 
9
9
  def initialize(path, format:, position: true, append: false)
10
- raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json jsonlines csv].include?(format)
10
+ raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json compact_json jsonlines csv].include?(format)
11
11
 
12
12
  @path = path
13
13
  @format = format
14
+ @format = :json if format == :pretty_json # :pretty_json is now an alias for :json
14
15
  @position = position
15
16
  @index = 0
16
17
  @append = append
@@ -19,44 +20,57 @@ module Kimurai
19
20
 
20
21
  def save(item)
21
22
  @mutex.synchronize do
22
- @index += 1
23
- item[:position] = @index if position
24
-
25
- case format
26
- when :json
27
- save_to_json(item)
28
- when :pretty_json
29
- save_to_pretty_json(item)
30
- when :jsonlines
31
- save_to_jsonlines(item)
32
- when :csv
33
- save_to_csv(item)
23
+ if item.is_a?(Array)
24
+ item.each do |it|
25
+ @index += 1
26
+ it[:position] = @index if position
27
+
28
+ save_item(it)
29
+ end
30
+ else
31
+ @index += 1
32
+ item[:position] = @index if position
33
+
34
+ save_item(item)
34
35
  end
35
36
  end
36
37
  end
37
38
 
38
39
  private
39
40
 
41
+ def save_item(item)
42
+ case format
43
+ when :json
44
+ save_to_json(item)
45
+ when :compact_json
46
+ save_to_compact_json(item)
47
+ when :jsonlines
48
+ save_to_jsonlines(item)
49
+ when :csv
50
+ save_to_csv(item)
51
+ end
52
+ end
53
+
40
54
  def save_to_json(item)
41
- data = JSON.generate([item])
55
+ data = JSON.pretty_generate([item])
42
56
 
43
57
  if @index > 1 || append && File.exist?(path)
44
- file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
58
+ file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
45
59
  File.open(path, 'w') do |f|
46
- f.write(file_content + data.sub(/\A\[/, ''))
60
+ f.write(file_content + data.sub(/\A\[\n/, ''))
47
61
  end
48
62
  else
49
63
  File.open(path, 'w') { |f| f.write(data) }
50
64
  end
51
65
  end
52
66
 
53
- def save_to_pretty_json(item)
54
- data = JSON.pretty_generate([item])
67
+ def save_to_compact_json(item)
68
+ data = JSON.generate([item])
55
69
 
56
70
  if @index > 1 || append && File.exist?(path)
57
- file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
71
+ file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
58
72
  File.open(path, 'w') do |f|
59
- f.write(file_content + data.sub(/\A\[\n/, ''))
73
+ f.write(file_content + data.sub(/\A\[/, ''))
60
74
  end
61
75
  else
62
76
  File.open(path, 'w') { |f| f.write(data) }
data/lib/kimurai/base.rb CHANGED
@@ -64,12 +64,12 @@ module Kimurai
64
64
 
65
65
  ###
66
66
 
67
- @engine = :mechanize
67
+ @engine = :selenium_chrome
68
68
  @pipelines = []
69
69
  @config = {}
70
70
 
71
71
  def self.name
72
- @name
72
+ @name || to_s.underscore
73
73
  end
74
74
 
75
75
  def self.engine
@@ -84,11 +84,22 @@ module Kimurai
84
84
  @start_urls
85
85
  end
86
86
 
87
+ def self.delay
88
+ @delay ||= superclass.respond_to?(:delay) ? superclass.delay : nil
89
+ end
90
+
87
91
  def self.config
88
- if superclass.equal?(::Object)
89
- @config
92
+ base_config = if superclass.equal?(::Object)
93
+ @config
94
+ else
95
+ superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
96
+ end
97
+
98
+ # Merge @delay shortcut into config if set
99
+ if delay
100
+ base_config.deep_merge_excl({ before_request: { delay: delay } }, DMERGE_EXCLUDE)
90
101
  else
91
- superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
102
+ base_config
92
103
  end
93
104
  end
94
105
 
@@ -1,5 +1,15 @@
1
1
  module Kimurai
2
2
  module BaseHelper
3
+ def extract(response, model: nil, &block)
4
+ caller_info = caller_locations(1, 1).first
5
+ method_name = caller_info.base_label
6
+ spider_dir = File.dirname(caller_info.path)
7
+ schema_path = File.join(spider_dir, "#{self.class.name}.json")
8
+
9
+ data = Nukitori(response, schema_path, prefix: method_name, model:, &block)
10
+ data.deep_symbolize_keys
11
+ end
12
+
3
13
  private
4
14
 
5
15
  def absolute_url(url, base:)
@@ -1,6 +1,13 @@
1
1
  module Kimurai
2
2
  module BrowserBuilder
3
+ ENGINE_ALIASES = {
4
+ chrome: :selenium_chrome,
5
+ firefox: :selenium_firefox
6
+ }.freeze
7
+
3
8
  def self.build(engine, config = {}, spider:)
9
+ engine = ENGINE_ALIASES.fetch(engine, engine)
10
+
4
11
  begin
5
12
  require "kimurai/browser_builder/#{engine}_builder"
6
13
  rescue LoadError
@@ -10,7 +10,6 @@ module Capybara
10
10
  alias original_visit visit
11
11
  def visit(visit_uri, delay: config.before_request[:delay], skip_request_options: false, max_retries: 3)
12
12
  if spider
13
- process_delay(delay) if delay
14
13
  retries = 0
15
14
  sleep_interval = 0
16
15
 
@@ -20,6 +19,9 @@ module Capybara
20
19
  spider.class.update(:visits, :requests) if spider.with_info
21
20
 
22
21
  original_visit(visit_uri)
22
+
23
+ logger.info "Browser: finished get request to: #{visit_uri}"
24
+ process_delay(delay) if delay
23
25
  rescue StandardError => e
24
26
  if match_error?(e, type: :to_skip)
25
27
  logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
@@ -40,7 +42,7 @@ module Capybara
40
42
  raise e
41
43
  end
42
44
  else
43
- driver.responses += 1 and logger.info "Browser: finished get request to: #{visit_uri}"
45
+ driver.responses += 1
44
46
  spider.class.update(:visits, :responses) if spider.with_info
45
47
  driver.visited = true unless driver.visited
46
48
  true
@@ -170,7 +172,7 @@ module Capybara
170
172
 
171
173
  def process_delay(delay)
172
174
  interval = (delay.instance_of?(Range) ? rand(delay) : delay)
173
- logger.debug "Browser: sleep #{interval.round(2)} #{'second'.pluralize(interval)} before request..."
175
+ logger.debug "Browser: delay #{interval.round(2)} #{'second'.pluralize(interval)}..."
174
176
  sleep interval
175
177
  end
176
178
 
@@ -35,7 +35,7 @@ module Kimurai
35
35
 
36
36
  return if in_project
37
37
 
38
- insert_into_file spider_path, " @engine = :mechanize\n", after: "@name = \"#{spider_name}\"\n"
38
+ insert_into_file spider_path, " @engine = :chrome\n", after: "@name = \"#{spider_name}\"\n"
39
39
  prepend_to_file spider_path, "require 'kimurai'\n\n"
40
40
  append_to_file spider_path, "\n#{spider_class}.crawl!"
41
41
  end
@@ -1,3 +1,3 @@
1
1
  module Kimurai
2
- VERSION = '2.0.1'.freeze
2
+ VERSION = '2.2.0'.freeze
3
3
  end
data/lib/kimurai.rb CHANGED
@@ -6,6 +6,7 @@ require 'uri'
6
6
  require 'active_support'
7
7
  require 'active_support/core_ext'
8
8
  require 'rbcat'
9
+ require 'nukitori'
9
10
 
10
11
  require_relative 'kimurai/version'
11
12
 
@@ -20,6 +21,33 @@ require_relative 'kimurai/pipeline'
20
21
  require_relative 'kimurai/base'
21
22
 
22
23
  module Kimurai
24
+ # Settings that will be forwarded to Nukitori configuration
25
+ NUKITORI_SETTINGS = %i[
26
+ openai_api_key
27
+ anthropic_api_key
28
+ gemini_api_key
29
+ vertexai_project_id
30
+ vertexai_location
31
+ deepseek_api_key
32
+ mistral_api_key
33
+ perplexity_api_key
34
+ openrouter_api_key
35
+ gpustack_api_key
36
+ openai_api_base
37
+ gemini_api_base
38
+ ollama_api_base
39
+ gpustack_api_base
40
+ openai_organization_id
41
+ openai_project_id
42
+ openai_use_system_role
43
+ bedrock_api_key
44
+ bedrock_secret_key
45
+ bedrock_region
46
+ bedrock_session_token
47
+ default_model
48
+ model_registry_file
49
+ ].freeze
50
+
23
51
  class << self
24
52
  def configuration
25
53
  @configuration ||= OpenStruct.new
@@ -27,6 +55,22 @@ module Kimurai
27
55
 
28
56
  def configure
29
57
  yield(configuration)
58
+ apply_nukitori_configuration
59
+ end
60
+
61
+ def apply_nukitori_configuration
62
+ nukitori_settings = NUKITORI_SETTINGS.filter_map do |setting|
63
+ value = configuration[setting]
64
+ [setting, value] if value
65
+ end.to_h
66
+
67
+ return if nukitori_settings.empty?
68
+
69
+ Nukitori.configure do |config|
70
+ nukitori_settings.each do |setting, value|
71
+ config.public_send("#{setting}=", value)
72
+ end
73
+ end
30
74
  end
31
75
 
32
76
  def env
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kimurai
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.1
4
+ version: 2.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Afanasev
@@ -261,6 +261,48 @@ dependencies:
261
261
  - - "~>"
262
262
  - !ruby/object:Gem::Version
263
263
  version: '1.0'
264
+ - !ruby/object:Gem::Dependency
265
+ name: nukitori
266
+ requirement: !ruby/object:Gem::Requirement
267
+ requirements:
268
+ - - ">="
269
+ - !ruby/object:Gem::Version
270
+ version: '0'
271
+ type: :runtime
272
+ prerelease: false
273
+ version_requirements: !ruby/object:Gem::Requirement
274
+ requirements:
275
+ - - ">="
276
+ - !ruby/object:Gem::Version
277
+ version: '0'
278
+ - !ruby/object:Gem::Dependency
279
+ name: rake
280
+ requirement: !ruby/object:Gem::Requirement
281
+ requirements:
282
+ - - "~>"
283
+ - !ruby/object:Gem::Version
284
+ version: '13.0'
285
+ type: :development
286
+ prerelease: false
287
+ version_requirements: !ruby/object:Gem::Requirement
288
+ requirements:
289
+ - - "~>"
290
+ - !ruby/object:Gem::Version
291
+ version: '13.0'
292
+ - !ruby/object:Gem::Dependency
293
+ name: rspec
294
+ requirement: !ruby/object:Gem::Requirement
295
+ requirements:
296
+ - - "~>"
297
+ - !ruby/object:Gem::Version
298
+ version: '3.13'
299
+ type: :development
300
+ prerelease: false
301
+ version_requirements: !ruby/object:Gem::Requirement
302
+ requirements:
303
+ - - "~>"
304
+ - !ruby/object:Gem::Version
305
+ version: '3.13'
264
306
  email:
265
307
  - vicfreefly@gmail.com
266
308
  executables:
@@ -269,6 +311,7 @@ extensions: []
269
311
  extra_rdoc_files: []
270
312
  files:
271
313
  - ".gitignore"
314
+ - ".rspec"
272
315
  - ".rubocop.yml"
273
316
  - CHANGELOG.md
274
317
  - Gemfile
@@ -329,7 +372,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
329
372
  requirements:
330
373
  - - ">="
331
374
  - !ruby/object:Gem::Version
332
- version: 3.1.0
375
+ version: 3.2.0
333
376
  required_rubygems_version: !ruby/object:Gem::Requirement
334
377
  requirements:
335
378
  - - ">="