kimurai 2.0.1 → 2.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/.rspec +3 -0
- data/.rubocop.yml +2 -0
- data/CHANGELOG.md +19 -0
- data/LICENSE.txt +1 -1
- data/README.md +242 -48
- data/Rakefile +3 -7
- data/kimurai.gemspec +5 -1
- data/lib/kimurai/base/saver.rb +34 -20
- data/lib/kimurai/base.rb +16 -5
- data/lib/kimurai/base_helper.rb +10 -0
- data/lib/kimurai/browser_builder.rb +7 -0
- data/lib/kimurai/capybara_ext/session.rb +5 -3
- data/lib/kimurai/cli/generator.rb +1 -1
- data/lib/kimurai/version.rb +1 -1
- data/lib/kimurai.rb +44 -0
- metadata +45 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: eeeca8fc2ae390e6c557f435478ee4ea8273920e3ab7c590800c338574f364d0
|
|
4
|
+
data.tar.gz: d7d2d799a97c51c0e1837080c249316651beedb58b312df3df9bc69fefabac31
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 80f449d68068d238da99dbbb83b710e8071e7fd4ded76bc17f096fac1446a307e841d87300c461299c6c01fe9abaecbcc5c91d9a2fa9878d5af5d1cac888349a
|
|
7
|
+
data.tar.gz: 793f3301353e135484ad0283973cd4c3c181fa43a9510339f06a61152a557f758f1d9e32a2fec12954ce5b9e3a409c554efea13485f7d66c67ba8654e4f6baeb
|
data/.gitignore
CHANGED
data/.rspec
ADDED
data/.rubocop.yml
CHANGED
data/CHANGELOG.md
CHANGED
|
@@ -1,4 +1,23 @@
|
|
|
1
1
|
# CHANGELOG
|
|
2
|
+
|
|
3
|
+
## 2.2.0
|
|
4
|
+
### New
|
|
5
|
+
* Default engine now is `:chrome` (was `:mechanize`)
|
|
6
|
+
|
|
7
|
+
## 2.1.0
|
|
8
|
+
### New
|
|
9
|
+
* Min. required Ruby version is 3.2.0
|
|
10
|
+
* **AI-powered data extraction with `extract` method** — Powered by [Nukitori](https://github.com/vifreefly/nukitori). Describe the data structure you want and let AI generate XPath selectors automatically. Selectors are cached for reuse, so AI is only called once per page type
|
|
11
|
+
* **Configure Nukitori via Kimurai** — Set LLM provider settings (OpenAI, Anthropic, Gemini, etc.) directly in `Kimurai.configure` block
|
|
12
|
+
* **Engine aliases** — Use shorter engine names: `:chrome` (alias for `:selenium_chrome`), `:firefox` (alias for `:selenium_firefox`)
|
|
13
|
+
* **Top-level `@delay` option** — Set request delay directly as `@delay = 2..5` instead of nested `@config = { before_request: { delay: 2..5 } }`
|
|
14
|
+
* **Auto spider name** — If `@name` is not provided, it's automatically derived from the class name
|
|
15
|
+
* **Save array of items** — `save_to` helper now accepts an array of items to save at once
|
|
16
|
+
|
|
17
|
+
### Improvements
|
|
18
|
+
* `save_to` helper now uses pretty JSON by default for `:json` format (use `format: :compact_json` for compact output)
|
|
19
|
+
* Request delay is now applied before the response is passed to the callback
|
|
20
|
+
|
|
2
21
|
## 2.0.1
|
|
3
22
|
### Fixes
|
|
4
23
|
* Remove xpath as default Capybara selector type (fixes https://github.com/vifreefly/kimuraframework/issues/28)
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
|
@@ -1,20 +1,75 @@
|
|
|
1
|
-
|
|
1
|
+
<div align="center">
|
|
2
|
+
<a href="https://github.com/vifreefly/kimuraframework">
|
|
3
|
+
<img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
|
|
4
|
+
</a>
|
|
2
5
|
|
|
3
|
-
Kimurai
|
|
6
|
+
<h1>Kimurai: AI-First Web Scraping Framework for Ruby</h1>
|
|
7
|
+
</div>
|
|
4
8
|
|
|
5
|
-
|
|
9
|
+
Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs:
|
|
10
|
+
|
|
11
|
+
```ruby
|
|
12
|
+
# google_spider.rb
|
|
13
|
+
require 'kimurai'
|
|
14
|
+
|
|
15
|
+
class GoogleSpider < Kimurai::Base
|
|
16
|
+
@start_urls = ['https://www.google.com/search?q=web+scraping+ai']
|
|
17
|
+
@delay = 1
|
|
18
|
+
|
|
19
|
+
def parse(response, url:, data: {})
|
|
20
|
+
results = extract(response) do
|
|
21
|
+
array :organic_results do
|
|
22
|
+
object do
|
|
23
|
+
string :title
|
|
24
|
+
string :snippet
|
|
25
|
+
string :url
|
|
26
|
+
end
|
|
27
|
+
end
|
|
28
|
+
|
|
29
|
+
array :sponsored_results do
|
|
30
|
+
object do
|
|
31
|
+
string :title
|
|
32
|
+
string :snippet
|
|
33
|
+
string :url
|
|
34
|
+
end
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
array :people_also_search_for, of: :string
|
|
38
|
+
|
|
39
|
+
string :next_page_link
|
|
40
|
+
number :current_page_number
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
save_to 'google_results.json', results, format: :json
|
|
44
|
+
|
|
45
|
+
if results[:next_page_link] && results[:current_page_number] < 3
|
|
46
|
+
request_to :parse, url: absolute_url(results[:next_page_link], base: url)
|
|
47
|
+
end
|
|
48
|
+
end
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
GoogleSpider.crawl!
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
**How it works:**
|
|
55
|
+
1. On the first request, `extract` sends the HTML + your schema to an LLM
|
|
56
|
+
2. The LLM generates XPath selectors and caches them in `google_spider.json`
|
|
57
|
+
3. **All subsequent requests use cached XPath — zero AI calls, pure fast Ruby extraction**
|
|
58
|
+
4. Supports OpenAI, Anthropic, Gemini, or local LLMs via [Nukitori](https://github.com/vifreefly/nukitori)
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
## Traditional Mode
|
|
62
|
+
|
|
63
|
+
Prefer writing your own selectors? Kimurai works great as a traditional scraper too — with headless antidetect Chromium, Firefox, or simple HTTP requests:
|
|
6
64
|
|
|
7
65
|
```ruby
|
|
8
66
|
# github_spider.rb
|
|
9
67
|
require 'kimurai'
|
|
10
68
|
|
|
11
69
|
class GithubSpider < Kimurai::Base
|
|
12
|
-
@
|
|
13
|
-
@engine = :selenium_chrome
|
|
70
|
+
@engine = :chrome
|
|
14
71
|
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
|
|
15
|
-
@
|
|
16
|
-
before_request: { delay: 3..5 }
|
|
17
|
-
}
|
|
72
|
+
@delay = 3..5
|
|
18
73
|
|
|
19
74
|
def parse(response, url:, data: {})
|
|
20
75
|
response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
|
|
@@ -149,8 +204,7 @@ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? L
|
|
|
149
204
|
require 'kimurai'
|
|
150
205
|
|
|
151
206
|
class InfiniteScrollSpider < Kimurai::Base
|
|
152
|
-
@
|
|
153
|
-
@engine = :selenium_chrome
|
|
207
|
+
@engine = :chrome
|
|
154
208
|
@start_urls = ["https://infinite-scroll.com/demo/full-page/"]
|
|
155
209
|
|
|
156
210
|
def parse(response, url:, data: {})
|
|
@@ -194,14 +248,76 @@ I, [2025-12-16 12:47:15] INFO -- infinite_scroll_spider: > Continue scrolling,
|
|
|
194
248
|
I, [2025-12-16 12:47:17] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
|
|
195
249
|
I, [2025-12-16 12:47:19] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
|
|
196
250
|
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > Pagination is done
|
|
197
|
-
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page:
|
|
251
|
+
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page:
|
|
252
|
+
|
|
253
|
+
1a - Infinite Scroll full page demo;
|
|
254
|
+
1b - RGB Schemes logo in Computer Arts;
|
|
255
|
+
2a - RGB Schemes logo;
|
|
256
|
+
2b - Masonry gets horizontalOrder;
|
|
257
|
+
2c - Every vector 2016;
|
|
258
|
+
3a - Logo Pizza delivered;
|
|
259
|
+
3b - Some CodePens;
|
|
260
|
+
3c - 365daysofmusic.com;
|
|
261
|
+
3d - Holograms;
|
|
262
|
+
4a - Huebee: 1-click color picker;
|
|
263
|
+
4b - Word is Flickity is good;
|
|
264
|
+
Flickity v2 released: groupCells, adaptiveHeight, parallax;
|
|
265
|
+
New tech gets chatter; Isotope v3 released: stagger in, IE8 out;
|
|
266
|
+
Packery v2 released
|
|
267
|
+
|
|
198
268
|
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
|
|
199
269
|
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
|
|
200
270
|
```
|
|
201
|
-
</details
|
|
271
|
+
</details>
|
|
272
|
+
|
|
273
|
+
## AI Extraction — Configuration
|
|
274
|
+
|
|
275
|
+
Configure your LLM provider to start using AI extraction. The `extract` method is powered by [Nukitori](https://github.com/vifreefly/nukitori):
|
|
276
|
+
|
|
277
|
+
```ruby
|
|
278
|
+
# github_spider_ai.rb
|
|
279
|
+
require 'kimurai'
|
|
280
|
+
|
|
281
|
+
Kimurai.configure do |config|
|
|
282
|
+
config.default_model = "gemini-3-flash-preview" # OpenAI, Anthropic, Gemini, local LLMs, etc.
|
|
283
|
+
config.gemini_api_key = ENV["GEMINI_API_KEY"]
|
|
284
|
+
end
|
|
285
|
+
|
|
286
|
+
class GithubSpider < Kimurai::Base
|
|
287
|
+
@engine = :chrome
|
|
288
|
+
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
|
|
289
|
+
@delay = 3..5
|
|
202
290
|
|
|
291
|
+
def parse(response, url:, data: {})
|
|
292
|
+
data = extract(response) do
|
|
293
|
+
string :next_page_url, description: "Next page path url"
|
|
294
|
+
array :repos do
|
|
295
|
+
object do
|
|
296
|
+
string :name
|
|
297
|
+
string :url
|
|
298
|
+
string :description
|
|
299
|
+
string :stars
|
|
300
|
+
string :language
|
|
301
|
+
array :tags, of: :string
|
|
302
|
+
end
|
|
303
|
+
end
|
|
304
|
+
end
|
|
305
|
+
|
|
306
|
+
save_to "results.json", data[:repos], format: :json
|
|
307
|
+
|
|
308
|
+
if data[:next_page_url]
|
|
309
|
+
request_to :parse, url: absolute_url(data[:next_page_url], base: url)
|
|
310
|
+
end
|
|
311
|
+
end
|
|
312
|
+
end
|
|
313
|
+
|
|
314
|
+
GithubSpider.crawl!
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
Selectors are cached in `github_spider_ai.json` after the first AI call — all subsequent requests use pure Ruby extraction.
|
|
203
318
|
|
|
204
319
|
## Features
|
|
320
|
+
* **AI-powered data extraction**: Use [Nukitori](https://github.com/vifreefly/nukitori) to extract structured data without writing XPath/CSS selectors — just describe what you want, and AI figures out how to extract it
|
|
205
321
|
* Scrape JavaScript rendered websites out of the box
|
|
206
322
|
* Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
|
|
207
323
|
* Write spider code once, and use it with any supported engine later
|
|
@@ -218,6 +334,8 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
|
|
|
218
334
|
|
|
219
335
|
## Table of Contents
|
|
220
336
|
* [Kimurai](#kimurai)
|
|
337
|
+
* [Traditional Mode](#traditional-mode)
|
|
338
|
+
* [AI Extraction — Configuration](#ai-extraction--configuration)
|
|
221
339
|
* [Features](#features)
|
|
222
340
|
* [Table of Contents](#table-of-contents)
|
|
223
341
|
* [Installation](#installation)
|
|
@@ -229,6 +347,7 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
|
|
|
229
347
|
* [browser object](#browser-object)
|
|
230
348
|
* [request_to method](#request_to-method)
|
|
231
349
|
* [save_to helper](#save_to-helper)
|
|
350
|
+
* [AI-powered extraction with extract](#ai-powered-extraction-with-extract)
|
|
232
351
|
* [Skip duplicates](#skip-duplicates)
|
|
233
352
|
* [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
|
|
234
353
|
* [Storage object](#storage-object)
|
|
@@ -262,7 +381,7 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
|
|
|
262
381
|
|
|
263
382
|
|
|
264
383
|
## Installation
|
|
265
|
-
Kimurai requires Ruby version `>= 3.
|
|
384
|
+
Kimurai requires Ruby version `>= 3.2.0`. Officially supported platforms: `Linux` and `macOS`.
|
|
266
385
|
|
|
267
386
|
1) If your system doesn't have the appropriate Ruby version, install it:
|
|
268
387
|
|
|
@@ -312,7 +431,7 @@ gem update --system
|
|
|
312
431
|
|
|
313
432
|
```bash
|
|
314
433
|
# Install basic tools
|
|
315
|
-
sudo apt install -q -y unzip wget tar openssl
|
|
434
|
+
sudo apt install -q -y unzip wget tar openssl lsof
|
|
316
435
|
|
|
317
436
|
# Install xvfb (for virtual_display headless mode, in addition to native)
|
|
318
437
|
sudo apt install -q -y xvfb
|
|
@@ -409,8 +528,8 @@ CLI arguments:
|
|
|
409
528
|
Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
|
|
410
529
|
|
|
411
530
|
* `:mechanize` – [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
|
|
412
|
-
* `:selenium_chrome` – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
|
|
413
|
-
* `:selenium_firefox` – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
|
|
531
|
+
* `:chrome` (`:selenium_chrome` alias) – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
|
|
532
|
+
* `:firefox` (`:selenium_firefox` alias) – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
|
|
414
533
|
|
|
415
534
|
**Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
|
|
416
535
|
|
|
@@ -423,7 +542,7 @@ require 'kimurai'
|
|
|
423
542
|
|
|
424
543
|
class SimpleSpider < Kimurai::Base
|
|
425
544
|
@name = "simple_spider"
|
|
426
|
-
@engine = :
|
|
545
|
+
@engine = :chrome
|
|
427
546
|
@start_urls = ["https://example.com/"]
|
|
428
547
|
|
|
429
548
|
def parse(response, url:, data: {})
|
|
@@ -434,8 +553,8 @@ SimpleSpider.crawl!
|
|
|
434
553
|
```
|
|
435
554
|
|
|
436
555
|
Where:
|
|
437
|
-
* `@name` – a name for the spider
|
|
438
|
-
* `@engine` – engine to use for the spider
|
|
556
|
+
* `@name` – a name for the spider (optional)
|
|
557
|
+
* `@engine` – engine to use for the spider (optional, default is `:selenium_chrome`)
|
|
439
558
|
* `@start_urls` – array of urls to process one-by-one inside the `parse` method
|
|
440
559
|
* The `parse` method is the entry point, and should always be present in a spider class
|
|
441
560
|
|
|
@@ -458,7 +577,7 @@ Imagine that there is a product page that doesn't contain a category name. The c
|
|
|
458
577
|
|
|
459
578
|
```ruby
|
|
460
579
|
class ProductsSpider < Kimurai::Base
|
|
461
|
-
@engine = :
|
|
580
|
+
@engine = :chrome
|
|
462
581
|
@start_urls = ["https://example-shop.com/example-product-category"]
|
|
463
582
|
|
|
464
583
|
def parse(response, url:, data: {})
|
|
@@ -497,8 +616,7 @@ But, if you need to interact with a page (like filling form fields, clicking ele
|
|
|
497
616
|
|
|
498
617
|
```ruby
|
|
499
618
|
class GoogleSpider < Kimurai::Base
|
|
500
|
-
@
|
|
501
|
-
@engine = :selenium_chrome
|
|
619
|
+
@engine = :chrome
|
|
502
620
|
@start_urls = ["https://www.google.com/"]
|
|
503
621
|
|
|
504
622
|
def parse(response, url:, data: {})
|
|
@@ -529,7 +647,7 @@ For making requests to a particular method, there is `request_to`. It requires a
|
|
|
529
647
|
|
|
530
648
|
```ruby
|
|
531
649
|
class Spider < Kimurai::Base
|
|
532
|
-
@engine = :
|
|
650
|
+
@engine = :chrome
|
|
533
651
|
@start_urls = ["https://example.com/"]
|
|
534
652
|
|
|
535
653
|
def parse(response, url:, data: {})
|
|
@@ -565,7 +683,7 @@ The `request_to` helper method makes things simpler. We could also do something
|
|
|
565
683
|
|
|
566
684
|
```ruby
|
|
567
685
|
class Spider < Kimurai::Base
|
|
568
|
-
@engine = :
|
|
686
|
+
@engine = :chrome
|
|
569
687
|
@start_urls = ["https://example.com/"]
|
|
570
688
|
|
|
571
689
|
def parse(response, url:, data: {})
|
|
@@ -588,7 +706,7 @@ Sometimes all you need is to simply save scraped data to a file. You can use the
|
|
|
588
706
|
|
|
589
707
|
```ruby
|
|
590
708
|
class ProductsSpider < Kimurai::Base
|
|
591
|
-
@engine = :
|
|
709
|
+
@engine = :chrome
|
|
592
710
|
@start_urls = ["https://example-shop.com/"]
|
|
593
711
|
|
|
594
712
|
# ...
|
|
@@ -607,12 +725,12 @@ end
|
|
|
607
725
|
```
|
|
608
726
|
|
|
609
727
|
Supported formats:
|
|
610
|
-
* `:json` – JSON
|
|
611
|
-
* `:
|
|
728
|
+
* `:json` – JSON (`JSON.pretty_generate`)
|
|
729
|
+
* `:compact_json` – JSON
|
|
612
730
|
* `:jsonlines` – [JSON Lines](http://jsonlines.org/)
|
|
613
731
|
* `:csv` – CSV
|
|
614
732
|
|
|
615
|
-
Note: `save_to` requires the data (item) to save to be a
|
|
733
|
+
Note: `save_to` requires the data (item) to save to be a Hash or Array of Hashes.
|
|
616
734
|
|
|
617
735
|
By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
|
|
618
736
|
|
|
@@ -622,13 +740,91 @@ While the spider is running, each new item will be appended to the output file.
|
|
|
622
740
|
|
|
623
741
|
> If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
|
|
624
742
|
|
|
743
|
+
### AI-powered extraction with `extract`
|
|
744
|
+
|
|
745
|
+
Writing and maintaining XPath/CSS selectors is tedious and error-prone. The `extract` method uses AI to generate selectors automatically — you just describe the data structure you want.
|
|
746
|
+
|
|
747
|
+
**Configuration:**
|
|
748
|
+
|
|
749
|
+
First, configure an LLM provider in your application:
|
|
750
|
+
|
|
751
|
+
```ruby
|
|
752
|
+
Kimurai.configure do |config|
|
|
753
|
+
config.default_model = 'gemini-3-flash-preview'
|
|
754
|
+
config.gemini_api_key = ENV['GEMINI_API_KEY']
|
|
755
|
+
|
|
756
|
+
# Or use OpenAI
|
|
757
|
+
# config.default_model = 'gpt-5.2'
|
|
758
|
+
# config.openai_api_key = ENV['OPENAI_API_KEY']
|
|
759
|
+
|
|
760
|
+
# Or Anthropic
|
|
761
|
+
# config.default_model = 'claude-sonnet-4-5'
|
|
762
|
+
# config.anthropic_api_key = ENV['ANTHROPIC_API_KEY']
|
|
763
|
+
end
|
|
764
|
+
```
|
|
765
|
+
|
|
766
|
+
**Usage:**
|
|
767
|
+
|
|
768
|
+
```ruby
|
|
769
|
+
def parse(response, url:, data: {})
|
|
770
|
+
data = extract(response) do
|
|
771
|
+
string :title
|
|
772
|
+
string :price
|
|
773
|
+
string :description
|
|
774
|
+
array :features, of: :string
|
|
775
|
+
end
|
|
776
|
+
|
|
777
|
+
save_to "products.json", data, format: :json
|
|
778
|
+
end
|
|
779
|
+
```
|
|
780
|
+
|
|
781
|
+
**Schema DSL:**
|
|
782
|
+
|
|
783
|
+
- `string :field_name` — extracts text
|
|
784
|
+
- `integer :field_name` — extracts integer
|
|
785
|
+
- `number :field_name` — extracts float/decimal
|
|
786
|
+
- `array :items do ... end` — extracts list of objects
|
|
787
|
+
- `array :tags, of: :string` — extracts list of strings
|
|
788
|
+
- `object do ... end` — nested structure
|
|
789
|
+
- `description: '...'` — hint for AI about what to look for
|
|
790
|
+
|
|
791
|
+
**How it works:**
|
|
792
|
+
|
|
793
|
+
1. On first run, `extract` sends the HTML and your schema to an LLM
|
|
794
|
+
2. The LLM returns XPath rules for each field
|
|
795
|
+
3. These rules are cached in `SpiderName.json` alongside your spider file
|
|
796
|
+
4. All subsequent extractions use cached XPath — fast and free, no more AI calls
|
|
797
|
+
5. Each method gets its own prefix in the schema file, so different parse methods can have different schemas
|
|
798
|
+
|
|
799
|
+
**Automatic pagination:**
|
|
800
|
+
|
|
801
|
+
Include a next page field in your schema:
|
|
802
|
+
|
|
803
|
+
```ruby
|
|
804
|
+
data = extract(response) do
|
|
805
|
+
string :next_page_url, description: 'Next page link'
|
|
806
|
+
array :products do
|
|
807
|
+
object do
|
|
808
|
+
string :name
|
|
809
|
+
string :price
|
|
810
|
+
end
|
|
811
|
+
end
|
|
812
|
+
end
|
|
813
|
+
|
|
814
|
+
if data[:next_page_url]
|
|
815
|
+
request_to :parse, url: absolute_url(data[:next_page_url], base: url)
|
|
816
|
+
end
|
|
817
|
+
```
|
|
818
|
+
|
|
819
|
+
When the last page has no "Next" link, the extracted value is `nil` and pagination stops naturally.
|
|
820
|
+
|
|
625
821
|
### Skip duplicates
|
|
626
822
|
|
|
627
823
|
It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
|
|
628
824
|
|
|
629
825
|
```ruby
|
|
630
826
|
class ProductsSpider < Kimurai::Base
|
|
631
|
-
@engine = :
|
|
827
|
+
@engine = :chrome
|
|
632
828
|
@start_urls = ["https://example-shop.com/"]
|
|
633
829
|
|
|
634
830
|
def parse(response, url:, data: {})
|
|
@@ -842,8 +1038,7 @@ The `run_info` method is available from the `open_spider` and `close_spider` cla
|
|
|
842
1038
|
|
|
843
1039
|
```ruby
|
|
844
1040
|
class ExampleSpider < Kimurai::Base
|
|
845
|
-
@
|
|
846
|
-
@engine = :selenium_chrome
|
|
1041
|
+
@engine = :chrome
|
|
847
1042
|
@start_urls = ["https://example.com/"]
|
|
848
1043
|
|
|
849
1044
|
def self.close_spider
|
|
@@ -895,7 +1090,7 @@ You can also use the additional methods `completed?` or `failed?`
|
|
|
895
1090
|
|
|
896
1091
|
```ruby
|
|
897
1092
|
class Spider < Kimurai::Base
|
|
898
|
-
@engine = :
|
|
1093
|
+
@engine = :chrome
|
|
899
1094
|
@start_urls = ["https://example.com/"]
|
|
900
1095
|
|
|
901
1096
|
def self.close_spider
|
|
@@ -933,7 +1128,7 @@ Kimurai supports environments. The default is `development`. To provide a custom
|
|
|
933
1128
|
Usage example:
|
|
934
1129
|
```ruby
|
|
935
1130
|
class Spider < Kimurai::Base
|
|
936
|
-
@engine = :
|
|
1131
|
+
@engine = :chrome
|
|
937
1132
|
@start_urls = ["https://example.com/"]
|
|
938
1133
|
|
|
939
1134
|
def self.close_spider
|
|
@@ -956,7 +1151,6 @@ Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, t
|
|
|
956
1151
|
require 'kimurai'
|
|
957
1152
|
|
|
958
1153
|
class AmazonSpider < Kimurai::Base
|
|
959
|
-
@name = "amazon_spider"
|
|
960
1154
|
@engine = :mechanize
|
|
961
1155
|
@start_urls = ["https://www.amazon.com/"]
|
|
962
1156
|
|
|
@@ -1068,7 +1262,7 @@ vic@Vics-MacBook-Air single %
|
|
|
1068
1262
|
|
|
1069
1263
|
* `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
|
|
1070
1264
|
* `delay:` – set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
|
|
1071
|
-
* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :
|
|
1265
|
+
* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :chrome)`
|
|
1072
1266
|
* `config:` – set custom [config](#spider-config) options
|
|
1073
1267
|
|
|
1074
1268
|
### Active Support included
|
|
@@ -1170,7 +1364,7 @@ Kimurai.configure do |config|
|
|
|
1170
1364
|
|
|
1171
1365
|
# Custom time zone (for logs):
|
|
1172
1366
|
# config.time_zone = "UTC"
|
|
1173
|
-
# config.time_zone = "Europe/
|
|
1367
|
+
# config.time_zone = "Europe/Berlin"
|
|
1174
1368
|
|
|
1175
1369
|
# Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
|
|
1176
1370
|
# config.selenium_chrome_path = "/usr/bin/chromium-browser"
|
|
@@ -1286,7 +1480,7 @@ class Spider < Kimurai::Base
|
|
|
1286
1480
|
USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
|
|
1287
1481
|
PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
|
|
1288
1482
|
|
|
1289
|
-
@engine = :
|
|
1483
|
+
@engine = :chrome
|
|
1290
1484
|
@start_urls = ["https://example.com/"]
|
|
1291
1485
|
@config = {
|
|
1292
1486
|
headers: { "custom_header" => "custom_value" },
|
|
@@ -1328,7 +1522,7 @@ end
|
|
|
1328
1522
|
# Custom User Agent – string or lambda
|
|
1329
1523
|
#
|
|
1330
1524
|
# Use lambda if you want to rotate user agents before each run:
|
|
1331
|
-
#
|
|
1525
|
+
# user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
|
|
1332
1526
|
#
|
|
1333
1527
|
# Works for all engines
|
|
1334
1528
|
user_agent: "Mozilla/5.0 Firefox/61.0",
|
|
@@ -1340,10 +1534,10 @@ end
|
|
|
1340
1534
|
cookies: [],
|
|
1341
1535
|
|
|
1342
1536
|
# Proxy – string or lambda. Format for a proxy string: "ip:port:protocol:user:password"
|
|
1343
|
-
#
|
|
1537
|
+
# `protocol` can be http or socks5. User and password are optional.
|
|
1344
1538
|
#
|
|
1345
1539
|
# Use lambda if you want to rotate proxies before each run:
|
|
1346
|
-
#
|
|
1540
|
+
# proxy: -> { ARRAY_OF_PROXIES.sample }
|
|
1347
1541
|
#
|
|
1348
1542
|
# Works for all engines, but keep in mind that Selenium drivers don't support proxies
|
|
1349
1543
|
# with authorization. Also, Mechanize doesn't support socks5 proxy format (only http).
|
|
@@ -1387,10 +1581,10 @@ end
|
|
|
1387
1581
|
# and if the url already exists in this scope, the request will be skipped.
|
|
1388
1582
|
#
|
|
1389
1583
|
# You can configure this setting by providing additional options as hash:
|
|
1390
|
-
#
|
|
1391
|
-
#
|
|
1392
|
-
#
|
|
1393
|
-
#
|
|
1584
|
+
# `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
|
|
1585
|
+
# `scope:` – use a custom scope other than `:requests_urls`
|
|
1586
|
+
# `check_only:` – if true, the url will not be added to the scope
|
|
1587
|
+
#
|
|
1394
1588
|
# Works for all drivers
|
|
1395
1589
|
skip_duplicate_requests: true,
|
|
1396
1590
|
|
|
@@ -1421,8 +1615,8 @@ end
|
|
|
1421
1615
|
# Handle page encoding while parsing html response using Nokogiri
|
|
1422
1616
|
#
|
|
1423
1617
|
# There are two ways to use this option:
|
|
1424
|
-
#
|
|
1425
|
-
#
|
|
1618
|
+
# encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
|
|
1619
|
+
# encoding: "GB2312" # set encoding manually
|
|
1426
1620
|
#
|
|
1427
1621
|
# This option is not set by default
|
|
1428
1622
|
encoding: nil,
|
|
@@ -1649,7 +1843,7 @@ end
|
|
|
1649
1843
|
spiders/application_spider.rb
|
|
1650
1844
|
```ruby
|
|
1651
1845
|
class ApplicationSpider < Kimurai::Base
|
|
1652
|
-
@engine = :
|
|
1846
|
+
@engine = :chrome
|
|
1653
1847
|
|
|
1654
1848
|
# Define pipelines (by order) for all spiders:
|
|
1655
1849
|
@pipelines = [:validator, :saver]
|
|
@@ -1726,7 +1920,7 @@ spiders/github_spider.rb
|
|
|
1726
1920
|
```ruby
|
|
1727
1921
|
class GithubSpider < Kimurai::Base
|
|
1728
1922
|
@name = "github_spider"
|
|
1729
|
-
@engine = :
|
|
1923
|
+
@engine = :chrome
|
|
1730
1924
|
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
|
|
1731
1925
|
@config = {
|
|
1732
1926
|
before_request: { delay: 3..5 }
|
data/Rakefile
CHANGED
|
@@ -1,10 +1,6 @@
|
|
|
1
1
|
require 'bundler/gem_tasks'
|
|
2
|
-
require '
|
|
2
|
+
require 'rspec/core/rake_task'
|
|
3
3
|
|
|
4
|
-
|
|
5
|
-
t.libs << 'test'
|
|
6
|
-
t.libs << 'lib'
|
|
7
|
-
t.test_files = FileList['test/**/*_test.rb']
|
|
8
|
-
end
|
|
4
|
+
RSpec::Core::RakeTask.new(:spec)
|
|
9
5
|
|
|
10
|
-
task default: :
|
|
6
|
+
task default: :spec
|
data/kimurai.gemspec
CHANGED
|
@@ -20,7 +20,7 @@ Gem::Specification.new do |spec|
|
|
|
20
20
|
spec.bindir = 'exe'
|
|
21
21
|
spec.executables = 'kimurai'
|
|
22
22
|
spec.require_paths = ['lib']
|
|
23
|
-
spec.required_ruby_version = '>= 3.
|
|
23
|
+
spec.required_ruby_version = '>= 3.2.0'
|
|
24
24
|
|
|
25
25
|
spec.add_dependency 'activesupport'
|
|
26
26
|
spec.add_dependency 'cliver'
|
|
@@ -46,4 +46,8 @@ Gem::Specification.new do |spec|
|
|
|
46
46
|
|
|
47
47
|
spec.add_dependency 'pry'
|
|
48
48
|
spec.add_dependency 'rbcat', '~> 1.0'
|
|
49
|
+
spec.add_dependency 'nukitori'
|
|
50
|
+
|
|
51
|
+
spec.add_development_dependency 'rake', '~> 13.0'
|
|
52
|
+
spec.add_development_dependency 'rspec', '~> 3.13'
|
|
49
53
|
end
|
data/lib/kimurai/base/saver.rb
CHANGED
|
@@ -7,10 +7,11 @@ module Kimurai
|
|
|
7
7
|
attr_reader :format, :path, :position, :append
|
|
8
8
|
|
|
9
9
|
def initialize(path, format:, position: true, append: false)
|
|
10
|
-
raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json jsonlines csv].include?(format)
|
|
10
|
+
raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json compact_json jsonlines csv].include?(format)
|
|
11
11
|
|
|
12
12
|
@path = path
|
|
13
13
|
@format = format
|
|
14
|
+
@format = :json if format == :pretty_json # :pretty_json is now an alias for :json
|
|
14
15
|
@position = position
|
|
15
16
|
@index = 0
|
|
16
17
|
@append = append
|
|
@@ -19,44 +20,57 @@ module Kimurai
|
|
|
19
20
|
|
|
20
21
|
def save(item)
|
|
21
22
|
@mutex.synchronize do
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
23
|
+
if item.is_a?(Array)
|
|
24
|
+
item.each do |it|
|
|
25
|
+
@index += 1
|
|
26
|
+
it[:position] = @index if position
|
|
27
|
+
|
|
28
|
+
save_item(it)
|
|
29
|
+
end
|
|
30
|
+
else
|
|
31
|
+
@index += 1
|
|
32
|
+
item[:position] = @index if position
|
|
33
|
+
|
|
34
|
+
save_item(item)
|
|
34
35
|
end
|
|
35
36
|
end
|
|
36
37
|
end
|
|
37
38
|
|
|
38
39
|
private
|
|
39
40
|
|
|
41
|
+
def save_item(item)
|
|
42
|
+
case format
|
|
43
|
+
when :json
|
|
44
|
+
save_to_json(item)
|
|
45
|
+
when :compact_json
|
|
46
|
+
save_to_compact_json(item)
|
|
47
|
+
when :jsonlines
|
|
48
|
+
save_to_jsonlines(item)
|
|
49
|
+
when :csv
|
|
50
|
+
save_to_csv(item)
|
|
51
|
+
end
|
|
52
|
+
end
|
|
53
|
+
|
|
40
54
|
def save_to_json(item)
|
|
41
|
-
data = JSON.
|
|
55
|
+
data = JSON.pretty_generate([item])
|
|
42
56
|
|
|
43
57
|
if @index > 1 || append && File.exist?(path)
|
|
44
|
-
file_content = File.read(path).sub(/\}\]\Z/, "\}
|
|
58
|
+
file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
|
|
45
59
|
File.open(path, 'w') do |f|
|
|
46
|
-
f.write(file_content + data.sub(/\A\[/, ''))
|
|
60
|
+
f.write(file_content + data.sub(/\A\[\n/, ''))
|
|
47
61
|
end
|
|
48
62
|
else
|
|
49
63
|
File.open(path, 'w') { |f| f.write(data) }
|
|
50
64
|
end
|
|
51
65
|
end
|
|
52
66
|
|
|
53
|
-
def
|
|
54
|
-
data = JSON.
|
|
67
|
+
def save_to_compact_json(item)
|
|
68
|
+
data = JSON.generate([item])
|
|
55
69
|
|
|
56
70
|
if @index > 1 || append && File.exist?(path)
|
|
57
|
-
file_content = File.read(path).sub(/\}\
|
|
71
|
+
file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
|
|
58
72
|
File.open(path, 'w') do |f|
|
|
59
|
-
f.write(file_content + data.sub(/\A\[
|
|
73
|
+
f.write(file_content + data.sub(/\A\[/, ''))
|
|
60
74
|
end
|
|
61
75
|
else
|
|
62
76
|
File.open(path, 'w') { |f| f.write(data) }
|
data/lib/kimurai/base.rb
CHANGED
|
@@ -64,12 +64,12 @@ module Kimurai
|
|
|
64
64
|
|
|
65
65
|
###
|
|
66
66
|
|
|
67
|
-
@engine = :
|
|
67
|
+
@engine = :selenium_chrome
|
|
68
68
|
@pipelines = []
|
|
69
69
|
@config = {}
|
|
70
70
|
|
|
71
71
|
def self.name
|
|
72
|
-
@name
|
|
72
|
+
@name || to_s.underscore
|
|
73
73
|
end
|
|
74
74
|
|
|
75
75
|
def self.engine
|
|
@@ -84,11 +84,22 @@ module Kimurai
|
|
|
84
84
|
@start_urls
|
|
85
85
|
end
|
|
86
86
|
|
|
87
|
+
def self.delay
|
|
88
|
+
@delay ||= superclass.respond_to?(:delay) ? superclass.delay : nil
|
|
89
|
+
end
|
|
90
|
+
|
|
87
91
|
def self.config
|
|
88
|
-
if superclass.equal?(::Object)
|
|
89
|
-
|
|
92
|
+
base_config = if superclass.equal?(::Object)
|
|
93
|
+
@config
|
|
94
|
+
else
|
|
95
|
+
superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
# Merge @delay shortcut into config if set
|
|
99
|
+
if delay
|
|
100
|
+
base_config.deep_merge_excl({ before_request: { delay: delay } }, DMERGE_EXCLUDE)
|
|
90
101
|
else
|
|
91
|
-
|
|
102
|
+
base_config
|
|
92
103
|
end
|
|
93
104
|
end
|
|
94
105
|
|
data/lib/kimurai/base_helper.rb
CHANGED
|
@@ -1,5 +1,15 @@
|
|
|
1
1
|
module Kimurai
|
|
2
2
|
module BaseHelper
|
|
3
|
+
def extract(response, model: nil, &block)
|
|
4
|
+
caller_info = caller_locations(1, 1).first
|
|
5
|
+
method_name = caller_info.base_label
|
|
6
|
+
spider_dir = File.dirname(caller_info.path)
|
|
7
|
+
schema_path = File.join(spider_dir, "#{self.class.name}.json")
|
|
8
|
+
|
|
9
|
+
data = Nukitori(response, schema_path, prefix: method_name, model:, &block)
|
|
10
|
+
data.deep_symbolize_keys
|
|
11
|
+
end
|
|
12
|
+
|
|
3
13
|
private
|
|
4
14
|
|
|
5
15
|
def absolute_url(url, base:)
|
|
@@ -1,6 +1,13 @@
|
|
|
1
1
|
module Kimurai
|
|
2
2
|
module BrowserBuilder
|
|
3
|
+
ENGINE_ALIASES = {
|
|
4
|
+
chrome: :selenium_chrome,
|
|
5
|
+
firefox: :selenium_firefox
|
|
6
|
+
}.freeze
|
|
7
|
+
|
|
3
8
|
def self.build(engine, config = {}, spider:)
|
|
9
|
+
engine = ENGINE_ALIASES.fetch(engine, engine)
|
|
10
|
+
|
|
4
11
|
begin
|
|
5
12
|
require "kimurai/browser_builder/#{engine}_builder"
|
|
6
13
|
rescue LoadError
|
|
@@ -10,7 +10,6 @@ module Capybara
|
|
|
10
10
|
alias original_visit visit
|
|
11
11
|
def visit(visit_uri, delay: config.before_request[:delay], skip_request_options: false, max_retries: 3)
|
|
12
12
|
if spider
|
|
13
|
-
process_delay(delay) if delay
|
|
14
13
|
retries = 0
|
|
15
14
|
sleep_interval = 0
|
|
16
15
|
|
|
@@ -20,6 +19,9 @@ module Capybara
|
|
|
20
19
|
spider.class.update(:visits, :requests) if spider.with_info
|
|
21
20
|
|
|
22
21
|
original_visit(visit_uri)
|
|
22
|
+
|
|
23
|
+
logger.info "Browser: finished get request to: #{visit_uri}"
|
|
24
|
+
process_delay(delay) if delay
|
|
23
25
|
rescue StandardError => e
|
|
24
26
|
if match_error?(e, type: :to_skip)
|
|
25
27
|
logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
|
|
@@ -40,7 +42,7 @@ module Capybara
|
|
|
40
42
|
raise e
|
|
41
43
|
end
|
|
42
44
|
else
|
|
43
|
-
driver.responses += 1
|
|
45
|
+
driver.responses += 1
|
|
44
46
|
spider.class.update(:visits, :responses) if spider.with_info
|
|
45
47
|
driver.visited = true unless driver.visited
|
|
46
48
|
true
|
|
@@ -170,7 +172,7 @@ module Capybara
|
|
|
170
172
|
|
|
171
173
|
def process_delay(delay)
|
|
172
174
|
interval = (delay.instance_of?(Range) ? rand(delay) : delay)
|
|
173
|
-
logger.debug "Browser:
|
|
175
|
+
logger.debug "Browser: delay #{interval.round(2)} #{'second'.pluralize(interval)}..."
|
|
174
176
|
sleep interval
|
|
175
177
|
end
|
|
176
178
|
|
|
@@ -35,7 +35,7 @@ module Kimurai
|
|
|
35
35
|
|
|
36
36
|
return if in_project
|
|
37
37
|
|
|
38
|
-
insert_into_file spider_path, " @engine = :
|
|
38
|
+
insert_into_file spider_path, " @engine = :chrome\n", after: "@name = \"#{spider_name}\"\n"
|
|
39
39
|
prepend_to_file spider_path, "require 'kimurai'\n\n"
|
|
40
40
|
append_to_file spider_path, "\n#{spider_class}.crawl!"
|
|
41
41
|
end
|
data/lib/kimurai/version.rb
CHANGED
data/lib/kimurai.rb
CHANGED
|
@@ -6,6 +6,7 @@ require 'uri'
|
|
|
6
6
|
require 'active_support'
|
|
7
7
|
require 'active_support/core_ext'
|
|
8
8
|
require 'rbcat'
|
|
9
|
+
require 'nukitori'
|
|
9
10
|
|
|
10
11
|
require_relative 'kimurai/version'
|
|
11
12
|
|
|
@@ -20,6 +21,33 @@ require_relative 'kimurai/pipeline'
|
|
|
20
21
|
require_relative 'kimurai/base'
|
|
21
22
|
|
|
22
23
|
module Kimurai
|
|
24
|
+
# Settings that will be forwarded to Nukitori configuration
|
|
25
|
+
NUKITORI_SETTINGS = %i[
|
|
26
|
+
openai_api_key
|
|
27
|
+
anthropic_api_key
|
|
28
|
+
gemini_api_key
|
|
29
|
+
vertexai_project_id
|
|
30
|
+
vertexai_location
|
|
31
|
+
deepseek_api_key
|
|
32
|
+
mistral_api_key
|
|
33
|
+
perplexity_api_key
|
|
34
|
+
openrouter_api_key
|
|
35
|
+
gpustack_api_key
|
|
36
|
+
openai_api_base
|
|
37
|
+
gemini_api_base
|
|
38
|
+
ollama_api_base
|
|
39
|
+
gpustack_api_base
|
|
40
|
+
openai_organization_id
|
|
41
|
+
openai_project_id
|
|
42
|
+
openai_use_system_role
|
|
43
|
+
bedrock_api_key
|
|
44
|
+
bedrock_secret_key
|
|
45
|
+
bedrock_region
|
|
46
|
+
bedrock_session_token
|
|
47
|
+
default_model
|
|
48
|
+
model_registry_file
|
|
49
|
+
].freeze
|
|
50
|
+
|
|
23
51
|
class << self
|
|
24
52
|
def configuration
|
|
25
53
|
@configuration ||= OpenStruct.new
|
|
@@ -27,6 +55,22 @@ module Kimurai
|
|
|
27
55
|
|
|
28
56
|
def configure
|
|
29
57
|
yield(configuration)
|
|
58
|
+
apply_nukitori_configuration
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
def apply_nukitori_configuration
|
|
62
|
+
nukitori_settings = NUKITORI_SETTINGS.filter_map do |setting|
|
|
63
|
+
value = configuration[setting]
|
|
64
|
+
[setting, value] if value
|
|
65
|
+
end.to_h
|
|
66
|
+
|
|
67
|
+
return if nukitori_settings.empty?
|
|
68
|
+
|
|
69
|
+
Nukitori.configure do |config|
|
|
70
|
+
nukitori_settings.each do |setting, value|
|
|
71
|
+
config.public_send("#{setting}=", value)
|
|
72
|
+
end
|
|
73
|
+
end
|
|
30
74
|
end
|
|
31
75
|
|
|
32
76
|
def env
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: kimurai
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 2.0
|
|
4
|
+
version: 2.2.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Victor Afanasev
|
|
@@ -261,6 +261,48 @@ dependencies:
|
|
|
261
261
|
- - "~>"
|
|
262
262
|
- !ruby/object:Gem::Version
|
|
263
263
|
version: '1.0'
|
|
264
|
+
- !ruby/object:Gem::Dependency
|
|
265
|
+
name: nukitori
|
|
266
|
+
requirement: !ruby/object:Gem::Requirement
|
|
267
|
+
requirements:
|
|
268
|
+
- - ">="
|
|
269
|
+
- !ruby/object:Gem::Version
|
|
270
|
+
version: '0'
|
|
271
|
+
type: :runtime
|
|
272
|
+
prerelease: false
|
|
273
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
274
|
+
requirements:
|
|
275
|
+
- - ">="
|
|
276
|
+
- !ruby/object:Gem::Version
|
|
277
|
+
version: '0'
|
|
278
|
+
- !ruby/object:Gem::Dependency
|
|
279
|
+
name: rake
|
|
280
|
+
requirement: !ruby/object:Gem::Requirement
|
|
281
|
+
requirements:
|
|
282
|
+
- - "~>"
|
|
283
|
+
- !ruby/object:Gem::Version
|
|
284
|
+
version: '13.0'
|
|
285
|
+
type: :development
|
|
286
|
+
prerelease: false
|
|
287
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
288
|
+
requirements:
|
|
289
|
+
- - "~>"
|
|
290
|
+
- !ruby/object:Gem::Version
|
|
291
|
+
version: '13.0'
|
|
292
|
+
- !ruby/object:Gem::Dependency
|
|
293
|
+
name: rspec
|
|
294
|
+
requirement: !ruby/object:Gem::Requirement
|
|
295
|
+
requirements:
|
|
296
|
+
- - "~>"
|
|
297
|
+
- !ruby/object:Gem::Version
|
|
298
|
+
version: '3.13'
|
|
299
|
+
type: :development
|
|
300
|
+
prerelease: false
|
|
301
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
302
|
+
requirements:
|
|
303
|
+
- - "~>"
|
|
304
|
+
- !ruby/object:Gem::Version
|
|
305
|
+
version: '3.13'
|
|
264
306
|
email:
|
|
265
307
|
- vicfreefly@gmail.com
|
|
266
308
|
executables:
|
|
@@ -269,6 +311,7 @@ extensions: []
|
|
|
269
311
|
extra_rdoc_files: []
|
|
270
312
|
files:
|
|
271
313
|
- ".gitignore"
|
|
314
|
+
- ".rspec"
|
|
272
315
|
- ".rubocop.yml"
|
|
273
316
|
- CHANGELOG.md
|
|
274
317
|
- Gemfile
|
|
@@ -329,7 +372,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
|
329
372
|
requirements:
|
|
330
373
|
- - ">="
|
|
331
374
|
- !ruby/object:Gem::Version
|
|
332
|
-
version: 3.
|
|
375
|
+
version: 3.2.0
|
|
333
376
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
334
377
|
requirements:
|
|
335
378
|
- - ">="
|