kimurai 2.0.1 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/.rspec +3 -0
- data/.rubocop.yml +2 -0
- data/CHANGELOG.md +14 -0
- data/LICENSE.txt +1 -1
- data/README.md +183 -39
- data/Rakefile +3 -7
- data/kimurai.gemspec +5 -1
- data/lib/kimurai/base/saver.rb +34 -20
- data/lib/kimurai/base.rb +15 -4
- data/lib/kimurai/base_helper.rb +10 -0
- data/lib/kimurai/browser_builder.rb +7 -0
- data/lib/kimurai/capybara_ext/session.rb +5 -3
- data/lib/kimurai/version.rb +1 -1
- data/lib/kimurai.rb +44 -0
- metadata +45 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: b0f990c2292eebb911b6036b7515fdbe4b844f75dc20ec032c1da352de740c80
|
|
4
|
+
data.tar.gz: 13f35756781bb2a0c8f14fe246edc87e428cd5de6bc46c269128d764b9f2763c
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: b2f236b701e505bba6e03083fc5e4308b125ba09203db04d41c1e72b0ccfbcb6957c35e6cc05e8f10ae2d3d96aeb7cdb1e7f24e7770406a8894f6020ec53c5c4
|
|
7
|
+
data.tar.gz: d47288711341145af98b0ad547583dae4ce26f38d6b024b4994a40eb9fc13f802238b0bf93eb0cd8b06d91ff098eceb2b837b50ef9e84ef3d0a98a25a70d0ce8
|
data/.gitignore
CHANGED
data/.rspec
ADDED
data/.rubocop.yml
CHANGED
data/CHANGELOG.md
CHANGED
|
@@ -1,4 +1,18 @@
|
|
|
1
1
|
# CHANGELOG
|
|
2
|
+
## 2.1.0
|
|
3
|
+
### New
|
|
4
|
+
* Min. required Ruby version is 3.2.0
|
|
5
|
+
* **AI-powered data extraction with `extract` method** — Powered by [Nukitori](https://github.com/vifreefly/nukitori). Describe the data structure you want and let AI generate XPath selectors automatically. Selectors are cached for reuse, so AI is only called once per page type
|
|
6
|
+
* **Configure Nukitori via Kimurai** — Set LLM provider settings (OpenAI, Anthropic, Gemini, etc.) directly in `Kimurai.configure` block
|
|
7
|
+
* **Engine aliases** — Use shorter engine names: `:chrome` (alias for `:selenium_chrome`), `:firefox` (alias for `:selenium_firefox`)
|
|
8
|
+
* **Top-level `@delay` option** — Set request delay directly as `@delay = 2..5` instead of nested `@config = { before_request: { delay: 2..5 } }`
|
|
9
|
+
* **Auto spider name** — If `@name` is not provided, it's automatically derived from the class name
|
|
10
|
+
* **Save array of items** — `save_to` helper now accepts an array of items to save at once
|
|
11
|
+
|
|
12
|
+
### Improvements
|
|
13
|
+
* `save_to` helper now uses pretty JSON by default for `:json` format (use `format: :compact_json` for compact output)
|
|
14
|
+
* Request delay is now applied before the response is passed to the callback
|
|
15
|
+
|
|
2
16
|
## 2.0.1
|
|
3
17
|
### Fixes
|
|
4
18
|
* Remove xpath as default Capybara selector type (fixes https://github.com/vifreefly/kimuraframework/issues/28)
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
|
@@ -1,20 +1,21 @@
|
|
|
1
|
-
|
|
1
|
+
<div align="center">
|
|
2
|
+
<a href="https://github.com/vifreefly/kimuraframework">
|
|
3
|
+
<img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
|
|
4
|
+
</a>
|
|
2
5
|
|
|
3
|
-
Kimurai
|
|
6
|
+
<h1>Kimurai</h1>
|
|
7
|
+
</div>
|
|
4
8
|
|
|
5
|
-
Kimurai is
|
|
9
|
+
Kimurai is a modern Ruby web scraping framework designed to scrape and interact with JavaScript-rendered websites using headless antidetect Chromium, Firefox, or simple HTTP requests — right out of the box:
|
|
6
10
|
|
|
7
11
|
```ruby
|
|
8
12
|
# github_spider.rb
|
|
9
13
|
require 'kimurai'
|
|
10
14
|
|
|
11
15
|
class GithubSpider < Kimurai::Base
|
|
12
|
-
@
|
|
13
|
-
@engine = :selenium_chrome
|
|
16
|
+
@engine = :chrome
|
|
14
17
|
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
|
|
15
|
-
@
|
|
16
|
-
before_request: { delay: 3..5 }
|
|
17
|
-
}
|
|
18
|
+
@delay = 3..5
|
|
18
19
|
|
|
19
20
|
def parse(response, url:, data: {})
|
|
20
21
|
response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
|
|
@@ -149,8 +150,7 @@ Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? L
|
|
|
149
150
|
require 'kimurai'
|
|
150
151
|
|
|
151
152
|
class InfiniteScrollSpider < Kimurai::Base
|
|
152
|
-
@
|
|
153
|
-
@engine = :selenium_chrome
|
|
153
|
+
@engine = :chrome
|
|
154
154
|
@start_urls = ["https://infinite-scroll.com/demo/full-page/"]
|
|
155
155
|
|
|
156
156
|
def parse(response, url:, data: {})
|
|
@@ -194,14 +194,82 @@ I, [2025-12-16 12:47:15] INFO -- infinite_scroll_spider: > Continue scrolling,
|
|
|
194
194
|
I, [2025-12-16 12:47:17] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
|
|
195
195
|
I, [2025-12-16 12:47:19] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
|
|
196
196
|
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > Pagination is done
|
|
197
|
-
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page:
|
|
197
|
+
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page:
|
|
198
|
+
|
|
199
|
+
1a - Infinite Scroll full page demo;
|
|
200
|
+
1b - RGB Schemes logo in Computer Arts;
|
|
201
|
+
2a - RGB Schemes logo;
|
|
202
|
+
2b - Masonry gets horizontalOrder;
|
|
203
|
+
2c - Every vector 2016;
|
|
204
|
+
3a - Logo Pizza delivered;
|
|
205
|
+
3b - Some CodePens;
|
|
206
|
+
3c - 365daysofmusic.com;
|
|
207
|
+
3d - Holograms;
|
|
208
|
+
4a - Huebee: 1-click color picker;
|
|
209
|
+
4b - Word is Flickity is good;
|
|
210
|
+
Flickity v2 released: groupCells, adaptiveHeight, parallax;
|
|
211
|
+
New tech gets chatter; Isotope v3 released: stagger in, IE8 out;
|
|
212
|
+
Packery v2 released
|
|
213
|
+
|
|
198
214
|
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
|
|
199
215
|
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
|
|
200
216
|
```
|
|
201
|
-
</details
|
|
217
|
+
</details>
|
|
202
218
|
|
|
219
|
+
## AI-Powered Extraction
|
|
220
|
+
|
|
221
|
+
What if you could just describe the data you want and let AI figure out how to extract it? With the built-in `extract` method powered by [Nukitori](https://github.com/vifreefly/nukitori), you can:
|
|
222
|
+
|
|
223
|
+
```ruby
|
|
224
|
+
# github_spider_ai.rb
|
|
225
|
+
require 'kimurai'
|
|
226
|
+
|
|
227
|
+
Kimurai.configure do |config|
|
|
228
|
+
config.default_model = "gemini-3-flash-preview" # OpenAI, Anthropic, Gemini, local LLMs, etc.
|
|
229
|
+
config.gemini_api_key = ENV["GEMINI_API_KEY"]
|
|
230
|
+
end
|
|
231
|
+
|
|
232
|
+
class GithubSpider < Kimurai::Base
|
|
233
|
+
@engine = :chrome
|
|
234
|
+
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
|
|
235
|
+
@delay = 3..5
|
|
236
|
+
|
|
237
|
+
def parse(response, url:, data: {})
|
|
238
|
+
data = extract(response) do
|
|
239
|
+
string :next_page_url, description: "Next page path url"
|
|
240
|
+
array :repos do
|
|
241
|
+
object do
|
|
242
|
+
string :name
|
|
243
|
+
string :url
|
|
244
|
+
string :description
|
|
245
|
+
string :stars
|
|
246
|
+
string :language
|
|
247
|
+
array :tags, of: :string
|
|
248
|
+
end
|
|
249
|
+
end
|
|
250
|
+
end
|
|
251
|
+
|
|
252
|
+
save_to "results.json", data[:repos], format: :json
|
|
253
|
+
|
|
254
|
+
if data[:next_page_url]
|
|
255
|
+
request_to :parse, url: absolute_url(data[:next_page_url], base: url)
|
|
256
|
+
end
|
|
257
|
+
end
|
|
258
|
+
end
|
|
259
|
+
|
|
260
|
+
GithubSpider.crawl!
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
**How it works:**
|
|
264
|
+
1. On the first page, `extract` sends the HTML to an LLM which generates XPath rules for your schema
|
|
265
|
+
2. These rules are cached in a JSON file alongside your spider
|
|
266
|
+
3. **All subsequent pages use the cached XPath — no more AI calls, pure fast extraction**
|
|
267
|
+
4. When there's no "Next" link on the last page, the extracted value is `nil` and pagination stops
|
|
268
|
+
|
|
269
|
+
Zero manual selectors. The AI figured out where everything lives, and that knowledge is reused for the entire crawl.
|
|
203
270
|
|
|
204
271
|
## Features
|
|
272
|
+
* **AI-powered data extraction**: Use [Nukitori](https://github.com/vifreefly/nukitori) to extract structured data without writing XPath/CSS selectors — just describe what you want, and AI figures out how to extract it
|
|
205
273
|
* Scrape JavaScript rendered websites out of the box
|
|
206
274
|
* Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
|
|
207
275
|
* Write spider code once, and use it with any supported engine later
|
|
@@ -229,6 +297,7 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
|
|
|
229
297
|
* [browser object](#browser-object)
|
|
230
298
|
* [request_to method](#request_to-method)
|
|
231
299
|
* [save_to helper](#save_to-helper)
|
|
300
|
+
* [AI-powered extraction with extract](#ai-powered-extraction-with-extract)
|
|
232
301
|
* [Skip duplicates](#skip-duplicates)
|
|
233
302
|
* [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
|
|
234
303
|
* [Storage object](#storage-object)
|
|
@@ -262,7 +331,7 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
|
|
|
262
331
|
|
|
263
332
|
|
|
264
333
|
## Installation
|
|
265
|
-
Kimurai requires Ruby version `>= 3.
|
|
334
|
+
Kimurai requires Ruby version `>= 3.2.0`. Officially supported platforms: `Linux` and `macOS`.
|
|
266
335
|
|
|
267
336
|
1) If your system doesn't have the appropriate Ruby version, install it:
|
|
268
337
|
|
|
@@ -312,7 +381,7 @@ gem update --system
|
|
|
312
381
|
|
|
313
382
|
```bash
|
|
314
383
|
# Install basic tools
|
|
315
|
-
sudo apt install -q -y unzip wget tar openssl
|
|
384
|
+
sudo apt install -q -y unzip wget tar openssl lsof
|
|
316
385
|
|
|
317
386
|
# Install xvfb (for virtual_display headless mode, in addition to native)
|
|
318
387
|
sudo apt install -q -y xvfb
|
|
@@ -409,8 +478,8 @@ CLI arguments:
|
|
|
409
478
|
Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
|
|
410
479
|
|
|
411
480
|
* `:mechanize` – [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
|
|
412
|
-
* `:selenium_chrome` – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
|
|
413
|
-
* `:selenium_firefox` – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
|
|
481
|
+
* `:chrome` (`:selenium_chrome` alias) – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
|
|
482
|
+
* `:firefox` (`:selenium_firefox` alias) – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
|
|
414
483
|
|
|
415
484
|
**Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
|
|
416
485
|
|
|
@@ -423,7 +492,7 @@ require 'kimurai'
|
|
|
423
492
|
|
|
424
493
|
class SimpleSpider < Kimurai::Base
|
|
425
494
|
@name = "simple_spider"
|
|
426
|
-
@engine = :
|
|
495
|
+
@engine = :chrome
|
|
427
496
|
@start_urls = ["https://example.com/"]
|
|
428
497
|
|
|
429
498
|
def parse(response, url:, data: {})
|
|
@@ -434,8 +503,8 @@ SimpleSpider.crawl!
|
|
|
434
503
|
```
|
|
435
504
|
|
|
436
505
|
Where:
|
|
437
|
-
* `@name` – a name for the spider
|
|
438
|
-
* `@engine` – engine to use for the spider
|
|
506
|
+
* `@name` – a name for the spider (optional)
|
|
507
|
+
* `@engine` – engine to use for the spider (optional, default is `:mechanize`)
|
|
439
508
|
* `@start_urls` – array of urls to process one-by-one inside the `parse` method
|
|
440
509
|
* The `parse` method is the entry point, and should always be present in a spider class
|
|
441
510
|
|
|
@@ -458,7 +527,7 @@ Imagine that there is a product page that doesn't contain a category name. The c
|
|
|
458
527
|
|
|
459
528
|
```ruby
|
|
460
529
|
class ProductsSpider < Kimurai::Base
|
|
461
|
-
@engine = :
|
|
530
|
+
@engine = :chrome
|
|
462
531
|
@start_urls = ["https://example-shop.com/example-product-category"]
|
|
463
532
|
|
|
464
533
|
def parse(response, url:, data: {})
|
|
@@ -497,8 +566,7 @@ But, if you need to interact with a page (like filling form fields, clicking ele
|
|
|
497
566
|
|
|
498
567
|
```ruby
|
|
499
568
|
class GoogleSpider < Kimurai::Base
|
|
500
|
-
@
|
|
501
|
-
@engine = :selenium_chrome
|
|
569
|
+
@engine = :chrome
|
|
502
570
|
@start_urls = ["https://www.google.com/"]
|
|
503
571
|
|
|
504
572
|
def parse(response, url:, data: {})
|
|
@@ -529,7 +597,7 @@ For making requests to a particular method, there is `request_to`. It requires a
|
|
|
529
597
|
|
|
530
598
|
```ruby
|
|
531
599
|
class Spider < Kimurai::Base
|
|
532
|
-
@engine = :
|
|
600
|
+
@engine = :chrome
|
|
533
601
|
@start_urls = ["https://example.com/"]
|
|
534
602
|
|
|
535
603
|
def parse(response, url:, data: {})
|
|
@@ -565,7 +633,7 @@ The `request_to` helper method makes things simpler. We could also do something
|
|
|
565
633
|
|
|
566
634
|
```ruby
|
|
567
635
|
class Spider < Kimurai::Base
|
|
568
|
-
@engine = :
|
|
636
|
+
@engine = :chrome
|
|
569
637
|
@start_urls = ["https://example.com/"]
|
|
570
638
|
|
|
571
639
|
def parse(response, url:, data: {})
|
|
@@ -588,7 +656,7 @@ Sometimes all you need is to simply save scraped data to a file. You can use the
|
|
|
588
656
|
|
|
589
657
|
```ruby
|
|
590
658
|
class ProductsSpider < Kimurai::Base
|
|
591
|
-
@engine = :
|
|
659
|
+
@engine = :chrome
|
|
592
660
|
@start_urls = ["https://example-shop.com/"]
|
|
593
661
|
|
|
594
662
|
# ...
|
|
@@ -607,12 +675,12 @@ end
|
|
|
607
675
|
```
|
|
608
676
|
|
|
609
677
|
Supported formats:
|
|
610
|
-
* `:json` – JSON
|
|
611
|
-
* `:
|
|
678
|
+
* `:json` – JSON (`JSON.pretty_generate`)
|
|
679
|
+
* `:compact_json` – JSON
|
|
612
680
|
* `:jsonlines` – [JSON Lines](http://jsonlines.org/)
|
|
613
681
|
* `:csv` – CSV
|
|
614
682
|
|
|
615
|
-
Note: `save_to` requires the data (item) to save to be a
|
|
683
|
+
Note: `save_to` requires the data (item) to save to be a Hash or Array of Hashes.
|
|
616
684
|
|
|
617
685
|
By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
|
|
618
686
|
|
|
@@ -622,13 +690,91 @@ While the spider is running, each new item will be appended to the output file.
|
|
|
622
690
|
|
|
623
691
|
> If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
|
|
624
692
|
|
|
693
|
+
### AI-powered extraction with `extract`
|
|
694
|
+
|
|
695
|
+
Writing and maintaining XPath/CSS selectors is tedious and error-prone. The `extract` method uses AI to generate selectors automatically — you just describe the data structure you want.
|
|
696
|
+
|
|
697
|
+
**Configuration:**
|
|
698
|
+
|
|
699
|
+
First, configure an LLM provider in your application:
|
|
700
|
+
|
|
701
|
+
```ruby
|
|
702
|
+
Kimurai.configure do |config|
|
|
703
|
+
config.default_model = 'gemini-3-flash-preview'
|
|
704
|
+
config.gemini_api_key = ENV['GEMINI_API_KEY']
|
|
705
|
+
|
|
706
|
+
# Or use OpenAI
|
|
707
|
+
# config.default_model = 'gpt-5.2'
|
|
708
|
+
# config.openai_api_key = ENV['OPENAI_API_KEY']
|
|
709
|
+
|
|
710
|
+
# Or Anthropic
|
|
711
|
+
# config.default_model = 'claude-sonnet-4-5'
|
|
712
|
+
# config.anthropic_api_key = ENV['ANTHROPIC_API_KEY']
|
|
713
|
+
end
|
|
714
|
+
```
|
|
715
|
+
|
|
716
|
+
**Usage:**
|
|
717
|
+
|
|
718
|
+
```ruby
|
|
719
|
+
def parse(response, url:, data: {})
|
|
720
|
+
data = extract(response) do
|
|
721
|
+
string :title
|
|
722
|
+
string :price
|
|
723
|
+
string :description
|
|
724
|
+
array :features, of: :string
|
|
725
|
+
end
|
|
726
|
+
|
|
727
|
+
save_to "products.json", data, format: :json
|
|
728
|
+
end
|
|
729
|
+
```
|
|
730
|
+
|
|
731
|
+
**Schema DSL:**
|
|
732
|
+
|
|
733
|
+
- `string :field_name` — extracts text
|
|
734
|
+
- `integer :field_name` — extracts integer
|
|
735
|
+
- `number :field_name` — extracts float/decimal
|
|
736
|
+
- `array :items do ... end` — extracts list of objects
|
|
737
|
+
- `array :tags, of: :string` — extracts list of strings
|
|
738
|
+
- `object do ... end` — nested structure
|
|
739
|
+
- `description: '...'` — hint for AI about what to look for
|
|
740
|
+
|
|
741
|
+
**How it works:**
|
|
742
|
+
|
|
743
|
+
1. On first run, `extract` sends the HTML and your schema to an LLM
|
|
744
|
+
2. The LLM returns XPath rules for each field
|
|
745
|
+
3. These rules are cached in `SpiderName.json` alongside your spider file
|
|
746
|
+
4. All subsequent extractions use cached XPath — fast and free, no more AI calls
|
|
747
|
+
5. Each method gets its own prefix in the schema file, so different parse methods can have different schemas
|
|
748
|
+
|
|
749
|
+
**Automatic pagination:**
|
|
750
|
+
|
|
751
|
+
Include a next page field in your schema:
|
|
752
|
+
|
|
753
|
+
```ruby
|
|
754
|
+
data = extract(response) do
|
|
755
|
+
string :next_page_url, description: 'Next page link'
|
|
756
|
+
array :products do
|
|
757
|
+
object do
|
|
758
|
+
string :name
|
|
759
|
+
string :price
|
|
760
|
+
end
|
|
761
|
+
end
|
|
762
|
+
end
|
|
763
|
+
|
|
764
|
+
if data[:next_page_url]
|
|
765
|
+
request_to :parse, url: absolute_url(data[:next_page_url], base: url)
|
|
766
|
+
end
|
|
767
|
+
```
|
|
768
|
+
|
|
769
|
+
When the last page has no "Next" link, the extracted value is `nil` and pagination stops naturally.
|
|
770
|
+
|
|
625
771
|
### Skip duplicates
|
|
626
772
|
|
|
627
773
|
It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
|
|
628
774
|
|
|
629
775
|
```ruby
|
|
630
776
|
class ProductsSpider < Kimurai::Base
|
|
631
|
-
@engine = :
|
|
777
|
+
@engine = :chrome
|
|
632
778
|
@start_urls = ["https://example-shop.com/"]
|
|
633
779
|
|
|
634
780
|
def parse(response, url:, data: {})
|
|
@@ -842,8 +988,7 @@ The `run_info` method is available from the `open_spider` and `close_spider` cla
|
|
|
842
988
|
|
|
843
989
|
```ruby
|
|
844
990
|
class ExampleSpider < Kimurai::Base
|
|
845
|
-
@
|
|
846
|
-
@engine = :selenium_chrome
|
|
991
|
+
@engine = :chrome
|
|
847
992
|
@start_urls = ["https://example.com/"]
|
|
848
993
|
|
|
849
994
|
def self.close_spider
|
|
@@ -895,7 +1040,7 @@ You can also use the additional methods `completed?` or `failed?`
|
|
|
895
1040
|
|
|
896
1041
|
```ruby
|
|
897
1042
|
class Spider < Kimurai::Base
|
|
898
|
-
@engine = :
|
|
1043
|
+
@engine = :chrome
|
|
899
1044
|
@start_urls = ["https://example.com/"]
|
|
900
1045
|
|
|
901
1046
|
def self.close_spider
|
|
@@ -933,7 +1078,7 @@ Kimurai supports environments. The default is `development`. To provide a custom
|
|
|
933
1078
|
Usage example:
|
|
934
1079
|
```ruby
|
|
935
1080
|
class Spider < Kimurai::Base
|
|
936
|
-
@engine = :
|
|
1081
|
+
@engine = :chrome
|
|
937
1082
|
@start_urls = ["https://example.com/"]
|
|
938
1083
|
|
|
939
1084
|
def self.close_spider
|
|
@@ -956,7 +1101,6 @@ Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, t
|
|
|
956
1101
|
require 'kimurai'
|
|
957
1102
|
|
|
958
1103
|
class AmazonSpider < Kimurai::Base
|
|
959
|
-
@name = "amazon_spider"
|
|
960
1104
|
@engine = :mechanize
|
|
961
1105
|
@start_urls = ["https://www.amazon.com/"]
|
|
962
1106
|
|
|
@@ -1068,7 +1212,7 @@ vic@Vics-MacBook-Air single %
|
|
|
1068
1212
|
|
|
1069
1213
|
* `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
|
|
1070
1214
|
* `delay:` – set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
|
|
1071
|
-
* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :
|
|
1215
|
+
* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :chrome)`
|
|
1072
1216
|
* `config:` – set custom [config](#spider-config) options
|
|
1073
1217
|
|
|
1074
1218
|
### Active Support included
|
|
@@ -1170,7 +1314,7 @@ Kimurai.configure do |config|
|
|
|
1170
1314
|
|
|
1171
1315
|
# Custom time zone (for logs):
|
|
1172
1316
|
# config.time_zone = "UTC"
|
|
1173
|
-
# config.time_zone = "Europe/
|
|
1317
|
+
# config.time_zone = "Europe/Berlin"
|
|
1174
1318
|
|
|
1175
1319
|
# Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
|
|
1176
1320
|
# config.selenium_chrome_path = "/usr/bin/chromium-browser"
|
|
@@ -1286,7 +1430,7 @@ class Spider < Kimurai::Base
|
|
|
1286
1430
|
USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
|
|
1287
1431
|
PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
|
|
1288
1432
|
|
|
1289
|
-
@engine = :
|
|
1433
|
+
@engine = :chrome
|
|
1290
1434
|
@start_urls = ["https://example.com/"]
|
|
1291
1435
|
@config = {
|
|
1292
1436
|
headers: { "custom_header" => "custom_value" },
|
|
@@ -1649,7 +1793,7 @@ end
|
|
|
1649
1793
|
spiders/application_spider.rb
|
|
1650
1794
|
```ruby
|
|
1651
1795
|
class ApplicationSpider < Kimurai::Base
|
|
1652
|
-
@engine = :
|
|
1796
|
+
@engine = :chrome
|
|
1653
1797
|
|
|
1654
1798
|
# Define pipelines (by order) for all spiders:
|
|
1655
1799
|
@pipelines = [:validator, :saver]
|
|
@@ -1726,7 +1870,7 @@ spiders/github_spider.rb
|
|
|
1726
1870
|
```ruby
|
|
1727
1871
|
class GithubSpider < Kimurai::Base
|
|
1728
1872
|
@name = "github_spider"
|
|
1729
|
-
@engine = :
|
|
1873
|
+
@engine = :chrome
|
|
1730
1874
|
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
|
|
1731
1875
|
@config = {
|
|
1732
1876
|
before_request: { delay: 3..5 }
|
data/Rakefile
CHANGED
|
@@ -1,10 +1,6 @@
|
|
|
1
1
|
require 'bundler/gem_tasks'
|
|
2
|
-
require '
|
|
2
|
+
require 'rspec/core/rake_task'
|
|
3
3
|
|
|
4
|
-
|
|
5
|
-
t.libs << 'test'
|
|
6
|
-
t.libs << 'lib'
|
|
7
|
-
t.test_files = FileList['test/**/*_test.rb']
|
|
8
|
-
end
|
|
4
|
+
RSpec::Core::RakeTask.new(:spec)
|
|
9
5
|
|
|
10
|
-
task default: :
|
|
6
|
+
task default: :spec
|
data/kimurai.gemspec
CHANGED
|
@@ -20,7 +20,7 @@ Gem::Specification.new do |spec|
|
|
|
20
20
|
spec.bindir = 'exe'
|
|
21
21
|
spec.executables = 'kimurai'
|
|
22
22
|
spec.require_paths = ['lib']
|
|
23
|
-
spec.required_ruby_version = '>= 3.
|
|
23
|
+
spec.required_ruby_version = '>= 3.2.0'
|
|
24
24
|
|
|
25
25
|
spec.add_dependency 'activesupport'
|
|
26
26
|
spec.add_dependency 'cliver'
|
|
@@ -46,4 +46,8 @@ Gem::Specification.new do |spec|
|
|
|
46
46
|
|
|
47
47
|
spec.add_dependency 'pry'
|
|
48
48
|
spec.add_dependency 'rbcat', '~> 1.0'
|
|
49
|
+
spec.add_dependency 'nukitori'
|
|
50
|
+
|
|
51
|
+
spec.add_development_dependency 'rake', '~> 13.0'
|
|
52
|
+
spec.add_development_dependency 'rspec', '~> 3.13'
|
|
49
53
|
end
|
data/lib/kimurai/base/saver.rb
CHANGED
|
@@ -7,10 +7,11 @@ module Kimurai
|
|
|
7
7
|
attr_reader :format, :path, :position, :append
|
|
8
8
|
|
|
9
9
|
def initialize(path, format:, position: true, append: false)
|
|
10
|
-
raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json jsonlines csv].include?(format)
|
|
10
|
+
raise "SimpleSaver: wrong type of format: #{format}" unless %i[json pretty_json compact_json jsonlines csv].include?(format)
|
|
11
11
|
|
|
12
12
|
@path = path
|
|
13
13
|
@format = format
|
|
14
|
+
@format = :json if format == :pretty_json # :pretty_json is now an alias for :json
|
|
14
15
|
@position = position
|
|
15
16
|
@index = 0
|
|
16
17
|
@append = append
|
|
@@ -19,44 +20,57 @@ module Kimurai
|
|
|
19
20
|
|
|
20
21
|
def save(item)
|
|
21
22
|
@mutex.synchronize do
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
23
|
+
if item.is_a?(Array)
|
|
24
|
+
item.each do |it|
|
|
25
|
+
@index += 1
|
|
26
|
+
it[:position] = @index if position
|
|
27
|
+
|
|
28
|
+
save_item(it)
|
|
29
|
+
end
|
|
30
|
+
else
|
|
31
|
+
@index += 1
|
|
32
|
+
item[:position] = @index if position
|
|
33
|
+
|
|
34
|
+
save_item(item)
|
|
34
35
|
end
|
|
35
36
|
end
|
|
36
37
|
end
|
|
37
38
|
|
|
38
39
|
private
|
|
39
40
|
|
|
41
|
+
def save_item(item)
|
|
42
|
+
case format
|
|
43
|
+
when :json
|
|
44
|
+
save_to_json(item)
|
|
45
|
+
when :compact_json
|
|
46
|
+
save_to_compact_json(item)
|
|
47
|
+
when :jsonlines
|
|
48
|
+
save_to_jsonlines(item)
|
|
49
|
+
when :csv
|
|
50
|
+
save_to_csv(item)
|
|
51
|
+
end
|
|
52
|
+
end
|
|
53
|
+
|
|
40
54
|
def save_to_json(item)
|
|
41
|
-
data = JSON.
|
|
55
|
+
data = JSON.pretty_generate([item])
|
|
42
56
|
|
|
43
57
|
if @index > 1 || append && File.exist?(path)
|
|
44
|
-
file_content = File.read(path).sub(/\}\]\Z/, "\}
|
|
58
|
+
file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
|
|
45
59
|
File.open(path, 'w') do |f|
|
|
46
|
-
f.write(file_content + data.sub(/\A\[/, ''))
|
|
60
|
+
f.write(file_content + data.sub(/\A\[\n/, ''))
|
|
47
61
|
end
|
|
48
62
|
else
|
|
49
63
|
File.open(path, 'w') { |f| f.write(data) }
|
|
50
64
|
end
|
|
51
65
|
end
|
|
52
66
|
|
|
53
|
-
def
|
|
54
|
-
data = JSON.
|
|
67
|
+
def save_to_compact_json(item)
|
|
68
|
+
data = JSON.generate([item])
|
|
55
69
|
|
|
56
70
|
if @index > 1 || append && File.exist?(path)
|
|
57
|
-
file_content = File.read(path).sub(/\}\
|
|
71
|
+
file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
|
|
58
72
|
File.open(path, 'w') do |f|
|
|
59
|
-
f.write(file_content + data.sub(/\A\[
|
|
73
|
+
f.write(file_content + data.sub(/\A\[/, ''))
|
|
60
74
|
end
|
|
61
75
|
else
|
|
62
76
|
File.open(path, 'w') { |f| f.write(data) }
|
data/lib/kimurai/base.rb
CHANGED
|
@@ -69,7 +69,7 @@ module Kimurai
|
|
|
69
69
|
@config = {}
|
|
70
70
|
|
|
71
71
|
def self.name
|
|
72
|
-
@name
|
|
72
|
+
@name || to_s.underscore
|
|
73
73
|
end
|
|
74
74
|
|
|
75
75
|
def self.engine
|
|
@@ -84,11 +84,22 @@ module Kimurai
|
|
|
84
84
|
@start_urls
|
|
85
85
|
end
|
|
86
86
|
|
|
87
|
+
def self.delay
|
|
88
|
+
@delay ||= superclass.respond_to?(:delay) ? superclass.delay : nil
|
|
89
|
+
end
|
|
90
|
+
|
|
87
91
|
def self.config
|
|
88
|
-
if superclass.equal?(::Object)
|
|
89
|
-
|
|
92
|
+
base_config = if superclass.equal?(::Object)
|
|
93
|
+
@config
|
|
94
|
+
else
|
|
95
|
+
superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
# Merge @delay shortcut into config if set
|
|
99
|
+
if delay
|
|
100
|
+
base_config.deep_merge_excl({ before_request: { delay: delay } }, DMERGE_EXCLUDE)
|
|
90
101
|
else
|
|
91
|
-
|
|
102
|
+
base_config
|
|
92
103
|
end
|
|
93
104
|
end
|
|
94
105
|
|
data/lib/kimurai/base_helper.rb
CHANGED
|
@@ -1,5 +1,15 @@
|
|
|
1
1
|
module Kimurai
|
|
2
2
|
module BaseHelper
|
|
3
|
+
def extract(response, model: nil, &block)
|
|
4
|
+
caller_info = caller_locations(1, 1).first
|
|
5
|
+
method_name = caller_info.base_label
|
|
6
|
+
spider_dir = File.dirname(caller_info.path)
|
|
7
|
+
schema_path = File.join(spider_dir, "#{self.class.name}.json")
|
|
8
|
+
|
|
9
|
+
data = Nukitori(response, schema_path, prefix: method_name, model:, &block)
|
|
10
|
+
data.deep_symbolize_keys
|
|
11
|
+
end
|
|
12
|
+
|
|
3
13
|
private
|
|
4
14
|
|
|
5
15
|
def absolute_url(url, base:)
|
|
@@ -1,6 +1,13 @@
|
|
|
1
1
|
module Kimurai
|
|
2
2
|
module BrowserBuilder
|
|
3
|
+
ENGINE_ALIASES = {
|
|
4
|
+
chrome: :selenium_chrome,
|
|
5
|
+
firefox: :selenium_firefox
|
|
6
|
+
}.freeze
|
|
7
|
+
|
|
3
8
|
def self.build(engine, config = {}, spider:)
|
|
9
|
+
engine = ENGINE_ALIASES.fetch(engine, engine)
|
|
10
|
+
|
|
4
11
|
begin
|
|
5
12
|
require "kimurai/browser_builder/#{engine}_builder"
|
|
6
13
|
rescue LoadError
|
|
@@ -10,7 +10,6 @@ module Capybara
|
|
|
10
10
|
alias original_visit visit
|
|
11
11
|
def visit(visit_uri, delay: config.before_request[:delay], skip_request_options: false, max_retries: 3)
|
|
12
12
|
if spider
|
|
13
|
-
process_delay(delay) if delay
|
|
14
13
|
retries = 0
|
|
15
14
|
sleep_interval = 0
|
|
16
15
|
|
|
@@ -20,6 +19,9 @@ module Capybara
|
|
|
20
19
|
spider.class.update(:visits, :requests) if spider.with_info
|
|
21
20
|
|
|
22
21
|
original_visit(visit_uri)
|
|
22
|
+
|
|
23
|
+
logger.info "Browser: finished get request to: #{visit_uri}"
|
|
24
|
+
process_delay(delay) if delay
|
|
23
25
|
rescue StandardError => e
|
|
24
26
|
if match_error?(e, type: :to_skip)
|
|
25
27
|
logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
|
|
@@ -40,7 +42,7 @@ module Capybara
|
|
|
40
42
|
raise e
|
|
41
43
|
end
|
|
42
44
|
else
|
|
43
|
-
driver.responses += 1
|
|
45
|
+
driver.responses += 1
|
|
44
46
|
spider.class.update(:visits, :responses) if spider.with_info
|
|
45
47
|
driver.visited = true unless driver.visited
|
|
46
48
|
true
|
|
@@ -170,7 +172,7 @@ module Capybara
|
|
|
170
172
|
|
|
171
173
|
def process_delay(delay)
|
|
172
174
|
interval = (delay.instance_of?(Range) ? rand(delay) : delay)
|
|
173
|
-
logger.debug "Browser:
|
|
175
|
+
logger.debug "Browser: delay #{interval.round(2)} #{'second'.pluralize(interval)}..."
|
|
174
176
|
sleep interval
|
|
175
177
|
end
|
|
176
178
|
|
data/lib/kimurai/version.rb
CHANGED
data/lib/kimurai.rb
CHANGED
|
@@ -6,6 +6,7 @@ require 'uri'
|
|
|
6
6
|
require 'active_support'
|
|
7
7
|
require 'active_support/core_ext'
|
|
8
8
|
require 'rbcat'
|
|
9
|
+
require 'nukitori'
|
|
9
10
|
|
|
10
11
|
require_relative 'kimurai/version'
|
|
11
12
|
|
|
@@ -20,6 +21,33 @@ require_relative 'kimurai/pipeline'
|
|
|
20
21
|
require_relative 'kimurai/base'
|
|
21
22
|
|
|
22
23
|
module Kimurai
|
|
24
|
+
# Settings that will be forwarded to Nukitori configuration
|
|
25
|
+
NUKITORI_SETTINGS = %i[
|
|
26
|
+
openai_api_key
|
|
27
|
+
anthropic_api_key
|
|
28
|
+
gemini_api_key
|
|
29
|
+
vertexai_project_id
|
|
30
|
+
vertexai_location
|
|
31
|
+
deepseek_api_key
|
|
32
|
+
mistral_api_key
|
|
33
|
+
perplexity_api_key
|
|
34
|
+
openrouter_api_key
|
|
35
|
+
gpustack_api_key
|
|
36
|
+
openai_api_base
|
|
37
|
+
gemini_api_base
|
|
38
|
+
ollama_api_base
|
|
39
|
+
gpustack_api_base
|
|
40
|
+
openai_organization_id
|
|
41
|
+
openai_project_id
|
|
42
|
+
openai_use_system_role
|
|
43
|
+
bedrock_api_key
|
|
44
|
+
bedrock_secret_key
|
|
45
|
+
bedrock_region
|
|
46
|
+
bedrock_session_token
|
|
47
|
+
default_model
|
|
48
|
+
model_registry_file
|
|
49
|
+
].freeze
|
|
50
|
+
|
|
23
51
|
class << self
|
|
24
52
|
def configuration
|
|
25
53
|
@configuration ||= OpenStruct.new
|
|
@@ -27,6 +55,22 @@ module Kimurai
|
|
|
27
55
|
|
|
28
56
|
def configure
|
|
29
57
|
yield(configuration)
|
|
58
|
+
apply_nukitori_configuration
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
def apply_nukitori_configuration
|
|
62
|
+
nukitori_settings = NUKITORI_SETTINGS.filter_map do |setting|
|
|
63
|
+
value = configuration[setting]
|
|
64
|
+
[setting, value] if value
|
|
65
|
+
end.to_h
|
|
66
|
+
|
|
67
|
+
return if nukitori_settings.empty?
|
|
68
|
+
|
|
69
|
+
Nukitori.configure do |config|
|
|
70
|
+
nukitori_settings.each do |setting, value|
|
|
71
|
+
config.public_send("#{setting}=", value)
|
|
72
|
+
end
|
|
73
|
+
end
|
|
30
74
|
end
|
|
31
75
|
|
|
32
76
|
def env
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: kimurai
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 2.0
|
|
4
|
+
version: 2.1.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Victor Afanasev
|
|
@@ -261,6 +261,48 @@ dependencies:
|
|
|
261
261
|
- - "~>"
|
|
262
262
|
- !ruby/object:Gem::Version
|
|
263
263
|
version: '1.0'
|
|
264
|
+
- !ruby/object:Gem::Dependency
|
|
265
|
+
name: nukitori
|
|
266
|
+
requirement: !ruby/object:Gem::Requirement
|
|
267
|
+
requirements:
|
|
268
|
+
- - ">="
|
|
269
|
+
- !ruby/object:Gem::Version
|
|
270
|
+
version: '0'
|
|
271
|
+
type: :runtime
|
|
272
|
+
prerelease: false
|
|
273
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
274
|
+
requirements:
|
|
275
|
+
- - ">="
|
|
276
|
+
- !ruby/object:Gem::Version
|
|
277
|
+
version: '0'
|
|
278
|
+
- !ruby/object:Gem::Dependency
|
|
279
|
+
name: rake
|
|
280
|
+
requirement: !ruby/object:Gem::Requirement
|
|
281
|
+
requirements:
|
|
282
|
+
- - "~>"
|
|
283
|
+
- !ruby/object:Gem::Version
|
|
284
|
+
version: '13.0'
|
|
285
|
+
type: :development
|
|
286
|
+
prerelease: false
|
|
287
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
288
|
+
requirements:
|
|
289
|
+
- - "~>"
|
|
290
|
+
- !ruby/object:Gem::Version
|
|
291
|
+
version: '13.0'
|
|
292
|
+
- !ruby/object:Gem::Dependency
|
|
293
|
+
name: rspec
|
|
294
|
+
requirement: !ruby/object:Gem::Requirement
|
|
295
|
+
requirements:
|
|
296
|
+
- - "~>"
|
|
297
|
+
- !ruby/object:Gem::Version
|
|
298
|
+
version: '3.13'
|
|
299
|
+
type: :development
|
|
300
|
+
prerelease: false
|
|
301
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
302
|
+
requirements:
|
|
303
|
+
- - "~>"
|
|
304
|
+
- !ruby/object:Gem::Version
|
|
305
|
+
version: '3.13'
|
|
264
306
|
email:
|
|
265
307
|
- vicfreefly@gmail.com
|
|
266
308
|
executables:
|
|
@@ -269,6 +311,7 @@ extensions: []
|
|
|
269
311
|
extra_rdoc_files: []
|
|
270
312
|
files:
|
|
271
313
|
- ".gitignore"
|
|
314
|
+
- ".rspec"
|
|
272
315
|
- ".rubocop.yml"
|
|
273
316
|
- CHANGELOG.md
|
|
274
317
|
- Gemfile
|
|
@@ -329,7 +372,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
|
329
372
|
requirements:
|
|
330
373
|
- - ">="
|
|
331
374
|
- !ruby/object:Gem::Version
|
|
332
|
-
version: 3.
|
|
375
|
+
version: 3.2.0
|
|
333
376
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
334
377
|
requirements:
|
|
335
378
|
- - ">="
|