kimurai 2.1.0 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b0f990c2292eebb911b6036b7515fdbe4b844f75dc20ec032c1da352de740c80
4
- data.tar.gz: 13f35756781bb2a0c8f14fe246edc87e428cd5de6bc46c269128d764b9f2763c
3
+ metadata.gz: eeeca8fc2ae390e6c557f435478ee4ea8273920e3ab7c590800c338574f364d0
4
+ data.tar.gz: d7d2d799a97c51c0e1837080c249316651beedb58b312df3df9bc69fefabac31
5
5
  SHA512:
6
- metadata.gz: b2f236b701e505bba6e03083fc5e4308b125ba09203db04d41c1e72b0ccfbcb6957c35e6cc05e8f10ae2d3d96aeb7cdb1e7f24e7770406a8894f6020ec53c5c4
7
- data.tar.gz: d47288711341145af98b0ad547583dae4ce26f38d6b024b4994a40eb9fc13f802238b0bf93eb0cd8b06d91ff098eceb2b837b50ef9e84ef3d0a98a25a70d0ce8
6
+ metadata.gz: 80f449d68068d238da99dbbb83b710e8071e7fd4ded76bc17f096fac1446a307e841d87300c461299c6c01fe9abaecbcc5c91d9a2fa9878d5af5d1cac888349a
7
+ data.tar.gz: 793f3301353e135484ad0283973cd4c3c181fa43a9510339f06a61152a557f758f1d9e32a2fec12954ce5b9e3a409c554efea13485f7d66c67ba8654e4f6baeb
data/CHANGELOG.md CHANGED
@@ -1,4 +1,9 @@
1
1
  # CHANGELOG
2
+
3
+ ## 2.2.0
4
+ ### New
5
+ * Default engine now is `:chrome` (was `:mechanize`)
6
+
2
7
  ## 2.1.0
3
8
  ### New
4
9
  * Min. required Ruby version is 3.2.0
data/README.md CHANGED
@@ -3,10 +3,64 @@
3
3
  <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
4
4
  </a>
5
5
 
6
- <h1>Kimurai</h1>
6
+ <h1>Kimurai: AI-First Web Scraping Framework for Ruby</h1>
7
7
  </div>
8
8
 
9
- Kimurai is a modern Ruby web scraping framework designed to scrape and interact with JavaScript-rendered websites using headless antidetect Chromium, Firefox, or simple HTTP requests right out of the box:
9
+ Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs:
10
+
11
+ ```ruby
12
+ # google_spider.rb
13
+ require 'kimurai'
14
+
15
+ class GoogleSpider < Kimurai::Base
16
+ @start_urls = ['https://www.google.com/search?q=web+scraping+ai']
17
+ @delay = 1
18
+
19
+ def parse(response, url:, data: {})
20
+ results = extract(response) do
21
+ array :organic_results do
22
+ object do
23
+ string :title
24
+ string :snippet
25
+ string :url
26
+ end
27
+ end
28
+
29
+ array :sponsored_results do
30
+ object do
31
+ string :title
32
+ string :snippet
33
+ string :url
34
+ end
35
+ end
36
+
37
+ array :people_also_search_for, of: :string
38
+
39
+ string :next_page_link
40
+ number :current_page_number
41
+ end
42
+
43
+ save_to 'google_results.json', results, format: :json
44
+
45
+ if results[:next_page_link] && results[:current_page_number] < 3
46
+ request_to :parse, url: absolute_url(results[:next_page_link], base: url)
47
+ end
48
+ end
49
+ end
50
+
51
+ GoogleSpider.crawl!
52
+ ```
53
+
54
+ **How it works:**
55
+ 1. On the first request, `extract` sends the HTML + your schema to an LLM
56
+ 2. The LLM generates XPath selectors and caches them in `google_spider.json`
57
+ 3. **All subsequent requests use cached XPath — zero AI calls, pure fast Ruby extraction**
58
+ 4. Supports OpenAI, Anthropic, Gemini, or local LLMs via [Nukitori](https://github.com/vifreefly/nukitori)
59
+
60
+
61
+ ## Traditional Mode
62
+
63
+ Prefer writing your own selectors? Kimurai works great as a traditional scraper too — with headless antidetect Chromium, Firefox, or simple HTTP requests:
10
64
 
11
65
  ```ruby
12
66
  # github_spider.rb
@@ -216,9 +270,9 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
216
270
  ```
217
271
  </details>
218
272
 
219
- ## AI-Powered Extraction
273
+ ## AI Extraction — Configuration
220
274
 
221
- What if you could just describe the data you want and let AI figure out how to extract it? With the built-in `extract` method powered by [Nukitori](https://github.com/vifreefly/nukitori), you can:
275
+ Configure your LLM provider to start using AI extraction. The `extract` method is powered by [Nukitori](https://github.com/vifreefly/nukitori):
222
276
 
223
277
  ```ruby
224
278
  # github_spider_ai.rb
@@ -260,13 +314,7 @@ end
260
314
  GithubSpider.crawl!
261
315
  ```
262
316
 
263
- **How it works:**
264
- 1. On the first page, `extract` sends the HTML to an LLM which generates XPath rules for your schema
265
- 2. These rules are cached in a JSON file alongside your spider
266
- 3. **All subsequent pages use the cached XPath — no more AI calls, pure fast extraction**
267
- 4. When there's no "Next" link on the last page, the extracted value is `nil` and pagination stops
268
-
269
- Zero manual selectors. The AI figured out where everything lives, and that knowledge is reused for the entire crawl.
317
+ Selectors are cached in `github_spider_ai.json` after the first AI call — all subsequent requests use pure Ruby extraction.
270
318
 
271
319
  ## Features
272
320
  * **AI-powered data extraction**: Use [Nukitori](https://github.com/vifreefly/nukitori) to extract structured data without writing XPath/CSS selectors — just describe what you want, and AI figures out how to extract it
@@ -286,6 +334,8 @@ Zero manual selectors. The AI figured out where everything lives, and that knowl
286
334
 
287
335
  ## Table of Contents
288
336
  * [Kimurai](#kimurai)
337
+ * [Traditional Mode](#traditional-mode)
338
+ * [AI Extraction — Configuration](#ai-extraction--configuration)
289
339
  * [Features](#features)
290
340
  * [Table of Contents](#table-of-contents)
291
341
  * [Installation](#installation)
@@ -504,7 +554,7 @@ SimpleSpider.crawl!
504
554
 
505
555
  Where:
506
556
  * `@name` – a name for the spider (optional)
507
- * `@engine` – engine to use for the spider (optional, default is `:mechanize`)
557
+ * `@engine` – engine to use for the spider (optional, default is `:selenium_chrome`)
508
558
  * `@start_urls` – array of urls to process one-by-one inside the `parse` method
509
559
  * The `parse` method is the entry point, and should always be present in a spider class
510
560
 
@@ -1472,7 +1522,7 @@ end
1472
1522
  # Custom User Agent – string or lambda
1473
1523
  #
1474
1524
  # Use lambda if you want to rotate user agents before each run:
1475
- # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1525
+ # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1476
1526
  #
1477
1527
  # Works for all engines
1478
1528
  user_agent: "Mozilla/5.0 Firefox/61.0",
@@ -1484,10 +1534,10 @@ end
1484
1534
  cookies: [],
1485
1535
 
1486
1536
  # Proxy – string or lambda. Format for a proxy string: "ip:port:protocol:user:password"
1487
- # `protocol` can be http or socks5. User and password are optional.
1537
+ # `protocol` can be http or socks5. User and password are optional.
1488
1538
  #
1489
1539
  # Use lambda if you want to rotate proxies before each run:
1490
- # proxy: -> { ARRAY_OF_PROXIES.sample }
1540
+ # proxy: -> { ARRAY_OF_PROXIES.sample }
1491
1541
  #
1492
1542
  # Works for all engines, but keep in mind that Selenium drivers don't support proxies
1493
1543
  # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http).
@@ -1531,10 +1581,10 @@ end
1531
1581
  # and if the url already exists in this scope, the request will be skipped.
1532
1582
  #
1533
1583
  # You can configure this setting by providing additional options as hash:
1534
- # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1535
- # `scope:` – use a custom scope other than `:requests_urls`
1536
- # `check_only:` – if true, the url will not be added to the scope
1537
- #
1584
+ # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1585
+ # `scope:` – use a custom scope other than `:requests_urls`
1586
+ # `check_only:` – if true, the url will not be added to the scope
1587
+ #
1538
1588
  # Works for all drivers
1539
1589
  skip_duplicate_requests: true,
1540
1590
 
@@ -1565,8 +1615,8 @@ end
1565
1615
  # Handle page encoding while parsing html response using Nokogiri
1566
1616
  #
1567
1617
  # There are two ways to use this option:
1568
- # encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
1569
- # encoding: "GB2312" # set encoding manually
1618
+ # encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
1619
+ # encoding: "GB2312" # set encoding manually
1570
1620
  #
1571
1621
  # This option is not set by default
1572
1622
  encoding: nil,
data/lib/kimurai/base.rb CHANGED
@@ -64,7 +64,7 @@ module Kimurai
64
64
 
65
65
  ###
66
66
 
67
- @engine = :mechanize
67
+ @engine = :selenium_chrome
68
68
  @pipelines = []
69
69
  @config = {}
70
70
 
@@ -35,7 +35,7 @@ module Kimurai
35
35
 
36
36
  return if in_project
37
37
 
38
- insert_into_file spider_path, " @engine = :mechanize\n", after: "@name = \"#{spider_name}\"\n"
38
+ insert_into_file spider_path, " @engine = :chrome\n", after: "@name = \"#{spider_name}\"\n"
39
39
  prepend_to_file spider_path, "require 'kimurai'\n\n"
40
40
  append_to_file spider_path, "\n#{spider_class}.crawl!"
41
41
  end
@@ -1,3 +1,3 @@
1
1
  module Kimurai
2
- VERSION = '2.1.0'.freeze
2
+ VERSION = '2.2.0'.freeze
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kimurai
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.1.0
4
+ version: 2.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Afanasev