kimurai 2.1.0 → 2.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +5 -0
- data/README.md +71 -21
- data/lib/kimurai/base.rb +1 -1
- data/lib/kimurai/cli/generator.rb +1 -1
- data/lib/kimurai/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: eeeca8fc2ae390e6c557f435478ee4ea8273920e3ab7c590800c338574f364d0
|
|
4
|
+
data.tar.gz: d7d2d799a97c51c0e1837080c249316651beedb58b312df3df9bc69fefabac31
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 80f449d68068d238da99dbbb83b710e8071e7fd4ded76bc17f096fac1446a307e841d87300c461299c6c01fe9abaecbcc5c91d9a2fa9878d5af5d1cac888349a
|
|
7
|
+
data.tar.gz: 793f3301353e135484ad0283973cd4c3c181fa43a9510339f06a61152a557f758f1d9e32a2fec12954ce5b9e3a409c554efea13485f7d66c67ba8654e4f6baeb
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
|
@@ -3,10 +3,64 @@
|
|
|
3
3
|
<img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
|
|
4
4
|
</a>
|
|
5
5
|
|
|
6
|
-
<h1>Kimurai</h1>
|
|
6
|
+
<h1>Kimurai: AI-First Web Scraping Framework for Ruby</h1>
|
|
7
7
|
</div>
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs:
|
|
10
|
+
|
|
11
|
+
```ruby
|
|
12
|
+
# google_spider.rb
|
|
13
|
+
require 'kimurai'
|
|
14
|
+
|
|
15
|
+
class GoogleSpider < Kimurai::Base
|
|
16
|
+
@start_urls = ['https://www.google.com/search?q=web+scraping+ai']
|
|
17
|
+
@delay = 1
|
|
18
|
+
|
|
19
|
+
def parse(response, url:, data: {})
|
|
20
|
+
results = extract(response) do
|
|
21
|
+
array :organic_results do
|
|
22
|
+
object do
|
|
23
|
+
string :title
|
|
24
|
+
string :snippet
|
|
25
|
+
string :url
|
|
26
|
+
end
|
|
27
|
+
end
|
|
28
|
+
|
|
29
|
+
array :sponsored_results do
|
|
30
|
+
object do
|
|
31
|
+
string :title
|
|
32
|
+
string :snippet
|
|
33
|
+
string :url
|
|
34
|
+
end
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
array :people_also_search_for, of: :string
|
|
38
|
+
|
|
39
|
+
string :next_page_link
|
|
40
|
+
number :current_page_number
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
save_to 'google_results.json', results, format: :json
|
|
44
|
+
|
|
45
|
+
if results[:next_page_link] && results[:current_page_number] < 3
|
|
46
|
+
request_to :parse, url: absolute_url(results[:next_page_link], base: url)
|
|
47
|
+
end
|
|
48
|
+
end
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
GoogleSpider.crawl!
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
**How it works:**
|
|
55
|
+
1. On the first request, `extract` sends the HTML + your schema to an LLM
|
|
56
|
+
2. The LLM generates XPath selectors and caches them in `google_spider.json`
|
|
57
|
+
3. **All subsequent requests use cached XPath — zero AI calls, pure fast Ruby extraction**
|
|
58
|
+
4. Supports OpenAI, Anthropic, Gemini, or local LLMs via [Nukitori](https://github.com/vifreefly/nukitori)
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
## Traditional Mode
|
|
62
|
+
|
|
63
|
+
Prefer writing your own selectors? Kimurai works great as a traditional scraper too — with headless antidetect Chromium, Firefox, or simple HTTP requests:
|
|
10
64
|
|
|
11
65
|
```ruby
|
|
12
66
|
# github_spider.rb
|
|
@@ -216,9 +270,9 @@ I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spid
|
|
|
216
270
|
```
|
|
217
271
|
</details>
|
|
218
272
|
|
|
219
|
-
## AI
|
|
273
|
+
## AI Extraction — Configuration
|
|
220
274
|
|
|
221
|
-
|
|
275
|
+
Configure your LLM provider to start using AI extraction. The `extract` method is powered by [Nukitori](https://github.com/vifreefly/nukitori):
|
|
222
276
|
|
|
223
277
|
```ruby
|
|
224
278
|
# github_spider_ai.rb
|
|
@@ -260,13 +314,7 @@ end
|
|
|
260
314
|
GithubSpider.crawl!
|
|
261
315
|
```
|
|
262
316
|
|
|
263
|
-
|
|
264
|
-
1. On the first page, `extract` sends the HTML to an LLM which generates XPath rules for your schema
|
|
265
|
-
2. These rules are cached in a JSON file alongside your spider
|
|
266
|
-
3. **All subsequent pages use the cached XPath — no more AI calls, pure fast extraction**
|
|
267
|
-
4. When there's no "Next" link on the last page, the extracted value is `nil` and pagination stops
|
|
268
|
-
|
|
269
|
-
Zero manual selectors. The AI figured out where everything lives, and that knowledge is reused for the entire crawl.
|
|
317
|
+
Selectors are cached in `github_spider_ai.json` after the first AI call — all subsequent requests use pure Ruby extraction.
|
|
270
318
|
|
|
271
319
|
## Features
|
|
272
320
|
* **AI-powered data extraction**: Use [Nukitori](https://github.com/vifreefly/nukitori) to extract structured data without writing XPath/CSS selectors — just describe what you want, and AI figures out how to extract it
|
|
@@ -286,6 +334,8 @@ Zero manual selectors. The AI figured out where everything lives, and that knowl
|
|
|
286
334
|
|
|
287
335
|
## Table of Contents
|
|
288
336
|
* [Kimurai](#kimurai)
|
|
337
|
+
* [Traditional Mode](#traditional-mode)
|
|
338
|
+
* [AI Extraction — Configuration](#ai-extraction--configuration)
|
|
289
339
|
* [Features](#features)
|
|
290
340
|
* [Table of Contents](#table-of-contents)
|
|
291
341
|
* [Installation](#installation)
|
|
@@ -504,7 +554,7 @@ SimpleSpider.crawl!
|
|
|
504
554
|
|
|
505
555
|
Where:
|
|
506
556
|
* `@name` – a name for the spider (optional)
|
|
507
|
-
* `@engine` – engine to use for the spider (optional, default is `:
|
|
557
|
+
* `@engine` – engine to use for the spider (optional, default is `:selenium_chrome`)
|
|
508
558
|
* `@start_urls` – array of urls to process one-by-one inside the `parse` method
|
|
509
559
|
* The `parse` method is the entry point, and should always be present in a spider class
|
|
510
560
|
|
|
@@ -1472,7 +1522,7 @@ end
|
|
|
1472
1522
|
# Custom User Agent – string or lambda
|
|
1473
1523
|
#
|
|
1474
1524
|
# Use lambda if you want to rotate user agents before each run:
|
|
1475
|
-
#
|
|
1525
|
+
# user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
|
|
1476
1526
|
#
|
|
1477
1527
|
# Works for all engines
|
|
1478
1528
|
user_agent: "Mozilla/5.0 Firefox/61.0",
|
|
@@ -1484,10 +1534,10 @@ end
|
|
|
1484
1534
|
cookies: [],
|
|
1485
1535
|
|
|
1486
1536
|
# Proxy – string or lambda. Format for a proxy string: "ip:port:protocol:user:password"
|
|
1487
|
-
#
|
|
1537
|
+
# `protocol` can be http or socks5. User and password are optional.
|
|
1488
1538
|
#
|
|
1489
1539
|
# Use lambda if you want to rotate proxies before each run:
|
|
1490
|
-
#
|
|
1540
|
+
# proxy: -> { ARRAY_OF_PROXIES.sample }
|
|
1491
1541
|
#
|
|
1492
1542
|
# Works for all engines, but keep in mind that Selenium drivers don't support proxies
|
|
1493
1543
|
# with authorization. Also, Mechanize doesn't support socks5 proxy format (only http).
|
|
@@ -1531,10 +1581,10 @@ end
|
|
|
1531
1581
|
# and if the url already exists in this scope, the request will be skipped.
|
|
1532
1582
|
#
|
|
1533
1583
|
# You can configure this setting by providing additional options as hash:
|
|
1534
|
-
#
|
|
1535
|
-
#
|
|
1536
|
-
#
|
|
1537
|
-
#
|
|
1584
|
+
# `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
|
|
1585
|
+
# `scope:` – use a custom scope other than `:requests_urls`
|
|
1586
|
+
# `check_only:` – if true, the url will not be added to the scope
|
|
1587
|
+
#
|
|
1538
1588
|
# Works for all drivers
|
|
1539
1589
|
skip_duplicate_requests: true,
|
|
1540
1590
|
|
|
@@ -1565,8 +1615,8 @@ end
|
|
|
1565
1615
|
# Handle page encoding while parsing html response using Nokogiri
|
|
1566
1616
|
#
|
|
1567
1617
|
# There are two ways to use this option:
|
|
1568
|
-
#
|
|
1569
|
-
#
|
|
1618
|
+
# encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
|
|
1619
|
+
# encoding: "GB2312" # set encoding manually
|
|
1570
1620
|
#
|
|
1571
1621
|
# This option is not set by default
|
|
1572
1622
|
encoding: nil,
|
data/lib/kimurai/base.rb
CHANGED
|
@@ -35,7 +35,7 @@ module Kimurai
|
|
|
35
35
|
|
|
36
36
|
return if in_project
|
|
37
37
|
|
|
38
|
-
insert_into_file spider_path, " @engine = :
|
|
38
|
+
insert_into_file spider_path, " @engine = :chrome\n", after: "@name = \"#{spider_name}\"\n"
|
|
39
39
|
prepend_to_file spider_path, "require 'kimurai'\n\n"
|
|
40
40
|
append_to_file spider_path, "\n#{spider_class}.crawl!"
|
|
41
41
|
end
|
data/lib/kimurai/version.rb
CHANGED