kimurai 1.3.2 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +9 -0
- data/CHANGELOG.md +29 -0
- data/Gemfile +2 -2
- data/README.md +478 -649
- data/Rakefile +6 -6
- data/bin/console +3 -4
- data/exe/kimurai +0 -1
- data/kimurai.gemspec +38 -37
- data/lib/kimurai/base/saver.rb +15 -19
- data/lib/kimurai/base/storage.rb +1 -1
- data/lib/kimurai/base.rb +42 -38
- data/lib/kimurai/base_helper.rb +5 -4
- data/lib/kimurai/browser_builder/mechanize_builder.rb +44 -38
- data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +63 -51
- data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +61 -55
- data/lib/kimurai/browser_builder.rb +7 -31
- data/lib/kimurai/capybara_configuration.rb +1 -1
- data/lib/kimurai/capybara_ext/driver/base.rb +50 -46
- data/lib/kimurai/capybara_ext/mechanize/driver.rb +51 -50
- data/lib/kimurai/capybara_ext/selenium/driver.rb +33 -29
- data/lib/kimurai/capybara_ext/session/config.rb +1 -1
- data/lib/kimurai/capybara_ext/session.rb +40 -38
- data/lib/kimurai/cli/generator.rb +15 -15
- data/lib/kimurai/cli.rb +52 -85
- data/lib/kimurai/core_ext/array.rb +2 -2
- data/lib/kimurai/core_ext/hash.rb +1 -1
- data/lib/kimurai/core_ext/numeric.rb +4 -4
- data/lib/kimurai/pipeline.rb +2 -1
- data/lib/kimurai/runner.rb +6 -6
- data/lib/kimurai/template/Gemfile +2 -2
- data/lib/kimurai/template/config/boot.rb +4 -4
- data/lib/kimurai/template/config/schedule.rb +15 -15
- data/lib/kimurai/template/spiders/application_spider.rb +14 -14
- data/lib/kimurai/version.rb +1 -1
- data/lib/kimurai.rb +7 -3
- metadata +58 -65
- data/.travis.yml +0 -5
- data/lib/kimurai/automation/deploy.yml +0 -54
- data/lib/kimurai/automation/setup/chromium_chromedriver.yml +0 -26
- data/lib/kimurai/automation/setup/firefox_geckodriver.yml +0 -20
- data/lib/kimurai/automation/setup/phantomjs.yml +0 -33
- data/lib/kimurai/automation/setup/ruby_environment.yml +0 -124
- data/lib/kimurai/automation/setup.yml +0 -44
- data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +0 -171
- data/lib/kimurai/capybara_ext/poltergeist/driver.rb +0 -13
- data/lib/kimurai/cli/ansible_command_builder.rb +0 -71
- data/lib/kimurai/template/config/automation.yml +0 -13
data/README.md
CHANGED
|
@@ -1,28 +1,8 @@
|
|
|
1
|
-
|
|
2
|
-
<a href="https://github.com/vifreefly/kimuraframework">
|
|
3
|
-
<img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
|
|
4
|
-
</a>
|
|
1
|
+
# Kimurai
|
|
5
2
|
|
|
6
|
-
|
|
7
|
-
</div>
|
|
3
|
+
Kimurai is a modern web scraping framework written in Ruby which **works out of the box with Headless Chromium/Firefox** or simple HTTP requests and **allows you to scrape and interact with JavaScript rendered websites.**
|
|
8
4
|
|
|
9
|
-
|
|
10
|
-
> * The code was massively refactored for a [support](#using-kimurai-inside-existing-ruby-application) to run spiders multiple times from inside a single process. Now it's possible to run Kimurai spiders using background jobs like Sidekiq.
|
|
11
|
-
> * `require 'kimurai'` doesn't require any gems except Active Support. Only when a particular spider [starts](#crawl-method), Capybara will be required with a specific driver.
|
|
12
|
-
> * Although Kimurai [extends](lib/kimurai/capybara_ext) Capybara (all the magic happens inside [extended](lib/kimurai/capybara_ext/session.rb) `Capybara::Session#visit` method), session instances which were created manually will behave normally.
|
|
13
|
-
> * No spaghetti code with `case/when/end` blocks anymore. All drivers [were extended](lib/kimurai/capybara_ext) to support unified methods for cookies, proxies, headers, etc.
|
|
14
|
-
> * `selenium_url_to_set_cookies` @config option don't need anymore if you're use Selenium-like engine with custom cookies setting.
|
|
15
|
-
> * Small changes in design (check the readme again to see what was changed)
|
|
16
|
-
> * Stats database with a web dashboard were removed
|
|
17
|
-
> * Again, massive refactor. Code now looks much better than it was before.
|
|
18
|
-
|
|
19
|
-
<br>
|
|
20
|
-
|
|
21
|
-
> Note: this readme is for `1.3.2` gem version. CHANGELOG [here](CHANGELOG.md).
|
|
22
|
-
|
|
23
|
-
Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
|
|
24
|
-
|
|
25
|
-
Kimurai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
|
|
5
|
+
Kimurai is based on the well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Let's try an example:
|
|
26
6
|
|
|
27
7
|
```ruby
|
|
28
8
|
# github_spider.rb
|
|
@@ -31,18 +11,17 @@ require 'kimurai'
|
|
|
31
11
|
class GithubSpider < Kimurai::Base
|
|
32
12
|
@name = "github_spider"
|
|
33
13
|
@engine = :selenium_chrome
|
|
34
|
-
@start_urls = ["https://github.com/search?q=
|
|
14
|
+
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
|
|
35
15
|
@config = {
|
|
36
|
-
|
|
37
|
-
before_request: { delay: 4..7 }
|
|
16
|
+
before_request: { delay: 3..5 }
|
|
38
17
|
}
|
|
39
18
|
|
|
40
19
|
def parse(response, url:, data: {})
|
|
41
|
-
response.xpath("//
|
|
20
|
+
response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
|
|
42
21
|
request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
|
|
43
22
|
end
|
|
44
23
|
|
|
45
|
-
if next_page = response.at_xpath("//a[@
|
|
24
|
+
if next_page = response.at_xpath("//a[@rel='next']")
|
|
46
25
|
request_to :parse, url: absolute_url(next_page[:href], base: url)
|
|
47
26
|
end
|
|
48
27
|
end
|
|
@@ -50,15 +29,15 @@ class GithubSpider < Kimurai::Base
|
|
|
50
29
|
def parse_repo_page(response, url:, data: {})
|
|
51
30
|
item = {}
|
|
52
31
|
|
|
53
|
-
item[:owner] = response.xpath("//
|
|
54
|
-
item[:repo_name] = response.xpath("//
|
|
32
|
+
item[:owner] = response.xpath("//a[@rel='author']").text.squish
|
|
33
|
+
item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
|
|
55
34
|
item[:repo_url] = url
|
|
56
|
-
item[:description] = response.xpath("//
|
|
57
|
-
item[:tags] = response.xpath("//div[@
|
|
58
|
-
item[:watch_count] = response.xpath("//
|
|
59
|
-
item[:star_count] = response.xpath("//
|
|
60
|
-
item[:fork_count] = response.xpath("//
|
|
61
|
-
item[:last_commit] = response.xpath("//
|
|
35
|
+
item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
|
|
36
|
+
item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
|
|
37
|
+
item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
|
|
38
|
+
item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
|
|
39
|
+
item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
|
|
40
|
+
item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish
|
|
62
41
|
|
|
63
42
|
save_to "results.json", item, format: :pretty_json
|
|
64
43
|
end
|
|
@@ -71,33 +50,25 @@ GithubSpider.crawl!
|
|
|
71
50
|
<summary>Run: <code>$ ruby github_spider.rb</code></summary>
|
|
72
51
|
|
|
73
52
|
```
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
I, [
|
|
81
|
-
I, [
|
|
82
|
-
I, [
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
I, [
|
|
86
|
-
I, [
|
|
87
|
-
I, [
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
I, [
|
|
53
|
+
$ ruby github_spider.rb
|
|
54
|
+
|
|
55
|
+
I, [2025-12-16 12:15:48] INFO -- github_spider: Spider: started: github_spider
|
|
56
|
+
I, [2025-12-16 12:15:48] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
|
|
57
|
+
I, [2025-12-16 12:16:01] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
|
|
58
|
+
I, [2025-12-16 12:16:01] INFO -- github_spider: Info: visits: requests: 1, responses: 1
|
|
59
|
+
I, [2025-12-16 12:16:01] INFO -- github_spider: Browser: started get request to: https://github.com/sparklemotion/mechanize
|
|
60
|
+
I, [2025-12-16 12:16:06] INFO -- github_spider: Browser: finished get request to: https://github.com/sparklemotion/mechanize
|
|
61
|
+
I, [2025-12-16 12:16:06] INFO -- github_spider: Info: visits: requests: 2, responses: 2
|
|
62
|
+
I, [2025-12-16 12:16:06] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
|
|
63
|
+
I, [2025-12-16 12:16:11] INFO -- github_spider: Browser: finished get request to: https://github.com/jaimeiniesta/metainspector
|
|
64
|
+
I, [2025-12-16 12:16:11] INFO -- github_spider: Info: visits: requests: 3, responses: 3
|
|
65
|
+
I, [2025-12-16 12:16:11] INFO -- github_spider: Browser: started get request to: https://github.com/Germey/AwesomeWebScraping
|
|
66
|
+
I, [2025-12-16 12:16:13] INFO -- github_spider: Browser: finished get request to: https://github.com/Germey/AwesomeWebScraping
|
|
67
|
+
I, [2025-12-16 12:16:13] INFO -- github_spider: Info: visits: requests: 4, responses: 4
|
|
68
|
+
I, [2025-12-16 12:16:13] INFO -- github_spider: Browser: started get request to: https://github.com/vifreefly/kimuraframework
|
|
69
|
+
I, [2025-12-16 12:16:17] INFO -- github_spider: Browser: finished get request to: https://github.com/vifreefly/kimuraframework
|
|
91
70
|
|
|
92
71
|
...
|
|
93
|
-
|
|
94
|
-
I, [2018-08-22 13:23:07 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
|
|
95
|
-
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
|
|
96
|
-
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 140, responses: 140
|
|
97
|
-
D, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 204198
|
|
98
|
-
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
|
|
99
|
-
|
|
100
|
-
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:08:03 +0400, :stop_time=>2018-08-22 13:23:08 +0400, :running_time=>"15m, 5s", :visits=>{:requests=>140, :responses=>140}, :error=>nil}
|
|
101
72
|
```
|
|
102
73
|
</details>
|
|
103
74
|
|
|
@@ -107,48 +78,71 @@ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider:
|
|
|
107
78
|
```json
|
|
108
79
|
[
|
|
109
80
|
{
|
|
110
|
-
"owner": "
|
|
111
|
-
"repo_name": "
|
|
112
|
-
"repo_url": "https://github.com/
|
|
113
|
-
"description": "
|
|
114
|
-
"tags": [
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
"python",
|
|
120
|
-
"javascript",
|
|
121
|
-
"php",
|
|
122
|
-
"ruby"
|
|
123
|
-
],
|
|
124
|
-
"watch_count": "159",
|
|
125
|
-
"star_count": "2,423",
|
|
126
|
-
"fork_count": "358",
|
|
127
|
-
"last_commit": "4 days ago",
|
|
81
|
+
"owner": "sparklemotion",
|
|
82
|
+
"repo_name": "mechanize",
|
|
83
|
+
"repo_url": "https://github.com/sparklemotion/mechanize",
|
|
84
|
+
"description": "Mechanize is a ruby library that makes automated web interaction easy.",
|
|
85
|
+
"tags": ["ruby", "web", "scraping"],
|
|
86
|
+
"watch_count": "79",
|
|
87
|
+
"star_count": "4.4k",
|
|
88
|
+
"fork_count": "480",
|
|
89
|
+
"last_commit": "Sep 30, 2025",
|
|
128
90
|
"position": 1
|
|
129
91
|
},
|
|
130
|
-
|
|
131
|
-
...
|
|
132
|
-
|
|
133
92
|
{
|
|
134
|
-
"owner": "
|
|
135
|
-
"repo_name": "
|
|
136
|
-
"repo_url": "https://github.com/
|
|
137
|
-
"description": "
|
|
138
|
-
"tags": [
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
"
|
|
142
|
-
"
|
|
93
|
+
"owner": "jaimeiniesta",
|
|
94
|
+
"repo_name": "metainspector",
|
|
95
|
+
"repo_url": "https://github.com/jaimeiniesta/metainspector",
|
|
96
|
+
"description": "Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...",
|
|
97
|
+
"tags": [],
|
|
98
|
+
"watch_count": "20",
|
|
99
|
+
"star_count": "1k",
|
|
100
|
+
"fork_count": "166",
|
|
101
|
+
"last_commit": "Oct 8, 2025",
|
|
102
|
+
"position": 2
|
|
103
|
+
},
|
|
104
|
+
{
|
|
105
|
+
"owner": "Germey",
|
|
106
|
+
"repo_name": "AwesomeWebScraping",
|
|
107
|
+
"repo_url": "https://github.com/Germey/AwesomeWebScraping",
|
|
108
|
+
"description": "List of libraries, tools and APIs for web scraping and data processing.",
|
|
109
|
+
"tags": ["javascript", "ruby", "python", "golang", "php", "awesome", "captcha", "proxy", "web-scraping", "aswsome-list"],
|
|
110
|
+
"watch_count": "5",
|
|
111
|
+
"star_count": "253",
|
|
112
|
+
"fork_count": "33",
|
|
113
|
+
"last_commit": "Apr 5, 2024",
|
|
114
|
+
"position": 3
|
|
115
|
+
},
|
|
116
|
+
{
|
|
117
|
+
"owner": "vifreefly",
|
|
118
|
+
"repo_name": "kimuraframework",
|
|
119
|
+
"repo_url": "https://github.com/vifreefly/kimuraframework",
|
|
120
|
+
"description": "Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites",
|
|
121
|
+
"tags": ["crawler", "scraper", "scrapy", "headless-chrome", "kimurai"],
|
|
122
|
+
"watch_count": "28",
|
|
123
|
+
"star_count": "1k",
|
|
124
|
+
"fork_count": "158",
|
|
125
|
+
"last_commit": "Dec 12, 2025",
|
|
126
|
+
"position": 4
|
|
127
|
+
},
|
|
128
|
+
// ...
|
|
129
|
+
{
|
|
130
|
+
"owner": "citixenken",
|
|
131
|
+
"repo_name": "web_scraping_with_ruby",
|
|
132
|
+
"repo_url": "https://github.com/citixenken/web_scraping_with_ruby",
|
|
133
|
+
"description": "",
|
|
134
|
+
"tags": [],
|
|
135
|
+
"watch_count": "1",
|
|
136
|
+
"star_count": "0",
|
|
143
137
|
"fork_count": "0",
|
|
144
|
-
"last_commit": "
|
|
145
|
-
"position":
|
|
138
|
+
"last_commit": "Aug 29, 2022",
|
|
139
|
+
"position": 118
|
|
146
140
|
}
|
|
147
141
|
]
|
|
148
142
|
```
|
|
149
143
|
</details><br>
|
|
150
144
|
|
|
151
|
-
Okay, that was easy. How about
|
|
145
|
+
Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:
|
|
152
146
|
|
|
153
147
|
```ruby
|
|
154
148
|
# infinite_scroll_spider.rb
|
|
@@ -172,7 +166,7 @@ class InfiniteScrollSpider < Kimurai::Base
|
|
|
172
166
|
logger.info "> Pagination is done" and break
|
|
173
167
|
else
|
|
174
168
|
count = new_count
|
|
175
|
-
logger.info "> Continue scrolling, current count is #{count}..."
|
|
169
|
+
logger.info "> Continue scrolling, current posts count is #{count}..."
|
|
176
170
|
end
|
|
177
171
|
end
|
|
178
172
|
|
|
@@ -188,49 +182,46 @@ InfiniteScrollSpider.crawl!
|
|
|
188
182
|
<summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>
|
|
189
183
|
|
|
190
184
|
```
|
|
191
|
-
|
|
192
|
-
D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): created browser instance
|
|
193
|
-
D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
|
|
194
|
-
I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
|
|
195
|
-
I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
|
|
196
|
-
I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
|
|
197
|
-
D, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: Browser: driver.current_memory: 95463
|
|
198
|
-
I, [2018-08-22 13:33:05 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 5...
|
|
199
|
-
I, [2018-08-22 13:33:18 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 9...
|
|
200
|
-
I, [2018-08-22 13:33:20 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 11...
|
|
201
|
-
I, [2018-08-22 13:33:26 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 13...
|
|
202
|
-
I, [2018-08-22 13:33:28 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 15...
|
|
203
|
-
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Pagination is done
|
|
204
|
-
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
|
|
205
|
-
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
|
|
206
|
-
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: stopped: {:spider_name=>"infinite_scroll_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:32:57 +0400, :stop_time=>2018-08-22 13:33:30 +0400, :running_time=>"33s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
|
|
185
|
+
$ ruby infinite_scroll_spider.rb
|
|
207
186
|
|
|
187
|
+
I, [2025-12-16 12:47:05] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
|
|
188
|
+
I, [2025-12-16 12:47:05] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
|
|
189
|
+
I, [2025-12-16 12:47:09] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
|
|
190
|
+
I, [2025-12-16 12:47:09] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
|
|
191
|
+
I, [2025-12-16 12:47:11] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 5...
|
|
192
|
+
I, [2025-12-16 12:47:13] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 9...
|
|
193
|
+
I, [2025-12-16 12:47:15] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 11...
|
|
194
|
+
I, [2025-12-16 12:47:17] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
|
|
195
|
+
I, [2025-12-16 12:47:19] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
|
|
196
|
+
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > Pagination is done
|
|
197
|
+
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
|
|
198
|
+
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
|
|
199
|
+
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: Spider: stopped: {spider_name: "infinite_scroll_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 12:47:05.372053 +0300, stop_time: 2025-12-16 12:47:21.505078 +0300, running_time: "16s", visits: {requests: 1, responses: 1}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
|
|
208
200
|
```
|
|
209
201
|
</details><br>
|
|
210
202
|
|
|
211
203
|
|
|
212
204
|
## Features
|
|
213
|
-
* Scrape
|
|
214
|
-
* Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode)
|
|
205
|
+
* Scrape JavaScript rendered websites out of the box
|
|
206
|
+
* Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
|
|
215
207
|
* Write spider code once, and use it with any supported engine later
|
|
216
208
|
* All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
|
|
217
209
|
* Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
|
|
218
|
-
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates
|
|
210
|
+
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates) to skip duplicates
|
|
219
211
|
* Automatically [handle requests errors](#handle-request-errors)
|
|
220
212
|
* Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
|
|
221
213
|
* Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
|
|
222
214
|
* [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
|
|
223
215
|
* **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
|
|
224
216
|
* Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
|
|
225
|
-
*
|
|
226
|
-
* Command-line [runner](#runner) to run all project spiders one by one or in parallel
|
|
217
|
+
* Command-line [runner](#runner) to run all project spiders one-by-one or in parallel
|
|
227
218
|
|
|
228
219
|
## Table of Contents
|
|
229
220
|
* [Kimurai](#kimurai)
|
|
230
221
|
* [Features](#features)
|
|
231
222
|
* [Table of Contents](#table-of-contents)
|
|
232
223
|
* [Installation](#installation)
|
|
233
|
-
* [Getting to
|
|
224
|
+
* [Getting to know Kimurai](#getting-to-know-kimurai)
|
|
234
225
|
* [Interactive console](#interactive-console)
|
|
235
226
|
* [Available engines](#available-engines)
|
|
236
227
|
* [Minimum required spider structure](#minimum-required-spider-structure)
|
|
@@ -239,9 +230,9 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
|
239
230
|
* [request_to method](#request_to-method)
|
|
240
231
|
* [save_to helper](#save_to-helper)
|
|
241
232
|
* [Skip duplicates](#skip-duplicates)
|
|
242
|
-
* [Automatically skip all
|
|
233
|
+
* [Automatically skip all duplicate request urls](#automatically-skip-all-duplicate-request-urls)
|
|
243
234
|
* [Storage object](#storage-object)
|
|
244
|
-
* [
|
|
235
|
+
* [Handling request errors](#handling-request-errors)
|
|
245
236
|
* [skip_request_errors](#skip_request_errors)
|
|
246
237
|
* [retry_request_errors](#retry_request_errors)
|
|
247
238
|
* [Logging custom events](#logging-custom-events)
|
|
@@ -251,13 +242,10 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
|
251
242
|
* [Active Support included](#active-support-included)
|
|
252
243
|
* [Schedule spiders using Cron](#schedule-spiders-using-cron)
|
|
253
244
|
* [Configuration options](#configuration-options)
|
|
254
|
-
* [Using Kimurai inside existing Ruby
|
|
245
|
+
* [Using Kimurai inside existing Ruby applications](#using-kimurai-inside-existing-ruby-applications)
|
|
255
246
|
* [crawl! method](#crawl-method)
|
|
256
247
|
* [parse! method](#parsemethod_name-url-method)
|
|
257
248
|
* [Kimurai.list and Kimurai.find_by_name](#kimurailist-and-kimuraifind_by_name)
|
|
258
|
-
* [Automated sever setup and deployment](#automated-sever-setup-and-deployment)
|
|
259
|
-
* [Setup](#setup)
|
|
260
|
-
* [Deploy](#deploy)
|
|
261
249
|
* [Spider @config](#spider-config)
|
|
262
250
|
* [All available @config options](#all-available-config-options)
|
|
263
251
|
* [@config settings inheritance](#config-settings-inheritance)
|
|
@@ -274,187 +262,111 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
|
274
262
|
|
|
275
263
|
|
|
276
264
|
## Installation
|
|
277
|
-
Kimurai requires Ruby version `>=
|
|
265
|
+
Kimurai requires Ruby version `>= 3.1.0`. Officially supported platforms: `Linux` and `macOS`.
|
|
278
266
|
|
|
279
|
-
1) If your system doesn't have appropriate Ruby version, install it:
|
|
267
|
+
1) If your system doesn't have the appropriate Ruby version, install it:
|
|
280
268
|
|
|
281
269
|
<details/>
|
|
282
|
-
<summary>Ubuntu
|
|
270
|
+
<summary>Ubuntu 24.04</summary>
|
|
283
271
|
|
|
284
272
|
```bash
|
|
285
|
-
# Install required packages
|
|
273
|
+
# Install required system packages
|
|
286
274
|
sudo apt update
|
|
287
|
-
sudo apt install
|
|
288
|
-
|
|
289
|
-
# Install rbenv and ruby-build
|
|
290
|
-
cd && git clone https://github.com/rbenv/rbenv.git ~/.rbenv
|
|
291
|
-
echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
|
|
292
|
-
echo 'eval "$(rbenv init -)"' >> ~/.bashrc
|
|
293
|
-
exec $SHELL
|
|
275
|
+
sudo apt install build-essential rustc libssl-dev libyaml-dev zlib1g-dev libgmp-dev
|
|
294
276
|
|
|
295
|
-
|
|
296
|
-
|
|
297
|
-
|
|
277
|
+
# Install Mice version manager
|
|
278
|
+
curl https://mise.run | sh
|
|
279
|
+
echo 'eval "$(~/.local/bin/mise activate)"' >> ~/.bashrc
|
|
280
|
+
source ~/.bashrc
|
|
298
281
|
|
|
299
282
|
# Install latest Ruby
|
|
300
|
-
|
|
301
|
-
|
|
302
|
-
|
|
303
|
-
gem install bundler
|
|
283
|
+
mise use --global ruby@3
|
|
284
|
+
gem update --system
|
|
304
285
|
```
|
|
305
286
|
</details>
|
|
306
287
|
|
|
307
288
|
<details/>
|
|
308
|
-
<summary>
|
|
289
|
+
<summary>macOS</summary>
|
|
309
290
|
|
|
310
291
|
```bash
|
|
311
|
-
# Install
|
|
312
|
-
|
|
313
|
-
brew install rbenv ruby-build
|
|
292
|
+
# Install Homebrew if you don't have it https://brew.sh/
|
|
293
|
+
brew install openssl@3 libyaml gmp rust
|
|
314
294
|
|
|
315
|
-
#
|
|
316
|
-
|
|
317
|
-
|
|
295
|
+
# Install Mice version manager
|
|
296
|
+
curl https://mise.run | sh
|
|
297
|
+
echo 'eval "$(~/.local/bin/mise activate)"' >> ~/.zshrc
|
|
298
|
+
source ~/.zshrc
|
|
318
299
|
|
|
319
300
|
# Install latest Ruby
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
|
|
323
|
-
gem install bundler
|
|
301
|
+
mise use --global ruby@3
|
|
302
|
+
gem update --system
|
|
324
303
|
```
|
|
325
304
|
</details>
|
|
326
305
|
|
|
327
306
|
2) Install Kimurai gem: `$ gem install kimurai`
|
|
328
307
|
|
|
329
|
-
3) Install browsers
|
|
308
|
+
3) Install browsers:
|
|
330
309
|
|
|
331
310
|
<details/>
|
|
332
|
-
<summary>Ubuntu
|
|
333
|
-
|
|
334
|
-
Note: for Ubuntu 16.04-18.04 there is available automatic installation using `setup` command:
|
|
335
|
-
```bash
|
|
336
|
-
$ kimurai setup localhost --local --ask-sudo
|
|
337
|
-
```
|
|
338
|
-
It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/kimurai/automation).
|
|
339
|
-
|
|
340
|
-
If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:
|
|
311
|
+
<summary>Ubuntu 24.04</summary>
|
|
341
312
|
|
|
342
313
|
```bash
|
|
343
314
|
# Install basic tools
|
|
344
315
|
sudo apt install -q -y unzip wget tar openssl
|
|
345
316
|
|
|
346
|
-
# Install xvfb (for virtual_display headless mode, in
|
|
317
|
+
# Install xvfb (for virtual_display headless mode, in addition to native)
|
|
347
318
|
sudo apt install -q -y xvfb
|
|
348
|
-
|
|
349
|
-
# Install chromium-browser and firefox
|
|
350
|
-
sudo apt install -q -y chromium-browser firefox
|
|
351
|
-
|
|
352
|
-
# Instal chromedriver (2.44 version)
|
|
353
|
-
# All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
|
|
354
|
-
cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
|
|
355
|
-
sudo unzip chromedriver_linux64.zip -d /usr/local/bin
|
|
356
|
-
rm -f chromedriver_linux64.zip
|
|
357
|
-
|
|
358
|
-
# Install geckodriver (0.23.0 version)
|
|
359
|
-
# All versions located here https://github.com/mozilla/geckodriver/releases/
|
|
360
|
-
cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
|
|
361
|
-
sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
|
|
362
|
-
rm -f geckodriver-v0.23.0-linux64.tar.gz
|
|
363
|
-
|
|
364
|
-
# Install PhantomJS (2.1.1)
|
|
365
|
-
# All versions located here http://phantomjs.org/download.html
|
|
366
|
-
sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
|
|
367
|
-
cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
|
|
368
|
-
tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
|
|
369
|
-
sudo mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib
|
|
370
|
-
sudo ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
|
|
371
|
-
rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2
|
|
372
319
|
```
|
|
373
320
|
|
|
374
|
-
|
|
375
|
-
|
|
376
|
-
<details/>
|
|
377
|
-
<summary>Mac OS X</summary>
|
|
321
|
+
Latest automatically installed selenium drivers doesn't work well with Ubuntu Snap versions of Chrome and Firefox, therefore we need to install classic .deb versions and make sure they are available over Snap versions:
|
|
378
322
|
|
|
379
323
|
```bash
|
|
380
|
-
# Install chrome
|
|
381
|
-
|
|
382
|
-
|
|
383
|
-
# Install chromedriver (latest)
|
|
384
|
-
brew cask install chromedriver
|
|
385
|
-
|
|
386
|
-
# Install geckodriver (latest)
|
|
387
|
-
brew install geckodriver
|
|
388
|
-
|
|
389
|
-
# Install PhantomJS (latest)
|
|
390
|
-
brew install phantomjs
|
|
324
|
+
# Install google chrome
|
|
325
|
+
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
|
|
326
|
+
sudo apt-get install -y ./google-chrome-stable_current_amd64.deb
|
|
391
327
|
```
|
|
392
|
-
</details><br>
|
|
393
328
|
|
|
394
|
-
|
|
329
|
+
```bash
|
|
330
|
+
# Install firefox (only if you intend to use Firefox as a browser, using selenium_firefox engine)
|
|
331
|
+
# See https://www.omgubuntu.co.uk/2022/04/how-to-install-firefox-deb-apt-ubuntu-22-04
|
|
332
|
+
sudo snap remove firefox
|
|
395
333
|
|
|
396
|
-
|
|
397
|
-
|
|
334
|
+
sudo install -d -m 0755 /etc/apt/keyrings
|
|
335
|
+
wget -q https://packages.mozilla.org/apt/repo-signing-key.gpg -O- | sudo tee /etc/apt/keyrings/packages.mozilla.org.asc > /dev/null
|
|
398
336
|
|
|
399
|
-
|
|
337
|
+
echo "deb [signed-by=/etc/apt/keyrings/packages.mozilla.org.asc] https://packages.mozilla.org/apt mozilla main" | sudo tee -a /etc/apt/sources.list.d/mozilla.list > /dev/null
|
|
400
338
|
|
|
401
|
-
|
|
402
|
-
|
|
403
|
-
|
|
404
|
-
|
|
339
|
+
echo '
|
|
340
|
+
Package: *
|
|
341
|
+
Pin: origin packages.mozilla.org
|
|
342
|
+
Pin-Priority: 1000
|
|
405
343
|
|
|
406
|
-
|
|
407
|
-
|
|
344
|
+
Package: firefox*
|
|
345
|
+
Pin: release o=Ubuntu
|
|
346
|
+
Pin-Priority: -1' | sudo tee /etc/apt/preferences.d/mozilla
|
|
408
347
|
|
|
409
|
-
|
|
410
|
-
sudo apt install
|
|
411
|
-
```
|
|
412
|
-
|
|
413
|
-
But if you want to save items to a local database, database server required as well:
|
|
414
|
-
```bash
|
|
415
|
-
# Install MySQL client and server
|
|
416
|
-
sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
|
|
417
|
-
|
|
418
|
-
# Install Postgres client and server
|
|
419
|
-
sudo apt install -q -y postgresql postgresql-contrib libpq-dev
|
|
420
|
-
|
|
421
|
-
# Install MongoDB client and server
|
|
422
|
-
# version 4.0 (check here https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
|
|
423
|
-
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
|
424
|
-
# for 16.04:
|
|
425
|
-
# echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
|
426
|
-
# for 18.04:
|
|
427
|
-
echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
|
428
|
-
sudo apt update
|
|
429
|
-
sudo apt install -q -y mongodb-org
|
|
430
|
-
sudo service mongod start
|
|
348
|
+
sudo apt update && sudo apt remove firefox
|
|
349
|
+
sudo apt install firefox
|
|
431
350
|
```
|
|
432
351
|
</details>
|
|
433
352
|
|
|
434
353
|
<details/>
|
|
435
|
-
<summary>
|
|
436
|
-
|
|
437
|
-
SQlite: `$ brew install sqlite3`
|
|
354
|
+
<summary>macOS</summary>
|
|
438
355
|
|
|
439
356
|
```bash
|
|
440
|
-
# Install
|
|
441
|
-
brew install
|
|
442
|
-
# Start server if you need it: brew services start mysql
|
|
443
|
-
|
|
444
|
-
# Install Postgres client and server
|
|
445
|
-
brew install postgresql
|
|
446
|
-
# Start server if you need it: brew services start postgresql
|
|
447
|
-
|
|
448
|
-
# Install MongoDB client and server
|
|
449
|
-
brew install mongodb
|
|
450
|
-
# Start server if you need it: brew services start mongodb
|
|
357
|
+
# Install google chrome
|
|
358
|
+
brew install google-chrome
|
|
451
359
|
```
|
|
452
|
-
</details>
|
|
453
360
|
|
|
361
|
+
```bash
|
|
362
|
+
# Install firefox (only if you intend to use Firefox as a browser, using selenium_firefox engine)
|
|
363
|
+
brew install firefox
|
|
364
|
+
```
|
|
365
|
+
</details><br>
|
|
454
366
|
|
|
455
|
-
## Getting to
|
|
367
|
+
## Getting to know Kimurai
|
|
456
368
|
### Interactive console
|
|
457
|
-
Before you get to know all Kimurai features, there is `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
|
|
369
|
+
Before you get to know all of Kimurai's features, there is a `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
|
|
458
370
|
|
|
459
371
|
```bash
|
|
460
372
|
$ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
|
|
@@ -466,76 +378,45 @@ $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/ki
|
|
|
466
378
|
```
|
|
467
379
|
$ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
|
|
468
380
|
|
|
469
|
-
D, [
|
|
470
|
-
|
|
471
|
-
I, [
|
|
472
|
-
|
|
473
|
-
|
|
474
|
-
|
|
475
|
-
|
|
476
|
-
|
|
477
|
-
|
|
478
|
-
|
|
479
|
-
|
|
480
|
-
|
|
481
|
-
[
|
|
482
|
-
=> "
|
|
483
|
-
|
|
484
|
-
[
|
|
485
|
-
|
|
486
|
-
|
|
487
|
-
|
|
488
|
-
|
|
489
|
-
[3] pry(#<Kimurai::Base>)> ls response
|
|
490
|
-
Nokogiri::XML::PP::Node#methods: inspect pretty_print
|
|
491
|
-
Nokogiri::XML::Searchable#methods: % / at at_css at_xpath css search xpath
|
|
492
|
-
Enumerable#methods:
|
|
493
|
-
all? collect drop each_with_index find_all grep_v lazy member? none? reject slice_when take_while without
|
|
494
|
-
any? collect_concat drop_while each_with_object find_index group_by many? min one? reverse_each sort to_a zip
|
|
495
|
-
as_json count each_cons entries first include? map min_by partition select sort_by to_h
|
|
496
|
-
chunk cycle each_entry exclude? flat_map index_by max minmax pluck slice_after sum to_set
|
|
497
|
-
chunk_while detect each_slice find grep inject max_by minmax_by reduce slice_before take uniq
|
|
498
|
-
Nokogiri::XML::Node#methods:
|
|
499
|
-
<=> append_class classes document? has_attribute? matches? node_name= processing_instruction? to_str
|
|
500
|
-
== attr comment? each html? name= node_type read_only? to_xhtml
|
|
501
|
-
> attribute content elem? inner_html namespace= parent= remove traverse
|
|
502
|
-
[] attribute_nodes content= element? inner_html= namespace_scopes parse remove_attribute unlink
|
|
503
|
-
[]= attribute_with_ns create_external_subset element_children inner_text namespaced_key? path remove_class values
|
|
504
|
-
accept before create_internal_subset elements internal_subset native_content= pointer_id replace write_html_to
|
|
505
|
-
add_class blank? css_path encode_special_chars key? next prepend_child set_attribute write_to
|
|
506
|
-
add_next_sibling cdata? decorate! external_subset keys next= previous text write_xhtml_to
|
|
507
|
-
add_previous_sibling child delete first_element_child lang next_element previous= text? write_xml_to
|
|
508
|
-
after children description fragment? lang= next_sibling previous_element to_html xml?
|
|
509
|
-
ancestors children= do_xinclude get_attribute last_element_child node_name previous_sibling to_s
|
|
510
|
-
Nokogiri::XML::Document#methods:
|
|
511
|
-
<< canonicalize collect_namespaces create_comment create_entity decorate document encoding errors name remove_namespaces! root= to_java url version
|
|
512
|
-
add_child clone create_cdata create_element create_text_node decorators dup encoding= errors= namespaces root slop! to_xml validate
|
|
513
|
-
Nokogiri::HTML::Document#methods: fragment meta_encoding meta_encoding= serialize title title= type
|
|
514
|
-
instance variables: @decorators @errors @node_cache
|
|
515
|
-
|
|
516
|
-
[4] pry(#<Kimurai::Base>)> exit
|
|
517
|
-
I, [2018-08-22 13:43:47 +0400#26079] [M: 47461994677760] INFO -- : Browser: driver selenium_chrome has been destroyed
|
|
518
|
-
$
|
|
381
|
+
D, [2025-12-16 13:08:41 +0300#37718] [M: 1208] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
|
|
382
|
+
I, [2025-12-16 13:08:41 +0300#37718] [M: 1208] INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
|
|
383
|
+
I, [2025-12-16 13:08:43 +0300#37718] [M: 1208] INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
|
|
384
|
+
|
|
385
|
+
From: /Users/vic/code/spiders/kimuraframework/lib/kimurai/base.rb:208 Kimurai::Base#console:
|
|
386
|
+
|
|
387
|
+
207: def console(response = nil, url: nil, data: {})
|
|
388
|
+
=> 208: binding.pry
|
|
389
|
+
209: end
|
|
390
|
+
|
|
391
|
+
[1] pry(#<Kimurai::Base>)> response.css('title').text
|
|
392
|
+
=> "GitHub - vifreefly/kimuraframework: Kimurai is a modern Ruby web scraping framework that supports scraping with antidetect Chrome/Firefox as well as HTTP requests"
|
|
393
|
+
[2] pry(#<Kimurai::Base>)> browser.current_url
|
|
394
|
+
=> "https://github.com/vifreefly/kimuraframework"
|
|
395
|
+
[3] pry(#<Kimurai::Base>)> browser.visit('https://google.com')
|
|
396
|
+
I, [2025-12-16 13:09:24 +0300#37718] [M: 1208] INFO -- : Browser: started get request to: https://google.com
|
|
397
|
+
I, [2025-12-16 13:09:26 +0300#37718] [M: 1208] INFO -- : Browser: finished get request to: https://google.com
|
|
398
|
+
=> true
|
|
399
|
+
[4] pry(#<Kimurai::Base>)> browser.current_response.title
|
|
400
|
+
=> "Google"
|
|
519
401
|
```
|
|
520
402
|
</details><br>
|
|
521
403
|
|
|
522
|
-
CLI
|
|
404
|
+
CLI arguments:
|
|
523
405
|
* `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
|
|
524
|
-
* `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
|
|
406
|
+
* `--url` (optional) url to process. If url is omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
|
|
525
407
|
|
|
526
408
|
### Available engines
|
|
527
|
-
Kimurai has support for following engines and mostly
|
|
409
|
+
Kimurai has support for the following engines and can mostly switch between them without the need to rewrite any code:
|
|
528
410
|
|
|
529
|
-
* `:mechanize`
|
|
530
|
-
* `:
|
|
531
|
-
* `:
|
|
532
|
-
* `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
|
|
411
|
+
* `:mechanize` – [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render JavaScript and doesn't know what the DOM is it. It can only parse the original HTML code of a page. Because of it, mechanize is much faster, takes much less memory and is in general much more stable than any real browser. It's recommended to use mechanize when possible; if the website doesn't use JavaScript to render any meaningful parts of its structure. Still, because mechanize is trying to mimic a real browser, it supports almost all of Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
|
|
412
|
+
* `:selenium_chrome` – Chrome in headless mode driven by selenium. A modern headless browser solution with proper JavaScript rendering.
|
|
413
|
+
* `:selenium_firefox` – Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but can sometimes be useful.
|
|
533
414
|
|
|
534
|
-
**Tip:**
|
|
415
|
+
**Tip:** prepend a `HEADLESS=false` environment variable on the command line (i.e. `$ HEADLESS=false ruby spider.rb`) to launch an interactive browser in normal (not headless) mode and see its window (only for selenium-like engines). It works for the [console](#interactive-console) command as well.
|
|
535
416
|
|
|
536
417
|
|
|
537
418
|
### Minimum required spider structure
|
|
538
|
-
> You can manually create a spider file, or use
|
|
419
|
+
> You can manually create a spider file, or use the generate command: `$ kimurai generate spider simple_spider`
|
|
539
420
|
|
|
540
421
|
```ruby
|
|
541
422
|
require 'kimurai'
|
|
@@ -553,10 +434,10 @@ SimpleSpider.crawl!
|
|
|
553
434
|
```
|
|
554
435
|
|
|
555
436
|
Where:
|
|
556
|
-
* `@name`
|
|
557
|
-
* `@engine` engine for
|
|
558
|
-
* `@start_urls` array of
|
|
559
|
-
*
|
|
437
|
+
* `@name` – a name for the spider
|
|
438
|
+
* `@engine` – engine to use for the spider
|
|
439
|
+
* `@start_urls` – array of urls to process one-by-one inside the `parse` method
|
|
440
|
+
* The `parse` method is the entry point, and should always be present in a spider class
|
|
560
441
|
|
|
561
442
|
|
|
562
443
|
### Method arguments `response`, `url` and `data`
|
|
@@ -566,14 +447,14 @@ def parse(response, url:, data: {})
|
|
|
566
447
|
end
|
|
567
448
|
```
|
|
568
449
|
|
|
569
|
-
* `response`
|
|
570
|
-
* `url`
|
|
571
|
-
* `data`
|
|
450
|
+
* `response` – [Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object – contains parsed HTML code of a processed webpage
|
|
451
|
+
* `url` – String – url of a processed webpage
|
|
452
|
+
* `data` – Hash – used to pass data between requests
|
|
572
453
|
|
|
573
454
|
<details/>
|
|
574
|
-
<summary><strong>
|
|
455
|
+
<summary><strong>An example of how to use <code>data</code></strong></summary>
|
|
575
456
|
|
|
576
|
-
Imagine that there is a product page
|
|
457
|
+
Imagine that there is a product page that doesn't contain a category name. The category name is only present on category pages with pagination. This is a case where we can use `data` to pass a category name from `parse` to `parse_product`:
|
|
577
458
|
|
|
578
459
|
```ruby
|
|
579
460
|
class ProductsSpider < Kimurai::Base
|
|
@@ -583,7 +464,7 @@ class ProductsSpider < Kimurai::Base
|
|
|
583
464
|
def parse(response, url:, data: {})
|
|
584
465
|
category_name = response.xpath("//path/to/category/name").text
|
|
585
466
|
response.xpath("//path/to/products/urls").each do |product_url|
|
|
586
|
-
# Merge category_name with current data hash and pass it
|
|
467
|
+
# Merge category_name with current data hash and pass it to parse_product
|
|
587
468
|
request_to(:parse_product, url: product_url[:href], data: data.merge(category_name: category_name))
|
|
588
469
|
end
|
|
589
470
|
|
|
@@ -592,7 +473,7 @@ class ProductsSpider < Kimurai::Base
|
|
|
592
473
|
|
|
593
474
|
def parse_product(response, url:, data: {})
|
|
594
475
|
item = {}
|
|
595
|
-
# Assign item's category_name from data[:category_name]
|
|
476
|
+
# Assign an item's category_name from data[:category_name]
|
|
596
477
|
item[:category_name] = data[:category_name]
|
|
597
478
|
|
|
598
479
|
# ...
|
|
@@ -603,16 +484,16 @@ end
|
|
|
603
484
|
</details><br>
|
|
604
485
|
|
|
605
486
|
**You can query `response` using [XPath or CSS selectors](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Searchable)**. Check Nokogiri tutorials to understand how to work with `response`:
|
|
606
|
-
* [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/)
|
|
607
|
-
* [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri)
|
|
608
|
-
* [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation)
|
|
487
|
+
* [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) – ruby.bastardsbook.com
|
|
488
|
+
* [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) – readysteadycode.com
|
|
489
|
+
* [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) – rubydoc.info
|
|
609
490
|
|
|
610
491
|
|
|
611
492
|
### `browser` object
|
|
612
493
|
|
|
613
|
-
|
|
494
|
+
A browser object is available from any spider instance method, which is a [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses it to process requests and get page responses (`current_response` method). Usually, you don't need to touch it directly because `response` (see above) contains the page response after it was loaded.
|
|
614
495
|
|
|
615
|
-
But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
|
|
496
|
+
But, if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) a `browser` is ready for you:
|
|
616
497
|
|
|
617
498
|
```ruby
|
|
618
499
|
class GoogleSpider < Kimurai::Base
|
|
@@ -624,7 +505,7 @@ class GoogleSpider < Kimurai::Base
|
|
|
624
505
|
browser.fill_in "q", with: "Kimurai web scraping framework"
|
|
625
506
|
browser.click_button "Google Search"
|
|
626
507
|
|
|
627
|
-
# Update response
|
|
508
|
+
# Update response with current_response after interaction with a browser
|
|
628
509
|
response = browser.current_response
|
|
629
510
|
|
|
630
511
|
# Collect results
|
|
@@ -638,13 +519,13 @@ end
|
|
|
638
519
|
```
|
|
639
520
|
|
|
640
521
|
Check out **Capybara cheat sheets** where you can see all available methods **to interact with browser**:
|
|
641
|
-
* [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara)
|
|
642
|
-
* [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf)
|
|
643
|
-
* [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation)
|
|
522
|
+
* [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) – cheatrags.com
|
|
523
|
+
* [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) – thoughtbot.com
|
|
524
|
+
* [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) – rubydoc.info
|
|
644
525
|
|
|
645
526
|
### `request_to` method
|
|
646
527
|
|
|
647
|
-
For making requests to a particular method there is `request_to`. It requires
|
|
528
|
+
For making requests to a particular method, there is `request_to`. It requires at least two arguments: `:method_name` and `url:`. And, optionally `data:` (see above). Example:
|
|
648
529
|
|
|
649
530
|
```ruby
|
|
650
531
|
class Spider < Kimurai::Base
|
|
@@ -662,7 +543,7 @@ class Spider < Kimurai::Base
|
|
|
662
543
|
end
|
|
663
544
|
```
|
|
664
545
|
|
|
665
|
-
Under the hood `request_to` simply
|
|
546
|
+
Under the hood, `request_to` simply calls [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`), and the provided method with arguments:
|
|
666
547
|
|
|
667
548
|
<details/>
|
|
668
549
|
<summary>request_to</summary>
|
|
@@ -677,10 +558,10 @@ end
|
|
|
677
558
|
```
|
|
678
559
|
</details><br>
|
|
679
560
|
|
|
680
|
-
`request_to`
|
|
561
|
+
The `request_to` helper method makes things simpler. We could also do something like:
|
|
681
562
|
|
|
682
563
|
<details/>
|
|
683
|
-
<summary>
|
|
564
|
+
<summary>See the code</summary>
|
|
684
565
|
|
|
685
566
|
```ruby
|
|
686
567
|
class Spider < Kimurai::Base
|
|
@@ -703,7 +584,7 @@ end
|
|
|
703
584
|
|
|
704
585
|
### `save_to` helper
|
|
705
586
|
|
|
706
|
-
Sometimes all
|
|
587
|
+
Sometimes all you need is to simply save scraped data to a file. You can use the `save_to` helper method like so:
|
|
707
588
|
|
|
708
589
|
```ruby
|
|
709
590
|
class ProductsSpider < Kimurai::Base
|
|
@@ -719,31 +600,31 @@ class ProductsSpider < Kimurai::Base
|
|
|
719
600
|
item[:description] = response.xpath("//desc/path").text.squish
|
|
720
601
|
item[:price] = response.xpath("//price/path").text[/\d+/]&.to_f
|
|
721
602
|
|
|
722
|
-
#
|
|
603
|
+
# Append each new item to the `scraped_products.json` file:
|
|
723
604
|
save_to "scraped_products.json", item, format: :json
|
|
724
605
|
end
|
|
725
606
|
end
|
|
726
607
|
```
|
|
727
608
|
|
|
728
609
|
Supported formats:
|
|
729
|
-
* `:json` JSON
|
|
730
|
-
* `:pretty_json` "pretty" JSON (`JSON.pretty_generate`)
|
|
731
|
-
* `:jsonlines` [JSON Lines](http://jsonlines.org/)
|
|
732
|
-
* `:csv` CSV
|
|
610
|
+
* `:json` – JSON
|
|
611
|
+
* `:pretty_json` – "pretty" JSON (`JSON.pretty_generate`)
|
|
612
|
+
* `:jsonlines` – [JSON Lines](http://jsonlines.org/)
|
|
613
|
+
* `:csv` – CSV
|
|
733
614
|
|
|
734
|
-
Note: `save_to` requires data (item to save
|
|
615
|
+
Note: `save_to` requires the data (item) to save to be a `Hash`.
|
|
735
616
|
|
|
736
|
-
By default `save_to` add position key to an item hash. You can disable it
|
|
617
|
+
By default, `save_to` will add a position key to an item hash. You can disable it like so: `save_to "scraped_products.json", item, format: :json, position: false`
|
|
737
618
|
|
|
738
619
|
**How helper works:**
|
|
739
620
|
|
|
740
|
-
|
|
621
|
+
While the spider is running, each new item will be appended to the output file. On the next run, this helper will clear the contents of the output file, then start appending items to it.
|
|
741
622
|
|
|
742
|
-
> If you don't want file to be cleared before each run,
|
|
623
|
+
> If you don't want the file to be cleared before each run, pass `append: true` like so: `save_to "scraped_products.json", item, format: :json, append: true`
|
|
743
624
|
|
|
744
625
|
### Skip duplicates
|
|
745
626
|
|
|
746
|
-
It's pretty common
|
|
627
|
+
It's pretty common for websites to have duplicate pages. For example, when an e-commerce site has the same products in different categories. To skip duplicates, there is a simple `unique?` helper:
|
|
747
628
|
|
|
748
629
|
```ruby
|
|
749
630
|
class ProductsSpider < Kimurai::Base
|
|
@@ -766,11 +647,11 @@ class ProductsSpider < Kimurai::Base
|
|
|
766
647
|
end
|
|
767
648
|
end
|
|
768
649
|
|
|
769
|
-
#
|
|
650
|
+
# And/or check products for uniqueness using product sku inside of parse_product:
|
|
770
651
|
def parse_product(response, url:, data: {})
|
|
771
652
|
item = {}
|
|
772
653
|
item[:sku] = response.xpath("//product/sku/path").text.strip.upcase
|
|
773
|
-
# Don't save product
|
|
654
|
+
# Don't save the product if there is already an item with the same sku:
|
|
774
655
|
return unless unique?(:sku, item[:sku])
|
|
775
656
|
|
|
776
657
|
# ...
|
|
@@ -779,14 +660,14 @@ class ProductsSpider < Kimurai::Base
|
|
|
779
660
|
end
|
|
780
661
|
```
|
|
781
662
|
|
|
782
|
-
`unique?` helper works
|
|
663
|
+
The `unique?` helper works quite simply:
|
|
783
664
|
|
|
784
665
|
```ruby
|
|
785
|
-
# Check
|
|
666
|
+
# Check for "http://example.com" in `url` scope for the first time:
|
|
786
667
|
unique?(:url, "http://example.com")
|
|
787
668
|
# => true
|
|
788
669
|
|
|
789
|
-
#
|
|
670
|
+
# Next time:
|
|
790
671
|
unique?(:url, "http://example.com")
|
|
791
672
|
# => false
|
|
792
673
|
```
|
|
@@ -804,44 +685,44 @@ unique?(:id, 324234232)
|
|
|
804
685
|
unique?(:custom, "Lorem Ipsum")
|
|
805
686
|
```
|
|
806
687
|
|
|
807
|
-
#### Automatically skip all
|
|
688
|
+
#### Automatically skip all duplicate request urls
|
|
808
689
|
|
|
809
|
-
It
|
|
690
|
+
It's possible to automatically skip any previously visited urls when calling the `request_to` method using the `skip_duplicate_requests: true` config option. See [@config](#all-available-config-options) for additional options.
|
|
810
691
|
|
|
811
692
|
#### `storage` object
|
|
812
693
|
|
|
813
|
-
`unique?` method
|
|
694
|
+
The `unique?` method is just an alias for `storage#unique?`. Storage has several methods:
|
|
814
695
|
|
|
815
|
-
* `#all`
|
|
816
|
-
* `#
|
|
817
|
-
* `#
|
|
818
|
-
* `#unique?(scope, value)`
|
|
819
|
-
* `#clear!`
|
|
696
|
+
* `#all` – return all scopes
|
|
697
|
+
* `#add(scope, value)` – add a value to the scope
|
|
698
|
+
* `#include?(scope, value)` – returns `true` if the value exists in the scope, or `false` if it doesn't
|
|
699
|
+
* `#unique?(scope, value)` – returns `false` if the value exists in the scope, otherwise adds the value to the scope and returns `true`
|
|
700
|
+
* `#clear!` – deletes all values from all scopes
|
|
820
701
|
|
|
821
702
|
|
|
822
|
-
###
|
|
823
|
-
It
|
|
703
|
+
### Handling request errors
|
|
704
|
+
It's common while crawling web pages to get response codes other than `200 OK`. In such cases, the `request_to` method (or `browser.visit`) can raise an exception. Kimurai provides the `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
|
|
824
705
|
|
|
825
706
|
#### skip_request_errors
|
|
826
|
-
|
|
707
|
+
Kimurai can automatically skip certain errors while performing requests using the `skip_request_errors` [config](#spider-config) option. If a raised error matches one of the errors in the list, the error will be caught, and the request will be skipped. It's a good idea to skip errors like `404 Not Found`, etc.
|
|
827
708
|
|
|
828
|
-
|
|
709
|
+
`skip_request_errors` is an array of error classes and/or hashes. You can use a _hash_ for more flexibility like so:
|
|
829
710
|
|
|
830
711
|
```
|
|
831
712
|
@config = {
|
|
832
|
-
skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }]
|
|
713
|
+
skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }, { error: TimeoutError }]
|
|
833
714
|
}
|
|
834
715
|
```
|
|
835
|
-
In this case, provided `message:` will be compared with a full error message using `String#include?`.
|
|
716
|
+
In this case, the provided `message:` will be compared with a full error message using `String#include?`. You can also use regex like so: `{ error: RuntimeError, message: /404|403/ }`.
|
|
836
717
|
|
|
837
718
|
#### retry_request_errors
|
|
838
|
-
|
|
719
|
+
Kimurai can automatically retry requests several times after certain errors with the `retry_request_errors` [config](#spider-config) option. If a raised error matches one of the errors in the list, the error will be caught, and the request will be processed again with progressive delay.
|
|
839
720
|
|
|
840
|
-
There are 3 attempts
|
|
721
|
+
There are 3 attempts with _15 sec_, _30 sec_, and _45 sec_ delays, respectively. If after 3 attempts there is still an exception, then the exception will be raised. It's a good idea to retry errors like `ReadTimeout`, `HTTPBadGateway`, etc.
|
|
841
722
|
|
|
842
|
-
|
|
723
|
+
The format for `retry_request_errors` is the same as for `skip_request_errors`.
|
|
843
724
|
|
|
844
|
-
If you would like to skip (not raise) error after
|
|
725
|
+
If you would like to skip (not raise) the error after the 3 retries, you can specify `skip_on_failure: true` like so:
|
|
845
726
|
|
|
846
727
|
```ruby
|
|
847
728
|
@config = {
|
|
@@ -851,7 +732,7 @@ If you would like to skip (not raise) error after all retries gone, you can spec
|
|
|
851
732
|
|
|
852
733
|
### Logging custom events
|
|
853
734
|
|
|
854
|
-
It
|
|
735
|
+
It's possible to save custom messages to the [run_info](#open_spider-and-close_spider-callbacks) hash using the `add_event('Some message')` method. This feature helps you to keep track of important events during crawling without checking the whole spider log (in case if you're logging these messages using `logger`). For example:
|
|
855
736
|
|
|
856
737
|
```ruby
|
|
857
738
|
def parse_product(response, url:, data: {})
|
|
@@ -872,7 +753,7 @@ I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640] INFO -- example_spider:
|
|
|
872
753
|
|
|
873
754
|
### `open_spider` and `close_spider` callbacks
|
|
874
755
|
|
|
875
|
-
You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before
|
|
756
|
+
You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action(s) before or after the spider runs:
|
|
876
757
|
|
|
877
758
|
```ruby
|
|
878
759
|
require 'kimurai'
|
|
@@ -917,7 +798,7 @@ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider:
|
|
|
917
798
|
```
|
|
918
799
|
</details><br>
|
|
919
800
|
|
|
920
|
-
|
|
801
|
+
The `run_info` method is available from the `open_spider` and `close_spider` class methods. It contains useful information about the spider state:
|
|
921
802
|
|
|
922
803
|
```ruby
|
|
923
804
|
11: def self.open_spider
|
|
@@ -937,7 +818,7 @@ Inside `open_spider` and `close_spider` class methods there is available `run_in
|
|
|
937
818
|
}
|
|
938
819
|
```
|
|
939
820
|
|
|
940
|
-
|
|
821
|
+
`run_info` will be updated from `close_spider`:
|
|
941
822
|
|
|
942
823
|
```ruby
|
|
943
824
|
15: def self.close_spider
|
|
@@ -957,7 +838,7 @@ Inside `close_spider`, `run_info` will be updated:
|
|
|
957
838
|
}
|
|
958
839
|
```
|
|
959
840
|
|
|
960
|
-
`run_info[:status]` helps to determine if spider
|
|
841
|
+
`run_info[:status]` helps to determine if the spider finished successfully or failed (possible values: `:completed`, `:failed`):
|
|
961
842
|
|
|
962
843
|
```ruby
|
|
963
844
|
class ExampleSpider < Kimurai::Base
|
|
@@ -1005,12 +886,12 @@ example_spider.rb:15:in `parse': undefined method `strip' for nil:NilClass (NoMe
|
|
|
1005
886
|
```
|
|
1006
887
|
</details><br>
|
|
1007
888
|
|
|
1008
|
-
**Usage example:** if spider finished successfully, send JSON file with scraped items to a remote FTP location, otherwise (if spider failed), skip incompleted results and send email/notification to
|
|
889
|
+
**Usage example:** if the spider finished successfully, send a JSON file with scraped items to a remote FTP location, otherwise (if the spider failed), skip incompleted results and send an email/notification to Slack about it:
|
|
1009
890
|
|
|
1010
891
|
<details/>
|
|
1011
892
|
<summary>Example</summary>
|
|
1012
893
|
|
|
1013
|
-
|
|
894
|
+
You can also use the additional methods `completed?` or `failed?`
|
|
1014
895
|
|
|
1015
896
|
```ruby
|
|
1016
897
|
class Spider < Kimurai::Base
|
|
@@ -1047,7 +928,7 @@ end
|
|
|
1047
928
|
|
|
1048
929
|
|
|
1049
930
|
### `KIMURAI_ENV`
|
|
1050
|
-
Kimurai
|
|
931
|
+
Kimurai supports environments. The default is `development`. To provide a custom environment provide a `KIMURAI_ENV` environment variable like so: `$ KIMURAI_ENV=production ruby spider.rb`. To access the current environment there is a `Kimurai.env` method.
|
|
1051
932
|
|
|
1052
933
|
Usage example:
|
|
1053
934
|
```ruby
|
|
@@ -1068,7 +949,7 @@ end
|
|
|
1068
949
|
```
|
|
1069
950
|
|
|
1070
951
|
### Parallel crawling using `in_parallel`
|
|
1071
|
-
Kimurai can process web pages concurrently
|
|
952
|
+
Kimurai can process web pages concurrently: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is an array of urls to crawl and `threads:` is a number of threads:
|
|
1072
953
|
|
|
1073
954
|
```ruby
|
|
1074
955
|
# amazon_spider.rb
|
|
@@ -1083,7 +964,7 @@ class AmazonSpider < Kimurai::Base
|
|
|
1083
964
|
browser.fill_in "field-keywords", with: "Web Scraping Books"
|
|
1084
965
|
browser.click_on "Go"
|
|
1085
966
|
|
|
1086
|
-
# Walk through pagination and collect
|
|
967
|
+
# Walk through pagination and collect product urls:
|
|
1087
968
|
urls = []
|
|
1088
969
|
loop do
|
|
1089
970
|
response = browser.current_response
|
|
@@ -1094,7 +975,7 @@ class AmazonSpider < Kimurai::Base
|
|
|
1094
975
|
browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
|
|
1095
976
|
end
|
|
1096
977
|
|
|
1097
|
-
# Process all collected urls concurrently
|
|
978
|
+
# Process all collected urls concurrently using 3 threads:
|
|
1098
979
|
in_parallel(:parse_book_page, urls, threads: 3)
|
|
1099
980
|
end
|
|
1100
981
|
|
|
@@ -1117,50 +998,22 @@ AmazonSpider.crawl!
|
|
|
1117
998
|
<summary>Run: <code>$ ruby amazon_spider.rb</code></summary>
|
|
1118
999
|
|
|
1119
1000
|
```
|
|
1120
|
-
|
|
1121
|
-
D, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
|
|
1122
|
-
I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
|
|
1123
|
-
I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
|
|
1124
|
-
I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
|
|
1125
|
-
|
|
1126
|
-
I, [2018-08-22 14:48:43 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: starting processing 52 urls within 3 threads
|
|
1127
|
-
D, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
|
|
1128
|
-
I, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
|
|
1129
|
-
D, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
|
|
1130
|
-
I, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
|
|
1131
|
-
D, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
|
|
1132
|
-
I, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
|
|
1133
|
-
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
|
|
1134
|
-
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 4, responses: 2
|
|
1135
|
-
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
|
|
1136
|
-
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
|
|
1137
|
-
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 5, responses: 3
|
|
1138
|
-
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
|
|
1139
|
-
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
|
|
1140
|
-
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 6, responses: 4
|
|
1141
|
-
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Excel-Effective-Scrapes-ebook/dp/B01CMMJGZ8/
|
|
1001
|
+
$ ruby amazon_spider.rb
|
|
1142
1002
|
|
|
1143
1003
|
...
|
|
1144
1004
|
|
|
1145
|
-
I, [
|
|
1146
|
-
I, [
|
|
1147
|
-
I, [
|
|
1148
|
-
I, [
|
|
1149
|
-
I, [
|
|
1150
|
-
I, [
|
|
1151
|
-
I, [
|
|
1152
|
-
I, [
|
|
1153
|
-
I, [
|
|
1154
|
-
I, [
|
|
1155
|
-
I, [
|
|
1156
|
-
|
|
1157
|
-
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 53, responses: 53
|
|
1158
|
-
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
|
|
1159
|
-
|
|
1160
|
-
I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: stopped processing 52 urls within 3 threads, total time: 29s
|
|
1161
|
-
I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
|
|
1162
|
-
|
|
1163
|
-
I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:48:37 +0400, :stop_time=>2018-08-22 14:49:12 +0400, :running_time=>"35s", :visits=>{:requests=>53, :responses=>53}, :error=>nil}
|
|
1005
|
+
I, [2025-12-16 13:48:19 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 305, responses: 305
|
|
1006
|
+
I, [2025-12-16 13:48:19 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Real-World-Python-Hackers-Solving-Problems/dp/1718500629/
|
|
1007
|
+
I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Real-World-Python-Hackers-Solving-Problems/dp/1718500629/
|
|
1008
|
+
I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 306, responses: 306
|
|
1009
|
+
I, [2025-12-16 13:48:22 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/
|
|
1010
|
+
I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/
|
|
1011
|
+
I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Info: visits: requests: 307, responses: 307
|
|
1012
|
+
I, [2025-12-16 13:48:23 +0300#39167] [C: 1624] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
|
|
1013
|
+
I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Spider: in_parallel: stopped processing 306 urls within 3 threads, total time: 2m, 37s
|
|
1014
|
+
I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
|
|
1015
|
+
I, [2025-12-16 13:48:23 +0300#39167] [M: 1152] INFO -- amazon_spider: Spider: stopped: {spider_name: "amazon_spider", status: :completed, error: nil, environment: "development", start_time: 2025-12-16 13:45:12.5338 +0300, stop_time: 2025-12-16 13:48:23.526221 +0300, running_time: "3m, 10s", visits: {requests: 307, responses: 307}, items: {sent: 0, processed: 0}, events: {requests_errors: {}, drop_items_errors: {}, custom: {}}}
|
|
1016
|
+
vic@Vics-MacBook-Air single %
|
|
1164
1017
|
|
|
1165
1018
|
```
|
|
1166
1019
|
</details>
|
|
@@ -1171,35 +1024,39 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider:
|
|
|
1171
1024
|
```json
|
|
1172
1025
|
[
|
|
1173
1026
|
{
|
|
1174
|
-
"title": "Web Scraping with Python:
|
|
1175
|
-
"url": "https://www.amazon.com/Web-Scraping-Python-
|
|
1176
|
-
"price": "$
|
|
1177
|
-
"
|
|
1027
|
+
"title": "Web Scraping with Python: Data Extraction from the Modern Web 3rd Edition",
|
|
1028
|
+
"url": "https://www.amazon.com/Web-Scraping-Python-Extraction-Modern/dp/1098145356/",
|
|
1029
|
+
"price": "$27.00",
|
|
1030
|
+
"author": "Ryan Mitchell",
|
|
1031
|
+
"publication_date": "March 26, 2024",
|
|
1178
1032
|
"position": 1
|
|
1179
1033
|
},
|
|
1180
1034
|
{
|
|
1181
|
-
"title": "
|
|
1182
|
-
"url": "https://www.amazon.com/
|
|
1183
|
-
"price": "$
|
|
1184
|
-
"
|
|
1035
|
+
"title": "Web Scraping with Python: Collecting More Data from the Modern Web 2nd Edition",
|
|
1036
|
+
"url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
|
|
1037
|
+
"price": "$13.20 - $38.15",
|
|
1038
|
+
"author": "Ryan Mitchell",
|
|
1039
|
+
"publication_date": "May 8, 2018",
|
|
1185
1040
|
"position": 2
|
|
1186
1041
|
},
|
|
1187
1042
|
{
|
|
1188
|
-
"title": "
|
|
1189
|
-
"url": "https://www.amazon.com/
|
|
1190
|
-
"price": "$
|
|
1191
|
-
"
|
|
1043
|
+
"title": "Scripting: Automation with Bash, PowerShell, and Python—Automate Everyday IT Tasks from Backups to Web Scraping in Just a Few Lines of Code (Rheinwerk Computing) First Edition",
|
|
1044
|
+
"url": "https://www.amazon.com/Scripting-Automation-Bash-PowerShell-Python/dp/1493225561/",
|
|
1045
|
+
"price": "$47.02",
|
|
1046
|
+
"author": "Michael Kofler",
|
|
1047
|
+
"publication_date": "February 25, 2024",
|
|
1192
1048
|
"position": 3
|
|
1193
1049
|
},
|
|
1194
1050
|
|
|
1195
|
-
...
|
|
1196
|
-
|
|
1051
|
+
// ...
|
|
1052
|
+
|
|
1197
1053
|
{
|
|
1198
|
-
"title": "
|
|
1199
|
-
"url": "https://www.amazon.com/
|
|
1200
|
-
"price": "$
|
|
1201
|
-
"
|
|
1202
|
-
"
|
|
1054
|
+
"title": "Introduction to Python Important points for efficient data collection with scraping (Japanese Edition) Kindle Edition",
|
|
1055
|
+
"url": "https://www.amazon.com/Introduction-Important-efficient-collection-scraping-ebook/dp/B0D2MLXFT6/",
|
|
1056
|
+
"price": "$0.00",
|
|
1057
|
+
"author": "r",
|
|
1058
|
+
"publication_date": "April 24, 2024",
|
|
1059
|
+
"position": 306
|
|
1203
1060
|
}
|
|
1204
1061
|
]
|
|
1205
1062
|
```
|
|
@@ -1207,11 +1064,12 @@ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider:
|
|
|
1207
1064
|
|
|
1208
1065
|
> Note that [save_to](#save_to-helper) and [unique?](#skip-duplicates-unique-helper) helpers are thread-safe (protected by [Mutex](https://ruby-doc.org/core-2.5.1/Mutex.html)) and can be freely used inside threads.
|
|
1209
1066
|
|
|
1210
|
-
`in_parallel` can take additional
|
|
1211
|
-
|
|
1212
|
-
* `
|
|
1213
|
-
* `
|
|
1214
|
-
* `
|
|
1067
|
+
`in_parallel` can take additional parameters:
|
|
1068
|
+
|
|
1069
|
+
* `data:` – pass custom data like so: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
|
|
1070
|
+
* `delay:` – set delay between requests like so: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
|
|
1071
|
+
* `engine:` – set custom engine like so: `in_parallel(:method, urls, threads: 3, engine: :selenium_chrome)`
|
|
1072
|
+
* `config:` – set custom [config](#spider-config) options
|
|
1215
1073
|
|
|
1216
1074
|
### Active Support included
|
|
1217
1075
|
|
|
@@ -1219,7 +1077,7 @@ You can use all the power of familiar [Rails core-ext methods](https://guides.ru
|
|
|
1219
1077
|
|
|
1220
1078
|
### Schedule spiders using Cron
|
|
1221
1079
|
|
|
1222
|
-
1) Inside spider directory generate [Whenever](https://github.com/javan/whenever)
|
|
1080
|
+
1) Inside the spider directory generate a [Whenever](https://github.com/javan/whenever) schedule configuration like so: `$ kimurai generate schedule`.
|
|
1223
1081
|
|
|
1224
1082
|
<details/>
|
|
1225
1083
|
<summary><code>schedule.rb</code></summary>
|
|
@@ -1228,7 +1086,7 @@ You can use all the power of familiar [Rails core-ext methods](https://guides.ru
|
|
|
1228
1086
|
### Settings ###
|
|
1229
1087
|
require 'tzinfo'
|
|
1230
1088
|
|
|
1231
|
-
# Export current PATH
|
|
1089
|
+
# Export current PATH for cron
|
|
1232
1090
|
env :PATH, ENV["PATH"]
|
|
1233
1091
|
|
|
1234
1092
|
# Use 24 hour format when using `at:` option
|
|
@@ -1236,8 +1094,8 @@ set :chronic_options, hours24: true
|
|
|
1236
1094
|
|
|
1237
1095
|
# Use local_to_utc helper to setup execution time using your local timezone instead
|
|
1238
1096
|
# of server's timezone (which is probably and should be UTC, to check run `$ timedatectl`).
|
|
1239
|
-
#
|
|
1240
|
-
#
|
|
1097
|
+
# You should also set the same timezone in kimurai (use `Kimurai.configuration.time_zone =` for that).
|
|
1098
|
+
#
|
|
1241
1099
|
# Example usage of helper:
|
|
1242
1100
|
# every 1.day, at: local_to_utc("7:00", zone: "Europe/Moscow") do
|
|
1243
1101
|
# crawl "google_spider.com", output: "log/google_spider.com.log"
|
|
@@ -1248,7 +1106,7 @@ end
|
|
|
1248
1106
|
|
|
1249
1107
|
# Note: by default Whenever exports cron commands with :environment == "production".
|
|
1250
1108
|
# Note: Whenever can only append log data to a log file (>>). If you want
|
|
1251
|
-
# to overwrite (>) log file before each run,
|
|
1109
|
+
# to overwrite (>) a log file before each run, use lambda notation:
|
|
1252
1110
|
# crawl "google_spider.com", output: -> { "> log/google_spider.com.log 2>&1" }
|
|
1253
1111
|
|
|
1254
1112
|
# Project job types
|
|
@@ -1261,31 +1119,29 @@ job_type :single, "cd :path && KIMURAI_ENV=:environment ruby :task :output"
|
|
|
1261
1119
|
job_type :single_bundle, "cd :path && KIMURAI_ENV=:environment bundle exec ruby :task :output"
|
|
1262
1120
|
|
|
1263
1121
|
### Schedule ###
|
|
1264
|
-
# Usage (
|
|
1122
|
+
# Usage (see examples here https://github.com/javan/whenever#example-schedulerb-file):
|
|
1265
1123
|
# every 1.day do
|
|
1266
1124
|
# Example to schedule a single spider in the project:
|
|
1267
1125
|
# crawl "google_spider.com", output: "log/google_spider.com.log"
|
|
1268
1126
|
|
|
1269
1127
|
# Example to schedule all spiders in the project using runner. Each spider will write
|
|
1270
|
-
#
|
|
1271
|
-
# Runner output will be written to log/runner.log
|
|
1272
|
-
# Argument number it's a count of concurrent jobs:
|
|
1273
|
-
# runner 3, output:"log/runner.log"
|
|
1128
|
+
# its own output to the `log/spider_name.log` file (handled by runner itself).
|
|
1129
|
+
# Runner output will be written to log/runner.log
|
|
1274
1130
|
|
|
1275
|
-
# Example to schedule single spider (without project):
|
|
1131
|
+
# Example to schedule single spider (without a project):
|
|
1276
1132
|
# single "single_spider.rb", output: "single_spider.log"
|
|
1277
1133
|
# end
|
|
1278
1134
|
|
|
1279
|
-
### How to set a cron schedule ###
|
|
1135
|
+
### How to set up a cron schedule ###
|
|
1280
1136
|
# Run: `$ whenever --update-crontab --load-file config/schedule.rb`.
|
|
1281
|
-
# If you don't have whenever command, install the gem: `$ gem install whenever`.
|
|
1137
|
+
# If you don't have the whenever command, install the gem like so: `$ gem install whenever`.
|
|
1282
1138
|
|
|
1283
1139
|
### How to cancel a schedule ###
|
|
1284
1140
|
# Run: `$ whenever --clear-crontab --load-file config/schedule.rb`.
|
|
1285
1141
|
```
|
|
1286
1142
|
</details><br>
|
|
1287
1143
|
|
|
1288
|
-
2)
|
|
1144
|
+
2) At the bottom of `schedule.rb`, add the following code:
|
|
1289
1145
|
|
|
1290
1146
|
```ruby
|
|
1291
1147
|
every 1.day, at: "7:00" do
|
|
@@ -1295,14 +1151,14 @@ end
|
|
|
1295
1151
|
|
|
1296
1152
|
3) Run: `$ whenever --update-crontab --load-file schedule.rb`. Done!
|
|
1297
1153
|
|
|
1298
|
-
You can
|
|
1154
|
+
You can see some [Whenever](https://github.com/javan/whenever) examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel a schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.
|
|
1299
1155
|
|
|
1300
1156
|
### Configuration options
|
|
1301
|
-
You can configure several options
|
|
1157
|
+
You can configure several options inside the `configure` block:
|
|
1302
1158
|
|
|
1303
1159
|
```ruby
|
|
1304
1160
|
Kimurai.configure do |config|
|
|
1305
|
-
#
|
|
1161
|
+
# The default logger has colorized mode enabled in development.
|
|
1306
1162
|
# If you would like to disable it, set `colorize_logger` to false.
|
|
1307
1163
|
# config.colorize_logger = false
|
|
1308
1164
|
|
|
@@ -1323,13 +1179,13 @@ Kimurai.configure do |config|
|
|
|
1323
1179
|
end
|
|
1324
1180
|
```
|
|
1325
1181
|
|
|
1326
|
-
### Using Kimurai inside existing Ruby
|
|
1182
|
+
### Using Kimurai inside existing Ruby applications
|
|
1327
1183
|
|
|
1328
|
-
You can integrate Kimurai spiders (which are just Ruby classes)
|
|
1184
|
+
You can integrate Kimurai spiders (which are just Ruby classes) into an existing Ruby application like Rails or Sinatra, and run them using background jobs, for example. See the following sections to understand the process of running spiders:
|
|
1329
1185
|
|
|
1330
1186
|
#### `.crawl!` method
|
|
1331
1187
|
|
|
1332
|
-
`.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if
|
|
1188
|
+
`.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if it was successful, or an exception if something went wrong.
|
|
1333
1189
|
|
|
1334
1190
|
```ruby
|
|
1335
1191
|
class ExampleSpider < Kimurai::Base
|
|
@@ -1346,7 +1202,7 @@ ExampleSpider.crawl!
|
|
|
1346
1202
|
# => { :spider_name => "example_spider", :status => :completed, :environment => "development", :start_time => 2018-08-22 18:20:16 +0400, :stop_time => 2018-08-22 18:20:17 +0400, :running_time => 1.216, :visits => { :requests => 1, :responses => 1 }, :items => { :sent => 0, :processed => 0 }, :error => nil }
|
|
1347
1203
|
```
|
|
1348
1204
|
|
|
1349
|
-
You can't `.crawl!` spider in different thread if it still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
|
|
1205
|
+
You can't `.crawl!` a spider in a different thread if it's still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
|
|
1350
1206
|
|
|
1351
1207
|
```ruby
|
|
1352
1208
|
2.times do |i|
|
|
@@ -1360,11 +1216,11 @@ end # =>
|
|
|
1360
1216
|
# {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 18:49:22 +0400, :stop_time=>2018-08-22 18:49:23 +0400, :running_time=>0.801, :visits=>{:requests=>1, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :error=>nil}
|
|
1361
1217
|
```
|
|
1362
1218
|
|
|
1363
|
-
So what if you
|
|
1219
|
+
So, what if you don't care about stats and just want to process a request with a particular spider method and get the return value from this method? Use `.parse!` instead:
|
|
1364
1220
|
|
|
1365
1221
|
#### `.parse!(:method_name, url:)` method
|
|
1366
1222
|
|
|
1367
|
-
`.parse!` (class method) creates a new spider instance and performs a request
|
|
1223
|
+
The `.parse!` (class method) creates a new spider instance and performs a request with the provided method and url. The value from the method will be returned back:
|
|
1368
1224
|
|
|
1369
1225
|
```ruby
|
|
1370
1226
|
class ExampleSpider < Kimurai::Base
|
|
@@ -1381,7 +1237,7 @@ ExampleSpider.parse!(:parse, url: "https://example.com/")
|
|
|
1381
1237
|
# => "Example Domain"
|
|
1382
1238
|
```
|
|
1383
1239
|
|
|
1384
|
-
Like `.crawl!`, `.parse!` method
|
|
1240
|
+
Like `.crawl!`, the `.parse!` method creates a browser instance and destroys it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:
|
|
1385
1241
|
|
|
1386
1242
|
```ruby
|
|
1387
1243
|
urls = ["https://www.google.com/", "https://www.reddit.com/", "https://en.wikipedia.org/"]
|
|
@@ -1395,7 +1251,7 @@ end # =>
|
|
|
1395
1251
|
# "reddit: the front page of the internetHotHot"
|
|
1396
1252
|
```
|
|
1397
1253
|
|
|
1398
|
-
Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using `.parse!` method.
|
|
1254
|
+
Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using the `.parse!` method.
|
|
1399
1255
|
|
|
1400
1256
|
#### `Kimurai.list` and `Kimurai.find_by_name()`
|
|
1401
1257
|
|
|
@@ -1416,64 +1272,21 @@ end
|
|
|
1416
1272
|
Kimurai.list
|
|
1417
1273
|
# => {"google_spider"=>GoogleSpider, "reddit_spider"=>RedditSpider, "wikipedia_spider"=>WikipediaSpider}
|
|
1418
1274
|
|
|
1419
|
-
# To find a particular spider class by
|
|
1275
|
+
# To find a particular spider class by its name:
|
|
1420
1276
|
Kimurai.find_by_name("reddit_spider")
|
|
1421
1277
|
# => RedditSpider
|
|
1422
1278
|
```
|
|
1423
1279
|
|
|
1424
|
-
|
|
1425
|
-
### Automated sever setup and deployment
|
|
1426
|
-
> **EXPERIMENTAL**
|
|
1427
|
-
|
|
1428
|
-
#### Setup
|
|
1429
|
-
You can automatically setup [required environment](#installation) for Kimurai on the remote server (currently there is only Ubuntu Server 18.04 support) using `$ kimurai setup` command. `setup` will perform installation of: latest Ruby with Rbenv, browsers with webdrivers and in additional databases clients (only clients) for MySQL, Postgres and MongoDB (so you can connect to a remote database from ruby).
|
|
1430
|
-
|
|
1431
|
-
> To perform remote server setup, [Ansible](https://github.com/ansible/ansible) is required **on the desktop** machine (to install: Ubuntu: `$ sudo apt install ansible`, Mac OS X: `$ brew install ansible`)
|
|
1432
|
-
|
|
1433
|
-
> It's recommended to use regular user to setup the server, not `root`. To create a new user, login to the server `$ ssh root@your_server_ip`, type `$ adduser username` to create a user, and `$ gpasswd -a username sudo` to add new user to a sudo group.
|
|
1434
|
-
|
|
1435
|
-
Example:
|
|
1436
|
-
|
|
1437
|
-
```bash
|
|
1438
|
-
$ kimurai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
|
|
1439
|
-
```
|
|
1440
|
-
|
|
1441
|
-
CLI options:
|
|
1442
|
-
* `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
|
|
1443
|
-
* `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
|
|
1444
|
-
* `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
|
|
1445
|
-
* `-p port_number` custom port for ssh connection (`-p 2222`)
|
|
1446
|
-
|
|
1447
|
-
> You can check setup playbook [here](lib/kimurai/automation/setup.yml)
|
|
1448
|
-
|
|
1449
|
-
#### Deploy
|
|
1450
|
-
|
|
1451
|
-
After successful `setup` you can deploy a spider to the remote server using `$ kimurai deploy` command. On each deploy there are performing several tasks: 1) pull repo from a remote origin to `~/repo_name` user directory 2) run `bundle install` 3) Update crontab `whenever --update-crontab` (to update spider schedule from schedule.rb file).
|
|
1452
|
-
|
|
1453
|
-
Before `deploy` make sure that inside spider directory you have: 1) git repository with remote origin (bitbucket, github, etc.) 2) `Gemfile` 3) schedule.rb inside subfolder `config` (`config/schedule.rb`).
|
|
1454
|
-
|
|
1455
|
-
Example:
|
|
1456
|
-
|
|
1457
|
-
```bash
|
|
1458
|
-
$ kimurai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
|
|
1459
|
-
```
|
|
1460
|
-
|
|
1461
|
-
CLI options: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
|
|
1462
|
-
* `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
|
|
1463
|
-
* `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
|
|
1464
|
-
|
|
1465
|
-
> You can check deploy playbook [here](lib/kimurai/automation/deploy.yml)
|
|
1466
|
-
|
|
1467
1280
|
## Spider `@config`
|
|
1468
1281
|
|
|
1469
|
-
Using `@config` you can set several options for a spider
|
|
1282
|
+
Using `@config` you can set several options for a spider; such as proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:
|
|
1470
1283
|
|
|
1471
1284
|
```ruby
|
|
1472
1285
|
class Spider < Kimurai::Base
|
|
1473
1286
|
USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
|
|
1474
1287
|
PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
|
|
1475
1288
|
|
|
1476
|
-
@engine = :
|
|
1289
|
+
@engine = :selenium_chrome
|
|
1477
1290
|
@start_urls = ["https://example.com/"]
|
|
1478
1291
|
@config = {
|
|
1479
1292
|
headers: { "custom_header" => "custom_value" },
|
|
@@ -1493,7 +1306,7 @@ class Spider < Kimurai::Base
|
|
|
1493
1306
|
change_proxy: true,
|
|
1494
1307
|
# Clear all cookies and set default cookies (if provided) before each request:
|
|
1495
1308
|
clear_and_set_cookies: true,
|
|
1496
|
-
#
|
|
1309
|
+
# Set a delay before each request:
|
|
1497
1310
|
delay: 1..3
|
|
1498
1311
|
}
|
|
1499
1312
|
}
|
|
@@ -1508,94 +1321,116 @@ end
|
|
|
1508
1321
|
|
|
1509
1322
|
```ruby
|
|
1510
1323
|
@config = {
|
|
1511
|
-
# Custom headers
|
|
1512
|
-
# Works
|
|
1324
|
+
# Custom headers hash. Example: { "some header" => "some value", "another header" => "another value" }
|
|
1325
|
+
# Works for :mechanize. Selenium doesn't support setting headers.
|
|
1513
1326
|
headers: {},
|
|
1514
1327
|
|
|
1515
|
-
# Custom User Agent
|
|
1328
|
+
# Custom User Agent – string or lambda
|
|
1329
|
+
#
|
|
1516
1330
|
# Use lambda if you want to rotate user agents before each run:
|
|
1517
|
-
#
|
|
1331
|
+
# user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
|
|
1332
|
+
#
|
|
1518
1333
|
# Works for all engines
|
|
1519
1334
|
user_agent: "Mozilla/5.0 Firefox/61.0",
|
|
1520
1335
|
|
|
1521
|
-
# Custom cookies
|
|
1336
|
+
# Custom cookies – an array of hashes
|
|
1522
1337
|
# Format for a single cookie: { name: "cookie name", value: "cookie value", domain: ".example.com" }
|
|
1338
|
+
#
|
|
1523
1339
|
# Works for all engines
|
|
1524
1340
|
cookies: [],
|
|
1525
1341
|
|
|
1526
|
-
# Proxy
|
|
1527
|
-
#
|
|
1342
|
+
# Proxy – string or lambda. Format for a proxy string: "ip:port:protocol:user:password"
|
|
1343
|
+
# `protocol` can be http or socks5. User and password are optional.
|
|
1344
|
+
#
|
|
1528
1345
|
# Use lambda if you want to rotate proxies before each run:
|
|
1529
|
-
#
|
|
1530
|
-
#
|
|
1531
|
-
#
|
|
1346
|
+
# proxy: -> { ARRAY_OF_PROXIES.sample }
|
|
1347
|
+
#
|
|
1348
|
+
# Works for all engines, but keep in mind that Selenium drivers don't support proxies
|
|
1349
|
+
# with authorization. Also, Mechanize doesn't support socks5 proxy format (only http).
|
|
1532
1350
|
proxy: "3.4.5.6:3128:http:user:pass",
|
|
1533
1351
|
|
|
1534
1352
|
# If enabled, browser will ignore any https errors. It's handy while using a proxy
|
|
1535
|
-
# with self-signed SSL cert (for example Crawlera or Mitmproxy)
|
|
1536
|
-
#
|
|
1353
|
+
# with a self-signed SSL cert (for example Crawlera or Mitmproxy). It will allow you to
|
|
1354
|
+
# visit web pages with expired SSL certificates.
|
|
1355
|
+
#
|
|
1537
1356
|
# Works for all engines
|
|
1538
1357
|
ignore_ssl_errors: true,
|
|
1539
1358
|
|
|
1540
1359
|
# Custom window size, works for all engines
|
|
1541
1360
|
window_size: [1366, 768],
|
|
1542
1361
|
|
|
1543
|
-
# Skip images
|
|
1362
|
+
# Skip loading images if true, works for all engines. Speeds up page load time.
|
|
1544
1363
|
disable_images: true,
|
|
1545
1364
|
|
|
1546
|
-
# Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
|
|
1547
|
-
# Although native mode has
|
|
1548
|
-
# sometimes
|
|
1549
|
-
# headless chrome, so you can use virtual_display mode instead
|
|
1365
|
+
# For Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
|
|
1366
|
+
# Although native mode has better performance, virtual display mode
|
|
1367
|
+
# can sometimes be useful. For example, some websites can detect (and block)
|
|
1368
|
+
# headless chrome, so you can use virtual_display mode instead.
|
|
1550
1369
|
headless_mode: :native,
|
|
1551
1370
|
|
|
1552
1371
|
# This option tells the browser not to use a proxy for the provided list of domains or IP addresses.
|
|
1553
|
-
# Format: array of strings. Works only for :selenium_firefox and selenium_chrome
|
|
1372
|
+
# Format: array of strings. Works only for :selenium_firefox and selenium_chrome.
|
|
1554
1373
|
proxy_bypass_list: [],
|
|
1555
1374
|
|
|
1556
|
-
# Option to provide custom SSL certificate. Works only for :
|
|
1375
|
+
# Option to provide custom SSL certificate. Works only for :mechanize.
|
|
1557
1376
|
ssl_cert_path: "path/to/ssl_cert",
|
|
1558
1377
|
|
|
1559
|
-
# Inject some JavaScript code
|
|
1560
|
-
# Format: array of strings, where each string is a path to JS file
|
|
1561
|
-
#
|
|
1378
|
+
# Inject some JavaScript code into the browser.
|
|
1379
|
+
# Format: array of strings, where each string is a path to a JS file or extension directory
|
|
1380
|
+
# Selenium doesn't support JS code injection.
|
|
1562
1381
|
extensions: ["lib/code_to_inject.js"],
|
|
1563
1382
|
|
|
1564
|
-
# Automatically skip
|
|
1565
|
-
#
|
|
1566
|
-
#
|
|
1567
|
-
#
|
|
1383
|
+
# Automatically skip already visited urls when using `request_to` method
|
|
1384
|
+
#
|
|
1385
|
+
# Possible values: `true` or a hash with options
|
|
1386
|
+
# In case of `true`, all visited urls will be added to the storage scope `:requests_urls`
|
|
1387
|
+
# and if the url already exists in this scope, the request will be skipped.
|
|
1388
|
+
#
|
|
1568
1389
|
# You can configure this setting by providing additional options as hash:
|
|
1569
|
-
#
|
|
1570
|
-
#
|
|
1571
|
-
#
|
|
1572
|
-
#
|
|
1573
|
-
#
|
|
1390
|
+
# `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
|
|
1391
|
+
# `scope:` – use a custom scope other than `:requests_urls`
|
|
1392
|
+
# `check_only:` – if true, the url will not be added to the scope
|
|
1393
|
+
#
|
|
1394
|
+
# Works for all drivers
|
|
1574
1395
|
skip_duplicate_requests: true,
|
|
1575
1396
|
|
|
1576
|
-
# Automatically skip provided errors while requesting a page
|
|
1577
|
-
#
|
|
1578
|
-
#
|
|
1579
|
-
# It
|
|
1580
|
-
#
|
|
1397
|
+
# Automatically skip provided errors while requesting a page
|
|
1398
|
+
#
|
|
1399
|
+
# If a raised error matches one of the errors in the list, then the error will be caught,
|
|
1400
|
+
# and the request will be skipped. It's a good idea to skip errors like 404 Not Found, etc.
|
|
1401
|
+
#
|
|
1402
|
+
# Format: array where elements are error classes and/or hashes. You can use a hash
|
|
1581
1403
|
# for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
|
|
1582
|
-
#
|
|
1583
|
-
#
|
|
1404
|
+
#
|
|
1405
|
+
# The provided `message:` will be compared with a full error message using `String#include?`.
|
|
1406
|
+
# You can also use regex: `{ error: "RuntimeError", message: /404|403/ }`.
|
|
1584
1407
|
skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
|
|
1585
|
-
|
|
1586
|
-
# Automatically retry
|
|
1587
|
-
#
|
|
1588
|
-
#
|
|
1589
|
-
#
|
|
1590
|
-
#
|
|
1591
|
-
#
|
|
1592
|
-
#
|
|
1408
|
+
|
|
1409
|
+
# Automatically retry requests several times after certain errors
|
|
1410
|
+
#
|
|
1411
|
+
# If a raised error matches one of the errors in the list, the error will be caught,
|
|
1412
|
+
# and the request will be processed again with progressive delay.
|
|
1413
|
+
#
|
|
1414
|
+
# There are 3 attempts with _15 sec_, _30 sec_, and _45 sec_ delays, respectively. If after 3
|
|
1415
|
+
# attempts there is still an exception, then the exception will be raised. It's a good idea to
|
|
1416
|
+
# retry errors like `ReadTimeout`, `HTTPBadGateway`, etc.
|
|
1417
|
+
#
|
|
1418
|
+
# The format for `retry_request_errors` is the same as for `skip_request_errors`.
|
|
1593
1419
|
retry_request_errors: [Net::ReadTimeout],
|
|
1594
1420
|
|
|
1421
|
+
# Handle page encoding while parsing html response using Nokogiri
|
|
1422
|
+
#
|
|
1423
|
+
# There are two ways to use this option:
|
|
1424
|
+
# encoding: :auto # auto-detect from <meta http-equiv="Content-Type"> or <meta charset> tags
|
|
1425
|
+
# encoding: "GB2312" # set encoding manually
|
|
1426
|
+
#
|
|
1427
|
+
# This option is not set by default
|
|
1428
|
+
encoding: nil,
|
|
1429
|
+
|
|
1595
1430
|
# Restart browser if one of the options is true:
|
|
1596
1431
|
restart_if: {
|
|
1597
1432
|
# Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
|
|
1598
|
-
memory_limit:
|
|
1433
|
+
memory_limit: 1_500_000,
|
|
1599
1434
|
|
|
1600
1435
|
# Restart browser if provided requests limit is exceeded (works for all engines)
|
|
1601
1436
|
requests_limit: 100
|
|
@@ -1603,26 +1438,25 @@ end
|
|
|
1603
1438
|
|
|
1604
1439
|
# Perform several actions before each request:
|
|
1605
1440
|
before_request: {
|
|
1606
|
-
# Change proxy before each request. The `proxy:` option above should be
|
|
1607
|
-
#
|
|
1608
|
-
# (Selenium doesn't support proxy rotation).
|
|
1441
|
+
# Change proxy before each request. The `proxy:` option above should be set with lambda notation.
|
|
1442
|
+
# Works for :mechanize engine. Selenium doesn't support proxy rotation.
|
|
1609
1443
|
change_proxy: true,
|
|
1610
1444
|
|
|
1611
|
-
# Change user agent before each request. The `user_agent:` option above should
|
|
1612
|
-
#
|
|
1613
|
-
# (selenium doesn't support to get/set headers).
|
|
1445
|
+
# Change user agent before each request. The `user_agent:` option above should set with lambda
|
|
1446
|
+
# notation. Works for :mechanize engine. Selenium doesn't support setting headers.
|
|
1614
1447
|
change_user_agent: true,
|
|
1615
1448
|
|
|
1616
|
-
# Clear all cookies before each request
|
|
1449
|
+
# Clear all cookies before each request. Works for all engines.
|
|
1617
1450
|
clear_cookies: true,
|
|
1618
1451
|
|
|
1619
|
-
# If you want to clear all cookies
|
|
1620
|
-
#
|
|
1452
|
+
# If you want to clear all cookies and set custom cookies, the `cookies:` option above should be set
|
|
1453
|
+
# Use this option instead of clear_cookies. Works for all engines.
|
|
1621
1454
|
clear_and_set_cookies: true,
|
|
1622
1455
|
|
|
1623
|
-
# Global option to set delay between requests
|
|
1456
|
+
# Global option to set delay between requests
|
|
1457
|
+
#
|
|
1624
1458
|
# Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
|
|
1625
|
-
# delay
|
|
1459
|
+
# the delay (in seconds) will be set randomly for each request: `rand (2..5) # => 3`
|
|
1626
1460
|
delay: 1..3
|
|
1627
1461
|
}
|
|
1628
1462
|
}
|
|
@@ -1635,11 +1469,11 @@ Settings can be inherited:
|
|
|
1635
1469
|
|
|
1636
1470
|
```ruby
|
|
1637
1471
|
class ApplicationSpider < Kimurai::Base
|
|
1638
|
-
@engine = :
|
|
1472
|
+
@engine = :selenium_chrome
|
|
1639
1473
|
@config = {
|
|
1640
|
-
user_agent: "
|
|
1474
|
+
user_agent: "Chrome",
|
|
1641
1475
|
disable_images: true,
|
|
1642
|
-
restart_if: { memory_limit:
|
|
1476
|
+
restart_if: { memory_limit: 1_500_000 },
|
|
1643
1477
|
before_request: { delay: 1..2 }
|
|
1644
1478
|
}
|
|
1645
1479
|
end
|
|
@@ -1657,11 +1491,11 @@ class CustomSpider < ApplicationSpider
|
|
|
1657
1491
|
end
|
|
1658
1492
|
```
|
|
1659
1493
|
|
|
1660
|
-
Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider` config,
|
|
1494
|
+
Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider`'s' config. In this example, `CustomSpider` will keep all inherited options with only the `delay` being updated.
|
|
1661
1495
|
|
|
1662
1496
|
## Project mode
|
|
1663
1497
|
|
|
1664
|
-
Kimurai can work in project mode
|
|
1498
|
+
Kimurai can work in project mode. To generate a new project, run: `$ kimurai new web_spiders` (where `web_spiders` is the name for the project).
|
|
1665
1499
|
|
|
1666
1500
|
Structure of the project:
|
|
1667
1501
|
|
|
@@ -1670,7 +1504,6 @@ Structure of the project:
|
|
|
1670
1504
|
├── config/
|
|
1671
1505
|
│ ├── initializers/
|
|
1672
1506
|
│ ├── application.rb
|
|
1673
|
-
│ ├── automation.yml
|
|
1674
1507
|
│ ├── boot.rb
|
|
1675
1508
|
│ └── schedule.rb
|
|
1676
1509
|
├── spiders/
|
|
@@ -1693,26 +1526,25 @@ Structure of the project:
|
|
|
1693
1526
|
<details/>
|
|
1694
1527
|
<summary>Description</summary>
|
|
1695
1528
|
|
|
1696
|
-
* `config/`
|
|
1697
|
-
* `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code
|
|
1698
|
-
* `config/application.rb` configuration settings for Kimurai (`Kimurai.configure do` block)
|
|
1699
|
-
* `config/
|
|
1700
|
-
* `config/
|
|
1701
|
-
|
|
1702
|
-
* `spiders
|
|
1703
|
-
|
|
1704
|
-
* `
|
|
1705
|
-
* `helpers
|
|
1706
|
-
|
|
1707
|
-
* `
|
|
1708
|
-
* `
|
|
1709
|
-
* `pipelines
|
|
1710
|
-
* `pipelines/
|
|
1711
|
-
|
|
1712
|
-
* `
|
|
1713
|
-
*
|
|
1714
|
-
* `
|
|
1715
|
-
* `Readme.md` example project readme
|
|
1529
|
+
* `config/` – directory for configutation files
|
|
1530
|
+
* `config/initializers` – [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code when the framework initializes
|
|
1531
|
+
* `config/application.rb` – configuration settings for Kimurai (`Kimurai.configure do` block)
|
|
1532
|
+
* `config/boot.rb`– loads framework and project
|
|
1533
|
+
* `config/schedule.rb` – Cron [schedule for spiders](#schedule-spiders-using-cron)
|
|
1534
|
+
* `spiders/` – directory for spiders
|
|
1535
|
+
* `spiders/application_spider.rb` – base parent class for all spiders
|
|
1536
|
+
* `db/` – directory for database files (`sqlite`, `json`, `csv`, etc.)
|
|
1537
|
+
* `helpers/` – Rails-like helpers for spiders
|
|
1538
|
+
* `helpers/application_helper.rb` – all methods inside the ApplicationHelper module will be available for all spiders
|
|
1539
|
+
* `lib/` – custom Ruby code
|
|
1540
|
+
* `log/` – directory for logs
|
|
1541
|
+
* `pipelines/` – directory for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines (one file per pipeline)
|
|
1542
|
+
* `pipelines/validator.rb` – example pipeline to validate an item
|
|
1543
|
+
* `pipelines/saver.rb` – example pipeline to save an item
|
|
1544
|
+
* `tmp/` – folder for temp files
|
|
1545
|
+
* `.env` – file to store environment variables for a project and load them using [Dotenv](https://github.com/bkeepers/dotenv)
|
|
1546
|
+
* `Gemfile` – dependency file
|
|
1547
|
+
* `Readme.md` – example project readme
|
|
1716
1548
|
</details>
|
|
1717
1549
|
|
|
1718
1550
|
|
|
@@ -1740,8 +1572,6 @@ end
|
|
|
1740
1572
|
### Crawl
|
|
1741
1573
|
To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
|
|
1742
1574
|
|
|
1743
|
-
You can provide an additional option `--continue` to use [persistence storage database](#persistence-database-for-the-storage) feature.
|
|
1744
|
-
|
|
1745
1575
|
### List
|
|
1746
1576
|
To list all project spiders, run: `$ bundle exec kimurai list`
|
|
1747
1577
|
|
|
@@ -1769,7 +1599,7 @@ class Validator < Kimurai::Pipeline
|
|
|
1769
1599
|
# Here you can validate item and raise `DropItemError`
|
|
1770
1600
|
# if one of the validations failed. Examples:
|
|
1771
1601
|
|
|
1772
|
-
# Drop item if
|
|
1602
|
+
# Drop item if its category is not "shoe":
|
|
1773
1603
|
if item[:category] != "shoe"
|
|
1774
1604
|
raise DropItemError, "Wrong item category"
|
|
1775
1605
|
end
|
|
@@ -1820,6 +1650,7 @@ spiders/application_spider.rb
|
|
|
1820
1650
|
```ruby
|
|
1821
1651
|
class ApplicationSpider < Kimurai::Base
|
|
1822
1652
|
@engine = :selenium_chrome
|
|
1653
|
+
|
|
1823
1654
|
# Define pipelines (by order) for all spiders:
|
|
1824
1655
|
@pipelines = [:validator, :saver]
|
|
1825
1656
|
end
|
|
@@ -1893,22 +1724,20 @@ end
|
|
|
1893
1724
|
|
|
1894
1725
|
spiders/github_spider.rb
|
|
1895
1726
|
```ruby
|
|
1896
|
-
class GithubSpider <
|
|
1727
|
+
class GithubSpider < Kimurai::Base
|
|
1897
1728
|
@name = "github_spider"
|
|
1898
1729
|
@engine = :selenium_chrome
|
|
1899
|
-
@
|
|
1900
|
-
@start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
|
|
1730
|
+
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
|
|
1901
1731
|
@config = {
|
|
1902
|
-
|
|
1903
|
-
before_request: { delay: 4..7 }
|
|
1732
|
+
before_request: { delay: 3..5 }
|
|
1904
1733
|
}
|
|
1905
1734
|
|
|
1906
1735
|
def parse(response, url:, data: {})
|
|
1907
|
-
response.xpath("//
|
|
1736
|
+
response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
|
|
1908
1737
|
request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
|
|
1909
1738
|
end
|
|
1910
1739
|
|
|
1911
|
-
if next_page = response.at_xpath("//a[@
|
|
1740
|
+
if next_page = response.at_xpath("//a[@rel='next']")
|
|
1912
1741
|
request_to :parse, url: absolute_url(next_page[:href], base: url)
|
|
1913
1742
|
end
|
|
1914
1743
|
end
|
|
@@ -1916,17 +1745,17 @@ class GithubSpider < ApplicationSpider
|
|
|
1916
1745
|
def parse_repo_page(response, url:, data: {})
|
|
1917
1746
|
item = {}
|
|
1918
1747
|
|
|
1919
|
-
item[:owner] = response.xpath("//
|
|
1920
|
-
item[:repo_name] = response.xpath("//
|
|
1748
|
+
item[:owner] = response.xpath("//a[@rel='author']").text.squish
|
|
1749
|
+
item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
|
|
1921
1750
|
item[:repo_url] = url
|
|
1922
|
-
item[:description] = response.xpath("//
|
|
1923
|
-
item[:tags] = response.xpath("//div[@
|
|
1924
|
-
item[:watch_count] = response.xpath("//
|
|
1925
|
-
item[:star_count] = response.xpath("//
|
|
1926
|
-
item[:fork_count] = response.xpath("//
|
|
1927
|
-
item[:last_commit] = response.xpath("//
|
|
1751
|
+
item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
|
|
1752
|
+
item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
|
|
1753
|
+
item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
|
|
1754
|
+
item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
|
|
1755
|
+
item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
|
|
1756
|
+
item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish
|
|
1928
1757
|
|
|
1929
|
-
|
|
1758
|
+
save_to "results.json", item, format: :pretty_json
|
|
1930
1759
|
end
|
|
1931
1760
|
end
|
|
1932
1761
|
```
|
|
@@ -1934,41 +1763,41 @@ end
|
|
|
1934
1763
|
```
|
|
1935
1764
|
$ bundle exec kimurai crawl github_spider
|
|
1936
1765
|
|
|
1937
|
-
I, [2018-08-22 15:56:35 +0400#1358]
|
|
1938
|
-
D, [2018-08-22 15:56:35 +0400#1358]
|
|
1939
|
-
I, [2018-08-22 15:56:40 +0400#1358]
|
|
1940
|
-
I, [2018-08-22 15:56:44 +0400#1358]
|
|
1941
|
-
I, [2018-08-22 15:56:44 +0400#1358]
|
|
1942
|
-
D, [2018-08-22 15:56:44 +0400#1358]
|
|
1943
|
-
D, [2018-08-22 15:56:44 +0400#1358]
|
|
1944
|
-
|
|
1945
|
-
I, [2018-08-22 15:56:49 +0400#1358]
|
|
1946
|
-
I, [2018-08-22 15:56:50 +0400#1358]
|
|
1947
|
-
I, [2018-08-22 15:56:50 +0400#1358]
|
|
1948
|
-
D, [2018-08-22 15:56:50 +0400#1358]
|
|
1949
|
-
D, [2018-08-22 15:56:50 +0400#1358]
|
|
1950
|
-
I, [2018-08-22 15:56:50 +0400#1358]
|
|
1951
|
-
I, [2018-08-22 15:56:50 +0400#1358]
|
|
1952
|
-
D, [2018-08-22 15:56:50 +0400#1358]
|
|
1766
|
+
I, [2018-08-22 15:56:35 +0400#1358] INFO -- github_spider: Spider: started: github_spider
|
|
1767
|
+
D, [2018-08-22 15:56:35 +0400#1358] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
|
|
1768
|
+
I, [2018-08-22 15:56:40 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
|
|
1769
|
+
I, [2018-08-22 15:56:44 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
|
|
1770
|
+
I, [2018-08-22 15:56:44 +0400#1358] INFO -- github_spider: Info: visits: requests: 1, responses: 1
|
|
1771
|
+
D, [2018-08-22 15:56:44 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 116182
|
|
1772
|
+
D, [2018-08-22 15:56:44 +0400#1358] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
|
|
1773
|
+
|
|
1774
|
+
I, [2018-08-22 15:56:49 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
|
|
1775
|
+
I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
|
|
1776
|
+
I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Info: visits: requests: 2, responses: 2
|
|
1777
|
+
D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 217432
|
|
1778
|
+
D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
|
|
1779
|
+
I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
|
|
1780
|
+
I, [2018-08-22 15:56:50 +0400#1358] INFO -- github_spider: Info: items: sent: 1, processed: 1
|
|
1781
|
+
D, [2018-08-22 15:56:50 +0400#1358] DEBUG -- github_spider: Browser: sleep 6 seconds before request...
|
|
1953
1782
|
|
|
1954
1783
|
...
|
|
1955
1784
|
|
|
1956
|
-
I, [2018-08-22 16:11:50 +0400#1358]
|
|
1957
|
-
I, [2018-08-22 16:11:51 +0400#1358]
|
|
1958
|
-
I, [2018-08-22 16:11:51 +0400#1358]
|
|
1959
|
-
D, [2018-08-22 16:11:51 +0400#1358]
|
|
1785
|
+
I, [2018-08-22 16:11:50 +0400#1358] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
|
|
1786
|
+
I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
|
|
1787
|
+
I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Info: visits: requests: 140, responses: 140
|
|
1788
|
+
D, [2018-08-22 16:11:51 +0400#1358] DEBUG -- github_spider: Browser: driver.current_memory: 211713
|
|
1960
1789
|
|
|
1961
|
-
D, [2018-08-22 16:11:51 +0400#1358]
|
|
1962
|
-
E, [2018-08-22 16:11:51 +0400#1358]
|
|
1790
|
+
D, [2018-08-22 16:11:51 +0400#1358] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
|
|
1791
|
+
E, [2018-08-22 16:11:51 +0400#1358] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}
|
|
1963
1792
|
|
|
1964
|
-
I, [2018-08-22 16:11:51 +0400#1358]
|
|
1793
|
+
I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Info: items: sent: 127, processed: 12
|
|
1965
1794
|
|
|
1966
|
-
I, [2018-08-22 16:11:51 +0400#1358]
|
|
1967
|
-
I, [2018-08-22 16:11:51 +0400#1358]
|
|
1795
|
+
I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
|
|
1796
|
+
I, [2018-08-22 16:11:51 +0400#1358] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}
|
|
1968
1797
|
```
|
|
1969
1798
|
</details><br>
|
|
1970
1799
|
|
|
1971
|
-
|
|
1800
|
+
You can also pass custom options to a pipeline from a particular spider if you want to change the pipeline behavior for this spider:
|
|
1972
1801
|
|
|
1973
1802
|
<details>
|
|
1974
1803
|
<summary>Example</summary>
|
|
@@ -2028,7 +1857,7 @@ $ bundle exec kimurai runner -j 3
|
|
|
2028
1857
|
<<< Runner: stopped: {:id=>1533727423, :status=>:completed, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>2018-08-08 15:25:11 +0400, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
|
|
2029
1858
|
```
|
|
2030
1859
|
|
|
2031
|
-
Each spider runs in a separate process.
|
|
1860
|
+
Each spider runs in a separate process. Spider logs are available in the `log/` directory. Use the `-j` argument to specify how many spiders should be processed at the same time (default is 1).
|
|
2032
1861
|
|
|
2033
1862
|
You can provide additional arguments like `--include` or `--exclude` to specify which spiders to run:
|
|
2034
1863
|
|
|
@@ -2046,7 +1875,7 @@ You can perform custom actions before runner starts and after runner stops using
|
|
|
2046
1875
|
|
|
2047
1876
|
|
|
2048
1877
|
## Chat Support and Feedback
|
|
2049
|
-
|
|
1878
|
+
Submit an issue on GitHub and we'll try to address it in a timely manner.
|
|
2050
1879
|
|
|
2051
1880
|
## License
|
|
2052
|
-
|
|
1881
|
+
This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|