kimurai_dynamic 1.4.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (62) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +11 -0
  3. data/.travis.yml +5 -0
  4. data/CHANGELOG.md +111 -0
  5. data/Gemfile +6 -0
  6. data/LICENSE.txt +21 -0
  7. data/README.md +2038 -0
  8. data/Rakefile +10 -0
  9. data/bin/console +14 -0
  10. data/bin/setup +8 -0
  11. data/exe/kimurai +6 -0
  12. data/kimurai.gemspec +48 -0
  13. data/lib/kimurai/automation/deploy.yml +54 -0
  14. data/lib/kimurai/automation/setup/chromium_chromedriver.yml +26 -0
  15. data/lib/kimurai/automation/setup/firefox_geckodriver.yml +20 -0
  16. data/lib/kimurai/automation/setup/phantomjs.yml +33 -0
  17. data/lib/kimurai/automation/setup/ruby_environment.yml +124 -0
  18. data/lib/kimurai/automation/setup.yml +45 -0
  19. data/lib/kimurai/base/saver.rb +106 -0
  20. data/lib/kimurai/base/storage.rb +54 -0
  21. data/lib/kimurai/base.rb +330 -0
  22. data/lib/kimurai/base_helper.rb +22 -0
  23. data/lib/kimurai/browser_builder/mechanize_builder.rb +154 -0
  24. data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +175 -0
  25. data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +199 -0
  26. data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +204 -0
  27. data/lib/kimurai/browser_builder.rb +20 -0
  28. data/lib/kimurai/capybara_configuration.rb +10 -0
  29. data/lib/kimurai/capybara_ext/driver/base.rb +62 -0
  30. data/lib/kimurai/capybara_ext/mechanize/driver.rb +71 -0
  31. data/lib/kimurai/capybara_ext/poltergeist/driver.rb +13 -0
  32. data/lib/kimurai/capybara_ext/selenium/driver.rb +34 -0
  33. data/lib/kimurai/capybara_ext/session/config.rb +22 -0
  34. data/lib/kimurai/capybara_ext/session.rb +249 -0
  35. data/lib/kimurai/cli/ansible_command_builder.rb +71 -0
  36. data/lib/kimurai/cli/generator.rb +57 -0
  37. data/lib/kimurai/cli.rb +183 -0
  38. data/lib/kimurai/core_ext/array.rb +14 -0
  39. data/lib/kimurai/core_ext/hash.rb +5 -0
  40. data/lib/kimurai/core_ext/numeric.rb +19 -0
  41. data/lib/kimurai/core_ext/string.rb +7 -0
  42. data/lib/kimurai/pipeline.rb +33 -0
  43. data/lib/kimurai/runner.rb +60 -0
  44. data/lib/kimurai/template/.gitignore +18 -0
  45. data/lib/kimurai/template/Gemfile +28 -0
  46. data/lib/kimurai/template/README.md +3 -0
  47. data/lib/kimurai/template/config/application.rb +37 -0
  48. data/lib/kimurai/template/config/automation.yml +13 -0
  49. data/lib/kimurai/template/config/boot.rb +22 -0
  50. data/lib/kimurai/template/config/initializers/.keep +0 -0
  51. data/lib/kimurai/template/config/schedule.rb +57 -0
  52. data/lib/kimurai/template/db/.keep +0 -0
  53. data/lib/kimurai/template/helpers/application_helper.rb +3 -0
  54. data/lib/kimurai/template/lib/.keep +0 -0
  55. data/lib/kimurai/template/log/.keep +0 -0
  56. data/lib/kimurai/template/pipelines/saver.rb +11 -0
  57. data/lib/kimurai/template/pipelines/validator.rb +24 -0
  58. data/lib/kimurai/template/spiders/application_spider.rb +143 -0
  59. data/lib/kimurai/template/tmp/.keep +0 -0
  60. data/lib/kimurai/version.rb +3 -0
  61. data/lib/kimurai.rb +54 -0
  62. metadata +349 -0
data/README.md ADDED
@@ -0,0 +1,2038 @@
1
+ # Kimurai
2
+
3
+ > UPD. I will soon have a time to work on issues for current 1.4 version and also plan to release new 2.0 version with https://github.com/twalpole/apparition engine.
4
+
5
+ Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
6
+
7
+ Kimurai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
8
+
9
+ ```ruby
10
+ # github_spider.rb
11
+ require 'kimurai'
12
+
13
+ class GithubSpider < Kimurai::Base
14
+ @name = "github_spider"
15
+ @engine = :selenium_chrome
16
+ @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
17
+ @config = {
18
+ user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
19
+ before_request: { delay: 4..7 }
20
+ }
21
+
22
+ def parse(response, url:, data: {})
23
+ response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
24
+ request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
25
+ end
26
+
27
+ if next_page = response.at_xpath("//a[@class='next_page']")
28
+ request_to :parse, url: absolute_url(next_page[:href], base: url)
29
+ end
30
+ end
31
+
32
+ def parse_repo_page(response, url:, data: {})
33
+ item = {}
34
+
35
+ item[:owner] = response.xpath("//h1//a[@rel='author']").text
36
+ item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
37
+ item[:repo_url] = url
38
+ item[:description] = response.xpath("//span[@itemprop='about']").text.squish
39
+ item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
40
+ item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish
41
+ item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish
42
+ item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish
43
+ item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
44
+
45
+ save_to "results.json", item, format: :pretty_json
46
+ end
47
+ end
48
+
49
+ GithubSpider.crawl!
50
+ ```
51
+
52
+ <details/>
53
+ <summary>Run: <code>$ ruby github_spider.rb</code></summary>
54
+
55
+ ```
56
+ I, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: started: github_spider
57
+ D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
58
+ D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled `browser before_request delay`
59
+ D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 7 seconds before request...
60
+ D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled custom user-agent
61
+ D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
62
+ I, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
63
+ I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
64
+ I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 1, responses: 1
65
+ D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 107968
66
+ D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
67
+ I, [2018-08-22 13:08:32 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
68
+ I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
69
+ I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 2, responses: 2
70
+ D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 212542
71
+ D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 4 seconds before request...
72
+ I, [2018-08-22 13:08:37 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
73
+
74
+ ...
75
+
76
+ I, [2018-08-22 13:23:07 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
77
+ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
78
+ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 140, responses: 140
79
+ D, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 204198
80
+ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
81
+
82
+ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:08:03 +0400, :stop_time=>2018-08-22 13:23:08 +0400, :running_time=>"15m, 5s", :visits=>{:requests=>140, :responses=>140}, :error=>nil}
83
+ ```
84
+ </details>
85
+
86
+ <details/>
87
+ <summary>results.json</summary>
88
+
89
+ ```json
90
+ [
91
+ {
92
+ "owner": "lorien",
93
+ "repo_name": "awesome-web-scraping",
94
+ "repo_url": "https://github.com/lorien/awesome-web-scraping",
95
+ "description": "List of libraries, tools and APIs for web scraping and data processing.",
96
+ "tags": [
97
+ "awesome",
98
+ "awesome-list",
99
+ "web-scraping",
100
+ "data-processing",
101
+ "python",
102
+ "javascript",
103
+ "php",
104
+ "ruby"
105
+ ],
106
+ "watch_count": "159",
107
+ "star_count": "2,423",
108
+ "fork_count": "358",
109
+ "last_commit": "4 days ago",
110
+ "position": 1
111
+ },
112
+
113
+ ...
114
+
115
+ {
116
+ "owner": "preston",
117
+ "repo_name": "idclight",
118
+ "repo_url": "https://github.com/preston/idclight",
119
+ "description": "A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.",
120
+ "tags": [
121
+
122
+ ],
123
+ "watch_count": "6",
124
+ "star_count": "1",
125
+ "fork_count": "0",
126
+ "last_commit": "on Apr 12, 2012",
127
+ "position": 127
128
+ }
129
+ ]
130
+ ```
131
+ </details><br>
132
+
133
+ Okay, that was easy. How about javascript rendered websites with dynamic HTML? Lets scrape a page with infinite scroll:
134
+
135
+ ```ruby
136
+ # infinite_scroll_spider.rb
137
+ require 'kimurai'
138
+
139
+ class InfiniteScrollSpider < Kimurai::Base
140
+ @name = "infinite_scroll_spider"
141
+ @engine = :selenium_chrome
142
+ @start_urls = ["https://infinite-scroll.com/demo/full-page/"]
143
+
144
+ def parse(response, url:, data: {})
145
+ posts_headers_path = "//article/h2"
146
+ count = response.xpath(posts_headers_path).count
147
+
148
+ loop do
149
+ browser.execute_script("window.scrollBy(0,10000)") ; sleep 2
150
+ response = browser.current_response
151
+
152
+ new_count = response.xpath(posts_headers_path).count
153
+ if count == new_count
154
+ logger.info "> Pagination is done" and break
155
+ else
156
+ count = new_count
157
+ logger.info "> Continue scrolling, current count is #{count}..."
158
+ end
159
+ end
160
+
161
+ posts_headers = response.xpath(posts_headers_path).map(&:text)
162
+ logger.info "> All posts from page: #{posts_headers.join('; ')}"
163
+ end
164
+ end
165
+
166
+ InfiniteScrollSpider.crawl!
167
+ ```
168
+
169
+ <details/>
170
+ <summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>
171
+
172
+ ```
173
+ I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
174
+ D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): created browser instance
175
+ D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
176
+ I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
177
+ I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
178
+ I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
179
+ D, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: Browser: driver.current_memory: 95463
180
+ I, [2018-08-22 13:33:05 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 5...
181
+ I, [2018-08-22 13:33:18 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 9...
182
+ I, [2018-08-22 13:33:20 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 11...
183
+ I, [2018-08-22 13:33:26 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 13...
184
+ I, [2018-08-22 13:33:28 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 15...
185
+ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Pagination is done
186
+ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
187
+ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
188
+ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: stopped: {:spider_name=>"infinite_scroll_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:32:57 +0400, :stop_time=>2018-08-22 13:33:30 +0400, :running_time=>"33s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
189
+
190
+ ```
191
+ </details><br>
192
+
193
+
194
+ ## Features
195
+ * Scrape javascript rendered websites out of box
196
+ * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
197
+ * Write spider code once, and use it with any supported engine later
198
+ * All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
199
+ * Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
200
+ * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
201
+ * Automatically [handle requests errors](#handle-request-errors)
202
+ * Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
203
+ * Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
204
+ * [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
205
+ * **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
206
+ * Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
207
+ * Automated [server environment setup](#setup) (for ubuntu 18.04) and [deploy](#deploy) using commands `kimurai setup` and `kimurai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
208
+ * Command-line [runner](#runner) to run all project spiders one by one or in parallel
209
+
210
+ ## Table of Contents
211
+ * [Kimurai](#kimurai)
212
+ * [Features](#features)
213
+ * [Table of Contents](#table-of-contents)
214
+ * [Installation](#installation)
215
+ * [Getting to Know](#getting-to-know)
216
+ * [Interactive console](#interactive-console)
217
+ * [Available engines](#available-engines)
218
+ * [Minimum required spider structure](#minimum-required-spider-structure)
219
+ * [Method arguments response, url and data](#method-arguments-response-url-and-data)
220
+ * [browser object](#browser-object)
221
+ * [request_to method](#request_to-method)
222
+ * [save_to helper](#save_to-helper)
223
+ * [Skip duplicates](#skip-duplicates)
224
+ * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
225
+ * [Storage object](#storage-object)
226
+ * [Handle request errors](#handle-request-errors)
227
+ * [skip_request_errors](#skip_request_errors)
228
+ * [retry_request_errors](#retry_request_errors)
229
+ * [Logging custom events](#logging-custom-events)
230
+ * [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
231
+ * [KIMURAI_ENV](#kimurai_env)
232
+ * [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
233
+ * [Active Support included](#active-support-included)
234
+ * [Schedule spiders using Cron](#schedule-spiders-using-cron)
235
+ * [Configuration options](#configuration-options)
236
+ * [Using Kimurai inside existing Ruby application](#using-kimurai-inside-existing-ruby-application)
237
+ * [crawl! method](#crawl-method)
238
+ * [parse! method](#parsemethod_name-url-method)
239
+ * [Kimurai.list and Kimurai.find_by_name](#kimurailist-and-kimuraifind_by_name)
240
+ * [Automated sever setup and deployment](#automated-sever-setup-and-deployment)
241
+ * [Setup](#setup)
242
+ * [Deploy](#deploy)
243
+ * [Spider @config](#spider-config)
244
+ * [All available @config options](#all-available-config-options)
245
+ * [@config settings inheritance](#config-settings-inheritance)
246
+ * [Project mode](#project-mode)
247
+ * [Generate new spider](#generate-new-spider)
248
+ * [Crawl](#crawl)
249
+ * [List](#list)
250
+ * [Parse](#parse)
251
+ * [Pipelines, send_item method](#pipelines-send_item-method)
252
+ * [Runner](#runner)
253
+ * [Runner callbacks](#runner-callbacks)
254
+ * [Chat Support and Feedback](#chat-support-and-feedback)
255
+ * [License](#license)
256
+
257
+
258
+ ## Installation
259
+ Kimurai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
260
+
261
+ 1) If your system doesn't have appropriate Ruby version, install it:
262
+
263
+ <details/>
264
+ <summary>Ubuntu 18.04</summary>
265
+
266
+ ```bash
267
+ # Install required packages for ruby-build
268
+ sudo apt update
269
+ sudo apt install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libreadline6-dev libyaml-dev libxml2-dev libxslt1-dev libcurl4-openssl-dev libffi-dev
270
+
271
+ # Install rbenv and ruby-build
272
+ cd && git clone https://github.com/rbenv/rbenv.git ~/.rbenv
273
+ echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
274
+ echo 'eval "$(rbenv init -)"' >> ~/.bashrc
275
+ exec $SHELL
276
+
277
+ git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build
278
+ echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
279
+ exec $SHELL
280
+
281
+ # Install latest Ruby
282
+ rbenv install 2.5.3
283
+ rbenv global 2.5.3
284
+
285
+ gem install bundler
286
+ ```
287
+ </details>
288
+
289
+ <details/>
290
+ <summary>Mac OS X</summary>
291
+
292
+ ```bash
293
+ # Install homebrew if you don't have it https://brew.sh/
294
+ # Install rbenv and ruby-build:
295
+ brew install rbenv ruby-build
296
+
297
+ # Add rbenv to bash so that it loads every time you open a terminal
298
+ echo 'if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi' >> ~/.bash_profile
299
+ source ~/.bash_profile
300
+
301
+ # Install latest Ruby
302
+ rbenv install 2.5.3
303
+ rbenv global 2.5.3
304
+
305
+ gem install bundler
306
+ ```
307
+ </details>
308
+
309
+ 2) Install Kimurai gem: `$ gem install kimurai`
310
+
311
+ 3) Install browsers with webdrivers:
312
+
313
+ <details/>
314
+ <summary>Ubuntu 18.04</summary>
315
+
316
+ Note: for Ubuntu 16.04-18.04 there is available automatic installation using `setup` command:
317
+ ```bash
318
+ $ kimurai setup localhost --local --ask-sudo
319
+ ```
320
+ It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/kimurai/automation).
321
+
322
+ If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:
323
+
324
+ ```bash
325
+ # Install basic tools
326
+ sudo apt install -q -y unzip wget tar openssl
327
+
328
+ # Install xvfb (for virtual_display headless mode, in additional to native)
329
+ sudo apt install -q -y xvfb
330
+
331
+ # Install chromium-browser and firefox
332
+ sudo apt install -q -y chromium-browser firefox
333
+
334
+ # Instal chromedriver (2.44 version)
335
+ # All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
336
+ cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
337
+ sudo unzip chromedriver_linux64.zip -d /usr/local/bin
338
+ rm -f chromedriver_linux64.zip
339
+
340
+ # Install geckodriver (0.23.0 version)
341
+ # All versions located here https://github.com/mozilla/geckodriver/releases/
342
+ cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
343
+ sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
344
+ rm -f geckodriver-v0.23.0-linux64.tar.gz
345
+
346
+ # Install PhantomJS (2.1.1)
347
+ # All versions located here http://phantomjs.org/download.html
348
+ sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
349
+ cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
350
+ tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
351
+ sudo mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib
352
+ sudo ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
353
+ rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2
354
+ ```
355
+
356
+ </details>
357
+
358
+ <details/>
359
+ <summary>Mac OS X</summary>
360
+
361
+ ```bash
362
+ # Install chrome and firefox
363
+ brew cask install google-chrome firefox
364
+
365
+ # Install chromedriver (latest)
366
+ brew cask install chromedriver
367
+
368
+ # Install geckodriver (latest)
369
+ brew install geckodriver
370
+
371
+ # Install PhantomJS (latest)
372
+ brew install phantomjs
373
+ ```
374
+ </details><br>
375
+
376
+ Also, if you want to save scraped items to the database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
377
+
378
+ <details/>
379
+ <summary>Ubuntu 18.04</summary>
380
+
381
+ SQlite: `$ sudo apt -q -y install libsqlite3-dev sqlite3`.
382
+
383
+ If you want to connect to a remote database, you don't need database server on a local machine (only client):
384
+ ```bash
385
+ # Install MySQL client
386
+ sudo apt -q -y install mysql-client libmysqlclient-dev
387
+
388
+ # Install Postgres client
389
+ sudo apt install -q -y postgresql-client libpq-dev
390
+
391
+ # Install MongoDB client
392
+ sudo apt install -q -y mongodb-clients
393
+ ```
394
+
395
+ But if you want to save items to a local database, database server required as well:
396
+ ```bash
397
+ # Install MySQL client and server
398
+ sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
399
+
400
+ # Install Postgres client and server
401
+ sudo apt install -q -y postgresql postgresql-contrib libpq-dev
402
+
403
+ # Install MongoDB client and server
404
+ # version 4.0 (check here https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
405
+ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
406
+ # for 16.04:
407
+ # echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
408
+ # for 18.04:
409
+ echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
410
+ sudo apt update
411
+ sudo apt install -q -y mongodb-org
412
+ sudo service mongod start
413
+ ```
414
+ </details>
415
+
416
+ <details/>
417
+ <summary>Mac OS X</summary>
418
+
419
+ SQlite: `$ brew install sqlite3`
420
+
421
+ ```bash
422
+ # Install MySQL client and server
423
+ brew install mysql
424
+ # Start server if you need it: brew services start mysql
425
+
426
+ # Install Postgres client and server
427
+ brew install postgresql
428
+ # Start server if you need it: brew services start postgresql
429
+
430
+ # Install MongoDB client and server
431
+ brew install mongodb
432
+ # Start server if you need it: brew services start mongodb
433
+ ```
434
+ </details>
435
+
436
+
437
+ ## Getting to Know
438
+ ### Interactive console
439
+ Before you get to know all Kimurai features, there is `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
440
+
441
+ ```bash
442
+ $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
443
+ ```
444
+
445
+ <details/>
446
+ <summary>Show output</summary>
447
+
448
+ ```
449
+ $ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
450
+
451
+ D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
452
+ D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
453
+ I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
454
+ I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
455
+ D, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701
456
+
457
+ From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#console:
458
+
459
+ 188: def console(response = nil, url: nil, data: {})
460
+ => 189: binding.pry
461
+ 190: end
462
+
463
+ [1] pry(#<Kimurai::Base>)> response.xpath("//title").text
464
+ => "GitHub - vifreefly/kimuraframework: Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
465
+
466
+ [2] pry(#<Kimurai::Base>)> ls
467
+ Kimurai::Base#methods: browser console logger request_to save_to unique?
468
+ instance variables: @browser @config @engine @logger @pipelines
469
+ locals: _ __ _dir_ _ex_ _file_ _in_ _out_ _pry_ data response url
470
+
471
+ [3] pry(#<Kimurai::Base>)> ls response
472
+ Nokogiri::XML::PP::Node#methods: inspect pretty_print
473
+ Nokogiri::XML::Searchable#methods: % / at at_css at_xpath css search xpath
474
+ Enumerable#methods:
475
+ all? collect drop each_with_index find_all grep_v lazy member? none? reject slice_when take_while without
476
+ any? collect_concat drop_while each_with_object find_index group_by many? min one? reverse_each sort to_a zip
477
+ as_json count each_cons entries first include? map min_by partition select sort_by to_h
478
+ chunk cycle each_entry exclude? flat_map index_by max minmax pluck slice_after sum to_set
479
+ chunk_while detect each_slice find grep inject max_by minmax_by reduce slice_before take uniq
480
+ Nokogiri::XML::Node#methods:
481
+ <=> append_class classes document? has_attribute? matches? node_name= processing_instruction? to_str
482
+ == attr comment? each html? name= node_type read_only? to_xhtml
483
+ > attribute content elem? inner_html namespace= parent= remove traverse
484
+ [] attribute_nodes content= element? inner_html= namespace_scopes parse remove_attribute unlink
485
+ []= attribute_with_ns create_external_subset element_children inner_text namespaced_key? path remove_class values
486
+ accept before create_internal_subset elements internal_subset native_content= pointer_id replace write_html_to
487
+ add_class blank? css_path encode_special_chars key? next prepend_child set_attribute write_to
488
+ add_next_sibling cdata? decorate! external_subset keys next= previous text write_xhtml_to
489
+ add_previous_sibling child delete first_element_child lang next_element previous= text? write_xml_to
490
+ after children description fragment? lang= next_sibling previous_element to_html xml?
491
+ ancestors children= do_xinclude get_attribute last_element_child node_name previous_sibling to_s
492
+ Nokogiri::XML::Document#methods:
493
+ << canonicalize collect_namespaces create_comment create_entity decorate document encoding errors name remove_namespaces! root= to_java url version
494
+ add_child clone create_cdata create_element create_text_node decorators dup encoding= errors= namespaces root slop! to_xml validate
495
+ Nokogiri::HTML::Document#methods: fragment meta_encoding meta_encoding= serialize title title= type
496
+ instance variables: @decorators @errors @node_cache
497
+
498
+ [4] pry(#<Kimurai::Base>)> exit
499
+ I, [2018-08-22 13:43:47 +0400#26079] [M: 47461994677760] INFO -- : Browser: driver selenium_chrome has been destroyed
500
+ $
501
+ ```
502
+ </details><br>
503
+
504
+ CLI options:
505
+ * `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
506
+ * `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
507
+
508
+ ### Available engines
509
+ Kimurai has support for following engines and mostly can switch between them without need to rewrite any code:
510
+
511
+ * `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render javascript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use javascript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
512
+ * `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Kimurai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
513
+ * `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper javascript rendering.
514
+ * `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
515
+
516
+ **Tip:** add `HEADLESS=false` ENV variable before command (`$ HEADLESS=false ruby spider.rb`) to run browser in normal (not headless) mode and see it's window (only for selenium-like engines). It works for [console](#interactive-console) command as well.
517
+
518
+
519
+ ### Minimum required spider structure
520
+ > You can manually create a spider file, or use generator instead: `$ kimurai generate spider simple_spider`
521
+
522
+ ```ruby
523
+ require 'kimurai'
524
+
525
+ class SimpleSpider < Kimurai::Base
526
+ @name = "simple_spider"
527
+ @engine = :selenium_chrome
528
+ @start_urls = ["https://example.com/"]
529
+
530
+ def parse(response, url:, data: {})
531
+ end
532
+ end
533
+
534
+ SimpleSpider.crawl!
535
+ ```
536
+
537
+ Where:
538
+ * `@name` name of a spider. You can omit name if use single-file spider
539
+ * `@engine` engine for a spider
540
+ * `@start_urls` array of start urls to process one by one inside `parse` method
541
+ * Method `parse` is the start method, should be always present in spider class
542
+
543
+
544
+ ### Method arguments `response`, `url` and `data`
545
+
546
+ ```ruby
547
+ def parse(response, url:, data: {})
548
+ end
549
+ ```
550
+
551
+ * `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object) Contains parsed HTML code of a processed webpage
552
+ * `url` (String) url of a processed webpage
553
+ * `data` (Hash) uses to pass data between requests
554
+
555
+ <details/>
556
+ <summary><strong>Example how to use <code>data</code></strong></summary>
557
+
558
+ Imagine that there is a product page which doesn't contain product category. Category name present only on category page with pagination. This is the case where we can use `data` to pass category name from `parse` to `parse_product` method:
559
+
560
+ ```ruby
561
+ class ProductsSpider < Kimurai::Base
562
+ @engine = :selenium_chrome
563
+ @start_urls = ["https://example-shop.com/example-product-category"]
564
+
565
+ def parse(response, url:, data: {})
566
+ category_name = response.xpath("//path/to/category/name").text
567
+ response.xpath("//path/to/products/urls").each do |product_url|
568
+ # Merge category_name with current data hash and pass it next to parse_product method
569
+ request_to(:parse_product, url: product_url[:href], data: data.merge(category_name: category_name))
570
+ end
571
+
572
+ # ...
573
+ end
574
+
575
+ def parse_product(response, url:, data: {})
576
+ item = {}
577
+ # Assign item's category_name from data[:category_name]
578
+ item[:category_name] = data[:category_name]
579
+
580
+ # ...
581
+ end
582
+ end
583
+
584
+ ```
585
+ </details><br>
586
+
587
+ **You can query `response` using [XPath or CSS selectors](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Searchable)**. Check Nokogiri tutorials to understand how to work with `response`:
588
+ * [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) - ruby.bastardsbook.com
589
+ * [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) - readysteadycode.com
590
+ * [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) - rubydoc.info
591
+
592
+
593
+ ### `browser` object
594
+
595
+ From any spider instance method there is available `browser` object, which is [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
596
+
597
+ But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
598
+
599
+ ```ruby
600
+ class GoogleSpider < Kimurai::Base
601
+ @name = "google_spider"
602
+ @engine = :selenium_chrome
603
+ @start_urls = ["https://www.google.com/"]
604
+
605
+ def parse(response, url:, data: {})
606
+ browser.fill_in "q", with: "Kimurai web scraping framework"
607
+ browser.click_button "Google Search"
608
+
609
+ # Update response to current response after interaction with a browser
610
+ response = browser.current_response
611
+
612
+ # Collect results
613
+ results = response.xpath("//div[@class='g']//h3/a").map do |a|
614
+ { title: a.text, url: a[:href] }
615
+ end
616
+
617
+ # ...
618
+ end
619
+ end
620
+ ```
621
+
622
+ Check out **Capybara cheat sheets** where you can see all available methods **to interact with browser**:
623
+ * [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) - cheatrags.com
624
+ * [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) - thoughtbot.com
625
+ * [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) - rubydoc.info
626
+
627
+ ### `request_to` method
628
+
629
+ For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it). Example:
630
+
631
+ ```ruby
632
+ class Spider < Kimurai::Base
633
+ @engine = :selenium_chrome
634
+ @start_urls = ["https://example.com/"]
635
+
636
+ def parse(response, url:, data: {})
637
+ # Process request to `parse_product` method with `https://example.com/some_product` url:
638
+ request_to :parse_product, url: "https://example.com/some_product"
639
+ end
640
+
641
+ def parse_product(response, url:, data: {})
642
+ puts "From page https://example.com/some_product !"
643
+ end
644
+ end
645
+ ```
646
+
647
+ Under the hood `request_to` simply call [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`) and then required method with arguments:
648
+
649
+ <details/>
650
+ <summary>request_to</summary>
651
+
652
+ ```ruby
653
+ def request_to(handler, url:, data: {})
654
+ request_data = { url: url, data: data }
655
+
656
+ browser.visit(url)
657
+ public_send(handler, browser.current_response, request_data)
658
+ end
659
+ ```
660
+ </details><br>
661
+
662
+ `request_to` just makes things simpler, and without it we could do something like:
663
+
664
+ <details/>
665
+ <summary>Check the code</summary>
666
+
667
+ ```ruby
668
+ class Spider < Kimurai::Base
669
+ @engine = :selenium_chrome
670
+ @start_urls = ["https://example.com/"]
671
+
672
+ def parse(response, url:, data: {})
673
+ url_to_process = "https://example.com/some_product"
674
+
675
+ browser.visit(url_to_process)
676
+ parse_product(browser.current_response, url: url_to_process)
677
+ end
678
+
679
+ def parse_product(response, url:, data: {})
680
+ puts "From page https://example.com/some_product !"
681
+ end
682
+ end
683
+ ```
684
+ </details>
685
+
686
+ ### `save_to` helper
687
+
688
+ Sometimes all that you need is to simply save scraped data to a file format, like JSON or CSV. You can use `save_to` for it:
689
+
690
+ ```ruby
691
+ class ProductsSpider < Kimurai::Base
692
+ @engine = :selenium_chrome
693
+ @start_urls = ["https://example-shop.com/"]
694
+
695
+ # ...
696
+
697
+ def parse_product(response, url:, data: {})
698
+ item = {}
699
+
700
+ item[:title] = response.xpath("//title/path").text
701
+ item[:description] = response.xpath("//desc/path").text.squish
702
+ item[:price] = response.xpath("//price/path").text[/\d+/]&.to_f
703
+
704
+ # Add each new item to the `scraped_products.json` file:
705
+ save_to "scraped_products.json", item, format: :json
706
+ end
707
+ end
708
+ ```
709
+
710
+ Supported formats:
711
+ * `:json` JSON
712
+ * `:pretty_json` "pretty" JSON (`JSON.pretty_generate`)
713
+ * `:jsonlines` [JSON Lines](http://jsonlines.org/)
714
+ * `:csv` CSV
715
+
716
+ Note: `save_to` requires data (item to save) to be a `Hash`.
717
+
718
+ By default `save_to` add position key to an item hash. You can disable it with `position: false`: `save_to "scraped_products.json", item, format: :json, position: false`.
719
+
720
+ **How helper works:**
721
+
722
+ Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
723
+
724
+ > If you don't want file to be cleared before each run, add option `append: true`: `save_to "scraped_products.json", item, format: :json, append: true`
725
+
726
+ ### Skip duplicates
727
+
728
+ It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is simple `unique?` helper:
729
+
730
+ ```ruby
731
+ class ProductsSpider < Kimurai::Base
732
+ @engine = :selenium_chrome
733
+ @start_urls = ["https://example-shop.com/"]
734
+
735
+ def parse(response, url:, data: {})
736
+ response.xpath("//categories/path").each do |category|
737
+ request_to :parse_category, url: category[:href]
738
+ end
739
+ end
740
+
741
+ # Check products for uniqueness using product url inside of parse_category:
742
+ def parse_category(response, url:, data: {})
743
+ response.xpath("//products/path").each do |product|
744
+ # Skip url if it's not unique:
745
+ next unless unique?(:product_url, product[:href])
746
+ # Otherwise process it:
747
+ request_to :parse_product, url: product[:href]
748
+ end
749
+ end
750
+
751
+ # Or/and check products for uniqueness using product sku inside of parse_product:
752
+ def parse_product(response, url:, data: {})
753
+ item = {}
754
+ item[:sku] = response.xpath("//product/sku/path").text.strip.upcase
755
+ # Don't save product and return from method if there is already saved item with the same sku:
756
+ return unless unique?(:sku, item[:sku])
757
+
758
+ # ...
759
+ save_to "results.json", item, format: :json
760
+ end
761
+ end
762
+ ```
763
+
764
+ `unique?` helper works pretty simple:
765
+
766
+ ```ruby
767
+ # Check string "http://example.com" in scope `url` for a first time:
768
+ unique?(:url, "http://example.com")
769
+ # => true
770
+
771
+ # Try again:
772
+ unique?(:url, "http://example.com")
773
+ # => false
774
+ ```
775
+
776
+ To check something for uniqueness, you need to provide a scope:
777
+
778
+ ```ruby
779
+ # `product_url` scope
780
+ unique?(:product_url, "http://example.com/product_1")
781
+
782
+ # `id` scope
783
+ unique?(:id, 324234232)
784
+
785
+ # `custom` scope
786
+ unique?(:custom, "Lorem Ipsum")
787
+ ```
788
+
789
+ #### Automatically skip all duplicated requests urls
790
+
791
+ It is possible to automatically skip all already visited urls while calling `request_to` method, using [@config](#all-available-config-options) option `skip_duplicate_requests: true`. With this option, all already visited urls will be automatically skipped. Also check the [@config](#all-available-config-options) for an additional options of this setting.
792
+
793
+ #### `storage` object
794
+
795
+ `unique?` method it's just an alias for `storage#unique?`. Storage has several methods:
796
+
797
+ * `#all` - display storage hash where keys are existing scopes.
798
+ * `#include?(scope, value)` - return `true` if value in the scope exists, and `false` if not
799
+ * `#add(scope, value)` - add value to the scope
800
+ * `#unique?(scope, value)` - method already described above, will return `false` if value in the scope exists, or return `true` + add value to the scope if value in the scope not exists.
801
+ * `#clear!` - reset the whole storage by deleting all values from all scopes.
802
+
803
+
804
+ ### Handle request errors
805
+ It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
806
+
807
+ #### skip_request_errors
808
+ You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
809
+
810
+ Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
811
+
812
+ ```
813
+ @config = {
814
+ skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }]
815
+ }
816
+ ```
817
+ In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: RuntimeError, message: /404|403/ }`.
818
+
819
+ #### retry_request_errors
820
+ You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
821
+
822
+ There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
823
+
824
+ Format for the option: same like for `skip_request_errors` option.
825
+
826
+ If you would like to skip (not raise) error after all retries gone, you can specify `skip_on_failure: true` option:
827
+
828
+ ```ruby
829
+ @config = {
830
+ retry_request_errors: [{ error: RuntimeError, skip_on_failure: true }]
831
+ }
832
+ ```
833
+
834
+ ### Logging custom events
835
+
836
+ It is possible to save custom messages to the [run_info](#open_spider-and-close_spider-callbacks) hash using `add_event('Some message')` method. This feature helps you to keep track on important things which happened during crawling without checking the whole spider log (in case if you're logging these messages using `logger`). Example:
837
+
838
+ ```ruby
839
+ def parse_product(response, url:, data: {})
840
+ unless response.at_xpath("//path/to/add_to_card_button")
841
+ add_event("Product is sold") and return
842
+ end
843
+
844
+ # ...
845
+ end
846
+ ```
847
+
848
+ ```
849
+ ...
850
+ I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640] INFO -- example_spider: Spider: new event (scope: custom): Product is sold
851
+ ...
852
+ I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640] INFO -- example_spider: Spider: stopped: {:events=>{:custom=>{"Product is sold"=>1}}}
853
+ ```
854
+
855
+ ### `open_spider` and `close_spider` callbacks
856
+
857
+ You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before spider started or after spider has been stopped:
858
+
859
+ ```ruby
860
+ require 'kimurai'
861
+
862
+ class ExampleSpider < Kimurai::Base
863
+ @name = "example_spider"
864
+ @engine = :selenium_chrome
865
+ @start_urls = ["https://example.com/"]
866
+
867
+ def self.open_spider
868
+ logger.info "> Starting..."
869
+ end
870
+
871
+ def self.close_spider
872
+ logger.info "> Stopped!"
873
+ end
874
+
875
+ def parse(response, url:, data: {})
876
+ logger.info "> Scraping..."
877
+ end
878
+ end
879
+
880
+ ExampleSpider.crawl!
881
+ ```
882
+
883
+ <details/>
884
+ <summary>Output</summary>
885
+
886
+ ```
887
+ I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: Spider: started: example_spider
888
+ I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Starting...
889
+ D, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance
890
+ D, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
891
+ I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: started get request to: https://example.com/
892
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: finished get request to: https://example.com/
893
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Info: visits: requests: 1, responses: 1
894
+ D, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: Browser: driver.current_memory: 82415
895
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Scraping...
896
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: driver selenium_chrome has been destroyed
897
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Stopped!
898
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:26:32 +0400, :stop_time=>2018-08-22 14:26:34 +0400, :running_time=>"1s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
899
+ ```
900
+ </details><br>
901
+
902
+ Inside `open_spider` and `close_spider` class methods there is available `run_info` method which contains useful information about spider state:
903
+
904
+ ```ruby
905
+ 11: def self.open_spider
906
+ => 12: binding.pry
907
+ 13: end
908
+
909
+ [1] pry(example_spider)> run_info
910
+ => {
911
+ :spider_name=>"example_spider",
912
+ :status=>:running,
913
+ :environment=>"development",
914
+ :start_time=>2018-08-05 23:32:00 +0400,
915
+ :stop_time=>nil,
916
+ :running_time=>nil,
917
+ :visits=>{:requests=>0, :responses=>0},
918
+ :error=>nil
919
+ }
920
+ ```
921
+
922
+ Inside `close_spider`, `run_info` will be updated:
923
+
924
+ ```ruby
925
+ 15: def self.close_spider
926
+ => 16: binding.pry
927
+ 17: end
928
+
929
+ [1] pry(example_spider)> run_info
930
+ => {
931
+ :spider_name=>"example_spider",
932
+ :status=>:completed,
933
+ :environment=>"development",
934
+ :start_time=>2018-08-05 23:32:00 +0400,
935
+ :stop_time=>2018-08-05 23:32:06 +0400,
936
+ :running_time=>6.214,
937
+ :visits=>{:requests=>1, :responses=>1},
938
+ :error=>nil
939
+ }
940
+ ```
941
+
942
+ `run_info[:status]` helps to determine if spider was finished successfully or failed (possible values: `:completed`, `:failed`):
943
+
944
+ ```ruby
945
+ class ExampleSpider < Kimurai::Base
946
+ @name = "example_spider"
947
+ @engine = :selenium_chrome
948
+ @start_urls = ["https://example.com/"]
949
+
950
+ def self.close_spider
951
+ puts ">>> run info: #{run_info}"
952
+ end
953
+
954
+ def parse(response, url:, data: {})
955
+ logger.info "> Scraping..."
956
+ # Let's try to strip nil:
957
+ nil.strip
958
+ end
959
+ end
960
+ ```
961
+
962
+ <details/>
963
+ <summary>Output</summary>
964
+
965
+ ```
966
+ I, [2018-08-22 14:34:24 +0400#8459] [M: 47020523644400] INFO -- example_spider: Spider: started: example_spider
967
+ D, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance
968
+ D, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
969
+ I, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: started get request to: https://example.com/
970
+ I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: finished get request to: https://example.com/
971
+ I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Info: visits: requests: 1, responses: 1
972
+ D, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: Browser: driver.current_memory: 83351
973
+ I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: > Scraping...
974
+ I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: driver selenium_chrome has been destroyed
975
+
976
+ >>> run info: {:spider_name=>"example_spider", :status=>:failed, :environment=>"development", :start_time=>2018-08-22 14:34:24 +0400, :stop_time=>2018-08-22 14:34:26 +0400, :running_time=>2.01, :visits=>{:requests=>1, :responses=>1}, :error=>"#<NoMethodError: undefined method `strip' for nil:NilClass>"}
977
+
978
+ F, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] FATAL -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:failed, :environment=>"development", :start_time=>2018-08-22 14:34:24 +0400, :stop_time=>2018-08-22 14:34:26 +0400, :running_time=>"2s", :visits=>{:requests=>1, :responses=>1}, :error=>"#<NoMethodError: undefined method `strip' for nil:NilClass>"}
979
+ Traceback (most recent call last):
980
+ 6: from example_spider.rb:19:in `<main>'
981
+ 5: from /home/victor/code/kimurai/lib/kimurai/base.rb:127:in `crawl!'
982
+ 4: from /home/victor/code/kimurai/lib/kimurai/base.rb:127:in `each'
983
+ 3: from /home/victor/code/kimurai/lib/kimurai/base.rb:128:in `block in crawl!'
984
+ 2: from /home/victor/code/kimurai/lib/kimurai/base.rb:185:in `request_to'
985
+ 1: from /home/victor/code/kimurai/lib/kimurai/base.rb:185:in `public_send'
986
+ example_spider.rb:15:in `parse': undefined method `strip' for nil:NilClass (NoMethodError)
987
+ ```
988
+ </details><br>
989
+
990
+ **Usage example:** if spider finished successfully, send JSON file with scraped items to a remote FTP location, otherwise (if spider failed), skip incompleted results and send email/notification to slack about it:
991
+
992
+ <details/>
993
+ <summary>Example</summary>
994
+
995
+ Also you can use additional methods `completed?` or `failed?`
996
+
997
+ ```ruby
998
+ class Spider < Kimurai::Base
999
+ @engine = :selenium_chrome
1000
+ @start_urls = ["https://example.com/"]
1001
+
1002
+ def self.close_spider
1003
+ if completed?
1004
+ send_file_to_ftp("results.json")
1005
+ else
1006
+ send_error_notification(run_info[:error])
1007
+ end
1008
+ end
1009
+
1010
+ def self.send_file_to_ftp(file_path)
1011
+ # ...
1012
+ end
1013
+
1014
+ def self.send_error_notification(error)
1015
+ # ...
1016
+ end
1017
+
1018
+ # ...
1019
+
1020
+ def parse_item(response, url:, data: {})
1021
+ item = {}
1022
+ # ...
1023
+
1024
+ save_to "results.json", item, format: :json
1025
+ end
1026
+ end
1027
+ ```
1028
+ </details>
1029
+
1030
+
1031
+ ### `KIMURAI_ENV`
1032
+ Kimurai has environments, default is `development`. To provide custom environment pass `KIMURAI_ENV` ENV variable before command: `$ KIMURAI_ENV=production ruby spider.rb`. To access current environment there is `Kimurai.env` method.
1033
+
1034
+ Usage example:
1035
+ ```ruby
1036
+ class Spider < Kimurai::Base
1037
+ @engine = :selenium_chrome
1038
+ @start_urls = ["https://example.com/"]
1039
+
1040
+ def self.close_spider
1041
+ if failed? && Kimurai.env == "production"
1042
+ send_error_notification(run_info[:error])
1043
+ else
1044
+ # Do nothing
1045
+ end
1046
+ end
1047
+
1048
+ # ...
1049
+ end
1050
+ ```
1051
+
1052
+ ### Parallel crawling using `in_parallel`
1053
+ Kimurai can process web pages concurrently in one single line: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is array of urls to crawl and `threads:` is a number of threads:
1054
+
1055
+ ```ruby
1056
+ # amazon_spider.rb
1057
+ require 'kimurai'
1058
+
1059
+ class AmazonSpider < Kimurai::Base
1060
+ @name = "amazon_spider"
1061
+ @engine = :mechanize
1062
+ @start_urls = ["https://www.amazon.com/"]
1063
+
1064
+ def parse(response, url:, data: {})
1065
+ browser.fill_in "field-keywords", with: "Web Scraping Books"
1066
+ browser.click_on "Go"
1067
+
1068
+ # Walk through pagination and collect products urls:
1069
+ urls = []
1070
+ loop do
1071
+ response = browser.current_response
1072
+ response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
1073
+ urls << a[:href].sub(/ref=.+/, "")
1074
+ end
1075
+
1076
+ browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
1077
+ end
1078
+
1079
+ # Process all collected urls concurrently within 3 threads:
1080
+ in_parallel(:parse_book_page, urls, threads: 3)
1081
+ end
1082
+
1083
+ def parse_book_page(response, url:, data: {})
1084
+ item = {}
1085
+
1086
+ item[:title] = response.xpath("//h1/span[@id]").text.squish
1087
+ item[:url] = url
1088
+ item[:price] = response.xpath("(//span[contains(@class, 'a-color-price')])[1]").text.squish.presence
1089
+ item[:publisher] = response.xpath("//h2[text()='Product details']/following::b[text()='Publisher:']/following-sibling::text()[1]").text.squish.presence
1090
+
1091
+ save_to "books.json", item, format: :pretty_json
1092
+ end
1093
+ end
1094
+
1095
+ AmazonSpider.crawl!
1096
+ ```
1097
+
1098
+ <details/>
1099
+ <summary>Run: <code>$ ruby amazon_spider.rb</code></summary>
1100
+
1101
+ ```
1102
+ I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: started: amazon_spider
1103
+ D, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1104
+ I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
1105
+ I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
1106
+ I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
1107
+
1108
+ I, [2018-08-22 14:48:43 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: starting processing 52 urls within 3 threads
1109
+ D, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1110
+ I, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
1111
+ D, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1112
+ I, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
1113
+ D, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1114
+ I, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
1115
+ I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
1116
+ I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 4, responses: 2
1117
+ I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
1118
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
1119
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 5, responses: 3
1120
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
1121
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
1122
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 6, responses: 4
1123
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Excel-Effective-Scrapes-ebook/dp/B01CMMJGZ8/
1124
+
1125
+ ...
1126
+
1127
+ I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 51, responses: 49
1128
+ I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1129
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Ice-Life-Bill-Rayburn-ebook/dp/B00C0NF1L8/
1130
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 51, responses: 50
1131
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
1132
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Php-architects-Guide-Scraping-Author/dp/B010DTKYY4/
1133
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 52, responses: 51
1134
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
1135
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
1136
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 53, responses: 52
1137
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1138
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
1139
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 53, responses: 53
1140
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1141
+
1142
+ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: stopped processing 52 urls within 3 threads, total time: 29s
1143
+ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1144
+
1145
+ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:48:37 +0400, :stop_time=>2018-08-22 14:49:12 +0400, :running_time=>"35s", :visits=>{:requests=>53, :responses=>53}, :error=>nil}
1146
+
1147
+ ```
1148
+ </details>
1149
+
1150
+ <details/>
1151
+ <summary>books.json</summary>
1152
+
1153
+ ```json
1154
+ [
1155
+ {
1156
+ "title": "Web Scraping with Python: Collecting More Data from the Modern Web2nd Edition",
1157
+ "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
1158
+ "price": "$26.94",
1159
+ "publisher": "O'Reilly Media; 2 edition (April 14, 2018)",
1160
+ "position": 1
1161
+ },
1162
+ {
1163
+ "title": "Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS",
1164
+ "url": "https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/",
1165
+ "price": "$39.99",
1166
+ "publisher": "Packt Publishing - ebooks Account (February 9, 2018)",
1167
+ "position": 2
1168
+ },
1169
+ {
1170
+ "title": "Web Scraping with Python: Collecting Data from the Modern Web1st Edition",
1171
+ "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/",
1172
+ "price": "$15.75",
1173
+ "publisher": "O'Reilly Media; 1 edition (July 24, 2015)",
1174
+ "position": 3
1175
+ },
1176
+
1177
+ ...
1178
+
1179
+ {
1180
+ "title": "Instant Web Scraping with Java by Ryan Mitchell (2013-08-26)",
1181
+ "url": "https://www.amazon.com/Instant-Scraping-Java-Mitchell-2013-08-26/dp/B01FEM76X2/",
1182
+ "price": "$35.82",
1183
+ "publisher": "Packt Publishing (2013-08-26) (1896)",
1184
+ "position": 52
1185
+ }
1186
+ ]
1187
+ ```
1188
+ </details><br>
1189
+
1190
+ > Note that [save_to](#save_to-helper) and [unique?](#skip-duplicates-unique-helper) helpers are thread-safe (protected by [Mutex](https://ruby-doc.org/core-2.5.1/Mutex.html)) and can be freely used inside threads.
1191
+
1192
+ `in_parallel` can take additional options:
1193
+ * `data:` pass with urls custom data hash: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
1194
+ * `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1195
+ * `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`
1196
+ * `config:` pass custom options to config (see [config section](#crawler-config))
1197
+
1198
+ ### Active Support included
1199
+
1200
+ You can use all the power of familiar [Rails core-ext methods](https://guides.rubyonrails.org/active_support_core_extensions.html#loading-all-core-extensions) for scraping inside Kimurai. Especially take a look at [squish](https://apidock.com/rails/String/squish), [truncate_words](https://apidock.com/rails/String/truncate_words), [titleize](https://apidock.com/rails/String/titleize), [remove](https://apidock.com/rails/String/remove), [present?](https://guides.rubyonrails.org/active_support_core_extensions.html#blank-questionmark-and-present-questionmark) and [presence](https://guides.rubyonrails.org/active_support_core_extensions.html#presence).
1201
+
1202
+ ### Schedule spiders using Cron
1203
+
1204
+ 1) Inside spider directory generate [Whenever](https://github.com/javan/whenever) config: `$ kimurai generate schedule`.
1205
+
1206
+ <details/>
1207
+ <summary><code>schedule.rb</code></summary>
1208
+
1209
+ ```ruby
1210
+ ### Settings ###
1211
+ require 'tzinfo'
1212
+
1213
+ # Export current PATH to the cron
1214
+ env :PATH, ENV["PATH"]
1215
+
1216
+ # Use 24 hour format when using `at:` option
1217
+ set :chronic_options, hours24: true
1218
+
1219
+ # Use local_to_utc helper to setup execution time using your local timezone instead
1220
+ # of server's timezone (which is probably and should be UTC, to check run `$ timedatectl`).
1221
+ # Also maybe you'll want to set same timezone in kimurai as well (use `Kimurai.configuration.time_zone =` for that),
1222
+ # to have spiders logs in a specific time zone format.
1223
+ # Example usage of helper:
1224
+ # every 1.day, at: local_to_utc("7:00", zone: "Europe/Moscow") do
1225
+ # crawl "google_spider.com", output: "log/google_spider.com.log"
1226
+ # end
1227
+ def local_to_utc(time_string, zone:)
1228
+ TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time_string))
1229
+ end
1230
+
1231
+ # Note: by default Whenever exports cron commands with :environment == "production".
1232
+ # Note: Whenever can only append log data to a log file (>>). If you want
1233
+ # to overwrite (>) log file before each run, pass lambda:
1234
+ # crawl "google_spider.com", output: -> { "> log/google_spider.com.log 2>&1" }
1235
+
1236
+ # Project job types
1237
+ job_type :crawl, "cd :path && KIMURAI_ENV=:environment bundle exec kimurai crawl :task :output"
1238
+ job_type :runner, "cd :path && KIMURAI_ENV=:environment bundle exec kimurai runner --jobs :task :output"
1239
+
1240
+ # Single file job type
1241
+ job_type :single, "cd :path && KIMURAI_ENV=:environment ruby :task :output"
1242
+ # Single with bundle exec
1243
+ job_type :single_bundle, "cd :path && KIMURAI_ENV=:environment bundle exec ruby :task :output"
1244
+
1245
+ ### Schedule ###
1246
+ # Usage (check examples here https://github.com/javan/whenever#example-schedulerb-file):
1247
+ # every 1.day do
1248
+ # Example to schedule a single spider in the project:
1249
+ # crawl "google_spider.com", output: "log/google_spider.com.log"
1250
+
1251
+ # Example to schedule all spiders in the project using runner. Each spider will write
1252
+ # it's own output to the `log/spider_name.log` file (handled by a runner itself).
1253
+ # Runner output will be written to log/runner.log file.
1254
+ # Argument number it's a count of concurrent jobs:
1255
+ # runner 3, output:"log/runner.log"
1256
+
1257
+ # Example to schedule single spider (without project):
1258
+ # single "single_spider.rb", output: "single_spider.log"
1259
+ # end
1260
+
1261
+ ### How to set a cron schedule ###
1262
+ # Run: `$ whenever --update-crontab --load-file config/schedule.rb`.
1263
+ # If you don't have whenever command, install the gem: `$ gem install whenever`.
1264
+
1265
+ ### How to cancel a schedule ###
1266
+ # Run: `$ whenever --clear-crontab --load-file config/schedule.rb`.
1267
+ ```
1268
+ </details><br>
1269
+
1270
+ 2) Add at the bottom of `schedule.rb` following code:
1271
+
1272
+ ```ruby
1273
+ every 1.day, at: "7:00" do
1274
+ single "example_spider.rb", output: "example_spider.log"
1275
+ end
1276
+ ```
1277
+
1278
+ 3) Run: `$ whenever --update-crontab --load-file schedule.rb`. Done!
1279
+
1280
+ You can check Whenever examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.
1281
+
1282
+ ### Configuration options
1283
+ You can configure several options using `configure` block:
1284
+
1285
+ ```ruby
1286
+ Kimurai.configure do |config|
1287
+ # Default logger has colored mode in development.
1288
+ # If you would like to disable it, set `colorize_logger` to false.
1289
+ # config.colorize_logger = false
1290
+
1291
+ # Logger level for default logger:
1292
+ # config.log_level = :info
1293
+
1294
+ # Custom logger:
1295
+ # config.logger = Logger.new(STDOUT)
1296
+
1297
+ # Custom time zone (for logs):
1298
+ # config.time_zone = "UTC"
1299
+ # config.time_zone = "Europe/Moscow"
1300
+
1301
+ # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
1302
+ # config.selenium_chrome_path = "/usr/bin/chromium-browser"
1303
+ # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
1304
+ # config.chromedriver_path = "~/.local/bin/chromedriver"
1305
+ end
1306
+ ```
1307
+
1308
+ ### Using Kimurai inside existing Ruby application
1309
+
1310
+ You can integrate Kimurai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
1311
+
1312
+ #### `.crawl!` method
1313
+
1314
+ `.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if run was successful, or an exception if something went wrong.
1315
+
1316
+ ```ruby
1317
+ class ExampleSpider < Kimurai::Base
1318
+ @name = "example_spider"
1319
+ @engine = :mechanize
1320
+ @start_urls = ["https://example.com/"]
1321
+
1322
+ def parse(response, url:, data: {})
1323
+ title = response.xpath("//title").text.squish
1324
+ end
1325
+ end
1326
+
1327
+ ExampleSpider.crawl!
1328
+ # => { :spider_name => "example_spider", :status => :completed, :environment => "development", :start_time => 2018-08-22 18:20:16 +0400, :stop_time => 2018-08-22 18:20:17 +0400, :running_time => 1.216, :visits => { :requests => 1, :responses => 1 }, :items => { :sent => 0, :processed => 0 }, :error => nil }
1329
+ ```
1330
+
1331
+ You can't `.crawl!` spider in different thread if it still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
1332
+
1333
+ ```ruby
1334
+ 2.times do |i|
1335
+ Thread.new { p i, ExampleSpider.crawl! }
1336
+ end # =>
1337
+
1338
+ # 1
1339
+ # false
1340
+
1341
+ # 0
1342
+ # {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 18:49:22 +0400, :stop_time=>2018-08-22 18:49:23 +0400, :running_time=>0.801, :visits=>{:requests=>1, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :error=>nil}
1343
+ ```
1344
+
1345
+ So what if you're don't care about stats and just want to process request to a particular spider method and get the returning value from this method? Use `.parse!` instead:
1346
+
1347
+ #### `.parse!(:method_name, url:)` method
1348
+
1349
+ `.parse!` (class method) creates a new spider instance and performs a request to given method with a given url. Value from the method will be returned back:
1350
+
1351
+ ```ruby
1352
+ class ExampleSpider < Kimurai::Base
1353
+ @name = "example_spider"
1354
+ @engine = :mechanize
1355
+ @start_urls = ["https://example.com/"]
1356
+
1357
+ def parse(response, url:, data: {})
1358
+ title = response.xpath("//title").text.squish
1359
+ end
1360
+ end
1361
+
1362
+ ExampleSpider.parse!(:parse, url: "https://example.com/")
1363
+ # => "Example Domain"
1364
+ ```
1365
+
1366
+ Like `.crawl!`, `.parse!` method takes care of a browser instance and kills it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:
1367
+
1368
+ ```ruby
1369
+ urls = ["https://www.google.com/", "https://www.reddit.com/", "https://en.wikipedia.org/"]
1370
+
1371
+ urls.each do |url|
1372
+ Thread.new { p ExampleSpider.parse!(:parse, url: url) }
1373
+ end # =>
1374
+
1375
+ # "Google"
1376
+ # "Wikipedia, the free encyclopedia"
1377
+ # "reddit: the front page of the internetHotHot"
1378
+ ```
1379
+
1380
+ Keep in mind, that [save_to](#save_to-helper) and [unique?](#skip-duplicates) helpers are not thread-safe while using `.parse!` method.
1381
+
1382
+ #### `Kimurai.list` and `Kimurai.find_by_name()`
1383
+
1384
+ ```ruby
1385
+ class GoogleSpider < Kimurai::Base
1386
+ @name = "google_spider"
1387
+ end
1388
+
1389
+ class RedditSpider < Kimurai::Base
1390
+ @name = "reddit_spider"
1391
+ end
1392
+
1393
+ class WikipediaSpider < Kimurai::Base
1394
+ @name = "wikipedia_spider"
1395
+ end
1396
+
1397
+ # To get the list of all available spider classes:
1398
+ Kimurai.list
1399
+ # => {"google_spider"=>GoogleSpider, "reddit_spider"=>RedditSpider, "wikipedia_spider"=>WikipediaSpider}
1400
+
1401
+ # To find a particular spider class by it's name:
1402
+ Kimurai.find_by_name("reddit_spider")
1403
+ # => RedditSpider
1404
+ ```
1405
+
1406
+
1407
+ ### Automated sever setup and deployment
1408
+ > **EXPERIMENTAL**
1409
+
1410
+ #### Setup
1411
+ You can automatically setup [required environment](#installation) for Kimurai on the remote server (currently there is only Ubuntu Server 18.04 support) using `$ kimurai setup` command. `setup` will perform installation of: latest Ruby with Rbenv, browsers with webdrivers and in additional databases clients (only clients) for MySQL, Postgres and MongoDB (so you can connect to a remote database from ruby).
1412
+
1413
+ > To perform remote server setup, [Ansible](https://github.com/ansible/ansible) is required **on the desktop** machine (to install: Ubuntu: `$ sudo apt install ansible`, Mac OS X: `$ brew install ansible`)
1414
+
1415
+ > It's recommended to use regular user to setup the server, not `root`. To create a new user, login to the server `$ ssh root@your_server_ip`, type `$ adduser username` to create a user, and `$ gpasswd -a username sudo` to add new user to a sudo group.
1416
+
1417
+ Example:
1418
+
1419
+ ```bash
1420
+ $ kimurai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
1421
+ ```
1422
+
1423
+ CLI options:
1424
+ * `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
1425
+ * `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
1426
+ * `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
1427
+ * `-p port_number` custom port for ssh connection (`-p 2222`)
1428
+
1429
+ > You can check setup playbook [here](lib/kimurai/automation/setup.yml)
1430
+
1431
+ #### Deploy
1432
+
1433
+ After successful `setup` you can deploy a spider to the remote server using `$ kimurai deploy` command. On each deploy there are performing several tasks: 1) pull repo from a remote origin to `~/repo_name` user directory 2) run `bundle install` 3) Update crontab `whenever --update-crontab` (to update spider schedule from schedule.rb file).
1434
+
1435
+ Before `deploy` make sure that inside spider directory you have: 1) git repository with remote origin (bitbucket, github, etc.) 2) `Gemfile` 3) schedule.rb inside subfolder `config` (`config/schedule.rb`).
1436
+
1437
+ Example:
1438
+
1439
+ ```bash
1440
+ $ kimurai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
1441
+ ```
1442
+
1443
+ CLI options: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
1444
+ * `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
1445
+ * `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
1446
+
1447
+ > You can check deploy playbook [here](lib/kimurai/automation/deploy.yml)
1448
+
1449
+ ## Spider `@config`
1450
+
1451
+ Using `@config` you can set several options for a spider, like proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:
1452
+
1453
+ ```ruby
1454
+ class Spider < Kimurai::Base
1455
+ USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
1456
+ PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
1457
+
1458
+ @engine = :poltergeist_phantomjs
1459
+ @start_urls = ["https://example.com/"]
1460
+ @config = {
1461
+ headers: { "custom_header" => "custom_value" },
1462
+ cookies: [{ name: "cookie_name", value: "cookie_value", domain: ".example.com" }],
1463
+ user_agent: -> { USER_AGENTS.sample },
1464
+ proxy: -> { PROXIES.sample },
1465
+ window_size: [1366, 768],
1466
+ disable_images: true,
1467
+ restart_if: {
1468
+ # Restart browser if provided memory limit (in kilobytes) is exceeded:
1469
+ memory_limit: 350_000
1470
+ },
1471
+ before_request: {
1472
+ # Change user agent before each request:
1473
+ change_user_agent: true,
1474
+ # Change proxy before each request:
1475
+ change_proxy: true,
1476
+ # Clear all cookies and set default cookies (if provided) before each request:
1477
+ clear_and_set_cookies: true,
1478
+ # Process delay before each request:
1479
+ delay: 1..3
1480
+ }
1481
+ }
1482
+
1483
+ def parse(response, url:, data: {})
1484
+ # ...
1485
+ end
1486
+ end
1487
+ ```
1488
+
1489
+ ### All available `@config` options
1490
+
1491
+ ```ruby
1492
+ @config = {
1493
+ # Custom headers, format: hash. Example: { "some header" => "some value", "another header" => "another value" }
1494
+ # Works only for :mechanize and :poltergeist_phantomjs engines (Selenium doesn't allow to set/get headers)
1495
+ headers: {},
1496
+
1497
+ # Custom User Agent, format: string or lambda.
1498
+ # Use lambda if you want to rotate user agents before each run:
1499
+ # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1500
+ # Works for all engines
1501
+ user_agent: "Mozilla/5.0 Firefox/61.0",
1502
+
1503
+ # Custom cookies, format: array of hashes.
1504
+ # Format for a single cookie: { name: "cookie name", value: "cookie value", domain: ".example.com" }
1505
+ # Works for all engines
1506
+ cookies: [],
1507
+
1508
+ # Proxy, format: string or lambda. Format of a proxy string: "ip:port:protocol:user:password"
1509
+ # `protocol` can be http or socks5. User and password are optional.
1510
+ # Use lambda if you want to rotate proxies before each run:
1511
+ # proxy: -> { ARRAY_OF_PROXIES.sample }
1512
+ # Works for all engines, but keep in mind that Selenium drivers doesn't support proxies
1513
+ # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http)
1514
+ proxy: "3.4.5.6:3128:http:user:pass",
1515
+
1516
+ # If enabled, browser will ignore any https errors. It's handy while using a proxy
1517
+ # with self-signed SSL cert (for example Crawlera or Mitmproxy)
1518
+ # Also, it will allow to visit webpages with expires SSL certificate.
1519
+ # Works for all engines
1520
+ ignore_ssl_errors: true,
1521
+
1522
+ # Custom window size, works for all engines
1523
+ window_size: [1366, 768],
1524
+
1525
+ # Skip images downloading if true, works for all engines
1526
+ disable_images: true,
1527
+
1528
+ # Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
1529
+ # Although native mode has a better performance, virtual display mode
1530
+ # sometimes can be useful. For example, some websites can detect (and block)
1531
+ # headless chrome, so you can use virtual_display mode instead
1532
+ headless_mode: :native,
1533
+
1534
+ # This option tells the browser not to use a proxy for the provided list of domains or IP addresses.
1535
+ # Format: array of strings. Works only for :selenium_firefox and selenium_chrome
1536
+ proxy_bypass_list: [],
1537
+
1538
+ # Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize
1539
+ ssl_cert_path: "path/to/ssl_cert",
1540
+
1541
+ # Inject some JavaScript code to the browser.
1542
+ # Format: array of strings, where each string is a path to JS file.
1543
+ # Works only for poltergeist_phantomjs engine (Selenium doesn't support JS code injection)
1544
+ extensions: ["lib/code_to_inject.js"],
1545
+
1546
+ # Automatically skip duplicated (already visited) urls when using `request_to` method.
1547
+ # Possible values: `true` or `hash` with options.
1548
+ # In case of `true`, all visited urls will be added to the storage's scope `:requests_urls`
1549
+ # and if url already contains in this scope, request will be skipped.
1550
+ # You can configure this setting by providing additional options as hash:
1551
+ # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
1552
+ # `scope:` - use custom scope than `:requests_urls`
1553
+ # `check_only:` - if true, then scope will be only checked for url, url will not
1554
+ # be added to the scope if scope doesn't contains it.
1555
+ # works for all drivers
1556
+ skip_duplicate_requests: true,
1557
+
1558
+ # Automatically skip provided errors while requesting a page.
1559
+ # If raised error matches one of the errors in the list, then this error will be caught,
1560
+ # and request will be skipped.
1561
+ # It is a good idea to skip errors like NotFound(404), etc.
1562
+ # Format: array where elements are error classes or/and hashes. You can use hash format
1563
+ # for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
1564
+ # Provided `message:` will be compared with a full error message using `String#include?`. Also
1565
+ # you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
1566
+ skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
1567
+
1568
+ # Automatically retry provided errors with a few attempts while requesting a page.
1569
+ # If raised error matches one of the errors in the list, then this error will be caught
1570
+ # and the request will be processed again within a delay. There are 3 attempts:
1571
+ # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
1572
+ # If after 3 attempts there is still an exception, then the exception will be raised.
1573
+ # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
1574
+ # Format: same like for `skip_request_errors` option.
1575
+ retry_request_errors: [Net::ReadTimeout],
1576
+
1577
+ # Handle page encoding while parsing html response using Nokogiri. There are two modes:
1578
+ # Auto (`:auto`) (try to fetch correct encoding from <meta http-equiv="Content-Type"> or <meta charset> tags)
1579
+ # Set required encoding manually, example: `encoding: "GB2312"` (Set required encoding manually)
1580
+ # Default this option is unset.
1581
+ encoding: nil,
1582
+
1583
+ # Restart browser if one of the options is true:
1584
+ restart_if: {
1585
+ # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
1586
+ memory_limit: 350_000,
1587
+
1588
+ # Restart browser if provided requests limit is exceeded (works for all engines)
1589
+ requests_limit: 100
1590
+ },
1591
+
1592
+ # Perform several actions before each request:
1593
+ before_request: {
1594
+ # Change proxy before each request. The `proxy:` option above should be presented
1595
+ # and has lambda format. Works only for poltergeist and mechanize engines
1596
+ # (Selenium doesn't support proxy rotation).
1597
+ change_proxy: true,
1598
+
1599
+ # Change user agent before each request. The `user_agent:` option above should be presented
1600
+ # and has lambda format. Works only for poltergeist and mechanize engines
1601
+ # (selenium doesn't support to get/set headers).
1602
+ change_user_agent: true,
1603
+
1604
+ # Clear all cookies before each request, works for all engines
1605
+ clear_cookies: true,
1606
+
1607
+ # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
1608
+ # use this option instead (works for all engines)
1609
+ clear_and_set_cookies: true,
1610
+
1611
+ # Global option to set delay between requests.
1612
+ # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
1613
+ # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1614
+ delay: 1..3
1615
+ }
1616
+ }
1617
+ ```
1618
+
1619
+ As you can see, most of the options are universal for any engine.
1620
+
1621
+ ### `@config` settings inheritance
1622
+ Settings can be inherited:
1623
+
1624
+ ```ruby
1625
+ class ApplicationSpider < Kimurai::Base
1626
+ @engine = :poltergeist_phantomjs
1627
+ @config = {
1628
+ user_agent: "Firefox",
1629
+ disable_images: true,
1630
+ restart_if: { memory_limit: 350_000 },
1631
+ before_request: { delay: 1..2 }
1632
+ }
1633
+ end
1634
+
1635
+ class CustomSpider < ApplicationSpider
1636
+ @name = "custom_spider"
1637
+ @start_urls = ["https://example.com/"]
1638
+ @config = {
1639
+ before_request: { delay: 4..6 }
1640
+ }
1641
+
1642
+ def parse(response, url:, data: {})
1643
+ # ...
1644
+ end
1645
+ end
1646
+ ```
1647
+
1648
+ Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider` config, so `CustomSpider` will keep all inherited options with only `delay` updated.
1649
+
1650
+ ## Project mode
1651
+
1652
+ Kimurai can work in project mode ([Like Scrapy](https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project)). To generate a new project, run: `$ kimurai generate project web_spiders` (where `web_spiders` is a name of project).
1653
+
1654
+ Structure of the project:
1655
+
1656
+ ```bash
1657
+ .
1658
+ ├── config/
1659
+ │   ├── initializers/
1660
+ │   ├── application.rb
1661
+ │   ├── automation.yml
1662
+ │   ├── boot.rb
1663
+ │   └── schedule.rb
1664
+ ├── spiders/
1665
+ │   └── application_spider.rb
1666
+ ├── db/
1667
+ ├── helpers/
1668
+ │   └── application_helper.rb
1669
+ ├── lib/
1670
+ ├── log/
1671
+ ├── pipelines/
1672
+ │   ├── validator.rb
1673
+ │   └── saver.rb
1674
+ ├── tmp/
1675
+ ├── .env
1676
+ ├── Gemfile
1677
+ ├── Gemfile.lock
1678
+ └── README.md
1679
+ ```
1680
+
1681
+ <details/>
1682
+ <summary>Description</summary>
1683
+
1684
+ * `config/` folder for configutation files
1685
+ * `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code at start of framework
1686
+ * `config/application.rb` configuration settings for Kimurai (`Kimurai.configure do` block)
1687
+ * `config/automation.yml` specify some settings for [setup and deploy](#automated-sever-setup-and-deployment)
1688
+ * `config/boot.rb` loads framework and project
1689
+ * `config/schedule.rb` Cron [schedule for spiders](#schedule-spiders-using-cron)
1690
+ * `spiders/` folder for spiders
1691
+ * `spiders/application_spider.rb` Base parent class for all spiders
1692
+ * `db/` store here all database files (`sqlite`, `json`, `csv`, etc.)
1693
+ * `helpers/` Rails-like helpers for spiders
1694
+ * `helpers/application_helper.rb` all methods inside ApplicationHelper module will be available for all spiders
1695
+ * `lib/` put here custom Ruby code
1696
+ * `log/` folder for logs
1697
+ * `pipelines/` folder for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines. One file = one pipeline
1698
+ * `pipelines/validator.rb` example pipeline to validate item
1699
+ * `pipelines/saver.rb` example pipeline to save item
1700
+ * `tmp/` folder for temp. files
1701
+ * `.env` file to store ENV variables for project and load them using [Dotenv](https://github.com/bkeepers/dotenv)
1702
+ * `Gemfile` dependency file
1703
+ * `Readme.md` example project readme
1704
+ </details>
1705
+
1706
+
1707
+ ### Generate new spider
1708
+ To generate a new spider in the project, run:
1709
+
1710
+ ```bash
1711
+ $ kimurai generate spider example_spider
1712
+ create spiders/example_spider.rb
1713
+ ```
1714
+
1715
+ Command will generate a new spider class inherited from `ApplicationSpider`:
1716
+
1717
+ ```ruby
1718
+ class ExampleSpider < ApplicationSpider
1719
+ @name = "example_spider"
1720
+ @start_urls = []
1721
+ @config = {}
1722
+
1723
+ def parse(response, url:, data: {})
1724
+ end
1725
+ end
1726
+ ```
1727
+
1728
+ ### Crawl
1729
+ To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
1730
+
1731
+ ### List
1732
+ To list all project spiders, run: `$ bundle exec kimurai list`
1733
+
1734
+ ### Parse
1735
+ For project spiders you can use `$ kimurai parse` command which helps to debug spiders:
1736
+
1737
+ ```bash
1738
+ $ bundle exec kimurai parse example_spider parse_product --url https://example-shop.com/product-1
1739
+ ```
1740
+
1741
+ where `example_spider` is a spider to run, `parse_product` is a spider method to process and `--url` is url to open inside processing method.
1742
+
1743
+ ### Pipelines, `send_item` method
1744
+ You can use item pipelines to organize and store in one place item processing logic for all project spiders (also check Scrapy [description of pipelines](https://doc.scrapy.org/en/latest/topics/item-pipeline.html#item-pipeline)).
1745
+
1746
+ Imagine if you have three spiders where each of them crawls different e-commerce shop and saves only shoe positions. For each spider, you want to save items only with "shoe" category, unique sku, valid title/price and with existing images. To avoid code duplication between spiders, use pipelines:
1747
+
1748
+ <details/>
1749
+ <summary>Example</summary>
1750
+
1751
+ pipelines/validator.rb
1752
+ ```ruby
1753
+ class Validator < Kimurai::Pipeline
1754
+ def process_item(item, options: {})
1755
+ # Here you can validate item and raise `DropItemError`
1756
+ # if one of the validations failed. Examples:
1757
+
1758
+ # Drop item if it's category is not "shoe":
1759
+ if item[:category] != "shoe"
1760
+ raise DropItemError, "Wrong item category"
1761
+ end
1762
+
1763
+ # Check item sku for uniqueness using buit-in unique? helper:
1764
+ unless unique?(:sku, item[:sku])
1765
+ raise DropItemError, "Item sku is not unique"
1766
+ end
1767
+
1768
+ # Drop item if title length shorter than 5 symbols:
1769
+ if item[:title].size < 5
1770
+ raise DropItemError, "Item title is short"
1771
+ end
1772
+
1773
+ # Drop item if price is not present
1774
+ unless item[:price].present?
1775
+ raise DropItemError, "item price is not present"
1776
+ end
1777
+
1778
+ # Drop item if it doesn't contains any images:
1779
+ unless item[:images].present?
1780
+ raise DropItemError, "Item images are not present"
1781
+ end
1782
+
1783
+ # Pass item to the next pipeline (if it wasn't dropped):
1784
+ item
1785
+ end
1786
+ end
1787
+
1788
+ ```
1789
+
1790
+ pipelines/saver.rb
1791
+ ```ruby
1792
+ class Saver < Kimurai::Pipeline
1793
+ def process_item(item, options: {})
1794
+ # Here you can save item to the database, send it to a remote API or
1795
+ # simply save item to a file format using `save_to` helper:
1796
+
1797
+ # To get the name of current spider: `spider.class.name`
1798
+ save_to "db/#{spider.class.name}.json", item, format: :json
1799
+
1800
+ item
1801
+ end
1802
+ end
1803
+ ```
1804
+
1805
+ spiders/application_spider.rb
1806
+ ```ruby
1807
+ class ApplicationSpider < Kimurai::Base
1808
+ @engine = :selenium_chrome
1809
+ # Define pipelines (by order) for all spiders:
1810
+ @pipelines = [:validator, :saver]
1811
+ end
1812
+ ```
1813
+
1814
+ spiders/shop_spider_1.rb
1815
+ ```ruby
1816
+ class ShopSpiderOne < ApplicationSpider
1817
+ @name = "shop_spider_1"
1818
+ @start_urls = ["https://shop-1.com"]
1819
+
1820
+ # ...
1821
+
1822
+ def parse_product(response, url:, data: {})
1823
+ # ...
1824
+
1825
+ # Send item to pipelines:
1826
+ send_item item
1827
+ end
1828
+ end
1829
+ ```
1830
+
1831
+ spiders/shop_spider_2.rb
1832
+ ```ruby
1833
+ class ShopSpiderTwo < ApplicationSpider
1834
+ @name = "shop_spider_2"
1835
+ @start_urls = ["https://shop-2.com"]
1836
+
1837
+ def parse_product(response, url:, data: {})
1838
+ # ...
1839
+
1840
+ # Send item to pipelines:
1841
+ send_item item
1842
+ end
1843
+ end
1844
+ ```
1845
+
1846
+ spiders/shop_spider_3.rb
1847
+ ```ruby
1848
+ class ShopSpiderThree < ApplicationSpider
1849
+ @name = "shop_spider_3"
1850
+ @start_urls = ["https://shop-3.com"]
1851
+
1852
+ def parse_product(response, url:, data: {})
1853
+ # ...
1854
+
1855
+ # Send item to pipelines:
1856
+ send_item item
1857
+ end
1858
+ end
1859
+ ```
1860
+ </details><br>
1861
+
1862
+ When you start using pipelines, there are stats for items appears:
1863
+
1864
+ <details>
1865
+ <summary>Example</summary>
1866
+
1867
+ pipelines/validator.rb
1868
+ ```ruby
1869
+ class Validator < Kimurai::Pipeline
1870
+ def process_item(item, options: {})
1871
+ if item[:star_count] < 10
1872
+ raise DropItemError, "Repository doesn't have enough stars"
1873
+ end
1874
+
1875
+ item
1876
+ end
1877
+ end
1878
+ ```
1879
+
1880
+ spiders/github_spider.rb
1881
+ ```ruby
1882
+ class GithubSpider < ApplicationSpider
1883
+ @name = "github_spider"
1884
+ @engine = :selenium_chrome
1885
+ @pipelines = [:validator]
1886
+ @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
1887
+ @config = {
1888
+ user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
1889
+ before_request: { delay: 4..7 }
1890
+ }
1891
+
1892
+ def parse(response, url:, data: {})
1893
+ response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
1894
+ request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
1895
+ end
1896
+
1897
+ if next_page = response.at_xpath("//a[@class='next_page']")
1898
+ request_to :parse, url: absolute_url(next_page[:href], base: url)
1899
+ end
1900
+ end
1901
+
1902
+ def parse_repo_page(response, url:, data: {})
1903
+ item = {}
1904
+
1905
+ item[:owner] = response.xpath("//h1//a[@rel='author']").text
1906
+ item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
1907
+ item[:repo_url] = url
1908
+ item[:description] = response.xpath("//span[@itemprop='about']").text.squish
1909
+ item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
1910
+ item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish.delete(",").to_i
1911
+ item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish.delete(",").to_i
1912
+ item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish.delete(",").to_i
1913
+ item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
1914
+
1915
+ send_item item
1916
+ end
1917
+ end
1918
+ ```
1919
+
1920
+ ```
1921
+ $ bundle exec kimurai crawl github_spider
1922
+
1923
+ I, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: started: github_spider
1924
+ D, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
1925
+ I, [2018-08-22 15:56:40 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1926
+ I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1927
+ I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 1, responses: 1
1928
+ D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 116182
1929
+ D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
1930
+
1931
+ I, [2018-08-22 15:56:49 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
1932
+ I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
1933
+ I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 2, responses: 2
1934
+ D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 217432
1935
+ D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1936
+ I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
1937
+ I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 1, processed: 1
1938
+ D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 6 seconds before request...
1939
+
1940
+ ...
1941
+
1942
+ I, [2018-08-22 16:11:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
1943
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
1944
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 140, responses: 140
1945
+ D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 211713
1946
+
1947
+ D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1948
+ E, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}
1949
+
1950
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 127, processed: 12
1951
+
1952
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
1953
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}
1954
+ ```
1955
+ </details><br>
1956
+
1957
+ Also, you can pass custom options to pipeline from a particular spider if you want to change pipeline behavior for this spider:
1958
+
1959
+ <details>
1960
+ <summary>Example</summary>
1961
+
1962
+ spiders/custom_spider.rb
1963
+ ```ruby
1964
+ class CustomSpider < ApplicationSpider
1965
+ @name = "custom_spider"
1966
+ @start_urls = ["https://example.com"]
1967
+ @pipelines = [:validator]
1968
+
1969
+ # ...
1970
+
1971
+ def parse_item(response, url:, data: {})
1972
+ # ...
1973
+
1974
+ # Pass custom option `skip_uniq_checking` for Validator pipeline:
1975
+ send_item item, validator: { skip_uniq_checking: true }
1976
+ end
1977
+ end
1978
+
1979
+ ```
1980
+
1981
+ pipelines/validator.rb
1982
+ ```ruby
1983
+ class Validator < Kimurai::Pipeline
1984
+ def process_item(item, options: {})
1985
+
1986
+ # Do not check item sku for uniqueness if options[:skip_uniq_checking] is true
1987
+ if options[:skip_uniq_checking] != true
1988
+ raise DropItemError, "Item sku is not unique" unless unique?(:sku, item[:sku])
1989
+ end
1990
+ end
1991
+ end
1992
+ ```
1993
+ </details>
1994
+
1995
+
1996
+ ### Runner
1997
+
1998
+ You can run project spiders one by one or in parallel using `$ kimurai runner` command:
1999
+
2000
+ ```
2001
+ $ bundle exec kimurai list
2002
+ custom_spider
2003
+ example_spider
2004
+ github_spider
2005
+
2006
+ $ bundle exec kimurai runner -j 3
2007
+ >>> Runner: started: {:id=>1533727423, :status=>:processing, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>nil, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
2008
+ > Runner: started spider: custom_spider, index: 0
2009
+ > Runner: started spider: github_spider, index: 1
2010
+ > Runner: started spider: example_spider, index: 2
2011
+ < Runner: stopped spider: custom_spider, index: 0
2012
+ < Runner: stopped spider: example_spider, index: 2
2013
+ < Runner: stopped spider: github_spider, index: 1
2014
+ <<< Runner: stopped: {:id=>1533727423, :status=>:completed, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>2018-08-08 15:25:11 +0400, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
2015
+ ```
2016
+
2017
+ Each spider runs in a separate process. Spiders logs available at `log/` folder. Pass `-j` option to specify how many spiders should be processed at the same time (default is 1).
2018
+
2019
+ You can provide additional arguments like `--include` or `--exclude` to specify which spiders to run:
2020
+
2021
+ ```bash
2022
+ # Run only custom_spider and example_spider:
2023
+ $ bundle exec kimurai runner --include custom_spider example_spider
2024
+
2025
+ # Run all except github_spider:
2026
+ $ bundle exec kimurai runner --exclude github_spider
2027
+ ```
2028
+
2029
+ #### Runner callbacks
2030
+
2031
+ You can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/kimurai/template/config/application.rb) to see example.
2032
+
2033
+
2034
+ ## Chat Support and Feedback
2035
+ Will be updated
2036
+
2037
+ ## License
2038
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).