kimurai 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (62) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +11 -0
  3. data/.travis.yml +5 -0
  4. data/CODE_OF_CONDUCT.md +74 -0
  5. data/Gemfile +6 -0
  6. data/LICENSE.txt +21 -0
  7. data/README.md +1923 -0
  8. data/Rakefile +10 -0
  9. data/bin/console +14 -0
  10. data/bin/setup +8 -0
  11. data/exe/kimurai +6 -0
  12. data/kimurai.gemspec +48 -0
  13. data/lib/kimurai.rb +53 -0
  14. data/lib/kimurai/automation/deploy.yml +54 -0
  15. data/lib/kimurai/automation/setup.yml +44 -0
  16. data/lib/kimurai/automation/setup/chromium_chromedriver.yml +26 -0
  17. data/lib/kimurai/automation/setup/firefox_geckodriver.yml +20 -0
  18. data/lib/kimurai/automation/setup/phantomjs.yml +33 -0
  19. data/lib/kimurai/automation/setup/ruby_environment.yml +124 -0
  20. data/lib/kimurai/base.rb +249 -0
  21. data/lib/kimurai/base/simple_saver.rb +98 -0
  22. data/lib/kimurai/base/uniq_checker.rb +22 -0
  23. data/lib/kimurai/base_helper.rb +22 -0
  24. data/lib/kimurai/browser_builder.rb +32 -0
  25. data/lib/kimurai/browser_builder/mechanize_builder.rb +140 -0
  26. data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +156 -0
  27. data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +178 -0
  28. data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +185 -0
  29. data/lib/kimurai/capybara_configuration.rb +10 -0
  30. data/lib/kimurai/capybara_ext/driver/base.rb +62 -0
  31. data/lib/kimurai/capybara_ext/mechanize/driver.rb +55 -0
  32. data/lib/kimurai/capybara_ext/poltergeist/driver.rb +13 -0
  33. data/lib/kimurai/capybara_ext/selenium/driver.rb +24 -0
  34. data/lib/kimurai/capybara_ext/session.rb +150 -0
  35. data/lib/kimurai/capybara_ext/session/config.rb +18 -0
  36. data/lib/kimurai/cli.rb +157 -0
  37. data/lib/kimurai/cli/ansible_command_builder.rb +71 -0
  38. data/lib/kimurai/cli/generator.rb +57 -0
  39. data/lib/kimurai/core_ext/array.rb +14 -0
  40. data/lib/kimurai/core_ext/numeric.rb +19 -0
  41. data/lib/kimurai/core_ext/string.rb +7 -0
  42. data/lib/kimurai/pipeline.rb +25 -0
  43. data/lib/kimurai/runner.rb +72 -0
  44. data/lib/kimurai/template/.gitignore +18 -0
  45. data/lib/kimurai/template/.ruby-version +1 -0
  46. data/lib/kimurai/template/Gemfile +20 -0
  47. data/lib/kimurai/template/README.md +3 -0
  48. data/lib/kimurai/template/config/application.rb +32 -0
  49. data/lib/kimurai/template/config/automation.yml +13 -0
  50. data/lib/kimurai/template/config/boot.rb +22 -0
  51. data/lib/kimurai/template/config/initializers/.keep +0 -0
  52. data/lib/kimurai/template/config/schedule.rb +57 -0
  53. data/lib/kimurai/template/db/.keep +0 -0
  54. data/lib/kimurai/template/helpers/application_helper.rb +3 -0
  55. data/lib/kimurai/template/lib/.keep +0 -0
  56. data/lib/kimurai/template/log/.keep +0 -0
  57. data/lib/kimurai/template/pipelines/saver.rb +11 -0
  58. data/lib/kimurai/template/pipelines/validator.rb +24 -0
  59. data/lib/kimurai/template/spiders/application_spider.rb +104 -0
  60. data/lib/kimurai/template/tmp/.keep +0 -0
  61. data/lib/kimurai/version.rb +3 -0
  62. metadata +349 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: d108c41e5da08b22c21cc6c71cc3ac7056ddd1af32054c22a22f0c59658bfcb4
4
+ data.tar.gz: 8a8d32b7b8646eb50bd9f71d8986edc2ac78efc0e2e6a437b3280cff4418c5dd
5
+ SHA512:
6
+ metadata.gz: 4c82647cbe276980ef0a246693c7e68c08651351a549f99fbc6618bc9836c4a4ba83b4d09e1e29d06abcfa0d4f70443fb88682f57c544c0218b22940834a48b1
7
+ data.tar.gz: 845f04c77fbb5e53b24d048e60f23e2c0f9fdeb4d2fde7dcaaa04bebfebc4454777ade03cae895e444583aafb6c8e56038d0d722589fde10076091903646fdf7
@@ -0,0 +1,11 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+ Gemfile.lock
10
+
11
+ *.retry
@@ -0,0 +1,5 @@
1
+ sudo: false
2
+ language: ruby
3
+ rvm:
4
+ - 2.5.1
5
+ before_install: gem install bundler -v 1.16.2
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at vicfreefly@gmail.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [http://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: http://contributor-covenant.org
74
+ [version]: http://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ source "https://rubygems.org"
2
+
3
+ git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
4
+
5
+ # Specify your gem's dependencies in kimurai.gemspec
6
+ gemspec
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2018 Victor Afanasev
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,1923 @@
1
+ <div align="center">
2
+ <a href="https://github.com/vfreefly/kimurai">
3
+ <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
4
+ </a>
5
+
6
+ <h1>Kimura Framework</h1>
7
+ </div>
8
+
9
+ Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
10
+
11
+ Kimurai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
12
+
13
+ ```ruby
14
+ # github_spider.rb
15
+ require 'kimurai'
16
+
17
+ class GithubSpider < Kimurai::Base
18
+ @name = "github_spider"
19
+ @engine = :selenium_chrome
20
+ @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
21
+ @config = {
22
+ user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
23
+ browser: {
24
+ before_request: { delay: 4..7 }
25
+ }
26
+ }
27
+
28
+ def parse(response, url:, data: {})
29
+ response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
30
+ request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
31
+ end
32
+
33
+ if next_page = response.at_xpath("//a[@class='next_page']")
34
+ request_to :parse, url: absolute_url(next_page[:href], base: url)
35
+ end
36
+ end
37
+
38
+ def parse_repo_page(response, url:, data: {})
39
+ item = {}
40
+
41
+ item[:owner] = response.xpath("//h1//a[@rel='author']").text
42
+ item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
43
+ item[:repo_url] = url
44
+ item[:description] = response.xpath("//span[@itemprop='about']").text.squish
45
+ item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
46
+ item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish
47
+ item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish
48
+ item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish
49
+ item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
50
+
51
+ save_to "results.json", item, format: :pretty_json
52
+ end
53
+ end
54
+
55
+ GithubSpider.crawl!
56
+ ```
57
+
58
+ <details/>
59
+ <summary>Run: <code>$ ruby github_spider.rb</code></summary>
60
+
61
+ ```
62
+ I, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: started: github_spider
63
+ D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
64
+ D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled `browser before_request delay`
65
+ D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 7 seconds before request...
66
+ D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled custom user-agent
67
+ D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
68
+ I, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
69
+ I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
70
+ I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 1, responses: 1
71
+ D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 107968
72
+ D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
73
+ I, [2018-08-22 13:08:32 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
74
+ I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
75
+ I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 2, responses: 2
76
+ D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 212542
77
+ D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 4 seconds before request...
78
+ I, [2018-08-22 13:08:37 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
79
+
80
+ ...
81
+
82
+ I, [2018-08-22 13:23:07 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
83
+ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
84
+ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 140, responses: 140
85
+ D, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 204198
86
+ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
87
+
88
+ I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:08:03 +0400, :stop_time=>2018-08-22 13:23:08 +0400, :running_time=>"15m, 5s", :visits=>{:requests=>140, :responses=>140}, :error=>nil}
89
+ ```
90
+ </details>
91
+
92
+ <details/>
93
+ <summary>results.json</summary>
94
+
95
+ ```json
96
+ [
97
+ {
98
+ "owner": "lorien",
99
+ "repo_name": "awesome-web-scraping",
100
+ "repo_url": "https://github.com/lorien/awesome-web-scraping",
101
+ "description": "List of libraries, tools and APIs for web scraping and data processing.",
102
+ "tags": [
103
+ "awesome",
104
+ "awesome-list",
105
+ "web-scraping",
106
+ "data-processing",
107
+ "python",
108
+ "javascript",
109
+ "php",
110
+ "ruby"
111
+ ],
112
+ "watch_count": "159",
113
+ "star_count": "2,423",
114
+ "fork_count": "358",
115
+ "last_commit": "4 days ago",
116
+ "position": 1
117
+ },
118
+
119
+ ...
120
+
121
+ {
122
+ "owner": "preston",
123
+ "repo_name": "idclight",
124
+ "repo_url": "https://github.com/preston/idclight",
125
+ "description": "A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.",
126
+ "tags": [
127
+
128
+ ],
129
+ "watch_count": "6",
130
+ "star_count": "1",
131
+ "fork_count": "0",
132
+ "last_commit": "on Apr 12, 2012",
133
+ "position": 127
134
+ }
135
+ ]
136
+ ```
137
+ </details><br>
138
+
139
+ Okay, that was easy. How about javascript rendered websites with dynamic HTML? Lets scrape a page with infinite scroll:
140
+
141
+ ```ruby
142
+ # infinite_scroll_spider.rb
143
+ require 'kimurai'
144
+
145
+ class InfiniteScrollSpider < Kimurai::Base
146
+ @name = "infinite_scroll_spider"
147
+ @engine = :selenium_chrome
148
+ @start_urls = ["https://infinite-scroll.com/demo/full-page/"]
149
+
150
+ def parse(response, url:, data: {})
151
+ posts_headers_path = "//article/h2"
152
+ count = response.xpath(posts_headers_path).count
153
+
154
+ loop do
155
+ browser.execute_script("window.scrollBy(0,10000)") ; sleep 2
156
+ response = browser.current_response
157
+
158
+ new_count = response.xpath(posts_headers_path).count
159
+ if count == new_count
160
+ logger.info "> Pagination is done" and break
161
+ else
162
+ count = new_count
163
+ logger.info "> Continue scrolling, current count is #{count}..."
164
+ end
165
+ end
166
+
167
+ posts_headers = response.xpath(posts_headers_path).map(&:text)
168
+ logger.info "> All posts from page: #{posts_headers.join('; ')}"
169
+ end
170
+ end
171
+
172
+ InfiniteScrollSpider.crawl!
173
+ ```
174
+
175
+ <details/>
176
+ <summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>
177
+
178
+ ```
179
+ I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
180
+ D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): created browser instance
181
+ D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
182
+ I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
183
+ I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
184
+ I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
185
+ D, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: Browser: driver.current_memory: 95463
186
+ I, [2018-08-22 13:33:05 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 5...
187
+ I, [2018-08-22 13:33:18 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 9...
188
+ I, [2018-08-22 13:33:20 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 11...
189
+ I, [2018-08-22 13:33:26 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 13...
190
+ I, [2018-08-22 13:33:28 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 15...
191
+ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Pagination is done
192
+ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
193
+ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
194
+ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: stopped: {:spider_name=>"infinite_scroll_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:32:57 +0400, :stop_time=>2018-08-22 13:33:30 +0400, :running_time=>"33s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
195
+
196
+ ```
197
+ </details><br>
198
+
199
+
200
+ ## Features
201
+ * Scrape javascript rendered websites out of box
202
+ * Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
203
+ * Write spider code once, and use it with any supported engine later
204
+ * All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
205
+ * Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
206
+ * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
207
+ * Automatically [retry failed requests](#configuration-options) with delay
208
+ * Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
209
+ * Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
210
+ * [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
211
+ * **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
212
+ * Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
213
+ * Automated [server environment setup](#setup) (for ubuntu 18.04) and [deploy](#deploy) using commands `kimurai setup` and `kimurai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
214
+ * Command-line [runner](#runner) to run all project spiders one by one or in parallel
215
+
216
+ ## Table of Contents
217
+ * [Kimurai](#kimurai)
218
+ * [Features](#features)
219
+ * [Table of Contents](#table-of-contents)
220
+ * [Note about v1.0.0 version](#note-about-v1-0-0-version)
221
+ * [Installation](#installation)
222
+ * [Getting to Know](#getting-to-know)
223
+ * [Interactive console](#interactive-console)
224
+ * [Available engines](#available-engines)
225
+ * [Minimum required spider structure](#minimum-required-spider-structure)
226
+ * [Method arguments response, url and data](#method-arguments-response-url-and-data)
227
+ * [browser object](#browser-object)
228
+ * [request_to method](#request_to-method)
229
+ * [save_to helper](#save_to-helper)
230
+ * [Skip duplicates, unique? helper](#skip-duplicates-unique-helper)
231
+ * [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
232
+ * [KIMURAI_ENV](#kimurai_env)
233
+ * [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
234
+ * [Active Support included](#active-support-included)
235
+ * [Schedule spiders using Cron](#schedule-spiders-using-cron)
236
+ * [Configuration options](#configuration-options)
237
+ * [Using Kimurai inside existing Ruby application](#using-kimurai-inside-existing-ruby-application)
238
+ * [crawl! method](#crawl-method)
239
+ * [parse! method](#parsemethod_name-url-method)
240
+ * [Kimurai.list and Kimurai.find_by_name](#kimurailist-and-kimuraifind_by_name)
241
+ * [Automated sever setup and deployment](#automated-sever-setup-and-deployment)
242
+ * [Setup](#setup)
243
+ * [Deploy](#deploy)
244
+ * [Spider @config](#spider-config)
245
+ * [All available @config options](#all-available-config-options)
246
+ * [@config settings inheritance](#config-settings-inheritance)
247
+ * [Project mode](#project-mode)
248
+ * [Generate new spider](#generate-new-spider)
249
+ * [Crawl](#crawl)
250
+ * [List](#list)
251
+ * [Parse](#parse)
252
+ * [Pipelines, send_item method](#pipelines-send_item-method)
253
+ * [Runner](#runner)
254
+ * [Runner callbacks](#runner-callbacks)
255
+ * [Chat Support and Feedback](#chat-support-and-feedback)
256
+ * [License](#license)
257
+
258
+ ## Note about v1.0.0 version
259
+ * The code was massively refactored for a [support](#using-kimurai-inside-existing-ruby-application) to run spiders multiple times from inside a single process. Now it's possible to run Kimurai spiders using background jobs like Sidekiq.
260
+ * `require 'kimurai'` doesn't require any gems except Active Support. Only when a particular spider [starts](#crawl-method), Capybara will be required with a specific driver.
261
+ * Although Kimurai [extends](lib/kimurai/capybara_ext) Capybara (all the magic happens inside [extended](lib/kimurai/capybara_ext/session.rb) `Capybara::Session#visit` method), session instances which were created manually will behave normally.
262
+ * Small changes in design (check the readme again to see what was changed)
263
+ * Again, massive refactor. Code now looks much better than it was before.
264
+
265
+ ## Installation
266
+ Kimurai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
267
+
268
+ 1) If your system doesn't have appropriate Ruby version, install it:
269
+
270
+ <details/>
271
+ <summary>Ubuntu 18.04</summary>
272
+
273
+ ```bash
274
+ # Install required packages for ruby-build
275
+ sudo apt update
276
+ sudo apt install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libreadline6-dev libyaml-dev libxml2-dev libxslt1-dev libcurl4-openssl-dev libffi-dev
277
+
278
+ # Install rbenv and ruby-build
279
+ cd && git clone https://github.com/rbenv/rbenv.git ~/.rbenv
280
+ echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
281
+ echo 'eval "$(rbenv init -)"' >> ~/.bashrc
282
+ exec $SHELL
283
+
284
+ git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build
285
+ echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
286
+ exec $SHELL
287
+
288
+ # Install latest Ruby
289
+ rbenv install 2.5.1
290
+ rbenv global 2.5.1
291
+
292
+ gem install bundler
293
+ ```
294
+ </details>
295
+
296
+ <details/>
297
+ <summary>Mac OS X</summary>
298
+
299
+ ```bash
300
+ # Install homebrew if you don't have it https://brew.sh/
301
+ # Install rbenv and ruby-build:
302
+ brew install rbenv ruby-build
303
+
304
+ # Add rbenv to bash so that it loads every time you open a terminal
305
+ echo 'if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi' >> ~/.bash_profile
306
+ source ~/.bash_profile
307
+
308
+ # Install latest Ruby
309
+ rbenv install 2.5.1
310
+ rbenv global 2.5.1
311
+
312
+ gem install bundler
313
+ ```
314
+ </details>
315
+
316
+ 2) Install Kimurai gem: `$ gem install kimurai`
317
+
318
+ 3) Install browsers with webdrivers:
319
+
320
+ <details/>
321
+ <summary>Ubuntu 18.04</summary>
322
+
323
+ Note: for Ubuntu 16.04-18.04 there is available automatic installation using `setup` command:
324
+ ```bash
325
+ $ kimurai setup localhost --local --ask-sudo
326
+ ```
327
+ It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/kimurai/automation).
328
+
329
+ If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:
330
+
331
+ ```bash
332
+ # Install basic tools
333
+ sudo apt install -q -y unzip wget tar openssl
334
+
335
+ # Install xvfb (for virtual_display headless mode, in additional to native)
336
+ sudo apt install -q -y xvfb
337
+
338
+ # Install chromium-browser and firefox
339
+ sudo apt install -q -y chromium-browser firefox
340
+
341
+ # Instal chromedriver (2.39 version)
342
+ # All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
343
+ cd /tmp && wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip
344
+ sudo unzip chromedriver_linux64.zip -d /usr/local/bin
345
+ rm -f chromedriver_linux64.zip
346
+
347
+ # Install geckodriver (0.21.0 version)
348
+ # All versions located here https://github.com/mozilla/geckodriver/releases/
349
+ cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz
350
+ sudo tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin
351
+ rm -f geckodriver-v0.21.0-linux64.tar.gz
352
+
353
+ # Install PhantomJS (2.1.1)
354
+ # All versions located here http://phantomjs.org/download.html
355
+ sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
356
+ cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
357
+ tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
358
+ sudo mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib
359
+ sudo ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
360
+ rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2
361
+ ```
362
+
363
+ </details>
364
+
365
+ <details/>
366
+ <summary>Mac OS X</summary>
367
+
368
+ ```bash
369
+ # Install chrome and firefox
370
+ brew cask install google-chrome firefox
371
+
372
+ # Install chromedriver (latest)
373
+ brew cask install chromedriver
374
+
375
+ # Install geckodriver (latest)
376
+ brew install geckodriver
377
+
378
+ # Install PhantomJS (latest)
379
+ brew install phantomjs
380
+ ```
381
+ </details><br>
382
+
383
+ Also, if you want to save scraped items to the database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
384
+
385
+ <details/>
386
+ <summary>Ubuntu 18.04</summary>
387
+
388
+ SQlite: `$ sudo apt -q -y install libsqlite3-dev sqlite3`.
389
+
390
+ If you want to connect to a remote database, you don't need database server on a local machine (only client):
391
+ ```bash
392
+ # Install MySQL client
393
+ sudo apt -q -y install mysql-client libmysqlclient-dev
394
+
395
+ # Install Postgres client
396
+ sudo apt install -q -y postgresql-client libpq-dev
397
+
398
+ # Install MongoDB client
399
+ sudo apt install -q -y mongodb-clients
400
+ ```
401
+
402
+ But if you want to save items to a local database, database server required as well:
403
+ ```bash
404
+ # Install MySQL client and server
405
+ sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
406
+
407
+ # Install Postgres client and server
408
+ sudo apt install -q -y postgresql postgresql-contrib libpq-dev
409
+
410
+ # Install MongoDB client and server
411
+ # version 4.0 (check here https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
412
+ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
413
+ # for 16.04:
414
+ # echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
415
+ # for 18.04:
416
+ echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
417
+ sudo apt update
418
+ sudo apt install -q -y mongodb-org
419
+ sudo service mongod start
420
+ ```
421
+ </details>
422
+
423
+ <details/>
424
+ <summary>Mac OS X</summary>
425
+
426
+ SQlite: `$ brew install sqlite3`
427
+
428
+ ```bash
429
+ # Install MySQL client and server
430
+ brew install mysql
431
+ # Start server if you need it: brew services start mysql
432
+
433
+ # Install Postgres client and server
434
+ brew install postgresql
435
+ # Start server if you need it: brew services start postgresql
436
+
437
+ # Install MongoDB client and server
438
+ brew install mongodb
439
+ # Start server if you need it: brew services start mongodb
440
+ ```
441
+ </details>
442
+
443
+
444
+ ## Getting to Know
445
+ ### Interactive console
446
+ Before you get to know all Kimurai features, there is `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
447
+
448
+ ```bash
449
+ $ kimurai console --engine selenium_chrome --url https://github.com/vfreefly/kimurai
450
+ ```
451
+
452
+ <details/>
453
+ <summary>Show output</summary>
454
+
455
+ ```
456
+ $ kimurai console --engine selenium_chrome --url https://github.com/vfreefly/kimurai
457
+
458
+ D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
459
+ D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
460
+ I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] INFO -- : Browser: started get request to: https://github.com/vfreefly/kimurai
461
+ I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] INFO -- : Browser: finished get request to: https://github.com/vfreefly/kimurai
462
+ D, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701
463
+
464
+ From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#console:
465
+
466
+ 188: def console(response = nil, url: nil, data: {})
467
+ => 189: binding.pry
468
+ 190: end
469
+
470
+ [1] pry(#<Kimurai::Base>)> response.xpath("//title").text
471
+ => "GitHub - vfreefly/kimurai: Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
472
+
473
+ [2] pry(#<Kimurai::Base>)> ls
474
+ Kimurai::Base#methods: browser console logger request_to save_to unique?
475
+ instance variables: @browser @config @engine @logger @pipelines
476
+ locals: _ __ _dir_ _ex_ _file_ _in_ _out_ _pry_ data response url
477
+
478
+ [3] pry(#<Kimurai::Base>)> ls response
479
+ Nokogiri::XML::PP::Node#methods: inspect pretty_print
480
+ Nokogiri::XML::Searchable#methods: % / at at_css at_xpath css search xpath
481
+ Enumerable#methods:
482
+ all? collect drop each_with_index find_all grep_v lazy member? none? reject slice_when take_while without
483
+ any? collect_concat drop_while each_with_object find_index group_by many? min one? reverse_each sort to_a zip
484
+ as_json count each_cons entries first include? map min_by partition select sort_by to_h
485
+ chunk cycle each_entry exclude? flat_map index_by max minmax pluck slice_after sum to_set
486
+ chunk_while detect each_slice find grep inject max_by minmax_by reduce slice_before take uniq
487
+ Nokogiri::XML::Node#methods:
488
+ <=> append_class classes document? has_attribute? matches? node_name= processing_instruction? to_str
489
+ == attr comment? each html? name= node_type read_only? to_xhtml
490
+ > attribute content elem? inner_html namespace= parent= remove traverse
491
+ [] attribute_nodes content= element? inner_html= namespace_scopes parse remove_attribute unlink
492
+ []= attribute_with_ns create_external_subset element_children inner_text namespaced_key? path remove_class values
493
+ accept before create_internal_subset elements internal_subset native_content= pointer_id replace write_html_to
494
+ add_class blank? css_path encode_special_chars key? next prepend_child set_attribute write_to
495
+ add_next_sibling cdata? decorate! external_subset keys next= previous text write_xhtml_to
496
+ add_previous_sibling child delete first_element_child lang next_element previous= text? write_xml_to
497
+ after children description fragment? lang= next_sibling previous_element to_html xml?
498
+ ancestors children= do_xinclude get_attribute last_element_child node_name previous_sibling to_s
499
+ Nokogiri::XML::Document#methods:
500
+ << canonicalize collect_namespaces create_comment create_entity decorate document encoding errors name remove_namespaces! root= to_java url version
501
+ add_child clone create_cdata create_element create_text_node decorators dup encoding= errors= namespaces root slop! to_xml validate
502
+ Nokogiri::HTML::Document#methods: fragment meta_encoding meta_encoding= serialize title title= type
503
+ instance variables: @decorators @errors @node_cache
504
+
505
+ [4] pry(#<Kimurai::Base>)> exit
506
+ I, [2018-08-22 13:43:47 +0400#26079] [M: 47461994677760] INFO -- : Browser: driver selenium_chrome has been destroyed
507
+ $
508
+ ```
509
+ </details><br>
510
+
511
+ CLI options:
512
+ * `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
513
+ * `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
514
+
515
+ ### Available engines
516
+ Kimurai has support for following engines and mostly can switch between them without need to rewrite any code:
517
+
518
+ * `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render javascript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use javascript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
519
+ * `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Kimurai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
520
+ * `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper javascript rendering.
521
+ * `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
522
+
523
+ **Tip:** add `HEADLESS=false` ENV variable before command (`$ HEADLESS=false ruby spider.rb`) to run browser in normal (not headless) mode and see it's window (only for selenium-like engines). It works for [console](#interactive-console) command as well.
524
+
525
+
526
+ ### Minimum required spider structure
527
+ > You can manually create a spider file, or use generator instead: `$ kimurai generate spider simple_spider`
528
+
529
+ ```ruby
530
+ require 'kimurai'
531
+
532
+ class SimpleSpider < Kimurai::Base
533
+ @name = "simple_spider"
534
+ @engine = :selenium_chrome
535
+ @start_urls = ["https://example.com/"]
536
+
537
+ def parse(response, url:, data: {})
538
+ end
539
+ end
540
+
541
+ SimpleSpider.crawl!
542
+ ```
543
+
544
+ Where:
545
+ * `@name` name of a spider. You can omit name if use single-file spider
546
+ * `@engine` engine for a spider
547
+ * `@start_urls` array of start urls to process one by one inside `parse` method
548
+ * Method `parse` is the start method, should be always present in spider class
549
+
550
+
551
+ ### Method arguments `response`, `url` and `data`
552
+
553
+ ```ruby
554
+ def parse(response, url:, data: {})
555
+ end
556
+ ```
557
+
558
+ * `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object) Contains parsed HTML code of a processed webpage
559
+ * `url` (String) url of a processed webpage
560
+ * `data` (Hash) uses to pass data between requests
561
+
562
+ <details/>
563
+ <summary><strong>Example how to use <code>data</code></strong></summary>
564
+
565
+ Imagine that there is a product page which doesn't contain product category. Category name present only on category page with pagination. This is the case where we can use `data` to pass category name from `parse` to `parse_product` method:
566
+
567
+ ```ruby
568
+ class ProductsSpider < Kimurai::Base
569
+ @engine = :selenium_chrome
570
+ @start_urls = ["https://example-shop.com/example-product-category"]
571
+
572
+ def parse(response, url:, data: {})
573
+ category_name = response.xpath("//path/to/category/name").text
574
+ response.xpath("//path/to/products/urls").each do |product_url|
575
+ # Merge category_name with current data hash and pass it next to parse_product method
576
+ request_to(:parse_product, url: product_url[:href], data: data.merge(category_name: category_name))
577
+ end
578
+
579
+ # ...
580
+ end
581
+
582
+ def parse_product(response, url:, data: {})
583
+ item = {}
584
+ # Assign item's category_name from data[:category_name]
585
+ item[:category_name] = data[:category_name]
586
+
587
+ # ...
588
+ end
589
+ end
590
+
591
+ ```
592
+ </details><br>
593
+
594
+ **You can query `response` using [XPath or CSS selectors](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Searchable)**. Check Nokogiri tutorials to understand how to work with `response`:
595
+ * [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) - ruby.bastardsbook.com
596
+ * [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) - readysteadycode.com
597
+ * [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) - rubydoc.info
598
+
599
+
600
+ ### `browser` object
601
+
602
+ From any spider instance method there is available `browser` object, which is [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
603
+
604
+ But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
605
+
606
+ ```ruby
607
+ class GoogleSpider < Kimurai::Base
608
+ @name = "google_spider"
609
+ @engine = :selenium_chrome
610
+ @start_urls = ["https://www.google.com/"]
611
+
612
+ def parse(response, url:, data: {})
613
+ browser.fill_in "q", with: "Kimurai web scraping framework"
614
+ browser.click_button "Google Search"
615
+
616
+ # Update response to current response after interaction with a browser
617
+ response = browser.current_response
618
+
619
+ # Collect results
620
+ results = response.xpath("//div[@class='g']//h3/a").map do |a|
621
+ { title: a.text, url: a[:href] }
622
+ end
623
+
624
+ # ...
625
+ end
626
+ end
627
+ ```
628
+
629
+ Check out **Capybara cheat sheets** where you can see all available methods **to interact with browser**:
630
+ * [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) - cheatrags.com
631
+ * [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) - thoughtbot.com
632
+ * [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) - rubydoc.info
633
+
634
+ ### `request_to` method
635
+
636
+ For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it). Example:
637
+
638
+ ```ruby
639
+ class Spider < Kimurai::Base
640
+ @engine = :selenium_chrome
641
+ @start_urls = ["https://example.com/"]
642
+
643
+ def parse(response, url:, data: {})
644
+ # Process request to `parse_product` method with `https://example.com/some_product` url:
645
+ request_to :parse_product, url: "https://example.com/some_product"
646
+ end
647
+
648
+ def parse_product(response, url:, data: {})
649
+ puts "From page https://example.com/some_product !"
650
+ end
651
+ end
652
+ ```
653
+
654
+ Under the hood `request_to` simply call [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`) and then required method with arguments:
655
+
656
+ <details/>
657
+ <summary>request_to</summary>
658
+
659
+ ```ruby
660
+ def request_to(handler, url:, data: {})
661
+ request_data = { url: url, data: data }
662
+
663
+ browser.visit(url)
664
+ public_send(handler, browser.current_response, request_data)
665
+ end
666
+ ```
667
+ </details><br>
668
+
669
+ `request_to` just makes things simpler, and without it we could do something like:
670
+
671
+ <details/>
672
+ <summary>Check the code</summary>
673
+
674
+ ```ruby
675
+ class Spider < Kimurai::Base
676
+ @engine = :selenium_chrome
677
+ @start_urls = ["https://example.com/"]
678
+
679
+ def parse(response, url:, data: {})
680
+ url_to_process = "https://example.com/some_product"
681
+
682
+ browser.visit(url_to_process)
683
+ parse_product(browser.current_response, url: url_to_process)
684
+ end
685
+
686
+ def parse_product(response, url:, data: {})
687
+ puts "From page https://example.com/some_product !"
688
+ end
689
+ end
690
+ ```
691
+ </details>
692
+
693
+ ### `save_to` helper
694
+
695
+ Sometimes all that you need is to simply save scraped data to a file format, like JSON or CSV. You can use `save_to` for it:
696
+
697
+ ```ruby
698
+ class ProductsSpider < Kimurai::Base
699
+ @engine = :selenium_chrome
700
+ @start_urls = ["https://example-shop.com/"]
701
+
702
+ # ...
703
+
704
+ def parse_product(response, url:, data: {})
705
+ item = {}
706
+
707
+ item[:title] = response.xpath("//title/path").text
708
+ item[:description] = response.xpath("//desc/path").text.squish
709
+ item[:price] = response.xpath("//price/path").text[/\d+/]&.to_f
710
+
711
+ # Add each new item to the `scraped_products.json` file:
712
+ save_to "scraped_products.json", item, format: :json
713
+ end
714
+ end
715
+ ```
716
+
717
+ Supported formats:
718
+ * `:json` JSON
719
+ * `:pretty_json` "pretty" JSON (`JSON.pretty_generate`)
720
+ * `:jsonlines` [JSON Lines](http://jsonlines.org/)
721
+ * `:csv` CSV
722
+
723
+ Note: `save_to` requires data (item to save) to be a `Hash`.
724
+
725
+ By default `save_to` add position key to an item hash. You can disable it with `position: false`: `save_to "scraped_products.json", item, format: :json, position: false`.
726
+
727
+ **How helper works:**
728
+
729
+ Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
730
+
731
+ ### Skip duplicates, `unique?` helper
732
+
733
+ It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is `unique?` helper:
734
+
735
+ ```ruby
736
+ class ProductsSpider < Kimurai::Base
737
+ @engine = :selenium_chrome
738
+ @start_urls = ["https://example-shop.com/"]
739
+
740
+ def parse(response, url:, data: {})
741
+ response.xpath("//categories/path").each do |category|
742
+ request_to :parse_category, url: category[:href]
743
+ end
744
+ end
745
+
746
+ # Check products for uniqueness using product url inside of parse_category:
747
+ def parse_category(response, url:, data: {})
748
+ response.xpath("//products/path").each do |product|
749
+ # Skip url if it's not unique:
750
+ next unless unique?(:product_url, product[:href])
751
+ # Otherwise process it:
752
+ request_to :parse_product, url: product[:href]
753
+ end
754
+ end
755
+
756
+ # Or/and check products for uniqueness using product sku inside of parse_product:
757
+ def parse_product(response, url:, data: {})
758
+ item = {}
759
+ item[:sku] = response.xpath("//product/sku/path").text.strip.upcase
760
+ # Don't save product and return from method if there is already saved item with the same sku:
761
+ return unless unique?(:sku, item[:sku])
762
+
763
+ # ...
764
+ save_to "results.json", item, format: :json
765
+ end
766
+ end
767
+ ```
768
+
769
+ `unique?` helper works pretty simple:
770
+
771
+ ```ruby
772
+ # Check string "http://example.com" in scope `url` for a first time:
773
+ unique?(:url, "http://example.com")
774
+ # => true
775
+
776
+ # Try again:
777
+ unique?(:url, "http://example.com")
778
+ # => false
779
+ ```
780
+
781
+ To check something for uniqueness, you need to provide a scope:
782
+
783
+ ```ruby
784
+ # `product_url` scope
785
+ unique?(:product_url, "http://example.com/product_1")
786
+
787
+ # `id` scope
788
+ unique?(:id, 324234232)
789
+
790
+ # `custom` scope
791
+ unique?(:custom, "Lorem Ipsum")
792
+ ```
793
+
794
+ ### `open_spider` and `close_spider` callbacks
795
+
796
+ You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before spider started or after spider has been stopped:
797
+
798
+ ```ruby
799
+ require 'kimurai'
800
+
801
+ class ExampleSpider < Kimurai::Base
802
+ @name = "example_spider"
803
+ @engine = :selenium_chrome
804
+ @start_urls = ["https://example.com/"]
805
+
806
+ def self.open_spider
807
+ logger.info "> Starting..."
808
+ end
809
+
810
+ def self.close_spider
811
+ logger.info "> Stopped!"
812
+ end
813
+
814
+ def parse(response, url:, data: {})
815
+ logger.info "> Scraping..."
816
+ end
817
+ end
818
+
819
+ ExampleSpider.crawl!
820
+ ```
821
+
822
+ <details/>
823
+ <summary>Output</summary>
824
+
825
+ ```
826
+ I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: Spider: started: example_spider
827
+ I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Starting...
828
+ D, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance
829
+ D, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
830
+ I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: started get request to: https://example.com/
831
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: finished get request to: https://example.com/
832
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Info: visits: requests: 1, responses: 1
833
+ D, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: Browser: driver.current_memory: 82415
834
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Scraping...
835
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: driver selenium_chrome has been destroyed
836
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Stopped!
837
+ I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:26:32 +0400, :stop_time=>2018-08-22 14:26:34 +0400, :running_time=>"1s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
838
+ ```
839
+ </details><br>
840
+
841
+ Inside `open_spider` and `close_spider` class methods there is available `run_info` method which contains useful information about spider state:
842
+
843
+ ```ruby
844
+ 11: def self.open_spider
845
+ => 12: binding.pry
846
+ 13: end
847
+
848
+ [1] pry(example_spider)> run_info
849
+ => {
850
+ :spider_name=>"example_spider",
851
+ :status=>:running,
852
+ :environment=>"development",
853
+ :start_time=>2018-08-05 23:32:00 +0400,
854
+ :stop_time=>nil,
855
+ :running_time=>nil,
856
+ :visits=>{:requests=>0, :responses=>0},
857
+ :error=>nil
858
+ }
859
+ ```
860
+
861
+ Inside `close_spider`, `run_info` will be updated:
862
+
863
+ ```ruby
864
+ 15: def self.close_spider
865
+ => 16: binding.pry
866
+ 17: end
867
+
868
+ [1] pry(example_spider)> run_info
869
+ => {
870
+ :spider_name=>"example_spider",
871
+ :status=>:completed,
872
+ :environment=>"development",
873
+ :start_time=>2018-08-05 23:32:00 +0400,
874
+ :stop_time=>2018-08-05 23:32:06 +0400,
875
+ :running_time=>6.214,
876
+ :visits=>{:requests=>1, :responses=>1},
877
+ :error=>nil
878
+ }
879
+ ```
880
+
881
+ `run_info[:status]` helps to determine if spider was finished successfully or failed (possible values: `:completed`, `:failed`):
882
+
883
+ ```ruby
884
+ class ExampleSpider < Kimurai::Base
885
+ @name = "example_spider"
886
+ @engine = :selenium_chrome
887
+ @start_urls = ["https://example.com/"]
888
+
889
+ def self.close_spider
890
+ puts ">>> run info: #{run_info}"
891
+ end
892
+
893
+ def parse(response, url:, data: {})
894
+ logger.info "> Scraping..."
895
+ # Let's try to strip nil:
896
+ nil.strip
897
+ end
898
+ end
899
+ ```
900
+
901
+ <details/>
902
+ <summary>Output</summary>
903
+
904
+ ```
905
+ I, [2018-08-22 14:34:24 +0400#8459] [M: 47020523644400] INFO -- example_spider: Spider: started: example_spider
906
+ D, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance
907
+ D, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
908
+ I, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: started get request to: https://example.com/
909
+ I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: finished get request to: https://example.com/
910
+ I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Info: visits: requests: 1, responses: 1
911
+ D, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: Browser: driver.current_memory: 83351
912
+ I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: > Scraping...
913
+ I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: driver selenium_chrome has been destroyed
914
+
915
+ >>> run info: {:spider_name=>"example_spider", :status=>:failed, :environment=>"development", :start_time=>2018-08-22 14:34:24 +0400, :stop_time=>2018-08-22 14:34:26 +0400, :running_time=>2.01, :visits=>{:requests=>1, :responses=>1}, :error=>"#<NoMethodError: undefined method `strip' for nil:NilClass>"}
916
+
917
+ F, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] FATAL -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:failed, :environment=>"development", :start_time=>2018-08-22 14:34:24 +0400, :stop_time=>2018-08-22 14:34:26 +0400, :running_time=>"2s", :visits=>{:requests=>1, :responses=>1}, :error=>"#<NoMethodError: undefined method `strip' for nil:NilClass>"}
918
+ Traceback (most recent call last):
919
+ 6: from example_spider.rb:19:in `<main>'
920
+ 5: from /home/victor/code/kimurai/lib/kimurai/base.rb:127:in `crawl!'
921
+ 4: from /home/victor/code/kimurai/lib/kimurai/base.rb:127:in `each'
922
+ 3: from /home/victor/code/kimurai/lib/kimurai/base.rb:128:in `block in crawl!'
923
+ 2: from /home/victor/code/kimurai/lib/kimurai/base.rb:185:in `request_to'
924
+ 1: from /home/victor/code/kimurai/lib/kimurai/base.rb:185:in `public_send'
925
+ example_spider.rb:15:in `parse': undefined method `strip' for nil:NilClass (NoMethodError)
926
+ ```
927
+ </details><br>
928
+
929
+ **Usage example:** if spider finished successfully, send JSON file with scraped items to a remote FTP location, otherwise (if spider failed), skip incompleted results and send email/notification to slack about it:
930
+
931
+ <details/>
932
+ <summary>Example</summary>
933
+
934
+ Also you can use additional methods `completed?` or `failed?`
935
+
936
+ ```ruby
937
+ class Spider < Kimurai::Base
938
+ @engine = :selenium_chrome
939
+ @start_urls = ["https://example.com/"]
940
+
941
+ def self.close_spider
942
+ if completed?
943
+ send_file_to_ftp("results.json")
944
+ else
945
+ send_error_notification(run_info[:error])
946
+ end
947
+ end
948
+
949
+ def self.send_file_to_ftp(file_path)
950
+ # ...
951
+ end
952
+
953
+ def self.send_error_notification(error)
954
+ # ...
955
+ end
956
+
957
+ # ...
958
+
959
+ def parse_item(response, url:, data: {})
960
+ item = {}
961
+ # ...
962
+
963
+ save_to "results.json", item, format: :json
964
+ end
965
+ end
966
+ ```
967
+ </details>
968
+
969
+
970
+ ### `KIMURAI_ENV`
971
+ Kimurai has environments, default is `development`. To provide custom environment pass `KIMURAI_ENV` ENV variable before command: `$ KIMURAI_ENV=production ruby spider.rb`. To access current environment there is `Kimurai.env` method.
972
+
973
+ Usage example:
974
+ ```ruby
975
+ class Spider < Kimurai::Base
976
+ @engine = :selenium_chrome
977
+ @start_urls = ["https://example.com/"]
978
+
979
+ def self.close_spider
980
+ if failed? && Kimurai.env == "production"
981
+ send_error_notification(run_info[:error])
982
+ else
983
+ # Do nothing
984
+ end
985
+ end
986
+
987
+ # ...
988
+ end
989
+ ```
990
+
991
+ ### Parallel crawling using `in_parallel`
992
+ Kimurai can process web pages concurrently in one single line: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is array of urls to crawl and `threads:` is a number of threads:
993
+
994
+ ```ruby
995
+ # amazon_spider.rb
996
+ require 'kimurai'
997
+
998
+ class AmazonSpider < Kimurai::Base
999
+ @name = "amazon_spider"
1000
+ @engine = :mechanize
1001
+ @start_urls = ["https://www.amazon.com/"]
1002
+
1003
+ def parse(response, url:, data: {})
1004
+ browser.fill_in "field-keywords", with: "Web Scraping Books"
1005
+ browser.click_on "Go"
1006
+
1007
+ # Walk through pagination and collect products urls:
1008
+ urls = []
1009
+ loop do
1010
+ response = browser.current_response
1011
+ response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
1012
+ urls << a[:href].sub(/ref=.+/, "")
1013
+ end
1014
+
1015
+ browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
1016
+ end
1017
+
1018
+ # Process all collected urls concurrently within 3 threads:
1019
+ in_parallel(:parse_book_page, urls, threads: 3)
1020
+ end
1021
+
1022
+ def parse_book_page(response, url:, data: {})
1023
+ item = {}
1024
+
1025
+ item[:title] = response.xpath("//h1/span[@id]").text.squish
1026
+ item[:url] = url
1027
+ item[:price] = response.xpath("(//span[contains(@class, 'a-color-price')])[1]").text.squish.presence
1028
+ item[:publisher] = response.xpath("//h2[text()='Product details']/following::b[text()='Publisher:']/following-sibling::text()[1]").text.squish.presence
1029
+
1030
+ save_to "books.json", item, format: :pretty_json
1031
+ end
1032
+ end
1033
+
1034
+ AmazonSpider.crawl!
1035
+ ```
1036
+
1037
+ <details/>
1038
+ <summary>Run: <code>$ ruby amazon_spider.rb</code></summary>
1039
+
1040
+ ```
1041
+ I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: started: amazon_spider
1042
+ D, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1043
+ I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
1044
+ I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
1045
+ I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
1046
+
1047
+ I, [2018-08-22 14:48:43 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: starting processing 52 urls within 3 threads
1048
+ D, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1049
+ I, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
1050
+ D, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1051
+ I, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
1052
+ D, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
1053
+ I, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
1054
+ I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
1055
+ I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 4, responses: 2
1056
+ I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
1057
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
1058
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 5, responses: 3
1059
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
1060
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
1061
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 6, responses: 4
1062
+ I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Excel-Effective-Scrapes-ebook/dp/B01CMMJGZ8/
1063
+
1064
+ ...
1065
+
1066
+ I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 51, responses: 49
1067
+ I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1068
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Ice-Life-Bill-Rayburn-ebook/dp/B00C0NF1L8/
1069
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 51, responses: 50
1070
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
1071
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Php-architects-Guide-Scraping-Author/dp/B010DTKYY4/
1072
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 52, responses: 51
1073
+ I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
1074
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
1075
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 53, responses: 52
1076
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1077
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
1078
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 53, responses: 53
1079
+ I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1080
+
1081
+ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: stopped processing 52 urls within 3 threads, total time: 29s
1082
+ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
1083
+
1084
+ I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:48:37 +0400, :stop_time=>2018-08-22 14:49:12 +0400, :running_time=>"35s", :visits=>{:requests=>53, :responses=>53}, :error=>nil}
1085
+
1086
+ ```
1087
+ </details>
1088
+
1089
+ <details/>
1090
+ <summary>books.json</summary>
1091
+
1092
+ ```json
1093
+ [
1094
+ {
1095
+ "title": "Web Scraping with Python: Collecting More Data from the Modern Web2nd Edition",
1096
+ "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
1097
+ "price": "$26.94",
1098
+ "publisher": "O'Reilly Media; 2 edition (April 14, 2018)",
1099
+ "position": 1
1100
+ },
1101
+ {
1102
+ "title": "Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS",
1103
+ "url": "https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/",
1104
+ "price": "$39.99",
1105
+ "publisher": "Packt Publishing - ebooks Account (February 9, 2018)",
1106
+ "position": 2
1107
+ },
1108
+ {
1109
+ "title": "Web Scraping with Python: Collecting Data from the Modern Web1st Edition",
1110
+ "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/",
1111
+ "price": "$15.75",
1112
+ "publisher": "O'Reilly Media; 1 edition (July 24, 2015)",
1113
+ "position": 3
1114
+ },
1115
+
1116
+ ...
1117
+
1118
+ {
1119
+ "title": "Instant Web Scraping with Java by Ryan Mitchell (2013-08-26)",
1120
+ "url": "https://www.amazon.com/Instant-Scraping-Java-Mitchell-2013-08-26/dp/B01FEM76X2/",
1121
+ "price": "$35.82",
1122
+ "publisher": "Packt Publishing (2013-08-26) (1896)",
1123
+ "position": 52
1124
+ }
1125
+ ]
1126
+ ```
1127
+ </details><br>
1128
+
1129
+ > Note that [save_to](#save_to-helper) and [unique?](#skip-duplicates-unique-helper) helpers are thread-safe (protected by [Mutex](https://ruby-doc.org/core-2.5.1/Mutex.html)) and can be freely used inside threads.
1130
+
1131
+ `in_parallel` can take additional options:
1132
+ * `data:` pass with urls custom data hash: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
1133
+ * `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1134
+ * `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`
1135
+ * `config:` pass custom options to config (see [config section](#crawler-config))
1136
+
1137
+ ### Active Support included
1138
+
1139
+ You can use all the power of familiar [Rails core-ext methods](https://guides.rubyonrails.org/active_support_core_extensions.html#loading-all-core-extensions) for scraping inside Kimurai. Especially take a look at [squish](https://apidock.com/rails/String/squish), [truncate_words](https://apidock.com/rails/String/truncate_words), [titleize](https://apidock.com/rails/String/titleize), [remove](https://apidock.com/rails/String/remove), [present?](https://guides.rubyonrails.org/active_support_core_extensions.html#blank-questionmark-and-present-questionmark) and [presence](https://guides.rubyonrails.org/active_support_core_extensions.html#presence).
1140
+
1141
+ ### Schedule spiders using Cron
1142
+
1143
+ 1) Inside spider directory generate [Whenever](https://github.com/javan/whenever) config: `$ kimurai generate schedule`.
1144
+
1145
+ <details/>
1146
+ <summary><code>schedule.rb</code></summary>
1147
+
1148
+ ```ruby
1149
+ ### Settings ###
1150
+ require 'tzinfo'
1151
+
1152
+ # Export current PATH to the cron
1153
+ env :PATH, ENV["PATH"]
1154
+
1155
+ # Use 24 hour format when using `at:` option
1156
+ set :chronic_options, hours24: true
1157
+
1158
+ # Use local_to_utc helper to setup execution time using your local timezone instead
1159
+ # of server's timezone (which is probably and should be UTC, to check run `$ timedatectl`).
1160
+ # Also maybe you'll want to set same timezone in kimurai as well (use `Kimurai.configuration.time_zone =` for that),
1161
+ # to have spiders logs in a specific time zone format.
1162
+ # Example usage of helper:
1163
+ # every 1.day, at: local_to_utc("7:00", zone: "Europe/Moscow") do
1164
+ # crawl "google_spider.com", output: "log/google_spider.com.log"
1165
+ # end
1166
+ def local_to_utc(time_string, zone:)
1167
+ TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time))
1168
+ end
1169
+
1170
+ # Note: by default Whenever exports cron commands with :environment == "production".
1171
+ # Note: Whenever can only append log data to a log file (>>). If you want
1172
+ # to overwrite (>) log file before each run, pass lambda:
1173
+ # crawl "google_spider.com", output: -> { "> log/google_spider.com.log 2>&1" }
1174
+
1175
+ # Project job types
1176
+ job_type :crawl, "cd :path && KIMURAI_ENV=:environment bundle exec kimurai crawl :task :output"
1177
+ job_type :runner, "cd :path && KIMURAI_ENV=:environment bundle exec kimurai runner --jobs :task :output"
1178
+
1179
+ # Single file job type
1180
+ job_type :single, "cd :path && KIMURAI_ENV=:environment ruby :task :output"
1181
+ # Single with bundle exec
1182
+ job_type :single_bundle, "cd :path && KIMURAI_ENV=:environment bundle exec ruby :task :output"
1183
+
1184
+ ### Schedule ###
1185
+ # Usage (check examples here https://github.com/javan/whenever#example-schedulerb-file):
1186
+ # every 1.day do
1187
+ # Example to schedule a single spider in the project:
1188
+ # crawl "google_spider.com", output: "log/google_spider.com.log"
1189
+
1190
+ # Example to schedule all spiders in the project using runner. Each spider will write
1191
+ # it's own output to the `log/spider_name.log` file (handled by a runner itself).
1192
+ # Runner output will be written to log/runner.log file.
1193
+ # Argument number it's a count of concurrent jobs:
1194
+ # runner 3, output:"log/runner.log"
1195
+
1196
+ # Example to schedule single spider (without project):
1197
+ # single "single_spider.rb", output: "single_spider.log"
1198
+ # end
1199
+
1200
+ ### How to set a cron schedule ###
1201
+ # Run: `$ whenever --update-crontab --load-file config/schedule.rb`.
1202
+ # If you don't have whenever command, install the gem: `$ gem install whenever`.
1203
+
1204
+ ### How to cancel a schedule ###
1205
+ # Run: `$ whenever --clear-crontab --load-file config/schedule.rb`.
1206
+ ```
1207
+ </details><br>
1208
+
1209
+ 2) Add at the bottom of `schedule.rb` following code:
1210
+
1211
+ ```ruby
1212
+ every 1.day, at: "7:00" do
1213
+ single "example_spider.rb", output: "example_spider.log"
1214
+ end
1215
+ ```
1216
+
1217
+ 3) Run: `$ whenever --update-crontab --load-file schedule.rb`. Done!
1218
+
1219
+ You can check Whenever examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.
1220
+
1221
+ ### Configuration options
1222
+ You can configure several options using `configure` block:
1223
+
1224
+ ```ruby
1225
+ Kimurai.configure do |config|
1226
+ # Default logger has colored mode in development.
1227
+ # If you would like to disable it, set `colorize_logger` to false.
1228
+ # config.colorize_logger = false
1229
+
1230
+ # Logger level for default logger:
1231
+ # config.log_level = :info
1232
+
1233
+ # Custom logger:
1234
+ # config.logger = Logger.new(STDOUT)
1235
+
1236
+ # Custom time zone (for logs):
1237
+ # config.time_zone = "UTC"
1238
+ # config.time_zone = "Europe/Moscow"
1239
+ end
1240
+ ```
1241
+
1242
+ ### Using Kimurai inside existing Ruby application
1243
+
1244
+ You can integrate Kimurai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
1245
+
1246
+ #### `.crawl!` method
1247
+
1248
+ `.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if run was successful, or an exception if something went wrong.
1249
+
1250
+ ```ruby
1251
+ class ExampleSpider < Kimurai::Base
1252
+ @name = "example_spider"
1253
+ @engine = :mechanize
1254
+ @start_urls = ["https://example.com/"]
1255
+
1256
+ def parse(response, url:, data: {})
1257
+ title = response.xpath("//title").text.squish
1258
+ end
1259
+ end
1260
+
1261
+ ExampleSpider.crawl!
1262
+ # => { :spider_name => "example_spider", :status => :completed, :environment => "development", :start_time => 2018-08-22 18:20:16 +0400, :stop_time => 2018-08-22 18:20:17 +0400, :running_time => 1.216, :visits => { :requests => 1, :responses => 1 }, :items => { :sent => 0, :processed => 0 }, :error => nil }
1263
+ ```
1264
+
1265
+ You can't `.crawl!` spider in different thread if it still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
1266
+
1267
+ ```ruby
1268
+ 2.times do |i|
1269
+ Thread.new { p i, ExampleSpider.crawl! }
1270
+ end # =>
1271
+
1272
+ # 1
1273
+ # false
1274
+
1275
+ # 0
1276
+ # {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 18:49:22 +0400, :stop_time=>2018-08-22 18:49:23 +0400, :running_time=>0.801, :visits=>{:requests=>1, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :error=>nil}
1277
+ ```
1278
+
1279
+ So what if you're don't care about stats and just want to process request to a particular spider method and get the returning value from this method? Use `.parse!` instead:
1280
+
1281
+ #### `.parse!(:method_name, url:)` method
1282
+
1283
+ `.parse!` (class method) creates a new spider instance and performs a request to given method with a given url. Value from the method will be returned back:
1284
+
1285
+ ```ruby
1286
+ class ExampleSpider < Kimurai::Base
1287
+ @name = "example_spider"
1288
+ @engine = :mechanize
1289
+ @start_urls = ["https://example.com/"]
1290
+
1291
+ def parse(response, url:, data: {})
1292
+ title = response.xpath("//title").text.squish
1293
+ end
1294
+ end
1295
+
1296
+ ExampleSpider.parse!(:parse, url: "https://example.com/")
1297
+ # => "Example Domain"
1298
+ ```
1299
+
1300
+ Like `.crawl!`, `.parse!` method takes care of a browser instance and kills it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:
1301
+
1302
+ ```ruby
1303
+ urls = ["https://www.google.com/", "https://www.reddit.com/", "https://en.wikipedia.org/"]
1304
+
1305
+ urls.each do |url|
1306
+ Thread.new { p ExampleSpider.parse!(:parse, url: url) }
1307
+ end # =>
1308
+
1309
+ # "Google"
1310
+ # "Wikipedia, the free encyclopedia"
1311
+ # "reddit: the front page of the internetHotHot"
1312
+ ```
1313
+
1314
+ #### `Kimurai.list` and `Kimurai.find_by_name()`
1315
+
1316
+ ```ruby
1317
+ class GoogleSpider < Kimurai::Base
1318
+ @name = "google_spider"
1319
+ end
1320
+
1321
+ class RedditSpider < Kimurai::Base
1322
+ @name = "reddit_spider"
1323
+ end
1324
+
1325
+ class WikipediaSpider < Kimurai::Base
1326
+ @name = "wikipedia_spider"
1327
+ end
1328
+
1329
+ # To get the list of all available spider classes:
1330
+ Kimurai.list
1331
+ # => {"google_spider"=>GoogleSpider, "reddit_spider"=>RedditSpider, "wikipedia_spider"=>WikipediaSpider}
1332
+
1333
+ # To find a particular spider class by it's name:
1334
+ Kimurai.find_by_name("reddit_spider")
1335
+ # => RedditSpider
1336
+ ```
1337
+
1338
+
1339
+ ### Automated sever setup and deployment
1340
+ > **EXPERIMENTAL**
1341
+
1342
+ #### Setup
1343
+ You can automatically setup [required environment](#installation) for Kimurai on the remote server (currently there is only Ubuntu Server 18.04 support) using `$ kimurai setup` command. `setup` will perform installation of: latest Ruby with Rbenv, browsers with webdrivers and in additional databases clients (only clients) for MySQL, Postgres and MongoDB (so you can connect to a remote database from ruby).
1344
+
1345
+ > To perform remote server setup, [Ansible](https://github.com/ansible/ansible) is required **on the desktop** machine (to install: Ubuntu: `$ sudo apt install ansible`, Mac OS X: `$ brew install ansible`)
1346
+
1347
+ Example:
1348
+
1349
+ ```bash
1350
+ $ kimurai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
1351
+ ```
1352
+
1353
+ CLI options:
1354
+ * `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
1355
+ * `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
1356
+ * `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
1357
+ * `-p port_number` custom port for ssh connection (`-p 2222`)
1358
+
1359
+ > You can check setup playbook [here](lib/kimurai/automation/setup.yml)
1360
+
1361
+ #### Deploy
1362
+
1363
+ After successful `setup` you can deploy a spider to the remote server using `$ kimurai deploy` command. On each deploy there are performing several tasks: 1) pull repo from a remote origin to `~/repo_name` user directory 2) run `bundle install` 3) Update crontab `whenever --update-crontab` (to update spider schedule from schedule.rb file).
1364
+
1365
+ Before `deploy` make sure that inside spider directory you have: 1) git repository with remote origin (bitbucket, github, etc.) 2) `Gemfile` 3) schedule.rb inside subfolder `config` (`config/schedule.rb`).
1366
+
1367
+ Example:
1368
+
1369
+ ```bash
1370
+ $ kimurai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
1371
+ ```
1372
+
1373
+ CLI options: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
1374
+ * `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
1375
+ * `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
1376
+
1377
+ > You can check deploy playbook [here](lib/kimurai/automation/deploy.yml)
1378
+
1379
+ ## Spider `@config`
1380
+
1381
+ Using `@config` you can set several options for a spider, like proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:
1382
+
1383
+ ```ruby
1384
+ class Spider < Kimurai::Base
1385
+ USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
1386
+ PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
1387
+
1388
+ @engine = :poltergeist_phantomjs
1389
+ @start_urls = ["https://example.com/"]
1390
+ @config = {
1391
+ headers: { "custom_header" => "custom_value" },
1392
+ cookies: [{ name: "cookie_name", value: "cookie_value", domain: ".example.com" }],
1393
+ user_agent: -> { USER_AGENTS.sample },
1394
+ proxy: -> { PROXIES.sample },
1395
+ window_size: [1366, 768],
1396
+ disable_images: true,
1397
+ browser: {
1398
+ restart_if: {
1399
+ # Restart browser if provided memory limit (in kilobytes) is exceeded:
1400
+ memory_limit: 350_000
1401
+ },
1402
+ before_request: {
1403
+ # Change user agent before each request:
1404
+ change_user_agent: true,
1405
+ # Change proxy before each request:
1406
+ change_proxy: true,
1407
+ # Clear all cookies and set default cookies (if provided) before each request:
1408
+ clear_and_set_cookies: true,
1409
+ # Process delay before each request:
1410
+ delay: 1..3
1411
+ }
1412
+ }
1413
+ }
1414
+
1415
+ def parse(response, url:, data: {})
1416
+ # ...
1417
+ end
1418
+ end
1419
+ ```
1420
+
1421
+ ### All available `@config` options
1422
+
1423
+ ```ruby
1424
+ @config = {
1425
+ # Custom headers, format: hash. Example: { "some header" => "some value", "another header" => "another value" }
1426
+ # Works only for :mechanize and :poltergeist_phantomjs engines (Selenium doesn't allow to set/get headers)
1427
+ headers: {},
1428
+
1429
+ # Custom User Agent, format: string or lambda.
1430
+ # Use lambda if you want to rotate user agents before each run:
1431
+ # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
1432
+ # Works for all engines
1433
+ user_agent: "Mozilla/5.0 Firefox/61.0",
1434
+
1435
+ # Custom cookies, format: array of hashes.
1436
+ # Format for a single cookie: { name: "cookie name", value: "cookie value", domain: ".example.com" }
1437
+ # Works for all engines
1438
+ cookies: [],
1439
+
1440
+ # Proxy, format: string or lambda. Format of a proxy string: "ip:port:protocol:user:password"
1441
+ # `protocol` can be http or socks5. User and password are optional.
1442
+ # Use lambda if you want to rotate proxies before each run:
1443
+ # proxy: -> { ARRAY_OF_PROXIES.sample }
1444
+ # Works for all engines, but keep in mind that Selenium drivers doesn't support proxies
1445
+ # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http)
1446
+ proxy: "3.4.5.6:3128:http:user:pass",
1447
+
1448
+ # If enabled, browser will ignore any https errors. It's handy while using a proxy
1449
+ # with self-signed SSL cert (for example Crawlera or Mitmproxy)
1450
+ # Also, it will allow to visit webpages with expires SSL certificate.
1451
+ # Works for all engines
1452
+ ignore_ssl_errors: true,
1453
+
1454
+ # Custom window size, works for all engines
1455
+ window_size: [1366, 768],
1456
+
1457
+ # Skip images downloading if true, works for all engines
1458
+ disable_images: true,
1459
+
1460
+ # Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
1461
+ # Although native mode has a better performance, virtual display mode
1462
+ # sometimes can be useful. For example, some websites can detect (and block)
1463
+ # headless chrome, so you can use virtual_display mode instead
1464
+ headless_mode: :native,
1465
+
1466
+ # This option tells the browser not to use a proxy for the provided list of domains or IP addresses.
1467
+ # Format: array of strings. Works only for :selenium_firefox and selenium_chrome
1468
+ proxy_bypass_list: [],
1469
+
1470
+ # Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize
1471
+ ssl_cert_path: "path/to/ssl_cert",
1472
+
1473
+ # Browser (Capybara session instance) options:
1474
+ browser: {
1475
+ # Array of errors to retry while processing a request
1476
+ retry_request_errors: [Net::ReadTimeout],
1477
+ # Restart browser if one of the options is true:
1478
+ restart_if: {
1479
+ # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
1480
+ memory_limit: 350_000,
1481
+
1482
+ # Restart browser if provided requests limit is exceeded (works for all engines)
1483
+ requests_limit: 100
1484
+ },
1485
+ before_request: {
1486
+ # Change proxy before each request. The `proxy:` option above should be presented
1487
+ # and has lambda format. Works only for poltergeist and mechanize engines
1488
+ # (Selenium doesn't support proxy rotation).
1489
+ change_proxy: true,
1490
+
1491
+ # Change user agent before each request. The `user_agent:` option above should be presented
1492
+ # and has lambda format. Works only for poltergeist and mechanize engines
1493
+ # (selenium doesn't support to get/set headers).
1494
+ change_user_agent: true,
1495
+
1496
+ # Clear all cookies before each request, works for all engines
1497
+ clear_cookies: true,
1498
+
1499
+ # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
1500
+ # use this option instead (works for all engines)
1501
+ clear_and_set_cookies: true,
1502
+
1503
+ # Global option to set delay between requests.
1504
+ # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
1505
+ # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
1506
+ delay: 1..3
1507
+ }
1508
+ }
1509
+ }
1510
+ ```
1511
+
1512
+ As you can see, most of the options are universal for any engine.
1513
+
1514
+ ### `@config` settings inheritance
1515
+ Settings can be inherited:
1516
+
1517
+ ```ruby
1518
+ class ApplicationSpider < Kimurai::Base
1519
+ @engine = :poltergeist_phantomjs
1520
+ @config = {
1521
+ user_agent: "Firefox",
1522
+ disable_images: true,
1523
+ browser: {
1524
+ restart_if: { memory_limit: 350_000 },
1525
+ before_request: { delay: 1..2 }
1526
+ }
1527
+ }
1528
+ end
1529
+
1530
+ class CustomSpider < ApplicationSpider
1531
+ @name = "custom_spider"
1532
+ @start_urls = ["https://example.com/"]
1533
+ @config = {
1534
+ browser: { before_request: { delay: 4..6 }}
1535
+ }
1536
+
1537
+ def parse(response, url:, data: {})
1538
+ # ...
1539
+ end
1540
+ end
1541
+ ```
1542
+
1543
+ Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider` config, so `CustomSpider` will keep all inherited options with only `delay` updated.
1544
+
1545
+ ## Project mode
1546
+
1547
+ Kimurai can work in project mode ([Like Scrapy](https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project)). To generate a new project, run: `$ kimurai generate project web_spiders` (where `web_spiders` is a name of project).
1548
+
1549
+ Structure of the project:
1550
+
1551
+ ```bash
1552
+ .
1553
+ ├── config/
1554
+ │   ├── initializers/
1555
+ │   ├── application.rb
1556
+ │   ├── automation.yml
1557
+ │   ├── boot.rb
1558
+ │   └── schedule.rb
1559
+ ├── spiders/
1560
+ │   └── application_spider.rb
1561
+ ├── db/
1562
+ ├── helpers/
1563
+ │   └── application_helper.rb
1564
+ ├── lib/
1565
+ ├── log/
1566
+ ├── pipelines/
1567
+ │   ├── validator.rb
1568
+ │   └── saver.rb
1569
+ ├── tmp/
1570
+ ├── .env
1571
+ ├── Gemfile
1572
+ ├── Gemfile.lock
1573
+ └── README.md
1574
+ ```
1575
+
1576
+ <details/>
1577
+ <summary>Description</summary>
1578
+
1579
+ * `config/` folder for configutation files
1580
+ * `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code at start of framework
1581
+ * `config/application.rb` configuration settings for Kimurai (`Kimurai.configure do` block)
1582
+ * `config/automation.yml` specify some settings for [setup and deploy](#automated-sever-setup-and-deployment)
1583
+ * `config/boot.rb` loads framework and project
1584
+ * `config/schedule.rb` Cron [schedule for spiders](#schedule-spiders-using-cron)
1585
+ * `spiders/` folder for spiders
1586
+ * `spiders/application_spider.rb` Base parent class for all spiders
1587
+ * `db/` store here all database files (`sqlite`, `json`, `csv`, etc.)
1588
+ * `helpers/` Rails-like helpers for spiders
1589
+ * `helpers/application_helper.rb` all methods inside ApplicationHelper module will be available for all spiders
1590
+ * `lib/` put here custom Ruby code
1591
+ * `log/` folder for logs
1592
+ * `pipelines/` folder for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines. One file = one pipeline
1593
+ * `pipelines/validator.rb` example pipeline to validate item
1594
+ * `pipelines/saver.rb` example pipeline to save item
1595
+ * `tmp/` folder for temp. files
1596
+ * `.env` file to store ENV variables for project and load them using [Dotenv](https://github.com/bkeepers/dotenv)
1597
+ * `Gemfile` dependency file
1598
+ * `Readme.md` example project readme
1599
+ </details>
1600
+
1601
+
1602
+ ### Generate new spider
1603
+ To generate a new spider in the project, run:
1604
+
1605
+ ```bash
1606
+ $ kimurai generate spider example_spider
1607
+ create crawlers/example_spider.rb
1608
+ ```
1609
+
1610
+ Command will generate a new spider class inherited from `ApplicationSpider`:
1611
+
1612
+ ```ruby
1613
+ class ExampleSpider < ApplicationSpider
1614
+ @name = "example_spider"
1615
+ @start_urls = []
1616
+ @config = {}
1617
+
1618
+ def parse(response, url:, data: {})
1619
+ end
1620
+ end
1621
+ ```
1622
+
1623
+ ### Crawl
1624
+ To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
1625
+
1626
+ ### List
1627
+ To list all project spiders, run: `$ bundle exec kimurai list`
1628
+
1629
+ ### Parse
1630
+ For project spiders you can use `$ kimurai parse` command which helps to debug spiders:
1631
+
1632
+ ```bash
1633
+ $ bundle exec kimurai parse example_spider parse_product --url https://example-shop.com/product-1
1634
+ ```
1635
+
1636
+ where `example_spider` is a spider to run, `parse_product` is a spider method to process and `--url` is url to open inside processing method.
1637
+
1638
+ ### Pipelines, `send_item` method
1639
+ You can use item pipelines to organize and store in one place item processing logic for all project spiders (also check Scrapy [description of pipelines](https://doc.scrapy.org/en/latest/topics/item-pipeline.html#item-pipeline)).
1640
+
1641
+ Imagine if you have three spiders where each of them crawls different e-commerce shop and saves only shoe positions. For each spider, you want to save items only with "shoe" category, unique sku, valid title/price and with existing images. To avoid code duplication between spiders, use pipelines:
1642
+
1643
+ <details/>
1644
+ <summary>Example</summary>
1645
+
1646
+ pipelines/validator.rb
1647
+ ```ruby
1648
+ class Validator < Kimurai::Pipeline
1649
+ def process_item(item, options: {})
1650
+ # Here you can validate item and raise `DropItemError`
1651
+ # if one of the validations failed. Examples:
1652
+
1653
+ # Drop item if it's category is not "shoe":
1654
+ if item[:category] != "shoe"
1655
+ raise DropItemError, "Wrong item category"
1656
+ end
1657
+
1658
+ # Check item sku for uniqueness using buit-in unique? helper:
1659
+ unless unique?(:sku, item[:sku])
1660
+ raise DropItemError, "Item sku is not unique"
1661
+ end
1662
+
1663
+ # Drop item if title length shorter than 5 symbols:
1664
+ if item[:title].size < 5
1665
+ raise DropItemError, "Item title is short"
1666
+ end
1667
+
1668
+ # Drop item if price is not present
1669
+ unless item[:price].present?
1670
+ raise DropItemError, "item price is not present"
1671
+ end
1672
+
1673
+ # Drop item if it doesn't contains any images:
1674
+ unless item[:images].present?
1675
+ raise DropItemError, "Item images are not present"
1676
+ end
1677
+
1678
+ # Pass item to the next pipeline (if it wasn't dropped):
1679
+ item
1680
+ end
1681
+ end
1682
+
1683
+ ```
1684
+
1685
+ pipelines/saver.rb
1686
+ ```ruby
1687
+ class Saver < Kimurai::Pipeline
1688
+ def process_item(item, options: {})
1689
+ # Here you can save item to the database, send it to a remote API or
1690
+ # simply save item to a file format using `save_to` helper:
1691
+
1692
+ # To get the name of current spider: `spider.class.name`
1693
+ save_to "db/#{spider.class.name}.json", item, format: :json
1694
+
1695
+ item
1696
+ end
1697
+ end
1698
+ ```
1699
+
1700
+ spiders/application_spider.rb
1701
+ ```ruby
1702
+ class ApplicationSpider < Kimurai::Base
1703
+ @engine = :selenium_chrome
1704
+ # Define pipelines (by order) for all spiders:
1705
+ @pipelines = [:validator, :saver]
1706
+ end
1707
+ ```
1708
+
1709
+ spiders/shop_spider_1.rb
1710
+ ```ruby
1711
+ class ShopSpiderOne < ApplicationSpider
1712
+ @name = "shop_spider_1"
1713
+ @start_urls = ["https://shop-1.com"]
1714
+
1715
+ # ...
1716
+
1717
+ def parse_product(response, url:, data: {})
1718
+ # ...
1719
+
1720
+ # Send item to pipelines:
1721
+ send_item item
1722
+ end
1723
+ end
1724
+ ```
1725
+
1726
+ spiders/shop_spider_2.rb
1727
+ ```ruby
1728
+ class ShopSpiderTwo < ApplicationSpider
1729
+ @name = "shop_spider_2"
1730
+ @start_urls = ["https://shop-2.com"]
1731
+
1732
+ def parse_product(response, url:, data: {})
1733
+ # ...
1734
+
1735
+ # Send item to pipelines:
1736
+ send_item item
1737
+ end
1738
+ end
1739
+ ```
1740
+
1741
+ spiders/shop_spider_3.rb
1742
+ ```ruby
1743
+ class ShopSpiderThree < ApplicationSpider
1744
+ @name = "shop_spider_3"
1745
+ @start_urls = ["https://shop-3.com"]
1746
+
1747
+ def parse_product(response, url:, data: {})
1748
+ # ...
1749
+
1750
+ # Send item to pipelines:
1751
+ send_item item
1752
+ end
1753
+ end
1754
+ ```
1755
+ </details><br>
1756
+
1757
+ When you start using pipelines, there are stats for items appears:
1758
+
1759
+ <details>
1760
+ <summary>Example</summary>
1761
+
1762
+ pipelines/validator.rb
1763
+ ```ruby
1764
+ class Validator < Kimurai::Pipeline
1765
+ def process_item(item, options: {})
1766
+ if item[:star_count] < 10
1767
+ raise DropItemError, "Repository doesn't have enough stars"
1768
+ end
1769
+
1770
+ item
1771
+ end
1772
+ end
1773
+ ```
1774
+
1775
+ spiders/github_spider.rb
1776
+ ```ruby
1777
+ class GithubSpider < ApplicationSpider
1778
+ @name = "github_spider"
1779
+ @engine = :selenium_chrome
1780
+ @pipelines = [:validator]
1781
+ @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
1782
+ @config = {
1783
+ user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
1784
+ browser: { before_request: { delay: 4..7 } }
1785
+ }
1786
+
1787
+ def parse(response, url:, data: {})
1788
+ response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
1789
+ request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
1790
+ end
1791
+
1792
+ if next_page = response.at_xpath("//a[@class='next_page']")
1793
+ request_to :parse, url: absolute_url(next_page[:href], base: url)
1794
+ end
1795
+ end
1796
+
1797
+ def parse_repo_page(response, url:, data: {})
1798
+ item = {}
1799
+
1800
+ item[:owner] = response.xpath("//h1//a[@rel='author']").text
1801
+ item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
1802
+ item[:repo_url] = url
1803
+ item[:description] = response.xpath("//span[@itemprop='about']").text.squish
1804
+ item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
1805
+ item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish.delete(",").to_i
1806
+ item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish.delete(",").to_i
1807
+ item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish.delete(",").to_i
1808
+ item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
1809
+
1810
+ send_item item
1811
+ end
1812
+ end
1813
+ ```
1814
+
1815
+ ```
1816
+ $ bundle exec kimurai crawl github_spider
1817
+
1818
+ I, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: started: github_spider
1819
+ D, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
1820
+ I, [2018-08-22 15:56:40 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1821
+ I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
1822
+ I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 1, responses: 1
1823
+ D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 116182
1824
+ D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
1825
+
1826
+ I, [2018-08-22 15:56:49 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
1827
+ I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
1828
+ I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 2, responses: 2
1829
+ D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 217432
1830
+ D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1831
+ I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
1832
+ I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 1, processed: 1
1833
+ D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 6 seconds before request...
1834
+
1835
+ ...
1836
+
1837
+ I, [2018-08-22 16:11:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
1838
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
1839
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 140, responses: 140
1840
+ D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 211713
1841
+
1842
+ D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
1843
+ E, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}
1844
+
1845
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 127, processed: 12
1846
+
1847
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
1848
+ I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}
1849
+ ```
1850
+ </details><br>
1851
+
1852
+ Also, you can pass custom options to pipeline from a particular spider if you want to change pipeline behavior for this spider:
1853
+
1854
+ <details>
1855
+ <summary>Example</summary>
1856
+
1857
+ spiders/custom_spider.rb
1858
+ ```ruby
1859
+ class CustomSpider < ApplicationSpider
1860
+ @name = "custom_spider"
1861
+ @start_urls = ["https://example.com"]
1862
+ @pipelines = [:validator]
1863
+
1864
+ # ...
1865
+
1866
+ def parse_item(response, url:, data: {})
1867
+ # ...
1868
+
1869
+ # Pass custom option `skip_uniq_checking` for Validator pipeline:
1870
+ send_item item, validator: { skip_uniq_checking: true }
1871
+ end
1872
+ end
1873
+
1874
+ ```
1875
+
1876
+ pipelines/validator.rb
1877
+ ```ruby
1878
+ class Validator < Kimurai::Pipeline
1879
+ def process_item(item, options: {})
1880
+
1881
+ # Do not check item sku for uniqueness if options[:skip_uniq_checking] is true
1882
+ if options[:skip_uniq_checking] != true
1883
+ raise DropItemError, "Item sku is not unique" unless unique?(:sku, item[:sku])
1884
+ end
1885
+ end
1886
+ end
1887
+ ```
1888
+ </details>
1889
+
1890
+
1891
+ ### Runner
1892
+
1893
+ You can run project spiders one by one or in parallel using `$ kimurai runner` command:
1894
+
1895
+ ```
1896
+ $ bundle exec kimurai list
1897
+ custom_spider
1898
+ example_spider
1899
+ github_spider
1900
+
1901
+ $ bundle exec kimurai runner -j 3
1902
+ >>> Runner: started: {:id=>1533727423, :status=>:processing, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>nil, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
1903
+ > Runner: started spider: custom_spider, index: 0
1904
+ > Runner: started spider: github_spider, index: 1
1905
+ > Runner: started spider: example_spider, index: 2
1906
+ < Runner: stopped spider: custom_spider, index: 0
1907
+ < Runner: stopped spider: example_spider, index: 2
1908
+ < Runner: stopped spider: github_spider, index: 1
1909
+ <<< Runner: stopped: {:id=>1533727423, :status=>:completed, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>2018-08-08 15:25:11 +0400, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
1910
+ ```
1911
+
1912
+ Each spider runs in a separate process. Spiders logs available at `log/` folder. Pass `-j` option to specify how many spiders should be processed at the same time (default is 1).
1913
+
1914
+ #### Runner callbacks
1915
+
1916
+ You can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/kimurai/template/config/application.rb) to see example.
1917
+
1918
+
1919
+ ## Chat Support and Feedback
1920
+ Will be updated
1921
+
1922
+ ## License
1923
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).