kimurai 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +11 -0
- data/.travis.yml +5 -0
- data/CODE_OF_CONDUCT.md +74 -0
- data/Gemfile +6 -0
- data/LICENSE.txt +21 -0
- data/README.md +1923 -0
- data/Rakefile +10 -0
- data/bin/console +14 -0
- data/bin/setup +8 -0
- data/exe/kimurai +6 -0
- data/kimurai.gemspec +48 -0
- data/lib/kimurai.rb +53 -0
- data/lib/kimurai/automation/deploy.yml +54 -0
- data/lib/kimurai/automation/setup.yml +44 -0
- data/lib/kimurai/automation/setup/chromium_chromedriver.yml +26 -0
- data/lib/kimurai/automation/setup/firefox_geckodriver.yml +20 -0
- data/lib/kimurai/automation/setup/phantomjs.yml +33 -0
- data/lib/kimurai/automation/setup/ruby_environment.yml +124 -0
- data/lib/kimurai/base.rb +249 -0
- data/lib/kimurai/base/simple_saver.rb +98 -0
- data/lib/kimurai/base/uniq_checker.rb +22 -0
- data/lib/kimurai/base_helper.rb +22 -0
- data/lib/kimurai/browser_builder.rb +32 -0
- data/lib/kimurai/browser_builder/mechanize_builder.rb +140 -0
- data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +156 -0
- data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +178 -0
- data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +185 -0
- data/lib/kimurai/capybara_configuration.rb +10 -0
- data/lib/kimurai/capybara_ext/driver/base.rb +62 -0
- data/lib/kimurai/capybara_ext/mechanize/driver.rb +55 -0
- data/lib/kimurai/capybara_ext/poltergeist/driver.rb +13 -0
- data/lib/kimurai/capybara_ext/selenium/driver.rb +24 -0
- data/lib/kimurai/capybara_ext/session.rb +150 -0
- data/lib/kimurai/capybara_ext/session/config.rb +18 -0
- data/lib/kimurai/cli.rb +157 -0
- data/lib/kimurai/cli/ansible_command_builder.rb +71 -0
- data/lib/kimurai/cli/generator.rb +57 -0
- data/lib/kimurai/core_ext/array.rb +14 -0
- data/lib/kimurai/core_ext/numeric.rb +19 -0
- data/lib/kimurai/core_ext/string.rb +7 -0
- data/lib/kimurai/pipeline.rb +25 -0
- data/lib/kimurai/runner.rb +72 -0
- data/lib/kimurai/template/.gitignore +18 -0
- data/lib/kimurai/template/.ruby-version +1 -0
- data/lib/kimurai/template/Gemfile +20 -0
- data/lib/kimurai/template/README.md +3 -0
- data/lib/kimurai/template/config/application.rb +32 -0
- data/lib/kimurai/template/config/automation.yml +13 -0
- data/lib/kimurai/template/config/boot.rb +22 -0
- data/lib/kimurai/template/config/initializers/.keep +0 -0
- data/lib/kimurai/template/config/schedule.rb +57 -0
- data/lib/kimurai/template/db/.keep +0 -0
- data/lib/kimurai/template/helpers/application_helper.rb +3 -0
- data/lib/kimurai/template/lib/.keep +0 -0
- data/lib/kimurai/template/log/.keep +0 -0
- data/lib/kimurai/template/pipelines/saver.rb +11 -0
- data/lib/kimurai/template/pipelines/validator.rb +24 -0
- data/lib/kimurai/template/spiders/application_spider.rb +104 -0
- data/lib/kimurai/template/tmp/.keep +0 -0
- data/lib/kimurai/version.rb +3 -0
- metadata +349 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: d108c41e5da08b22c21cc6c71cc3ac7056ddd1af32054c22a22f0c59658bfcb4
|
4
|
+
data.tar.gz: 8a8d32b7b8646eb50bd9f71d8986edc2ac78efc0e2e6a437b3280cff4418c5dd
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 4c82647cbe276980ef0a246693c7e68c08651351a549f99fbc6618bc9836c4a4ba83b4d09e1e29d06abcfa0d4f70443fb88682f57c544c0218b22940834a48b1
|
7
|
+
data.tar.gz: 845f04c77fbb5e53b24d048e60f23e2c0f9fdeb4d2fde7dcaaa04bebfebc4454777ade03cae895e444583aafb6c8e56038d0d722589fde10076091903646fdf7
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/CODE_OF_CONDUCT.md
ADDED
@@ -0,0 +1,74 @@
|
|
1
|
+
# Contributor Covenant Code of Conduct
|
2
|
+
|
3
|
+
## Our Pledge
|
4
|
+
|
5
|
+
In the interest of fostering an open and welcoming environment, we as
|
6
|
+
contributors and maintainers pledge to making participation in our project and
|
7
|
+
our community a harassment-free experience for everyone, regardless of age, body
|
8
|
+
size, disability, ethnicity, gender identity and expression, level of experience,
|
9
|
+
nationality, personal appearance, race, religion, or sexual identity and
|
10
|
+
orientation.
|
11
|
+
|
12
|
+
## Our Standards
|
13
|
+
|
14
|
+
Examples of behavior that contributes to creating a positive environment
|
15
|
+
include:
|
16
|
+
|
17
|
+
* Using welcoming and inclusive language
|
18
|
+
* Being respectful of differing viewpoints and experiences
|
19
|
+
* Gracefully accepting constructive criticism
|
20
|
+
* Focusing on what is best for the community
|
21
|
+
* Showing empathy towards other community members
|
22
|
+
|
23
|
+
Examples of unacceptable behavior by participants include:
|
24
|
+
|
25
|
+
* The use of sexualized language or imagery and unwelcome sexual attention or
|
26
|
+
advances
|
27
|
+
* Trolling, insulting/derogatory comments, and personal or political attacks
|
28
|
+
* Public or private harassment
|
29
|
+
* Publishing others' private information, such as a physical or electronic
|
30
|
+
address, without explicit permission
|
31
|
+
* Other conduct which could reasonably be considered inappropriate in a
|
32
|
+
professional setting
|
33
|
+
|
34
|
+
## Our Responsibilities
|
35
|
+
|
36
|
+
Project maintainers are responsible for clarifying the standards of acceptable
|
37
|
+
behavior and are expected to take appropriate and fair corrective action in
|
38
|
+
response to any instances of unacceptable behavior.
|
39
|
+
|
40
|
+
Project maintainers have the right and responsibility to remove, edit, or
|
41
|
+
reject comments, commits, code, wiki edits, issues, and other contributions
|
42
|
+
that are not aligned to this Code of Conduct, or to ban temporarily or
|
43
|
+
permanently any contributor for other behaviors that they deem inappropriate,
|
44
|
+
threatening, offensive, or harmful.
|
45
|
+
|
46
|
+
## Scope
|
47
|
+
|
48
|
+
This Code of Conduct applies both within project spaces and in public spaces
|
49
|
+
when an individual is representing the project or its community. Examples of
|
50
|
+
representing a project or community include using an official project e-mail
|
51
|
+
address, posting via an official social media account, or acting as an appointed
|
52
|
+
representative at an online or offline event. Representation of a project may be
|
53
|
+
further defined and clarified by project maintainers.
|
54
|
+
|
55
|
+
## Enforcement
|
56
|
+
|
57
|
+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
58
|
+
reported by contacting the project team at vicfreefly@gmail.com. All
|
59
|
+
complaints will be reviewed and investigated and will result in a response that
|
60
|
+
is deemed necessary and appropriate to the circumstances. The project team is
|
61
|
+
obligated to maintain confidentiality with regard to the reporter of an incident.
|
62
|
+
Further details of specific enforcement policies may be posted separately.
|
63
|
+
|
64
|
+
Project maintainers who do not follow or enforce the Code of Conduct in good
|
65
|
+
faith may face temporary or permanent repercussions as determined by other
|
66
|
+
members of the project's leadership.
|
67
|
+
|
68
|
+
## Attribution
|
69
|
+
|
70
|
+
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
|
71
|
+
available at [http://contributor-covenant.org/version/1/4][version]
|
72
|
+
|
73
|
+
[homepage]: http://contributor-covenant.org
|
74
|
+
[version]: http://contributor-covenant.org/version/1/4/
|
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2018 Victor Afanasev
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,1923 @@
|
|
1
|
+
<div align="center">
|
2
|
+
<a href="https://github.com/vfreefly/kimurai">
|
3
|
+
<img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png">
|
4
|
+
</a>
|
5
|
+
|
6
|
+
<h1>Kimura Framework</h1>
|
7
|
+
</div>
|
8
|
+
|
9
|
+
Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
|
10
|
+
|
11
|
+
Kimurai based on well-known [Capybara](https://github.com/teamcapybara/capybara) and [Nokogiri](https://github.com/sparklemotion/nokogiri) gems, so you don't have to learn anything new. Lets see:
|
12
|
+
|
13
|
+
```ruby
|
14
|
+
# github_spider.rb
|
15
|
+
require 'kimurai'
|
16
|
+
|
17
|
+
class GithubSpider < Kimurai::Base
|
18
|
+
@name = "github_spider"
|
19
|
+
@engine = :selenium_chrome
|
20
|
+
@start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
|
21
|
+
@config = {
|
22
|
+
user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
|
23
|
+
browser: {
|
24
|
+
before_request: { delay: 4..7 }
|
25
|
+
}
|
26
|
+
}
|
27
|
+
|
28
|
+
def parse(response, url:, data: {})
|
29
|
+
response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
|
30
|
+
request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
|
31
|
+
end
|
32
|
+
|
33
|
+
if next_page = response.at_xpath("//a[@class='next_page']")
|
34
|
+
request_to :parse, url: absolute_url(next_page[:href], base: url)
|
35
|
+
end
|
36
|
+
end
|
37
|
+
|
38
|
+
def parse_repo_page(response, url:, data: {})
|
39
|
+
item = {}
|
40
|
+
|
41
|
+
item[:owner] = response.xpath("//h1//a[@rel='author']").text
|
42
|
+
item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
|
43
|
+
item[:repo_url] = url
|
44
|
+
item[:description] = response.xpath("//span[@itemprop='about']").text.squish
|
45
|
+
item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
|
46
|
+
item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish
|
47
|
+
item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish
|
48
|
+
item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish
|
49
|
+
item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
|
50
|
+
|
51
|
+
save_to "results.json", item, format: :pretty_json
|
52
|
+
end
|
53
|
+
end
|
54
|
+
|
55
|
+
GithubSpider.crawl!
|
56
|
+
```
|
57
|
+
|
58
|
+
<details/>
|
59
|
+
<summary>Run: <code>$ ruby github_spider.rb</code></summary>
|
60
|
+
|
61
|
+
```
|
62
|
+
I, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: started: github_spider
|
63
|
+
D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
|
64
|
+
D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled `browser before_request delay`
|
65
|
+
D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 7 seconds before request...
|
66
|
+
D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled custom user-agent
|
67
|
+
D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
|
68
|
+
I, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
|
69
|
+
I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
|
70
|
+
I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 1, responses: 1
|
71
|
+
D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 107968
|
72
|
+
D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
|
73
|
+
I, [2018-08-22 13:08:32 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
|
74
|
+
I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
|
75
|
+
I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 2, responses: 2
|
76
|
+
D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 212542
|
77
|
+
D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 4 seconds before request...
|
78
|
+
I, [2018-08-22 13:08:37 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
|
79
|
+
|
80
|
+
...
|
81
|
+
|
82
|
+
I, [2018-08-22 13:23:07 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
|
83
|
+
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
|
84
|
+
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Info: visits: requests: 140, responses: 140
|
85
|
+
D, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 204198
|
86
|
+
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
|
87
|
+
|
88
|
+
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:08:03 +0400, :stop_time=>2018-08-22 13:23:08 +0400, :running_time=>"15m, 5s", :visits=>{:requests=>140, :responses=>140}, :error=>nil}
|
89
|
+
```
|
90
|
+
</details>
|
91
|
+
|
92
|
+
<details/>
|
93
|
+
<summary>results.json</summary>
|
94
|
+
|
95
|
+
```json
|
96
|
+
[
|
97
|
+
{
|
98
|
+
"owner": "lorien",
|
99
|
+
"repo_name": "awesome-web-scraping",
|
100
|
+
"repo_url": "https://github.com/lorien/awesome-web-scraping",
|
101
|
+
"description": "List of libraries, tools and APIs for web scraping and data processing.",
|
102
|
+
"tags": [
|
103
|
+
"awesome",
|
104
|
+
"awesome-list",
|
105
|
+
"web-scraping",
|
106
|
+
"data-processing",
|
107
|
+
"python",
|
108
|
+
"javascript",
|
109
|
+
"php",
|
110
|
+
"ruby"
|
111
|
+
],
|
112
|
+
"watch_count": "159",
|
113
|
+
"star_count": "2,423",
|
114
|
+
"fork_count": "358",
|
115
|
+
"last_commit": "4 days ago",
|
116
|
+
"position": 1
|
117
|
+
},
|
118
|
+
|
119
|
+
...
|
120
|
+
|
121
|
+
{
|
122
|
+
"owner": "preston",
|
123
|
+
"repo_name": "idclight",
|
124
|
+
"repo_url": "https://github.com/preston/idclight",
|
125
|
+
"description": "A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.",
|
126
|
+
"tags": [
|
127
|
+
|
128
|
+
],
|
129
|
+
"watch_count": "6",
|
130
|
+
"star_count": "1",
|
131
|
+
"fork_count": "0",
|
132
|
+
"last_commit": "on Apr 12, 2012",
|
133
|
+
"position": 127
|
134
|
+
}
|
135
|
+
]
|
136
|
+
```
|
137
|
+
</details><br>
|
138
|
+
|
139
|
+
Okay, that was easy. How about javascript rendered websites with dynamic HTML? Lets scrape a page with infinite scroll:
|
140
|
+
|
141
|
+
```ruby
|
142
|
+
# infinite_scroll_spider.rb
|
143
|
+
require 'kimurai'
|
144
|
+
|
145
|
+
class InfiniteScrollSpider < Kimurai::Base
|
146
|
+
@name = "infinite_scroll_spider"
|
147
|
+
@engine = :selenium_chrome
|
148
|
+
@start_urls = ["https://infinite-scroll.com/demo/full-page/"]
|
149
|
+
|
150
|
+
def parse(response, url:, data: {})
|
151
|
+
posts_headers_path = "//article/h2"
|
152
|
+
count = response.xpath(posts_headers_path).count
|
153
|
+
|
154
|
+
loop do
|
155
|
+
browser.execute_script("window.scrollBy(0,10000)") ; sleep 2
|
156
|
+
response = browser.current_response
|
157
|
+
|
158
|
+
new_count = response.xpath(posts_headers_path).count
|
159
|
+
if count == new_count
|
160
|
+
logger.info "> Pagination is done" and break
|
161
|
+
else
|
162
|
+
count = new_count
|
163
|
+
logger.info "> Continue scrolling, current count is #{count}..."
|
164
|
+
end
|
165
|
+
end
|
166
|
+
|
167
|
+
posts_headers = response.xpath(posts_headers_path).map(&:text)
|
168
|
+
logger.info "> All posts from page: #{posts_headers.join('; ')}"
|
169
|
+
end
|
170
|
+
end
|
171
|
+
|
172
|
+
InfiniteScrollSpider.crawl!
|
173
|
+
```
|
174
|
+
|
175
|
+
<details/>
|
176
|
+
<summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>
|
177
|
+
|
178
|
+
```
|
179
|
+
I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
|
180
|
+
D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): created browser instance
|
181
|
+
D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
|
182
|
+
I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
|
183
|
+
I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
|
184
|
+
I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
|
185
|
+
D, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: Browser: driver.current_memory: 95463
|
186
|
+
I, [2018-08-22 13:33:05 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 5...
|
187
|
+
I, [2018-08-22 13:33:18 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 9...
|
188
|
+
I, [2018-08-22 13:33:20 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 11...
|
189
|
+
I, [2018-08-22 13:33:26 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 13...
|
190
|
+
I, [2018-08-22 13:33:28 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Continue scrolling, current count is 15...
|
191
|
+
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > Pagination is done
|
192
|
+
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
|
193
|
+
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
|
194
|
+
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scroll_spider: Spider: stopped: {:spider_name=>"infinite_scroll_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:32:57 +0400, :stop_time=>2018-08-22 13:33:30 +0400, :running_time=>"33s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
|
195
|
+
|
196
|
+
```
|
197
|
+
</details><br>
|
198
|
+
|
199
|
+
|
200
|
+
## Features
|
201
|
+
* Scrape javascript rendered websites out of box
|
202
|
+
* Supported engines: [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome), [Headless Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Headless_mode), [PhantomJS](https://github.com/ariya/phantomjs) or simple HTTP requests ([mechanize](https://github.com/sparklemotion/mechanize) gem)
|
203
|
+
* Write spider code once, and use it with any supported engine later
|
204
|
+
* All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
|
205
|
+
* Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
|
206
|
+
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
|
207
|
+
* Automatically [retry failed requests](#configuration-options) with delay
|
208
|
+
* Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
|
209
|
+
* Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
|
210
|
+
* [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
|
211
|
+
* **Two modes:** use single file for a simple spider, or [generate](#project-mode) Scrapy-like **project**
|
212
|
+
* Convenient development mode with [console](#interactive-console), colorized logger and debugger ([Pry](https://github.com/pry/pry), [Byebug](https://github.com/deivid-rodriguez/byebug))
|
213
|
+
* Automated [server environment setup](#setup) (for ubuntu 18.04) and [deploy](#deploy) using commands `kimurai setup` and `kimurai deploy` ([Ansible](https://github.com/ansible/ansible) under the hood)
|
214
|
+
* Command-line [runner](#runner) to run all project spiders one by one or in parallel
|
215
|
+
|
216
|
+
## Table of Contents
|
217
|
+
* [Kimurai](#kimurai)
|
218
|
+
* [Features](#features)
|
219
|
+
* [Table of Contents](#table-of-contents)
|
220
|
+
* [Note about v1.0.0 version](#note-about-v1-0-0-version)
|
221
|
+
* [Installation](#installation)
|
222
|
+
* [Getting to Know](#getting-to-know)
|
223
|
+
* [Interactive console](#interactive-console)
|
224
|
+
* [Available engines](#available-engines)
|
225
|
+
* [Minimum required spider structure](#minimum-required-spider-structure)
|
226
|
+
* [Method arguments response, url and data](#method-arguments-response-url-and-data)
|
227
|
+
* [browser object](#browser-object)
|
228
|
+
* [request_to method](#request_to-method)
|
229
|
+
* [save_to helper](#save_to-helper)
|
230
|
+
* [Skip duplicates, unique? helper](#skip-duplicates-unique-helper)
|
231
|
+
* [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
|
232
|
+
* [KIMURAI_ENV](#kimurai_env)
|
233
|
+
* [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
|
234
|
+
* [Active Support included](#active-support-included)
|
235
|
+
* [Schedule spiders using Cron](#schedule-spiders-using-cron)
|
236
|
+
* [Configuration options](#configuration-options)
|
237
|
+
* [Using Kimurai inside existing Ruby application](#using-kimurai-inside-existing-ruby-application)
|
238
|
+
* [crawl! method](#crawl-method)
|
239
|
+
* [parse! method](#parsemethod_name-url-method)
|
240
|
+
* [Kimurai.list and Kimurai.find_by_name](#kimurailist-and-kimuraifind_by_name)
|
241
|
+
* [Automated sever setup and deployment](#automated-sever-setup-and-deployment)
|
242
|
+
* [Setup](#setup)
|
243
|
+
* [Deploy](#deploy)
|
244
|
+
* [Spider @config](#spider-config)
|
245
|
+
* [All available @config options](#all-available-config-options)
|
246
|
+
* [@config settings inheritance](#config-settings-inheritance)
|
247
|
+
* [Project mode](#project-mode)
|
248
|
+
* [Generate new spider](#generate-new-spider)
|
249
|
+
* [Crawl](#crawl)
|
250
|
+
* [List](#list)
|
251
|
+
* [Parse](#parse)
|
252
|
+
* [Pipelines, send_item method](#pipelines-send_item-method)
|
253
|
+
* [Runner](#runner)
|
254
|
+
* [Runner callbacks](#runner-callbacks)
|
255
|
+
* [Chat Support and Feedback](#chat-support-and-feedback)
|
256
|
+
* [License](#license)
|
257
|
+
|
258
|
+
## Note about v1.0.0 version
|
259
|
+
* The code was massively refactored for a [support](#using-kimurai-inside-existing-ruby-application) to run spiders multiple times from inside a single process. Now it's possible to run Kimurai spiders using background jobs like Sidekiq.
|
260
|
+
* `require 'kimurai'` doesn't require any gems except Active Support. Only when a particular spider [starts](#crawl-method), Capybara will be required with a specific driver.
|
261
|
+
* Although Kimurai [extends](lib/kimurai/capybara_ext) Capybara (all the magic happens inside [extended](lib/kimurai/capybara_ext/session.rb) `Capybara::Session#visit` method), session instances which were created manually will behave normally.
|
262
|
+
* Small changes in design (check the readme again to see what was changed)
|
263
|
+
* Again, massive refactor. Code now looks much better than it was before.
|
264
|
+
|
265
|
+
## Installation
|
266
|
+
Kimurai requires Ruby version `>= 2.5.0`. Supported platforms: `Linux` and `Mac OS X`.
|
267
|
+
|
268
|
+
1) If your system doesn't have appropriate Ruby version, install it:
|
269
|
+
|
270
|
+
<details/>
|
271
|
+
<summary>Ubuntu 18.04</summary>
|
272
|
+
|
273
|
+
```bash
|
274
|
+
# Install required packages for ruby-build
|
275
|
+
sudo apt update
|
276
|
+
sudo apt install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libreadline6-dev libyaml-dev libxml2-dev libxslt1-dev libcurl4-openssl-dev libffi-dev
|
277
|
+
|
278
|
+
# Install rbenv and ruby-build
|
279
|
+
cd && git clone https://github.com/rbenv/rbenv.git ~/.rbenv
|
280
|
+
echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
|
281
|
+
echo 'eval "$(rbenv init -)"' >> ~/.bashrc
|
282
|
+
exec $SHELL
|
283
|
+
|
284
|
+
git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build
|
285
|
+
echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
|
286
|
+
exec $SHELL
|
287
|
+
|
288
|
+
# Install latest Ruby
|
289
|
+
rbenv install 2.5.1
|
290
|
+
rbenv global 2.5.1
|
291
|
+
|
292
|
+
gem install bundler
|
293
|
+
```
|
294
|
+
</details>
|
295
|
+
|
296
|
+
<details/>
|
297
|
+
<summary>Mac OS X</summary>
|
298
|
+
|
299
|
+
```bash
|
300
|
+
# Install homebrew if you don't have it https://brew.sh/
|
301
|
+
# Install rbenv and ruby-build:
|
302
|
+
brew install rbenv ruby-build
|
303
|
+
|
304
|
+
# Add rbenv to bash so that it loads every time you open a terminal
|
305
|
+
echo 'if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi' >> ~/.bash_profile
|
306
|
+
source ~/.bash_profile
|
307
|
+
|
308
|
+
# Install latest Ruby
|
309
|
+
rbenv install 2.5.1
|
310
|
+
rbenv global 2.5.1
|
311
|
+
|
312
|
+
gem install bundler
|
313
|
+
```
|
314
|
+
</details>
|
315
|
+
|
316
|
+
2) Install Kimurai gem: `$ gem install kimurai`
|
317
|
+
|
318
|
+
3) Install browsers with webdrivers:
|
319
|
+
|
320
|
+
<details/>
|
321
|
+
<summary>Ubuntu 18.04</summary>
|
322
|
+
|
323
|
+
Note: for Ubuntu 16.04-18.04 there is available automatic installation using `setup` command:
|
324
|
+
```bash
|
325
|
+
$ kimurai setup localhost --local --ask-sudo
|
326
|
+
```
|
327
|
+
It works using [Ansible](https://github.com/ansible/ansible) so you need to install it first: `$ sudo apt install ansible`. You can check using playbooks [here](lib/kimurai/automation).
|
328
|
+
|
329
|
+
If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:
|
330
|
+
|
331
|
+
```bash
|
332
|
+
# Install basic tools
|
333
|
+
sudo apt install -q -y unzip wget tar openssl
|
334
|
+
|
335
|
+
# Install xvfb (for virtual_display headless mode, in additional to native)
|
336
|
+
sudo apt install -q -y xvfb
|
337
|
+
|
338
|
+
# Install chromium-browser and firefox
|
339
|
+
sudo apt install -q -y chromium-browser firefox
|
340
|
+
|
341
|
+
# Instal chromedriver (2.39 version)
|
342
|
+
# All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
|
343
|
+
cd /tmp && wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip
|
344
|
+
sudo unzip chromedriver_linux64.zip -d /usr/local/bin
|
345
|
+
rm -f chromedriver_linux64.zip
|
346
|
+
|
347
|
+
# Install geckodriver (0.21.0 version)
|
348
|
+
# All versions located here https://github.com/mozilla/geckodriver/releases/
|
349
|
+
cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz
|
350
|
+
sudo tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin
|
351
|
+
rm -f geckodriver-v0.21.0-linux64.tar.gz
|
352
|
+
|
353
|
+
# Install PhantomJS (2.1.1)
|
354
|
+
# All versions located here http://phantomjs.org/download.html
|
355
|
+
sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
|
356
|
+
cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
|
357
|
+
tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
|
358
|
+
sudo mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib
|
359
|
+
sudo ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
|
360
|
+
rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2
|
361
|
+
```
|
362
|
+
|
363
|
+
</details>
|
364
|
+
|
365
|
+
<details/>
|
366
|
+
<summary>Mac OS X</summary>
|
367
|
+
|
368
|
+
```bash
|
369
|
+
# Install chrome and firefox
|
370
|
+
brew cask install google-chrome firefox
|
371
|
+
|
372
|
+
# Install chromedriver (latest)
|
373
|
+
brew cask install chromedriver
|
374
|
+
|
375
|
+
# Install geckodriver (latest)
|
376
|
+
brew install geckodriver
|
377
|
+
|
378
|
+
# Install PhantomJS (latest)
|
379
|
+
brew install phantomjs
|
380
|
+
```
|
381
|
+
</details><br>
|
382
|
+
|
383
|
+
Also, if you want to save scraped items to the database (using [ActiveRecord](https://github.com/rails/rails/tree/master/activerecord), [Sequel](https://github.com/jeremyevans/sequel) or [MongoDB Ruby Driver](https://github.com/mongodb/mongo-ruby-driver)/[Mongoid](https://github.com/mongodb/mongoid)), you need to install database clients/servers:
|
384
|
+
|
385
|
+
<details/>
|
386
|
+
<summary>Ubuntu 18.04</summary>
|
387
|
+
|
388
|
+
SQlite: `$ sudo apt -q -y install libsqlite3-dev sqlite3`.
|
389
|
+
|
390
|
+
If you want to connect to a remote database, you don't need database server on a local machine (only client):
|
391
|
+
```bash
|
392
|
+
# Install MySQL client
|
393
|
+
sudo apt -q -y install mysql-client libmysqlclient-dev
|
394
|
+
|
395
|
+
# Install Postgres client
|
396
|
+
sudo apt install -q -y postgresql-client libpq-dev
|
397
|
+
|
398
|
+
# Install MongoDB client
|
399
|
+
sudo apt install -q -y mongodb-clients
|
400
|
+
```
|
401
|
+
|
402
|
+
But if you want to save items to a local database, database server required as well:
|
403
|
+
```bash
|
404
|
+
# Install MySQL client and server
|
405
|
+
sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev
|
406
|
+
|
407
|
+
# Install Postgres client and server
|
408
|
+
sudo apt install -q -y postgresql postgresql-contrib libpq-dev
|
409
|
+
|
410
|
+
# Install MongoDB client and server
|
411
|
+
# version 4.0 (check here https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
|
412
|
+
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
413
|
+
# for 16.04:
|
414
|
+
# echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
415
|
+
# for 18.04:
|
416
|
+
echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
417
|
+
sudo apt update
|
418
|
+
sudo apt install -q -y mongodb-org
|
419
|
+
sudo service mongod start
|
420
|
+
```
|
421
|
+
</details>
|
422
|
+
|
423
|
+
<details/>
|
424
|
+
<summary>Mac OS X</summary>
|
425
|
+
|
426
|
+
SQlite: `$ brew install sqlite3`
|
427
|
+
|
428
|
+
```bash
|
429
|
+
# Install MySQL client and server
|
430
|
+
brew install mysql
|
431
|
+
# Start server if you need it: brew services start mysql
|
432
|
+
|
433
|
+
# Install Postgres client and server
|
434
|
+
brew install postgresql
|
435
|
+
# Start server if you need it: brew services start postgresql
|
436
|
+
|
437
|
+
# Install MongoDB client and server
|
438
|
+
brew install mongodb
|
439
|
+
# Start server if you need it: brew services start mongodb
|
440
|
+
```
|
441
|
+
</details>
|
442
|
+
|
443
|
+
|
444
|
+
## Getting to Know
|
445
|
+
### Interactive console
|
446
|
+
Before you get to know all Kimurai features, there is `$ kimurai console` command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)).
|
447
|
+
|
448
|
+
```bash
|
449
|
+
$ kimurai console --engine selenium_chrome --url https://github.com/vfreefly/kimurai
|
450
|
+
```
|
451
|
+
|
452
|
+
<details/>
|
453
|
+
<summary>Show output</summary>
|
454
|
+
|
455
|
+
```
|
456
|
+
$ kimurai console --engine selenium_chrome --url https://github.com/vfreefly/kimurai
|
457
|
+
|
458
|
+
D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
|
459
|
+
D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
|
460
|
+
I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] INFO -- : Browser: started get request to: https://github.com/vfreefly/kimurai
|
461
|
+
I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] INFO -- : Browser: finished get request to: https://github.com/vfreefly/kimurai
|
462
|
+
D, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701
|
463
|
+
|
464
|
+
From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#console:
|
465
|
+
|
466
|
+
188: def console(response = nil, url: nil, data: {})
|
467
|
+
=> 189: binding.pry
|
468
|
+
190: end
|
469
|
+
|
470
|
+
[1] pry(#<Kimurai::Base>)> response.xpath("//title").text
|
471
|
+
=> "GitHub - vfreefly/kimurai: Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"
|
472
|
+
|
473
|
+
[2] pry(#<Kimurai::Base>)> ls
|
474
|
+
Kimurai::Base#methods: browser console logger request_to save_to unique?
|
475
|
+
instance variables: @browser @config @engine @logger @pipelines
|
476
|
+
locals: _ __ _dir_ _ex_ _file_ _in_ _out_ _pry_ data response url
|
477
|
+
|
478
|
+
[3] pry(#<Kimurai::Base>)> ls response
|
479
|
+
Nokogiri::XML::PP::Node#methods: inspect pretty_print
|
480
|
+
Nokogiri::XML::Searchable#methods: % / at at_css at_xpath css search xpath
|
481
|
+
Enumerable#methods:
|
482
|
+
all? collect drop each_with_index find_all grep_v lazy member? none? reject slice_when take_while without
|
483
|
+
any? collect_concat drop_while each_with_object find_index group_by many? min one? reverse_each sort to_a zip
|
484
|
+
as_json count each_cons entries first include? map min_by partition select sort_by to_h
|
485
|
+
chunk cycle each_entry exclude? flat_map index_by max minmax pluck slice_after sum to_set
|
486
|
+
chunk_while detect each_slice find grep inject max_by minmax_by reduce slice_before take uniq
|
487
|
+
Nokogiri::XML::Node#methods:
|
488
|
+
<=> append_class classes document? has_attribute? matches? node_name= processing_instruction? to_str
|
489
|
+
== attr comment? each html? name= node_type read_only? to_xhtml
|
490
|
+
> attribute content elem? inner_html namespace= parent= remove traverse
|
491
|
+
[] attribute_nodes content= element? inner_html= namespace_scopes parse remove_attribute unlink
|
492
|
+
[]= attribute_with_ns create_external_subset element_children inner_text namespaced_key? path remove_class values
|
493
|
+
accept before create_internal_subset elements internal_subset native_content= pointer_id replace write_html_to
|
494
|
+
add_class blank? css_path encode_special_chars key? next prepend_child set_attribute write_to
|
495
|
+
add_next_sibling cdata? decorate! external_subset keys next= previous text write_xhtml_to
|
496
|
+
add_previous_sibling child delete first_element_child lang next_element previous= text? write_xml_to
|
497
|
+
after children description fragment? lang= next_sibling previous_element to_html xml?
|
498
|
+
ancestors children= do_xinclude get_attribute last_element_child node_name previous_sibling to_s
|
499
|
+
Nokogiri::XML::Document#methods:
|
500
|
+
<< canonicalize collect_namespaces create_comment create_entity decorate document encoding errors name remove_namespaces! root= to_java url version
|
501
|
+
add_child clone create_cdata create_element create_text_node decorators dup encoding= errors= namespaces root slop! to_xml validate
|
502
|
+
Nokogiri::HTML::Document#methods: fragment meta_encoding meta_encoding= serialize title title= type
|
503
|
+
instance variables: @decorators @errors @node_cache
|
504
|
+
|
505
|
+
[4] pry(#<Kimurai::Base>)> exit
|
506
|
+
I, [2018-08-22 13:43:47 +0400#26079] [M: 47461994677760] INFO -- : Browser: driver selenium_chrome has been destroyed
|
507
|
+
$
|
508
|
+
```
|
509
|
+
</details><br>
|
510
|
+
|
511
|
+
CLI options:
|
512
|
+
* `--engine` (optional) [engine](#available-drivers) to use. Default is `mechanize`
|
513
|
+
* `--url` (optional) url to process. If url omitted, `response` and `url` objects inside the console will be `nil` (use [browser](#browser-object) object to navigate to any webpage).
|
514
|
+
|
515
|
+
### Available engines
|
516
|
+
Kimurai has support for following engines and mostly can switch between them without need to rewrite any code:
|
517
|
+
|
518
|
+
* `:mechanize` - [pure Ruby fake http browser](https://github.com/sparklemotion/mechanize). Mechanize can't render javascript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use javascript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's [methods to interact with a web page](http://cheatrags.com/capybara) (filling forms, clicking buttons, checkboxes, etc).
|
519
|
+
* `:poltergeist_phantomjs` - [PhantomJS headless browser](https://github.com/ariya/phantomjs), can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Kimurai has [memory control feature](#crawler-config) so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) `:poltergeist_phantomjs` can freely rotate proxies and change headers _on the fly_ (see [config section](#all-available-config-options)).
|
520
|
+
* `:selenium_chrome` Chrome in headless mode driven by selenium. Modern headless browser solution with proper javascript rendering.
|
521
|
+
* `:selenium_firefox` Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.
|
522
|
+
|
523
|
+
**Tip:** add `HEADLESS=false` ENV variable before command (`$ HEADLESS=false ruby spider.rb`) to run browser in normal (not headless) mode and see it's window (only for selenium-like engines). It works for [console](#interactive-console) command as well.
|
524
|
+
|
525
|
+
|
526
|
+
### Minimum required spider structure
|
527
|
+
> You can manually create a spider file, or use generator instead: `$ kimurai generate spider simple_spider`
|
528
|
+
|
529
|
+
```ruby
|
530
|
+
require 'kimurai'
|
531
|
+
|
532
|
+
class SimpleSpider < Kimurai::Base
|
533
|
+
@name = "simple_spider"
|
534
|
+
@engine = :selenium_chrome
|
535
|
+
@start_urls = ["https://example.com/"]
|
536
|
+
|
537
|
+
def parse(response, url:, data: {})
|
538
|
+
end
|
539
|
+
end
|
540
|
+
|
541
|
+
SimpleSpider.crawl!
|
542
|
+
```
|
543
|
+
|
544
|
+
Where:
|
545
|
+
* `@name` name of a spider. You can omit name if use single-file spider
|
546
|
+
* `@engine` engine for a spider
|
547
|
+
* `@start_urls` array of start urls to process one by one inside `parse` method
|
548
|
+
* Method `parse` is the start method, should be always present in spider class
|
549
|
+
|
550
|
+
|
551
|
+
### Method arguments `response`, `url` and `data`
|
552
|
+
|
553
|
+
```ruby
|
554
|
+
def parse(response, url:, data: {})
|
555
|
+
end
|
556
|
+
```
|
557
|
+
|
558
|
+
* `response` ([Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) object) Contains parsed HTML code of a processed webpage
|
559
|
+
* `url` (String) url of a processed webpage
|
560
|
+
* `data` (Hash) uses to pass data between requests
|
561
|
+
|
562
|
+
<details/>
|
563
|
+
<summary><strong>Example how to use <code>data</code></strong></summary>
|
564
|
+
|
565
|
+
Imagine that there is a product page which doesn't contain product category. Category name present only on category page with pagination. This is the case where we can use `data` to pass category name from `parse` to `parse_product` method:
|
566
|
+
|
567
|
+
```ruby
|
568
|
+
class ProductsSpider < Kimurai::Base
|
569
|
+
@engine = :selenium_chrome
|
570
|
+
@start_urls = ["https://example-shop.com/example-product-category"]
|
571
|
+
|
572
|
+
def parse(response, url:, data: {})
|
573
|
+
category_name = response.xpath("//path/to/category/name").text
|
574
|
+
response.xpath("//path/to/products/urls").each do |product_url|
|
575
|
+
# Merge category_name with current data hash and pass it next to parse_product method
|
576
|
+
request_to(:parse_product, url: product_url[:href], data: data.merge(category_name: category_name))
|
577
|
+
end
|
578
|
+
|
579
|
+
# ...
|
580
|
+
end
|
581
|
+
|
582
|
+
def parse_product(response, url:, data: {})
|
583
|
+
item = {}
|
584
|
+
# Assign item's category_name from data[:category_name]
|
585
|
+
item[:category_name] = data[:category_name]
|
586
|
+
|
587
|
+
# ...
|
588
|
+
end
|
589
|
+
end
|
590
|
+
|
591
|
+
```
|
592
|
+
</details><br>
|
593
|
+
|
594
|
+
**You can query `response` using [XPath or CSS selectors](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Searchable)**. Check Nokogiri tutorials to understand how to work with `response`:
|
595
|
+
* [Parsing HTML with Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/) - ruby.bastardsbook.com
|
596
|
+
* [HOWTO parse HTML with Ruby & Nokogiri](https://readysteadycode.com/howto-parse-html-with-ruby-and-nokogiri) - readysteadycode.com
|
597
|
+
* [Class: Nokogiri::HTML::Document](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document) (documentation) - rubydoc.info
|
598
|
+
|
599
|
+
|
600
|
+
### `browser` object
|
601
|
+
|
602
|
+
From any spider instance method there is available `browser` object, which is [Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) object and uses to process requests and get page response (`current_response` method). Usually you don't need to touch it directly, because there is `response` (see above) which contains page response after it was loaded.
|
603
|
+
|
604
|
+
But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) `browser` is ready for you:
|
605
|
+
|
606
|
+
```ruby
|
607
|
+
class GoogleSpider < Kimurai::Base
|
608
|
+
@name = "google_spider"
|
609
|
+
@engine = :selenium_chrome
|
610
|
+
@start_urls = ["https://www.google.com/"]
|
611
|
+
|
612
|
+
def parse(response, url:, data: {})
|
613
|
+
browser.fill_in "q", with: "Kimurai web scraping framework"
|
614
|
+
browser.click_button "Google Search"
|
615
|
+
|
616
|
+
# Update response to current response after interaction with a browser
|
617
|
+
response = browser.current_response
|
618
|
+
|
619
|
+
# Collect results
|
620
|
+
results = response.xpath("//div[@class='g']//h3/a").map do |a|
|
621
|
+
{ title: a.text, url: a[:href] }
|
622
|
+
end
|
623
|
+
|
624
|
+
# ...
|
625
|
+
end
|
626
|
+
end
|
627
|
+
```
|
628
|
+
|
629
|
+
Check out **Capybara cheat sheets** where you can see all available methods **to interact with browser**:
|
630
|
+
* [UI Testing with RSpec and Capybara [cheat sheet]](http://cheatrags.com/capybara) - cheatrags.com
|
631
|
+
* [Capybara Cheatsheet PDF](https://thoughtbot.com/upcase/test-driven-rails-resources/capybara.pdf) - thoughtbot.com
|
632
|
+
* [Class: Capybara::Session](https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Session) (documentation) - rubydoc.info
|
633
|
+
|
634
|
+
### `request_to` method
|
635
|
+
|
636
|
+
For making requests to a particular method there is `request_to`. It requires minimum two arguments: `:method_name` and `url:`. An optional argument is `data:` (see above what for is it). Example:
|
637
|
+
|
638
|
+
```ruby
|
639
|
+
class Spider < Kimurai::Base
|
640
|
+
@engine = :selenium_chrome
|
641
|
+
@start_urls = ["https://example.com/"]
|
642
|
+
|
643
|
+
def parse(response, url:, data: {})
|
644
|
+
# Process request to `parse_product` method with `https://example.com/some_product` url:
|
645
|
+
request_to :parse_product, url: "https://example.com/some_product"
|
646
|
+
end
|
647
|
+
|
648
|
+
def parse_product(response, url:, data: {})
|
649
|
+
puts "From page https://example.com/some_product !"
|
650
|
+
end
|
651
|
+
end
|
652
|
+
```
|
653
|
+
|
654
|
+
Under the hood `request_to` simply call [#visit](https://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FSession:visit) (`browser.visit(url)`) and then required method with arguments:
|
655
|
+
|
656
|
+
<details/>
|
657
|
+
<summary>request_to</summary>
|
658
|
+
|
659
|
+
```ruby
|
660
|
+
def request_to(handler, url:, data: {})
|
661
|
+
request_data = { url: url, data: data }
|
662
|
+
|
663
|
+
browser.visit(url)
|
664
|
+
public_send(handler, browser.current_response, request_data)
|
665
|
+
end
|
666
|
+
```
|
667
|
+
</details><br>
|
668
|
+
|
669
|
+
`request_to` just makes things simpler, and without it we could do something like:
|
670
|
+
|
671
|
+
<details/>
|
672
|
+
<summary>Check the code</summary>
|
673
|
+
|
674
|
+
```ruby
|
675
|
+
class Spider < Kimurai::Base
|
676
|
+
@engine = :selenium_chrome
|
677
|
+
@start_urls = ["https://example.com/"]
|
678
|
+
|
679
|
+
def parse(response, url:, data: {})
|
680
|
+
url_to_process = "https://example.com/some_product"
|
681
|
+
|
682
|
+
browser.visit(url_to_process)
|
683
|
+
parse_product(browser.current_response, url: url_to_process)
|
684
|
+
end
|
685
|
+
|
686
|
+
def parse_product(response, url:, data: {})
|
687
|
+
puts "From page https://example.com/some_product !"
|
688
|
+
end
|
689
|
+
end
|
690
|
+
```
|
691
|
+
</details>
|
692
|
+
|
693
|
+
### `save_to` helper
|
694
|
+
|
695
|
+
Sometimes all that you need is to simply save scraped data to a file format, like JSON or CSV. You can use `save_to` for it:
|
696
|
+
|
697
|
+
```ruby
|
698
|
+
class ProductsSpider < Kimurai::Base
|
699
|
+
@engine = :selenium_chrome
|
700
|
+
@start_urls = ["https://example-shop.com/"]
|
701
|
+
|
702
|
+
# ...
|
703
|
+
|
704
|
+
def parse_product(response, url:, data: {})
|
705
|
+
item = {}
|
706
|
+
|
707
|
+
item[:title] = response.xpath("//title/path").text
|
708
|
+
item[:description] = response.xpath("//desc/path").text.squish
|
709
|
+
item[:price] = response.xpath("//price/path").text[/\d+/]&.to_f
|
710
|
+
|
711
|
+
# Add each new item to the `scraped_products.json` file:
|
712
|
+
save_to "scraped_products.json", item, format: :json
|
713
|
+
end
|
714
|
+
end
|
715
|
+
```
|
716
|
+
|
717
|
+
Supported formats:
|
718
|
+
* `:json` JSON
|
719
|
+
* `:pretty_json` "pretty" JSON (`JSON.pretty_generate`)
|
720
|
+
* `:jsonlines` [JSON Lines](http://jsonlines.org/)
|
721
|
+
* `:csv` CSV
|
722
|
+
|
723
|
+
Note: `save_to` requires data (item to save) to be a `Hash`.
|
724
|
+
|
725
|
+
By default `save_to` add position key to an item hash. You can disable it with `position: false`: `save_to "scraped_products.json", item, format: :json, position: false`.
|
726
|
+
|
727
|
+
**How helper works:**
|
728
|
+
|
729
|
+
Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.
|
730
|
+
|
731
|
+
### Skip duplicates, `unique?` helper
|
732
|
+
|
733
|
+
It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is `unique?` helper:
|
734
|
+
|
735
|
+
```ruby
|
736
|
+
class ProductsSpider < Kimurai::Base
|
737
|
+
@engine = :selenium_chrome
|
738
|
+
@start_urls = ["https://example-shop.com/"]
|
739
|
+
|
740
|
+
def parse(response, url:, data: {})
|
741
|
+
response.xpath("//categories/path").each do |category|
|
742
|
+
request_to :parse_category, url: category[:href]
|
743
|
+
end
|
744
|
+
end
|
745
|
+
|
746
|
+
# Check products for uniqueness using product url inside of parse_category:
|
747
|
+
def parse_category(response, url:, data: {})
|
748
|
+
response.xpath("//products/path").each do |product|
|
749
|
+
# Skip url if it's not unique:
|
750
|
+
next unless unique?(:product_url, product[:href])
|
751
|
+
# Otherwise process it:
|
752
|
+
request_to :parse_product, url: product[:href]
|
753
|
+
end
|
754
|
+
end
|
755
|
+
|
756
|
+
# Or/and check products for uniqueness using product sku inside of parse_product:
|
757
|
+
def parse_product(response, url:, data: {})
|
758
|
+
item = {}
|
759
|
+
item[:sku] = response.xpath("//product/sku/path").text.strip.upcase
|
760
|
+
# Don't save product and return from method if there is already saved item with the same sku:
|
761
|
+
return unless unique?(:sku, item[:sku])
|
762
|
+
|
763
|
+
# ...
|
764
|
+
save_to "results.json", item, format: :json
|
765
|
+
end
|
766
|
+
end
|
767
|
+
```
|
768
|
+
|
769
|
+
`unique?` helper works pretty simple:
|
770
|
+
|
771
|
+
```ruby
|
772
|
+
# Check string "http://example.com" in scope `url` for a first time:
|
773
|
+
unique?(:url, "http://example.com")
|
774
|
+
# => true
|
775
|
+
|
776
|
+
# Try again:
|
777
|
+
unique?(:url, "http://example.com")
|
778
|
+
# => false
|
779
|
+
```
|
780
|
+
|
781
|
+
To check something for uniqueness, you need to provide a scope:
|
782
|
+
|
783
|
+
```ruby
|
784
|
+
# `product_url` scope
|
785
|
+
unique?(:product_url, "http://example.com/product_1")
|
786
|
+
|
787
|
+
# `id` scope
|
788
|
+
unique?(:id, 324234232)
|
789
|
+
|
790
|
+
# `custom` scope
|
791
|
+
unique?(:custom, "Lorem Ipsum")
|
792
|
+
```
|
793
|
+
|
794
|
+
### `open_spider` and `close_spider` callbacks
|
795
|
+
|
796
|
+
You can define `.open_spider` and `.close_spider` callbacks (class methods) to perform some action before spider started or after spider has been stopped:
|
797
|
+
|
798
|
+
```ruby
|
799
|
+
require 'kimurai'
|
800
|
+
|
801
|
+
class ExampleSpider < Kimurai::Base
|
802
|
+
@name = "example_spider"
|
803
|
+
@engine = :selenium_chrome
|
804
|
+
@start_urls = ["https://example.com/"]
|
805
|
+
|
806
|
+
def self.open_spider
|
807
|
+
logger.info "> Starting..."
|
808
|
+
end
|
809
|
+
|
810
|
+
def self.close_spider
|
811
|
+
logger.info "> Stopped!"
|
812
|
+
end
|
813
|
+
|
814
|
+
def parse(response, url:, data: {})
|
815
|
+
logger.info "> Scraping..."
|
816
|
+
end
|
817
|
+
end
|
818
|
+
|
819
|
+
ExampleSpider.crawl!
|
820
|
+
```
|
821
|
+
|
822
|
+
<details/>
|
823
|
+
<summary>Output</summary>
|
824
|
+
|
825
|
+
```
|
826
|
+
I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: Spider: started: example_spider
|
827
|
+
I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Starting...
|
828
|
+
D, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance
|
829
|
+
D, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
|
830
|
+
I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: started get request to: https://example.com/
|
831
|
+
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: finished get request to: https://example.com/
|
832
|
+
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Info: visits: requests: 1, responses: 1
|
833
|
+
D, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: Browser: driver.current_memory: 82415
|
834
|
+
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Scraping...
|
835
|
+
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Browser: driver selenium_chrome has been destroyed
|
836
|
+
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: > Stopped!
|
837
|
+
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] INFO -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:26:32 +0400, :stop_time=>2018-08-22 14:26:34 +0400, :running_time=>"1s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}
|
838
|
+
```
|
839
|
+
</details><br>
|
840
|
+
|
841
|
+
Inside `open_spider` and `close_spider` class methods there is available `run_info` method which contains useful information about spider state:
|
842
|
+
|
843
|
+
```ruby
|
844
|
+
11: def self.open_spider
|
845
|
+
=> 12: binding.pry
|
846
|
+
13: end
|
847
|
+
|
848
|
+
[1] pry(example_spider)> run_info
|
849
|
+
=> {
|
850
|
+
:spider_name=>"example_spider",
|
851
|
+
:status=>:running,
|
852
|
+
:environment=>"development",
|
853
|
+
:start_time=>2018-08-05 23:32:00 +0400,
|
854
|
+
:stop_time=>nil,
|
855
|
+
:running_time=>nil,
|
856
|
+
:visits=>{:requests=>0, :responses=>0},
|
857
|
+
:error=>nil
|
858
|
+
}
|
859
|
+
```
|
860
|
+
|
861
|
+
Inside `close_spider`, `run_info` will be updated:
|
862
|
+
|
863
|
+
```ruby
|
864
|
+
15: def self.close_spider
|
865
|
+
=> 16: binding.pry
|
866
|
+
17: end
|
867
|
+
|
868
|
+
[1] pry(example_spider)> run_info
|
869
|
+
=> {
|
870
|
+
:spider_name=>"example_spider",
|
871
|
+
:status=>:completed,
|
872
|
+
:environment=>"development",
|
873
|
+
:start_time=>2018-08-05 23:32:00 +0400,
|
874
|
+
:stop_time=>2018-08-05 23:32:06 +0400,
|
875
|
+
:running_time=>6.214,
|
876
|
+
:visits=>{:requests=>1, :responses=>1},
|
877
|
+
:error=>nil
|
878
|
+
}
|
879
|
+
```
|
880
|
+
|
881
|
+
`run_info[:status]` helps to determine if spider was finished successfully or failed (possible values: `:completed`, `:failed`):
|
882
|
+
|
883
|
+
```ruby
|
884
|
+
class ExampleSpider < Kimurai::Base
|
885
|
+
@name = "example_spider"
|
886
|
+
@engine = :selenium_chrome
|
887
|
+
@start_urls = ["https://example.com/"]
|
888
|
+
|
889
|
+
def self.close_spider
|
890
|
+
puts ">>> run info: #{run_info}"
|
891
|
+
end
|
892
|
+
|
893
|
+
def parse(response, url:, data: {})
|
894
|
+
logger.info "> Scraping..."
|
895
|
+
# Let's try to strip nil:
|
896
|
+
nil.strip
|
897
|
+
end
|
898
|
+
end
|
899
|
+
```
|
900
|
+
|
901
|
+
<details/>
|
902
|
+
<summary>Output</summary>
|
903
|
+
|
904
|
+
```
|
905
|
+
I, [2018-08-22 14:34:24 +0400#8459] [M: 47020523644400] INFO -- example_spider: Spider: started: example_spider
|
906
|
+
D, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance
|
907
|
+
D, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
|
908
|
+
I, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: started get request to: https://example.com/
|
909
|
+
I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: finished get request to: https://example.com/
|
910
|
+
I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Info: visits: requests: 1, responses: 1
|
911
|
+
D, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: Browser: driver.current_memory: 83351
|
912
|
+
I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: > Scraping...
|
913
|
+
I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] INFO -- example_spider: Browser: driver selenium_chrome has been destroyed
|
914
|
+
|
915
|
+
>>> run info: {:spider_name=>"example_spider", :status=>:failed, :environment=>"development", :start_time=>2018-08-22 14:34:24 +0400, :stop_time=>2018-08-22 14:34:26 +0400, :running_time=>2.01, :visits=>{:requests=>1, :responses=>1}, :error=>"#<NoMethodError: undefined method `strip' for nil:NilClass>"}
|
916
|
+
|
917
|
+
F, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] FATAL -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:failed, :environment=>"development", :start_time=>2018-08-22 14:34:24 +0400, :stop_time=>2018-08-22 14:34:26 +0400, :running_time=>"2s", :visits=>{:requests=>1, :responses=>1}, :error=>"#<NoMethodError: undefined method `strip' for nil:NilClass>"}
|
918
|
+
Traceback (most recent call last):
|
919
|
+
6: from example_spider.rb:19:in `<main>'
|
920
|
+
5: from /home/victor/code/kimurai/lib/kimurai/base.rb:127:in `crawl!'
|
921
|
+
4: from /home/victor/code/kimurai/lib/kimurai/base.rb:127:in `each'
|
922
|
+
3: from /home/victor/code/kimurai/lib/kimurai/base.rb:128:in `block in crawl!'
|
923
|
+
2: from /home/victor/code/kimurai/lib/kimurai/base.rb:185:in `request_to'
|
924
|
+
1: from /home/victor/code/kimurai/lib/kimurai/base.rb:185:in `public_send'
|
925
|
+
example_spider.rb:15:in `parse': undefined method `strip' for nil:NilClass (NoMethodError)
|
926
|
+
```
|
927
|
+
</details><br>
|
928
|
+
|
929
|
+
**Usage example:** if spider finished successfully, send JSON file with scraped items to a remote FTP location, otherwise (if spider failed), skip incompleted results and send email/notification to slack about it:
|
930
|
+
|
931
|
+
<details/>
|
932
|
+
<summary>Example</summary>
|
933
|
+
|
934
|
+
Also you can use additional methods `completed?` or `failed?`
|
935
|
+
|
936
|
+
```ruby
|
937
|
+
class Spider < Kimurai::Base
|
938
|
+
@engine = :selenium_chrome
|
939
|
+
@start_urls = ["https://example.com/"]
|
940
|
+
|
941
|
+
def self.close_spider
|
942
|
+
if completed?
|
943
|
+
send_file_to_ftp("results.json")
|
944
|
+
else
|
945
|
+
send_error_notification(run_info[:error])
|
946
|
+
end
|
947
|
+
end
|
948
|
+
|
949
|
+
def self.send_file_to_ftp(file_path)
|
950
|
+
# ...
|
951
|
+
end
|
952
|
+
|
953
|
+
def self.send_error_notification(error)
|
954
|
+
# ...
|
955
|
+
end
|
956
|
+
|
957
|
+
# ...
|
958
|
+
|
959
|
+
def parse_item(response, url:, data: {})
|
960
|
+
item = {}
|
961
|
+
# ...
|
962
|
+
|
963
|
+
save_to "results.json", item, format: :json
|
964
|
+
end
|
965
|
+
end
|
966
|
+
```
|
967
|
+
</details>
|
968
|
+
|
969
|
+
|
970
|
+
### `KIMURAI_ENV`
|
971
|
+
Kimurai has environments, default is `development`. To provide custom environment pass `KIMURAI_ENV` ENV variable before command: `$ KIMURAI_ENV=production ruby spider.rb`. To access current environment there is `Kimurai.env` method.
|
972
|
+
|
973
|
+
Usage example:
|
974
|
+
```ruby
|
975
|
+
class Spider < Kimurai::Base
|
976
|
+
@engine = :selenium_chrome
|
977
|
+
@start_urls = ["https://example.com/"]
|
978
|
+
|
979
|
+
def self.close_spider
|
980
|
+
if failed? && Kimurai.env == "production"
|
981
|
+
send_error_notification(run_info[:error])
|
982
|
+
else
|
983
|
+
# Do nothing
|
984
|
+
end
|
985
|
+
end
|
986
|
+
|
987
|
+
# ...
|
988
|
+
end
|
989
|
+
```
|
990
|
+
|
991
|
+
### Parallel crawling using `in_parallel`
|
992
|
+
Kimurai can process web pages concurrently in one single line: `in_parallel(:parse_product, urls, threads: 3)`, where `:parse_product` is a method to process, `urls` is array of urls to crawl and `threads:` is a number of threads:
|
993
|
+
|
994
|
+
```ruby
|
995
|
+
# amazon_spider.rb
|
996
|
+
require 'kimurai'
|
997
|
+
|
998
|
+
class AmazonSpider < Kimurai::Base
|
999
|
+
@name = "amazon_spider"
|
1000
|
+
@engine = :mechanize
|
1001
|
+
@start_urls = ["https://www.amazon.com/"]
|
1002
|
+
|
1003
|
+
def parse(response, url:, data: {})
|
1004
|
+
browser.fill_in "field-keywords", with: "Web Scraping Books"
|
1005
|
+
browser.click_on "Go"
|
1006
|
+
|
1007
|
+
# Walk through pagination and collect products urls:
|
1008
|
+
urls = []
|
1009
|
+
loop do
|
1010
|
+
response = browser.current_response
|
1011
|
+
response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
|
1012
|
+
urls << a[:href].sub(/ref=.+/, "")
|
1013
|
+
end
|
1014
|
+
|
1015
|
+
browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
|
1016
|
+
end
|
1017
|
+
|
1018
|
+
# Process all collected urls concurrently within 3 threads:
|
1019
|
+
in_parallel(:parse_book_page, urls, threads: 3)
|
1020
|
+
end
|
1021
|
+
|
1022
|
+
def parse_book_page(response, url:, data: {})
|
1023
|
+
item = {}
|
1024
|
+
|
1025
|
+
item[:title] = response.xpath("//h1/span[@id]").text.squish
|
1026
|
+
item[:url] = url
|
1027
|
+
item[:price] = response.xpath("(//span[contains(@class, 'a-color-price')])[1]").text.squish.presence
|
1028
|
+
item[:publisher] = response.xpath("//h2[text()='Product details']/following::b[text()='Publisher:']/following-sibling::text()[1]").text.squish.presence
|
1029
|
+
|
1030
|
+
save_to "books.json", item, format: :pretty_json
|
1031
|
+
end
|
1032
|
+
end
|
1033
|
+
|
1034
|
+
AmazonSpider.crawl!
|
1035
|
+
```
|
1036
|
+
|
1037
|
+
<details/>
|
1038
|
+
<summary>Run: <code>$ ruby amazon_spider.rb</code></summary>
|
1039
|
+
|
1040
|
+
```
|
1041
|
+
I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: started: amazon_spider
|
1042
|
+
D, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
|
1043
|
+
I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
|
1044
|
+
I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
|
1045
|
+
I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
|
1046
|
+
|
1047
|
+
I, [2018-08-22 14:48:43 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: starting processing 52 urls within 3 threads
|
1048
|
+
D, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
|
1049
|
+
I, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
|
1050
|
+
D, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
|
1051
|
+
I, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
|
1052
|
+
D, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
|
1053
|
+
I, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
|
1054
|
+
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
|
1055
|
+
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 4, responses: 2
|
1056
|
+
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
|
1057
|
+
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
|
1058
|
+
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 5, responses: 3
|
1059
|
+
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
|
1060
|
+
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
|
1061
|
+
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 6, responses: 4
|
1062
|
+
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Excel-Effective-Scrapes-ebook/dp/B01CMMJGZ8/
|
1063
|
+
|
1064
|
+
...
|
1065
|
+
|
1066
|
+
I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Info: visits: requests: 51, responses: 49
|
1067
|
+
I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
|
1068
|
+
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Ice-Life-Bill-Rayburn-ebook/dp/B00C0NF1L8/
|
1069
|
+
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 51, responses: 50
|
1070
|
+
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
|
1071
|
+
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Php-architects-Guide-Scraping-Author/dp/B010DTKYY4/
|
1072
|
+
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 52, responses: 51
|
1073
|
+
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
|
1074
|
+
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
|
1075
|
+
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Info: visits: requests: 53, responses: 52
|
1076
|
+
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
|
1077
|
+
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
|
1078
|
+
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Info: visits: requests: 53, responses: 53
|
1079
|
+
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
|
1080
|
+
|
1081
|
+
I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: in_parallel: stopped processing 52 urls within 3 threads, total time: 29s
|
1082
|
+
I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Browser: driver mechanize has been destroyed
|
1083
|
+
|
1084
|
+
I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840] INFO -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:48:37 +0400, :stop_time=>2018-08-22 14:49:12 +0400, :running_time=>"35s", :visits=>{:requests=>53, :responses=>53}, :error=>nil}
|
1085
|
+
|
1086
|
+
```
|
1087
|
+
</details>
|
1088
|
+
|
1089
|
+
<details/>
|
1090
|
+
<summary>books.json</summary>
|
1091
|
+
|
1092
|
+
```json
|
1093
|
+
[
|
1094
|
+
{
|
1095
|
+
"title": "Web Scraping with Python: Collecting More Data from the Modern Web2nd Edition",
|
1096
|
+
"url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
|
1097
|
+
"price": "$26.94",
|
1098
|
+
"publisher": "O'Reilly Media; 2 edition (April 14, 2018)",
|
1099
|
+
"position": 1
|
1100
|
+
},
|
1101
|
+
{
|
1102
|
+
"title": "Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS",
|
1103
|
+
"url": "https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/",
|
1104
|
+
"price": "$39.99",
|
1105
|
+
"publisher": "Packt Publishing - ebooks Account (February 9, 2018)",
|
1106
|
+
"position": 2
|
1107
|
+
},
|
1108
|
+
{
|
1109
|
+
"title": "Web Scraping with Python: Collecting Data from the Modern Web1st Edition",
|
1110
|
+
"url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/",
|
1111
|
+
"price": "$15.75",
|
1112
|
+
"publisher": "O'Reilly Media; 1 edition (July 24, 2015)",
|
1113
|
+
"position": 3
|
1114
|
+
},
|
1115
|
+
|
1116
|
+
...
|
1117
|
+
|
1118
|
+
{
|
1119
|
+
"title": "Instant Web Scraping with Java by Ryan Mitchell (2013-08-26)",
|
1120
|
+
"url": "https://www.amazon.com/Instant-Scraping-Java-Mitchell-2013-08-26/dp/B01FEM76X2/",
|
1121
|
+
"price": "$35.82",
|
1122
|
+
"publisher": "Packt Publishing (2013-08-26) (1896)",
|
1123
|
+
"position": 52
|
1124
|
+
}
|
1125
|
+
]
|
1126
|
+
```
|
1127
|
+
</details><br>
|
1128
|
+
|
1129
|
+
> Note that [save_to](#save_to-helper) and [unique?](#skip-duplicates-unique-helper) helpers are thread-safe (protected by [Mutex](https://ruby-doc.org/core-2.5.1/Mutex.html)) and can be freely used inside threads.
|
1130
|
+
|
1131
|
+
`in_parallel` can take additional options:
|
1132
|
+
* `data:` pass with urls custom data hash: `in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })`
|
1133
|
+
* `delay:` set delay between requests: `in_parallel(:method, urls, threads: 3, delay: 2)`. Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a Range, delay number will be chosen randomly for each request: `rand (2..5) # => 3`
|
1134
|
+
* `engine:` set custom engine than a default one: `in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)`
|
1135
|
+
* `config:` pass custom options to config (see [config section](#crawler-config))
|
1136
|
+
|
1137
|
+
### Active Support included
|
1138
|
+
|
1139
|
+
You can use all the power of familiar [Rails core-ext methods](https://guides.rubyonrails.org/active_support_core_extensions.html#loading-all-core-extensions) for scraping inside Kimurai. Especially take a look at [squish](https://apidock.com/rails/String/squish), [truncate_words](https://apidock.com/rails/String/truncate_words), [titleize](https://apidock.com/rails/String/titleize), [remove](https://apidock.com/rails/String/remove), [present?](https://guides.rubyonrails.org/active_support_core_extensions.html#blank-questionmark-and-present-questionmark) and [presence](https://guides.rubyonrails.org/active_support_core_extensions.html#presence).
|
1140
|
+
|
1141
|
+
### Schedule spiders using Cron
|
1142
|
+
|
1143
|
+
1) Inside spider directory generate [Whenever](https://github.com/javan/whenever) config: `$ kimurai generate schedule`.
|
1144
|
+
|
1145
|
+
<details/>
|
1146
|
+
<summary><code>schedule.rb</code></summary>
|
1147
|
+
|
1148
|
+
```ruby
|
1149
|
+
### Settings ###
|
1150
|
+
require 'tzinfo'
|
1151
|
+
|
1152
|
+
# Export current PATH to the cron
|
1153
|
+
env :PATH, ENV["PATH"]
|
1154
|
+
|
1155
|
+
# Use 24 hour format when using `at:` option
|
1156
|
+
set :chronic_options, hours24: true
|
1157
|
+
|
1158
|
+
# Use local_to_utc helper to setup execution time using your local timezone instead
|
1159
|
+
# of server's timezone (which is probably and should be UTC, to check run `$ timedatectl`).
|
1160
|
+
# Also maybe you'll want to set same timezone in kimurai as well (use `Kimurai.configuration.time_zone =` for that),
|
1161
|
+
# to have spiders logs in a specific time zone format.
|
1162
|
+
# Example usage of helper:
|
1163
|
+
# every 1.day, at: local_to_utc("7:00", zone: "Europe/Moscow") do
|
1164
|
+
# crawl "google_spider.com", output: "log/google_spider.com.log"
|
1165
|
+
# end
|
1166
|
+
def local_to_utc(time_string, zone:)
|
1167
|
+
TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time))
|
1168
|
+
end
|
1169
|
+
|
1170
|
+
# Note: by default Whenever exports cron commands with :environment == "production".
|
1171
|
+
# Note: Whenever can only append log data to a log file (>>). If you want
|
1172
|
+
# to overwrite (>) log file before each run, pass lambda:
|
1173
|
+
# crawl "google_spider.com", output: -> { "> log/google_spider.com.log 2>&1" }
|
1174
|
+
|
1175
|
+
# Project job types
|
1176
|
+
job_type :crawl, "cd :path && KIMURAI_ENV=:environment bundle exec kimurai crawl :task :output"
|
1177
|
+
job_type :runner, "cd :path && KIMURAI_ENV=:environment bundle exec kimurai runner --jobs :task :output"
|
1178
|
+
|
1179
|
+
# Single file job type
|
1180
|
+
job_type :single, "cd :path && KIMURAI_ENV=:environment ruby :task :output"
|
1181
|
+
# Single with bundle exec
|
1182
|
+
job_type :single_bundle, "cd :path && KIMURAI_ENV=:environment bundle exec ruby :task :output"
|
1183
|
+
|
1184
|
+
### Schedule ###
|
1185
|
+
# Usage (check examples here https://github.com/javan/whenever#example-schedulerb-file):
|
1186
|
+
# every 1.day do
|
1187
|
+
# Example to schedule a single spider in the project:
|
1188
|
+
# crawl "google_spider.com", output: "log/google_spider.com.log"
|
1189
|
+
|
1190
|
+
# Example to schedule all spiders in the project using runner. Each spider will write
|
1191
|
+
# it's own output to the `log/spider_name.log` file (handled by a runner itself).
|
1192
|
+
# Runner output will be written to log/runner.log file.
|
1193
|
+
# Argument number it's a count of concurrent jobs:
|
1194
|
+
# runner 3, output:"log/runner.log"
|
1195
|
+
|
1196
|
+
# Example to schedule single spider (without project):
|
1197
|
+
# single "single_spider.rb", output: "single_spider.log"
|
1198
|
+
# end
|
1199
|
+
|
1200
|
+
### How to set a cron schedule ###
|
1201
|
+
# Run: `$ whenever --update-crontab --load-file config/schedule.rb`.
|
1202
|
+
# If you don't have whenever command, install the gem: `$ gem install whenever`.
|
1203
|
+
|
1204
|
+
### How to cancel a schedule ###
|
1205
|
+
# Run: `$ whenever --clear-crontab --load-file config/schedule.rb`.
|
1206
|
+
```
|
1207
|
+
</details><br>
|
1208
|
+
|
1209
|
+
2) Add at the bottom of `schedule.rb` following code:
|
1210
|
+
|
1211
|
+
```ruby
|
1212
|
+
every 1.day, at: "7:00" do
|
1213
|
+
single "example_spider.rb", output: "example_spider.log"
|
1214
|
+
end
|
1215
|
+
```
|
1216
|
+
|
1217
|
+
3) Run: `$ whenever --update-crontab --load-file schedule.rb`. Done!
|
1218
|
+
|
1219
|
+
You can check Whenever examples [here](https://github.com/javan/whenever#example-schedulerb-file). To cancel schedule, run: `$ whenever --clear-crontab --load-file schedule.rb`.
|
1220
|
+
|
1221
|
+
### Configuration options
|
1222
|
+
You can configure several options using `configure` block:
|
1223
|
+
|
1224
|
+
```ruby
|
1225
|
+
Kimurai.configure do |config|
|
1226
|
+
# Default logger has colored mode in development.
|
1227
|
+
# If you would like to disable it, set `colorize_logger` to false.
|
1228
|
+
# config.colorize_logger = false
|
1229
|
+
|
1230
|
+
# Logger level for default logger:
|
1231
|
+
# config.log_level = :info
|
1232
|
+
|
1233
|
+
# Custom logger:
|
1234
|
+
# config.logger = Logger.new(STDOUT)
|
1235
|
+
|
1236
|
+
# Custom time zone (for logs):
|
1237
|
+
# config.time_zone = "UTC"
|
1238
|
+
# config.time_zone = "Europe/Moscow"
|
1239
|
+
end
|
1240
|
+
```
|
1241
|
+
|
1242
|
+
### Using Kimurai inside existing Ruby application
|
1243
|
+
|
1244
|
+
You can integrate Kimurai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:
|
1245
|
+
|
1246
|
+
#### `.crawl!` method
|
1247
|
+
|
1248
|
+
`.crawl!` (class method) performs a _full run_ of a particular spider. This method will return run_info if run was successful, or an exception if something went wrong.
|
1249
|
+
|
1250
|
+
```ruby
|
1251
|
+
class ExampleSpider < Kimurai::Base
|
1252
|
+
@name = "example_spider"
|
1253
|
+
@engine = :mechanize
|
1254
|
+
@start_urls = ["https://example.com/"]
|
1255
|
+
|
1256
|
+
def parse(response, url:, data: {})
|
1257
|
+
title = response.xpath("//title").text.squish
|
1258
|
+
end
|
1259
|
+
end
|
1260
|
+
|
1261
|
+
ExampleSpider.crawl!
|
1262
|
+
# => { :spider_name => "example_spider", :status => :completed, :environment => "development", :start_time => 2018-08-22 18:20:16 +0400, :stop_time => 2018-08-22 18:20:17 +0400, :running_time => 1.216, :visits => { :requests => 1, :responses => 1 }, :items => { :sent => 0, :processed => 0 }, :error => nil }
|
1263
|
+
```
|
1264
|
+
|
1265
|
+
You can't `.crawl!` spider in different thread if it still running (because spider instances store some shared data in the `@run_info` class variable while `crawl`ing):
|
1266
|
+
|
1267
|
+
```ruby
|
1268
|
+
2.times do |i|
|
1269
|
+
Thread.new { p i, ExampleSpider.crawl! }
|
1270
|
+
end # =>
|
1271
|
+
|
1272
|
+
# 1
|
1273
|
+
# false
|
1274
|
+
|
1275
|
+
# 0
|
1276
|
+
# {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 18:49:22 +0400, :stop_time=>2018-08-22 18:49:23 +0400, :running_time=>0.801, :visits=>{:requests=>1, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :error=>nil}
|
1277
|
+
```
|
1278
|
+
|
1279
|
+
So what if you're don't care about stats and just want to process request to a particular spider method and get the returning value from this method? Use `.parse!` instead:
|
1280
|
+
|
1281
|
+
#### `.parse!(:method_name, url:)` method
|
1282
|
+
|
1283
|
+
`.parse!` (class method) creates a new spider instance and performs a request to given method with a given url. Value from the method will be returned back:
|
1284
|
+
|
1285
|
+
```ruby
|
1286
|
+
class ExampleSpider < Kimurai::Base
|
1287
|
+
@name = "example_spider"
|
1288
|
+
@engine = :mechanize
|
1289
|
+
@start_urls = ["https://example.com/"]
|
1290
|
+
|
1291
|
+
def parse(response, url:, data: {})
|
1292
|
+
title = response.xpath("//title").text.squish
|
1293
|
+
end
|
1294
|
+
end
|
1295
|
+
|
1296
|
+
ExampleSpider.parse!(:parse, url: "https://example.com/")
|
1297
|
+
# => "Example Domain"
|
1298
|
+
```
|
1299
|
+
|
1300
|
+
Like `.crawl!`, `.parse!` method takes care of a browser instance and kills it (`browser.destroy_driver!`) before returning the value. Unlike `.crawl!`, `.parse!` method can be called from different threads at the same time:
|
1301
|
+
|
1302
|
+
```ruby
|
1303
|
+
urls = ["https://www.google.com/", "https://www.reddit.com/", "https://en.wikipedia.org/"]
|
1304
|
+
|
1305
|
+
urls.each do |url|
|
1306
|
+
Thread.new { p ExampleSpider.parse!(:parse, url: url) }
|
1307
|
+
end # =>
|
1308
|
+
|
1309
|
+
# "Google"
|
1310
|
+
# "Wikipedia, the free encyclopedia"
|
1311
|
+
# "reddit: the front page of the internetHotHot"
|
1312
|
+
```
|
1313
|
+
|
1314
|
+
#### `Kimurai.list` and `Kimurai.find_by_name()`
|
1315
|
+
|
1316
|
+
```ruby
|
1317
|
+
class GoogleSpider < Kimurai::Base
|
1318
|
+
@name = "google_spider"
|
1319
|
+
end
|
1320
|
+
|
1321
|
+
class RedditSpider < Kimurai::Base
|
1322
|
+
@name = "reddit_spider"
|
1323
|
+
end
|
1324
|
+
|
1325
|
+
class WikipediaSpider < Kimurai::Base
|
1326
|
+
@name = "wikipedia_spider"
|
1327
|
+
end
|
1328
|
+
|
1329
|
+
# To get the list of all available spider classes:
|
1330
|
+
Kimurai.list
|
1331
|
+
# => {"google_spider"=>GoogleSpider, "reddit_spider"=>RedditSpider, "wikipedia_spider"=>WikipediaSpider}
|
1332
|
+
|
1333
|
+
# To find a particular spider class by it's name:
|
1334
|
+
Kimurai.find_by_name("reddit_spider")
|
1335
|
+
# => RedditSpider
|
1336
|
+
```
|
1337
|
+
|
1338
|
+
|
1339
|
+
### Automated sever setup and deployment
|
1340
|
+
> **EXPERIMENTAL**
|
1341
|
+
|
1342
|
+
#### Setup
|
1343
|
+
You can automatically setup [required environment](#installation) for Kimurai on the remote server (currently there is only Ubuntu Server 18.04 support) using `$ kimurai setup` command. `setup` will perform installation of: latest Ruby with Rbenv, browsers with webdrivers and in additional databases clients (only clients) for MySQL, Postgres and MongoDB (so you can connect to a remote database from ruby).
|
1344
|
+
|
1345
|
+
> To perform remote server setup, [Ansible](https://github.com/ansible/ansible) is required **on the desktop** machine (to install: Ubuntu: `$ sudo apt install ansible`, Mac OS X: `$ brew install ansible`)
|
1346
|
+
|
1347
|
+
Example:
|
1348
|
+
|
1349
|
+
```bash
|
1350
|
+
$ kimurai setup deploy@123.123.123.123 --ask-sudo --ssh-key-path path/to/private_key
|
1351
|
+
```
|
1352
|
+
|
1353
|
+
CLI options:
|
1354
|
+
* `--ask-sudo` pass this option to ask sudo (user) password for system-wide installation of packages (`apt install`)
|
1355
|
+
* `--ssh-key-path path/to/private_key` authorization on the server using private ssh key. You can omit it if required key already [added to keychain](https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#adding-your-ssh-key-to-the-ssh-agent) on your desktop (Ansible uses [SSH agent forwarding](https://developer.github.com/v3/guides/using-ssh-agent-forwarding/))
|
1356
|
+
* `--ask-auth-pass` authorization on the server using user password, alternative option to `--ssh-key-path`.
|
1357
|
+
* `-p port_number` custom port for ssh connection (`-p 2222`)
|
1358
|
+
|
1359
|
+
> You can check setup playbook [here](lib/kimurai/automation/setup.yml)
|
1360
|
+
|
1361
|
+
#### Deploy
|
1362
|
+
|
1363
|
+
After successful `setup` you can deploy a spider to the remote server using `$ kimurai deploy` command. On each deploy there are performing several tasks: 1) pull repo from a remote origin to `~/repo_name` user directory 2) run `bundle install` 3) Update crontab `whenever --update-crontab` (to update spider schedule from schedule.rb file).
|
1364
|
+
|
1365
|
+
Before `deploy` make sure that inside spider directory you have: 1) git repository with remote origin (bitbucket, github, etc.) 2) `Gemfile` 3) schedule.rb inside subfolder `config` (`config/schedule.rb`).
|
1366
|
+
|
1367
|
+
Example:
|
1368
|
+
|
1369
|
+
```bash
|
1370
|
+
$ kimurai deploy deploy@123.123.123.123 --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key
|
1371
|
+
```
|
1372
|
+
|
1373
|
+
CLI options: _same like for [setup](#setup) command_ (except `--ask-sudo`), plus
|
1374
|
+
* `--repo-url` provide custom repo url (`--repo-url git@bitbucket.org:username/repo_name.git`), otherwise current `origin/master` will be taken (output from `$ git remote get-url origin`)
|
1375
|
+
* `--repo-key-path` if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with `--ssh-key-path` option)
|
1376
|
+
|
1377
|
+
> You can check deploy playbook [here](lib/kimurai/automation/deploy.yml)
|
1378
|
+
|
1379
|
+
## Spider `@config`
|
1380
|
+
|
1381
|
+
Using `@config` you can set several options for a spider, like proxy, user-agent, default cookies/headers, delay between requests, browser **memory control** and so on:
|
1382
|
+
|
1383
|
+
```ruby
|
1384
|
+
class Spider < Kimurai::Base
|
1385
|
+
USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
|
1386
|
+
PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]
|
1387
|
+
|
1388
|
+
@engine = :poltergeist_phantomjs
|
1389
|
+
@start_urls = ["https://example.com/"]
|
1390
|
+
@config = {
|
1391
|
+
headers: { "custom_header" => "custom_value" },
|
1392
|
+
cookies: [{ name: "cookie_name", value: "cookie_value", domain: ".example.com" }],
|
1393
|
+
user_agent: -> { USER_AGENTS.sample },
|
1394
|
+
proxy: -> { PROXIES.sample },
|
1395
|
+
window_size: [1366, 768],
|
1396
|
+
disable_images: true,
|
1397
|
+
browser: {
|
1398
|
+
restart_if: {
|
1399
|
+
# Restart browser if provided memory limit (in kilobytes) is exceeded:
|
1400
|
+
memory_limit: 350_000
|
1401
|
+
},
|
1402
|
+
before_request: {
|
1403
|
+
# Change user agent before each request:
|
1404
|
+
change_user_agent: true,
|
1405
|
+
# Change proxy before each request:
|
1406
|
+
change_proxy: true,
|
1407
|
+
# Clear all cookies and set default cookies (if provided) before each request:
|
1408
|
+
clear_and_set_cookies: true,
|
1409
|
+
# Process delay before each request:
|
1410
|
+
delay: 1..3
|
1411
|
+
}
|
1412
|
+
}
|
1413
|
+
}
|
1414
|
+
|
1415
|
+
def parse(response, url:, data: {})
|
1416
|
+
# ...
|
1417
|
+
end
|
1418
|
+
end
|
1419
|
+
```
|
1420
|
+
|
1421
|
+
### All available `@config` options
|
1422
|
+
|
1423
|
+
```ruby
|
1424
|
+
@config = {
|
1425
|
+
# Custom headers, format: hash. Example: { "some header" => "some value", "another header" => "another value" }
|
1426
|
+
# Works only for :mechanize and :poltergeist_phantomjs engines (Selenium doesn't allow to set/get headers)
|
1427
|
+
headers: {},
|
1428
|
+
|
1429
|
+
# Custom User Agent, format: string or lambda.
|
1430
|
+
# Use lambda if you want to rotate user agents before each run:
|
1431
|
+
# user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
|
1432
|
+
# Works for all engines
|
1433
|
+
user_agent: "Mozilla/5.0 Firefox/61.0",
|
1434
|
+
|
1435
|
+
# Custom cookies, format: array of hashes.
|
1436
|
+
# Format for a single cookie: { name: "cookie name", value: "cookie value", domain: ".example.com" }
|
1437
|
+
# Works for all engines
|
1438
|
+
cookies: [],
|
1439
|
+
|
1440
|
+
# Proxy, format: string or lambda. Format of a proxy string: "ip:port:protocol:user:password"
|
1441
|
+
# `protocol` can be http or socks5. User and password are optional.
|
1442
|
+
# Use lambda if you want to rotate proxies before each run:
|
1443
|
+
# proxy: -> { ARRAY_OF_PROXIES.sample }
|
1444
|
+
# Works for all engines, but keep in mind that Selenium drivers doesn't support proxies
|
1445
|
+
# with authorization. Also, Mechanize doesn't support socks5 proxy format (only http)
|
1446
|
+
proxy: "3.4.5.6:3128:http:user:pass",
|
1447
|
+
|
1448
|
+
# If enabled, browser will ignore any https errors. It's handy while using a proxy
|
1449
|
+
# with self-signed SSL cert (for example Crawlera or Mitmproxy)
|
1450
|
+
# Also, it will allow to visit webpages with expires SSL certificate.
|
1451
|
+
# Works for all engines
|
1452
|
+
ignore_ssl_errors: true,
|
1453
|
+
|
1454
|
+
# Custom window size, works for all engines
|
1455
|
+
window_size: [1366, 768],
|
1456
|
+
|
1457
|
+
# Skip images downloading if true, works for all engines
|
1458
|
+
disable_images: true,
|
1459
|
+
|
1460
|
+
# Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
|
1461
|
+
# Although native mode has a better performance, virtual display mode
|
1462
|
+
# sometimes can be useful. For example, some websites can detect (and block)
|
1463
|
+
# headless chrome, so you can use virtual_display mode instead
|
1464
|
+
headless_mode: :native,
|
1465
|
+
|
1466
|
+
# This option tells the browser not to use a proxy for the provided list of domains or IP addresses.
|
1467
|
+
# Format: array of strings. Works only for :selenium_firefox and selenium_chrome
|
1468
|
+
proxy_bypass_list: [],
|
1469
|
+
|
1470
|
+
# Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize
|
1471
|
+
ssl_cert_path: "path/to/ssl_cert",
|
1472
|
+
|
1473
|
+
# Browser (Capybara session instance) options:
|
1474
|
+
browser: {
|
1475
|
+
# Array of errors to retry while processing a request
|
1476
|
+
retry_request_errors: [Net::ReadTimeout],
|
1477
|
+
# Restart browser if one of the options is true:
|
1478
|
+
restart_if: {
|
1479
|
+
# Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
|
1480
|
+
memory_limit: 350_000,
|
1481
|
+
|
1482
|
+
# Restart browser if provided requests limit is exceeded (works for all engines)
|
1483
|
+
requests_limit: 100
|
1484
|
+
},
|
1485
|
+
before_request: {
|
1486
|
+
# Change proxy before each request. The `proxy:` option above should be presented
|
1487
|
+
# and has lambda format. Works only for poltergeist and mechanize engines
|
1488
|
+
# (Selenium doesn't support proxy rotation).
|
1489
|
+
change_proxy: true,
|
1490
|
+
|
1491
|
+
# Change user agent before each request. The `user_agent:` option above should be presented
|
1492
|
+
# and has lambda format. Works only for poltergeist and mechanize engines
|
1493
|
+
# (selenium doesn't support to get/set headers).
|
1494
|
+
change_user_agent: true,
|
1495
|
+
|
1496
|
+
# Clear all cookies before each request, works for all engines
|
1497
|
+
clear_cookies: true,
|
1498
|
+
|
1499
|
+
# If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
|
1500
|
+
# use this option instead (works for all engines)
|
1501
|
+
clear_and_set_cookies: true,
|
1502
|
+
|
1503
|
+
# Global option to set delay between requests.
|
1504
|
+
# Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
|
1505
|
+
# delay number will be chosen randomly for each request: `rand (2..5) # => 3`
|
1506
|
+
delay: 1..3
|
1507
|
+
}
|
1508
|
+
}
|
1509
|
+
}
|
1510
|
+
```
|
1511
|
+
|
1512
|
+
As you can see, most of the options are universal for any engine.
|
1513
|
+
|
1514
|
+
### `@config` settings inheritance
|
1515
|
+
Settings can be inherited:
|
1516
|
+
|
1517
|
+
```ruby
|
1518
|
+
class ApplicationSpider < Kimurai::Base
|
1519
|
+
@engine = :poltergeist_phantomjs
|
1520
|
+
@config = {
|
1521
|
+
user_agent: "Firefox",
|
1522
|
+
disable_images: true,
|
1523
|
+
browser: {
|
1524
|
+
restart_if: { memory_limit: 350_000 },
|
1525
|
+
before_request: { delay: 1..2 }
|
1526
|
+
}
|
1527
|
+
}
|
1528
|
+
end
|
1529
|
+
|
1530
|
+
class CustomSpider < ApplicationSpider
|
1531
|
+
@name = "custom_spider"
|
1532
|
+
@start_urls = ["https://example.com/"]
|
1533
|
+
@config = {
|
1534
|
+
browser: { before_request: { delay: 4..6 }}
|
1535
|
+
}
|
1536
|
+
|
1537
|
+
def parse(response, url:, data: {})
|
1538
|
+
# ...
|
1539
|
+
end
|
1540
|
+
end
|
1541
|
+
```
|
1542
|
+
|
1543
|
+
Here, `@config` of `CustomSpider` will be _[deep merged](https://apidock.com/rails/Hash/deep_merge)_ with `ApplicationSpider` config, so `CustomSpider` will keep all inherited options with only `delay` updated.
|
1544
|
+
|
1545
|
+
## Project mode
|
1546
|
+
|
1547
|
+
Kimurai can work in project mode ([Like Scrapy](https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project)). To generate a new project, run: `$ kimurai generate project web_spiders` (where `web_spiders` is a name of project).
|
1548
|
+
|
1549
|
+
Structure of the project:
|
1550
|
+
|
1551
|
+
```bash
|
1552
|
+
.
|
1553
|
+
├── config/
|
1554
|
+
│ ├── initializers/
|
1555
|
+
│ ├── application.rb
|
1556
|
+
│ ├── automation.yml
|
1557
|
+
│ ├── boot.rb
|
1558
|
+
│ └── schedule.rb
|
1559
|
+
├── spiders/
|
1560
|
+
│ └── application_spider.rb
|
1561
|
+
├── db/
|
1562
|
+
├── helpers/
|
1563
|
+
│ └── application_helper.rb
|
1564
|
+
├── lib/
|
1565
|
+
├── log/
|
1566
|
+
├── pipelines/
|
1567
|
+
│ ├── validator.rb
|
1568
|
+
│ └── saver.rb
|
1569
|
+
├── tmp/
|
1570
|
+
├── .env
|
1571
|
+
├── Gemfile
|
1572
|
+
├── Gemfile.lock
|
1573
|
+
└── README.md
|
1574
|
+
```
|
1575
|
+
|
1576
|
+
<details/>
|
1577
|
+
<summary>Description</summary>
|
1578
|
+
|
1579
|
+
* `config/` folder for configutation files
|
1580
|
+
* `config/initializers` [Rails-like initializers](https://guides.rubyonrails.org/configuring.html#using-initializer-files) to load custom code at start of framework
|
1581
|
+
* `config/application.rb` configuration settings for Kimurai (`Kimurai.configure do` block)
|
1582
|
+
* `config/automation.yml` specify some settings for [setup and deploy](#automated-sever-setup-and-deployment)
|
1583
|
+
* `config/boot.rb` loads framework and project
|
1584
|
+
* `config/schedule.rb` Cron [schedule for spiders](#schedule-spiders-using-cron)
|
1585
|
+
* `spiders/` folder for spiders
|
1586
|
+
* `spiders/application_spider.rb` Base parent class for all spiders
|
1587
|
+
* `db/` store here all database files (`sqlite`, `json`, `csv`, etc.)
|
1588
|
+
* `helpers/` Rails-like helpers for spiders
|
1589
|
+
* `helpers/application_helper.rb` all methods inside ApplicationHelper module will be available for all spiders
|
1590
|
+
* `lib/` put here custom Ruby code
|
1591
|
+
* `log/` folder for logs
|
1592
|
+
* `pipelines/` folder for [Scrapy-like](https://doc.scrapy.org/en/latest/topics/item-pipeline.html) pipelines. One file = one pipeline
|
1593
|
+
* `pipelines/validator.rb` example pipeline to validate item
|
1594
|
+
* `pipelines/saver.rb` example pipeline to save item
|
1595
|
+
* `tmp/` folder for temp. files
|
1596
|
+
* `.env` file to store ENV variables for project and load them using [Dotenv](https://github.com/bkeepers/dotenv)
|
1597
|
+
* `Gemfile` dependency file
|
1598
|
+
* `Readme.md` example project readme
|
1599
|
+
</details>
|
1600
|
+
|
1601
|
+
|
1602
|
+
### Generate new spider
|
1603
|
+
To generate a new spider in the project, run:
|
1604
|
+
|
1605
|
+
```bash
|
1606
|
+
$ kimurai generate spider example_spider
|
1607
|
+
create crawlers/example_spider.rb
|
1608
|
+
```
|
1609
|
+
|
1610
|
+
Command will generate a new spider class inherited from `ApplicationSpider`:
|
1611
|
+
|
1612
|
+
```ruby
|
1613
|
+
class ExampleSpider < ApplicationSpider
|
1614
|
+
@name = "example_spider"
|
1615
|
+
@start_urls = []
|
1616
|
+
@config = {}
|
1617
|
+
|
1618
|
+
def parse(response, url:, data: {})
|
1619
|
+
end
|
1620
|
+
end
|
1621
|
+
```
|
1622
|
+
|
1623
|
+
### Crawl
|
1624
|
+
To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
|
1625
|
+
|
1626
|
+
### List
|
1627
|
+
To list all project spiders, run: `$ bundle exec kimurai list`
|
1628
|
+
|
1629
|
+
### Parse
|
1630
|
+
For project spiders you can use `$ kimurai parse` command which helps to debug spiders:
|
1631
|
+
|
1632
|
+
```bash
|
1633
|
+
$ bundle exec kimurai parse example_spider parse_product --url https://example-shop.com/product-1
|
1634
|
+
```
|
1635
|
+
|
1636
|
+
where `example_spider` is a spider to run, `parse_product` is a spider method to process and `--url` is url to open inside processing method.
|
1637
|
+
|
1638
|
+
### Pipelines, `send_item` method
|
1639
|
+
You can use item pipelines to organize and store in one place item processing logic for all project spiders (also check Scrapy [description of pipelines](https://doc.scrapy.org/en/latest/topics/item-pipeline.html#item-pipeline)).
|
1640
|
+
|
1641
|
+
Imagine if you have three spiders where each of them crawls different e-commerce shop and saves only shoe positions. For each spider, you want to save items only with "shoe" category, unique sku, valid title/price and with existing images. To avoid code duplication between spiders, use pipelines:
|
1642
|
+
|
1643
|
+
<details/>
|
1644
|
+
<summary>Example</summary>
|
1645
|
+
|
1646
|
+
pipelines/validator.rb
|
1647
|
+
```ruby
|
1648
|
+
class Validator < Kimurai::Pipeline
|
1649
|
+
def process_item(item, options: {})
|
1650
|
+
# Here you can validate item and raise `DropItemError`
|
1651
|
+
# if one of the validations failed. Examples:
|
1652
|
+
|
1653
|
+
# Drop item if it's category is not "shoe":
|
1654
|
+
if item[:category] != "shoe"
|
1655
|
+
raise DropItemError, "Wrong item category"
|
1656
|
+
end
|
1657
|
+
|
1658
|
+
# Check item sku for uniqueness using buit-in unique? helper:
|
1659
|
+
unless unique?(:sku, item[:sku])
|
1660
|
+
raise DropItemError, "Item sku is not unique"
|
1661
|
+
end
|
1662
|
+
|
1663
|
+
# Drop item if title length shorter than 5 symbols:
|
1664
|
+
if item[:title].size < 5
|
1665
|
+
raise DropItemError, "Item title is short"
|
1666
|
+
end
|
1667
|
+
|
1668
|
+
# Drop item if price is not present
|
1669
|
+
unless item[:price].present?
|
1670
|
+
raise DropItemError, "item price is not present"
|
1671
|
+
end
|
1672
|
+
|
1673
|
+
# Drop item if it doesn't contains any images:
|
1674
|
+
unless item[:images].present?
|
1675
|
+
raise DropItemError, "Item images are not present"
|
1676
|
+
end
|
1677
|
+
|
1678
|
+
# Pass item to the next pipeline (if it wasn't dropped):
|
1679
|
+
item
|
1680
|
+
end
|
1681
|
+
end
|
1682
|
+
|
1683
|
+
```
|
1684
|
+
|
1685
|
+
pipelines/saver.rb
|
1686
|
+
```ruby
|
1687
|
+
class Saver < Kimurai::Pipeline
|
1688
|
+
def process_item(item, options: {})
|
1689
|
+
# Here you can save item to the database, send it to a remote API or
|
1690
|
+
# simply save item to a file format using `save_to` helper:
|
1691
|
+
|
1692
|
+
# To get the name of current spider: `spider.class.name`
|
1693
|
+
save_to "db/#{spider.class.name}.json", item, format: :json
|
1694
|
+
|
1695
|
+
item
|
1696
|
+
end
|
1697
|
+
end
|
1698
|
+
```
|
1699
|
+
|
1700
|
+
spiders/application_spider.rb
|
1701
|
+
```ruby
|
1702
|
+
class ApplicationSpider < Kimurai::Base
|
1703
|
+
@engine = :selenium_chrome
|
1704
|
+
# Define pipelines (by order) for all spiders:
|
1705
|
+
@pipelines = [:validator, :saver]
|
1706
|
+
end
|
1707
|
+
```
|
1708
|
+
|
1709
|
+
spiders/shop_spider_1.rb
|
1710
|
+
```ruby
|
1711
|
+
class ShopSpiderOne < ApplicationSpider
|
1712
|
+
@name = "shop_spider_1"
|
1713
|
+
@start_urls = ["https://shop-1.com"]
|
1714
|
+
|
1715
|
+
# ...
|
1716
|
+
|
1717
|
+
def parse_product(response, url:, data: {})
|
1718
|
+
# ...
|
1719
|
+
|
1720
|
+
# Send item to pipelines:
|
1721
|
+
send_item item
|
1722
|
+
end
|
1723
|
+
end
|
1724
|
+
```
|
1725
|
+
|
1726
|
+
spiders/shop_spider_2.rb
|
1727
|
+
```ruby
|
1728
|
+
class ShopSpiderTwo < ApplicationSpider
|
1729
|
+
@name = "shop_spider_2"
|
1730
|
+
@start_urls = ["https://shop-2.com"]
|
1731
|
+
|
1732
|
+
def parse_product(response, url:, data: {})
|
1733
|
+
# ...
|
1734
|
+
|
1735
|
+
# Send item to pipelines:
|
1736
|
+
send_item item
|
1737
|
+
end
|
1738
|
+
end
|
1739
|
+
```
|
1740
|
+
|
1741
|
+
spiders/shop_spider_3.rb
|
1742
|
+
```ruby
|
1743
|
+
class ShopSpiderThree < ApplicationSpider
|
1744
|
+
@name = "shop_spider_3"
|
1745
|
+
@start_urls = ["https://shop-3.com"]
|
1746
|
+
|
1747
|
+
def parse_product(response, url:, data: {})
|
1748
|
+
# ...
|
1749
|
+
|
1750
|
+
# Send item to pipelines:
|
1751
|
+
send_item item
|
1752
|
+
end
|
1753
|
+
end
|
1754
|
+
```
|
1755
|
+
</details><br>
|
1756
|
+
|
1757
|
+
When you start using pipelines, there are stats for items appears:
|
1758
|
+
|
1759
|
+
<details>
|
1760
|
+
<summary>Example</summary>
|
1761
|
+
|
1762
|
+
pipelines/validator.rb
|
1763
|
+
```ruby
|
1764
|
+
class Validator < Kimurai::Pipeline
|
1765
|
+
def process_item(item, options: {})
|
1766
|
+
if item[:star_count] < 10
|
1767
|
+
raise DropItemError, "Repository doesn't have enough stars"
|
1768
|
+
end
|
1769
|
+
|
1770
|
+
item
|
1771
|
+
end
|
1772
|
+
end
|
1773
|
+
```
|
1774
|
+
|
1775
|
+
spiders/github_spider.rb
|
1776
|
+
```ruby
|
1777
|
+
class GithubSpider < ApplicationSpider
|
1778
|
+
@name = "github_spider"
|
1779
|
+
@engine = :selenium_chrome
|
1780
|
+
@pipelines = [:validator]
|
1781
|
+
@start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
|
1782
|
+
@config = {
|
1783
|
+
user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
|
1784
|
+
browser: { before_request: { delay: 4..7 } }
|
1785
|
+
}
|
1786
|
+
|
1787
|
+
def parse(response, url:, data: {})
|
1788
|
+
response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
|
1789
|
+
request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
|
1790
|
+
end
|
1791
|
+
|
1792
|
+
if next_page = response.at_xpath("//a[@class='next_page']")
|
1793
|
+
request_to :parse, url: absolute_url(next_page[:href], base: url)
|
1794
|
+
end
|
1795
|
+
end
|
1796
|
+
|
1797
|
+
def parse_repo_page(response, url:, data: {})
|
1798
|
+
item = {}
|
1799
|
+
|
1800
|
+
item[:owner] = response.xpath("//h1//a[@rel='author']").text
|
1801
|
+
item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
|
1802
|
+
item[:repo_url] = url
|
1803
|
+
item[:description] = response.xpath("//span[@itemprop='about']").text.squish
|
1804
|
+
item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
|
1805
|
+
item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish.delete(",").to_i
|
1806
|
+
item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish.delete(",").to_i
|
1807
|
+
item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish.delete(",").to_i
|
1808
|
+
item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
|
1809
|
+
|
1810
|
+
send_item item
|
1811
|
+
end
|
1812
|
+
end
|
1813
|
+
```
|
1814
|
+
|
1815
|
+
```
|
1816
|
+
$ bundle exec kimurai crawl github_spider
|
1817
|
+
|
1818
|
+
I, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: started: github_spider
|
1819
|
+
D, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
|
1820
|
+
I, [2018-08-22 15:56:40 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
|
1821
|
+
I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
|
1822
|
+
I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 1, responses: 1
|
1823
|
+
D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 116182
|
1824
|
+
D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
|
1825
|
+
|
1826
|
+
I, [2018-08-22 15:56:49 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
|
1827
|
+
I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
|
1828
|
+
I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 2, responses: 2
|
1829
|
+
D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 217432
|
1830
|
+
D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
|
1831
|
+
I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
|
1832
|
+
I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 1, processed: 1
|
1833
|
+
D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 6 seconds before request...
|
1834
|
+
|
1835
|
+
...
|
1836
|
+
|
1837
|
+
I, [2018-08-22 16:11:50 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
|
1838
|
+
I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
|
1839
|
+
I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: visits: requests: 140, responses: 140
|
1840
|
+
D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 211713
|
1841
|
+
|
1842
|
+
D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
|
1843
|
+
E, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}
|
1844
|
+
|
1845
|
+
I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Info: items: sent: 127, processed: 12
|
1846
|
+
|
1847
|
+
I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
|
1848
|
+
I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}
|
1849
|
+
```
|
1850
|
+
</details><br>
|
1851
|
+
|
1852
|
+
Also, you can pass custom options to pipeline from a particular spider if you want to change pipeline behavior for this spider:
|
1853
|
+
|
1854
|
+
<details>
|
1855
|
+
<summary>Example</summary>
|
1856
|
+
|
1857
|
+
spiders/custom_spider.rb
|
1858
|
+
```ruby
|
1859
|
+
class CustomSpider < ApplicationSpider
|
1860
|
+
@name = "custom_spider"
|
1861
|
+
@start_urls = ["https://example.com"]
|
1862
|
+
@pipelines = [:validator]
|
1863
|
+
|
1864
|
+
# ...
|
1865
|
+
|
1866
|
+
def parse_item(response, url:, data: {})
|
1867
|
+
# ...
|
1868
|
+
|
1869
|
+
# Pass custom option `skip_uniq_checking` for Validator pipeline:
|
1870
|
+
send_item item, validator: { skip_uniq_checking: true }
|
1871
|
+
end
|
1872
|
+
end
|
1873
|
+
|
1874
|
+
```
|
1875
|
+
|
1876
|
+
pipelines/validator.rb
|
1877
|
+
```ruby
|
1878
|
+
class Validator < Kimurai::Pipeline
|
1879
|
+
def process_item(item, options: {})
|
1880
|
+
|
1881
|
+
# Do not check item sku for uniqueness if options[:skip_uniq_checking] is true
|
1882
|
+
if options[:skip_uniq_checking] != true
|
1883
|
+
raise DropItemError, "Item sku is not unique" unless unique?(:sku, item[:sku])
|
1884
|
+
end
|
1885
|
+
end
|
1886
|
+
end
|
1887
|
+
```
|
1888
|
+
</details>
|
1889
|
+
|
1890
|
+
|
1891
|
+
### Runner
|
1892
|
+
|
1893
|
+
You can run project spiders one by one or in parallel using `$ kimurai runner` command:
|
1894
|
+
|
1895
|
+
```
|
1896
|
+
$ bundle exec kimurai list
|
1897
|
+
custom_spider
|
1898
|
+
example_spider
|
1899
|
+
github_spider
|
1900
|
+
|
1901
|
+
$ bundle exec kimurai runner -j 3
|
1902
|
+
>>> Runner: started: {:id=>1533727423, :status=>:processing, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>nil, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
|
1903
|
+
> Runner: started spider: custom_spider, index: 0
|
1904
|
+
> Runner: started spider: github_spider, index: 1
|
1905
|
+
> Runner: started spider: example_spider, index: 2
|
1906
|
+
< Runner: stopped spider: custom_spider, index: 0
|
1907
|
+
< Runner: stopped spider: example_spider, index: 2
|
1908
|
+
< Runner: stopped spider: github_spider, index: 1
|
1909
|
+
<<< Runner: stopped: {:id=>1533727423, :status=>:completed, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>2018-08-08 15:25:11 +0400, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
|
1910
|
+
```
|
1911
|
+
|
1912
|
+
Each spider runs in a separate process. Spiders logs available at `log/` folder. Pass `-j` option to specify how many spiders should be processed at the same time (default is 1).
|
1913
|
+
|
1914
|
+
#### Runner callbacks
|
1915
|
+
|
1916
|
+
You can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/kimurai/template/config/application.rb) to see example.
|
1917
|
+
|
1918
|
+
|
1919
|
+
## Chat Support and Feedback
|
1920
|
+
Will be updated
|
1921
|
+
|
1922
|
+
## License
|
1923
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|