kimurai 1.1.0 → 1.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -0
- data/README.md +54 -3
- data/lib/kimurai.rb +1 -0
- data/lib/kimurai/base.rb +30 -14
- data/lib/kimurai/base/saver.rb +4 -4
- data/lib/kimurai/base/storage.rb +12 -2
- data/lib/kimurai/browser_builder/mechanize_builder.rb +8 -2
- data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +8 -2
- data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +19 -5
- data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +8 -2
- data/lib/kimurai/capybara_ext/session.rb +46 -10
- data/lib/kimurai/capybara_ext/session/config.rb +5 -1
- data/lib/kimurai/cli.rb +14 -1
- data/lib/kimurai/core_ext/hash.rb +5 -0
- data/lib/kimurai/runner.rb +37 -38
- data/lib/kimurai/template/config/application.rb +5 -0
- data/lib/kimurai/template/spiders/application_spider.rb +19 -1
- data/lib/kimurai/version.rb +1 -1
- metadata +3 -3
- data/CODE_OF_CONDUCT.md +0 -74
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 2ebb62d0916ee55eb8ec05f34b5f5909b99aae8fa2236dd93169f6e5e1221805
|
4
|
+
data.tar.gz: da86497a6c4f61f2ff1ffe3e8eeee92acb203305e712b9bd4baaa5794fdcf5f5
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9d54d6074928a5bc0aa0a9d7e64308942e21ad5ae17788b2fa36adfffe88f67df0c533558ced412f2e22481a1664661c547419d09451d1260dd6ebd14ca4d915
|
7
|
+
data.tar.gz: d21f499a2a292dd672480d15da71742cfa82dec054b6a3c3a3c756e6cd2c98e58ac7d3b7fa3fe0ce6ae2e956d46417269242e742f8d83b644176ccef0822de75
|
data/CHANGELOG.md
CHANGED
@@ -1,4 +1,20 @@
|
|
1
1
|
# CHANGELOG
|
2
|
+
## 1.2.0
|
3
|
+
### New
|
4
|
+
* Add possibility to add array of values to the storage (`Base::Storage#add`)
|
5
|
+
* Add `exception_on_fail` option to `Base.crawl!`
|
6
|
+
* Add possibility to pass request hash to the `start_urls` (You can use array of hashes as well, like: `@start_urls = [{ url: "https://example.com/cat?id=1", data: { category: "First Category" } }]`)
|
7
|
+
* Implement `skip_request_errors` config feature. Added [Handle request errors](https://github.com/vifreefly/kimuraframework#handle-request-errors) chapter to the README.
|
8
|
+
* Add option to choose response type for `Session#current_response` (`:html` default, or `:json`)
|
9
|
+
* Add option to provide custom chrome and chromedriver paths
|
10
|
+
|
11
|
+
### Improvements
|
12
|
+
* Refactor `Runner`
|
13
|
+
|
14
|
+
### Fixes
|
15
|
+
* Fix `Base#Saver` (automatically create file if it doesn't exists in case of persistence database)
|
16
|
+
* Do not deep merge config's `headers:` option
|
17
|
+
|
2
18
|
## 1.1.0
|
3
19
|
### Breaking changes 1.1.0
|
4
20
|
`browser` config option depricated. Now all sub-options inside `browser` should be placed right into `@config` hash, without `browser` parent key. Example:
|
data/README.md
CHANGED
@@ -18,7 +18,7 @@
|
|
18
18
|
|
19
19
|
<br>
|
20
20
|
|
21
|
-
> Note: this readme is for `1.
|
21
|
+
> Note: this readme is for `1.2.0` gem version. CHANGELOG [here](CHANGELOG.md).
|
22
22
|
|
23
23
|
Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
|
24
24
|
|
@@ -216,7 +216,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
216
216
|
* All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
|
217
217
|
* Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
|
218
218
|
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
|
219
|
-
* Automatically [
|
219
|
+
* Automatically [handle requests errors](#handle-request-errors)
|
220
220
|
* Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
|
221
221
|
* Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
|
222
222
|
* [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
|
@@ -242,6 +242,9 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
242
242
|
* [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
|
243
243
|
* [Storage object](#storage-object)
|
244
244
|
* [Persistence database for the storage](#persistence-database-for-the-storage)
|
245
|
+
* [Handle request errors](#handle-request-errors)
|
246
|
+
* [skip_request_errors](#skip_request_errors)
|
247
|
+
* [retry_request_errors](#retry_request_errors)
|
245
248
|
* [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
|
246
249
|
* [KIMURAI_ENV](#kimurai_env)
|
247
250
|
* [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
|
@@ -860,6 +863,9 @@ ProductsSpider.crawl!(continue: true)
|
|
860
863
|
|
861
864
|
Second approach is to automatically skip already processed items urls using `@config` `skip_duplicate_requests:` option:
|
862
865
|
|
866
|
+
<details/>
|
867
|
+
<summary>Check the code</summary>
|
868
|
+
|
863
869
|
```ruby
|
864
870
|
class ProductsSpider < Kimurai::Base
|
865
871
|
@start_urls = ["https://example-shop.com/"]
|
@@ -893,7 +899,29 @@ end
|
|
893
899
|
# Run the spider with persistence database option:
|
894
900
|
ProductsSpider.crawl!(continue: true)
|
895
901
|
```
|
902
|
+
</details>
|
903
|
+
|
904
|
+
### Handle request errors
|
905
|
+
It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
|
906
|
+
|
907
|
+
#### skip_request_errors
|
908
|
+
You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
|
909
|
+
|
910
|
+
Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
|
911
|
+
|
912
|
+
```ruby
|
913
|
+
@config = {
|
914
|
+
skip_request_errors: [{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }]
|
915
|
+
}
|
916
|
+
```
|
917
|
+
In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
|
896
918
|
|
919
|
+
#### retry_request_errors
|
920
|
+
You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
|
921
|
+
|
922
|
+
There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
|
923
|
+
|
924
|
+
Format for the option: same like for `skip_request_errors` option.
|
897
925
|
|
898
926
|
### `open_spider` and `close_spider` callbacks
|
899
927
|
|
@@ -1340,6 +1368,11 @@ Kimurai.configure do |config|
|
|
1340
1368
|
# Custom time zone (for logs):
|
1341
1369
|
# config.time_zone = "UTC"
|
1342
1370
|
# config.time_zone = "Europe/Moscow"
|
1371
|
+
|
1372
|
+
# Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
|
1373
|
+
# config.selenium_chrome_path = "/usr/bin/chromium-browser"
|
1374
|
+
# Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
|
1375
|
+
# config.chromedriver_path = "~/.local/bin/chromedriver"
|
1343
1376
|
end
|
1344
1377
|
```
|
1345
1378
|
|
@@ -1591,7 +1624,23 @@ end
|
|
1591
1624
|
# works for all drivers
|
1592
1625
|
skip_duplicate_requests: true,
|
1593
1626
|
|
1594
|
-
#
|
1627
|
+
# Automatically skip provided errors while requesting a page.
|
1628
|
+
# If raised error matches one of the errors in the list, then this error will be caught,
|
1629
|
+
# and request will be skipped.
|
1630
|
+
# It is a good idea to skip errors like NotFound(404), etc.
|
1631
|
+
# Format: array where elements are error classes or/and hashes. You can use hash format
|
1632
|
+
# for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
|
1633
|
+
# Provided `message:` will be compared with a full error message using `String#include?`. Also
|
1634
|
+
# you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
|
1635
|
+
skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
|
1636
|
+
|
1637
|
+
# Automatically retry provided errors with a few attempts while requesting a page.
|
1638
|
+
# If raised error matches one of the errors in the list, then this error will be caught
|
1639
|
+
# and the request will be processed again within a delay. There are 3 attempts:
|
1640
|
+
# first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
|
1641
|
+
# If after 3 attempts there is still an exception, then the exception will be raised.
|
1642
|
+
# It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
|
1643
|
+
# Format: same like for `skip_request_errors` option.
|
1595
1644
|
retry_request_errors: [Net::ReadTimeout],
|
1596
1645
|
|
1597
1646
|
# Restart browser if one of the options is true:
|
@@ -1602,6 +1651,8 @@ end
|
|
1602
1651
|
# Restart browser if provided requests limit is exceeded (works for all engines)
|
1603
1652
|
requests_limit: 100
|
1604
1653
|
},
|
1654
|
+
|
1655
|
+
# Perform several actions before each request:
|
1605
1656
|
before_request: {
|
1606
1657
|
# Change proxy before each request. The `proxy:` option above should be presented
|
1607
1658
|
# and has lambda format. Works only for poltergeist and mechanize engines
|
data/lib/kimurai.rb
CHANGED
@@ -10,6 +10,7 @@ require_relative 'kimurai/version'
|
|
10
10
|
require_relative 'kimurai/core_ext/numeric'
|
11
11
|
require_relative 'kimurai/core_ext/string'
|
12
12
|
require_relative 'kimurai/core_ext/array'
|
13
|
+
require_relative 'kimurai/core_ext/hash'
|
13
14
|
|
14
15
|
require_relative 'kimurai/browser_builder'
|
15
16
|
require_relative 'kimurai/base_helper'
|
data/lib/kimurai/base.rb
CHANGED
@@ -3,6 +3,9 @@ require_relative 'base/storage'
|
|
3
3
|
|
4
4
|
module Kimurai
|
5
5
|
class Base
|
6
|
+
# don't deep merge config's headers hash option
|
7
|
+
DMERGE_EXCLUDE = [:headers]
|
8
|
+
|
6
9
|
LoggerFormatter = proc do |severity, datetime, progname, msg|
|
7
10
|
current_thread_id = Thread.current.object_id
|
8
11
|
thread_type = Thread.main == Thread.current ? "M" : "C"
|
@@ -77,7 +80,11 @@ module Kimurai
|
|
77
80
|
end
|
78
81
|
|
79
82
|
def self.config
|
80
|
-
superclass.equal?(::Object)
|
83
|
+
if superclass.equal?(::Object)
|
84
|
+
@config
|
85
|
+
else
|
86
|
+
superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
|
87
|
+
end
|
81
88
|
end
|
82
89
|
|
83
90
|
###
|
@@ -90,7 +97,7 @@ module Kimurai
|
|
90
97
|
end
|
91
98
|
end
|
92
99
|
|
93
|
-
def self.crawl!(continue: false)
|
100
|
+
def self.crawl!(continue: false, exception_on_fail: true)
|
94
101
|
logger.error "Spider: already running: #{name}" and return false if running?
|
95
102
|
|
96
103
|
storage_path =
|
@@ -118,19 +125,23 @@ module Kimurai
|
|
118
125
|
spider.with_info = true
|
119
126
|
if start_urls
|
120
127
|
start_urls.each do |start_url|
|
121
|
-
|
128
|
+
if start_url.class == Hash
|
129
|
+
spider.request_to(:parse, start_url)
|
130
|
+
else
|
131
|
+
spider.request_to(:parse, url: start_url)
|
132
|
+
end
|
122
133
|
end
|
123
134
|
else
|
124
135
|
spider.parse
|
125
136
|
end
|
126
137
|
rescue StandardError, SignalException, SystemExit => e
|
127
138
|
@run_info.merge!(status: :failed, error: e.inspect)
|
128
|
-
raise e
|
139
|
+
exception_on_fail ? raise(e) : [@run_info, e]
|
129
140
|
else
|
130
141
|
@run_info.merge!(status: :completed)
|
131
142
|
ensure
|
132
143
|
if spider
|
133
|
-
spider.browser.destroy_driver!
|
144
|
+
spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
|
134
145
|
|
135
146
|
stop_time = Time.now
|
136
147
|
total_time = (stop_time - @run_info[:start_time]).round(3)
|
@@ -168,7 +179,7 @@ module Kimurai
|
|
168
179
|
|
169
180
|
def initialize(engine = self.class.engine, config: {})
|
170
181
|
@engine = engine
|
171
|
-
@config = self.class.config.
|
182
|
+
@config = self.class.config.deep_merge_excl(config, DMERGE_EXCLUDE)
|
172
183
|
@pipelines = self.class.pipelines.map do |pipeline_name|
|
173
184
|
klass = Pipeline.descendants.find { |kl| kl.name == pipeline_name }
|
174
185
|
instance = klass.new
|
@@ -184,15 +195,16 @@ module Kimurai
|
|
184
195
|
@browser ||= BrowserBuilder.build(@engine, @config, spider: self)
|
185
196
|
end
|
186
197
|
|
187
|
-
def request_to(handler, delay = nil, url:, data: {})
|
198
|
+
def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
|
188
199
|
if @config[:skip_duplicate_requests] && !unique_request?(url)
|
189
200
|
add_event(:duplicate_requests) if self.with_info
|
190
|
-
logger.warn "Spider: request_to:
|
201
|
+
logger.warn "Spider: request_to: not unique url: #{url}, skipped" and return
|
191
202
|
end
|
192
203
|
|
193
|
-
|
194
|
-
|
195
|
-
|
204
|
+
visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
|
205
|
+
return unless visited
|
206
|
+
|
207
|
+
public_send(handler, browser.current_response(response_type), { url: url, data: data })
|
196
208
|
end
|
197
209
|
|
198
210
|
def console(response = nil, url: nil, data: {})
|
@@ -285,18 +297,22 @@ module Kimurai
|
|
285
297
|
all << Thread.new(part) do |part|
|
286
298
|
Thread.current.abort_on_exception = true
|
287
299
|
|
288
|
-
spider = self.class.new(engine, config: config)
|
300
|
+
spider = self.class.new(engine, config: @config.deep_merge_excl(config, DMERGE_EXCLUDE))
|
289
301
|
spider.with_info = true if self.with_info
|
290
302
|
|
291
303
|
part.each do |url_data|
|
292
304
|
if url_data.class == Hash
|
293
|
-
|
305
|
+
if url_data[:url].present? && url_data[:data].present?
|
306
|
+
spider.request_to(handler, delay, url_data)
|
307
|
+
else
|
308
|
+
spider.public_send(handler, url_data)
|
309
|
+
end
|
294
310
|
else
|
295
311
|
spider.request_to(handler, delay, url: url_data, data: data)
|
296
312
|
end
|
297
313
|
end
|
298
314
|
ensure
|
299
|
-
spider.browser.destroy_driver!
|
315
|
+
spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
|
300
316
|
end
|
301
317
|
|
302
318
|
sleep 0.5
|
data/lib/kimurai/base/saver.rb
CHANGED
@@ -42,7 +42,7 @@ module Kimurai
|
|
42
42
|
def save_to_json(item)
|
43
43
|
data = JSON.generate([item])
|
44
44
|
|
45
|
-
if
|
45
|
+
if @index > 1 || append && File.exists?(path)
|
46
46
|
file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
|
47
47
|
File.open(path, "w") do |f|
|
48
48
|
f.write(file_content + data.sub(/\A\[/, ""))
|
@@ -55,7 +55,7 @@ module Kimurai
|
|
55
55
|
def save_to_pretty_json(item)
|
56
56
|
data = JSON.pretty_generate([item])
|
57
57
|
|
58
|
-
if
|
58
|
+
if @index > 1 || append && File.exists?(path)
|
59
59
|
file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
|
60
60
|
File.open(path, "w") do |f|
|
61
61
|
f.write(file_content + data.sub(/\A\[\n/, ""))
|
@@ -68,7 +68,7 @@ module Kimurai
|
|
68
68
|
def save_to_jsonlines(item)
|
69
69
|
data = JSON.generate(item)
|
70
70
|
|
71
|
-
if
|
71
|
+
if @index > 1 || append && File.exists?(path)
|
72
72
|
File.open(path, "a") { |file| file.write("\n" + data) }
|
73
73
|
else
|
74
74
|
File.open(path, "w") { |file| file.write(data) }
|
@@ -78,7 +78,7 @@ module Kimurai
|
|
78
78
|
def save_to_csv(item)
|
79
79
|
data = flatten_hash(item)
|
80
80
|
|
81
|
-
if
|
81
|
+
if @index > 1 || append && File.exists?(path)
|
82
82
|
CSV.open(path, "a+", force_quotes: true) do |csv|
|
83
83
|
csv << data.values
|
84
84
|
end
|
data/lib/kimurai/base/storage.rb
CHANGED
@@ -40,11 +40,21 @@ module Kimurai
|
|
40
40
|
if path
|
41
41
|
database.transaction do
|
42
42
|
database[scope] ||= []
|
43
|
-
|
43
|
+
if value.class == Array
|
44
|
+
database[scope] += value
|
45
|
+
database[scope].uniq!
|
46
|
+
else
|
47
|
+
database[scope].push(value) unless database[scope].include?(value)
|
48
|
+
end
|
44
49
|
end
|
45
50
|
else
|
46
51
|
database[scope] ||= []
|
47
|
-
|
52
|
+
if value.class == Array
|
53
|
+
database[scope] += value
|
54
|
+
database[scope].uniq!
|
55
|
+
else
|
56
|
+
database[scope].push(value) unless database[scope].include?(value)
|
57
|
+
end
|
48
58
|
end
|
49
59
|
end
|
50
60
|
end
|
@@ -80,9 +80,15 @@ module Kimurai
|
|
80
80
|
end
|
81
81
|
|
82
82
|
# Browser instance options
|
83
|
+
# skip_request_errors
|
84
|
+
if skip_errors = @config[:skip_request_errors].presence
|
85
|
+
@browser.config.skip_request_errors = skip_errors
|
86
|
+
logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
|
87
|
+
end
|
88
|
+
|
83
89
|
# retry_request_errors
|
84
|
-
if
|
85
|
-
@browser.config.retry_request_errors =
|
90
|
+
if retry_errors = @config[:retry_request_errors].presence
|
91
|
+
@browser.config.retry_request_errors = retry_errors
|
86
92
|
logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
|
87
93
|
end
|
88
94
|
|
@@ -91,9 +91,15 @@ module Kimurai
|
|
91
91
|
end
|
92
92
|
|
93
93
|
# Browser instance options
|
94
|
+
# skip_request_errors
|
95
|
+
if skip_errors = @config[:skip_request_errors].presence
|
96
|
+
@browser.config.skip_request_errors = skip_errors
|
97
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
|
98
|
+
end
|
99
|
+
|
94
100
|
# retry_request_errors
|
95
|
-
if
|
96
|
-
@browser.config.retry_request_errors =
|
101
|
+
if retry_errors = @config[:retry_request_errors].presence
|
102
|
+
@browser.config.retry_request_errors = retry_errors
|
97
103
|
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
|
98
104
|
end
|
99
105
|
|
@@ -23,8 +23,15 @@ module Kimurai
|
|
23
23
|
# Register driver
|
24
24
|
Capybara.register_driver :selenium_chrome do |app|
|
25
25
|
# Create driver options
|
26
|
-
|
27
|
-
|
26
|
+
opts = { args: %w[--disable-gpu --no-sandbox --disable-translate] }
|
27
|
+
|
28
|
+
# Provide custom chrome browser path:
|
29
|
+
if chrome_path = Kimurai.configuration.selenium_chrome_path
|
30
|
+
opts.merge!(binary: chrome_path)
|
31
|
+
end
|
32
|
+
|
33
|
+
# See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
|
34
|
+
driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
|
28
35
|
|
29
36
|
# Window size
|
30
37
|
if size = @config[:window_size].presence
|
@@ -99,7 +106,8 @@ module Kimurai
|
|
99
106
|
end
|
100
107
|
end
|
101
108
|
|
102
|
-
|
109
|
+
chromedriver_path = Kimurai.configuration.chromedriver_path || "/usr/local/bin/chromedriver"
|
110
|
+
Capybara::Selenium::Driver.new(app, browser: :chrome, options: driver_options, driver_path: chromedriver_path)
|
103
111
|
end
|
104
112
|
|
105
113
|
# Create browser instance (Capybara session)
|
@@ -118,9 +126,15 @@ module Kimurai
|
|
118
126
|
end
|
119
127
|
|
120
128
|
# Browser instance options
|
129
|
+
# skip_request_errors
|
130
|
+
if skip_errors = @config[:skip_request_errors].presence
|
131
|
+
@browser.config.skip_request_errors = skip_errors
|
132
|
+
logger.debug "BrowserBuilder (selenium_chrome): enabled skip_request_errors"
|
133
|
+
end
|
134
|
+
|
121
135
|
# retry_request_errors
|
122
|
-
if
|
123
|
-
@browser.config.retry_request_errors =
|
136
|
+
if retry_errors = @config[:retry_request_errors].presence
|
137
|
+
@browser.config.retry_request_errors = retry_errors
|
124
138
|
logger.debug "BrowserBuilder (selenium_chrome): enabled retry_request_errors"
|
125
139
|
end
|
126
140
|
|
@@ -131,9 +131,15 @@ module Kimurai
|
|
131
131
|
end
|
132
132
|
|
133
133
|
# Browser instance options
|
134
|
+
# skip_request_errors
|
135
|
+
if skip_errors = @config[:skip_request_errors].presence
|
136
|
+
@browser.config.skip_request_errors = skip_errors
|
137
|
+
logger.debug "BrowserBuilder (selenium_firefox): enabled skip_request_errors"
|
138
|
+
end
|
139
|
+
|
134
140
|
# retry_request_errors
|
135
|
-
if
|
136
|
-
@browser.config.retry_request_errors =
|
141
|
+
if retry_errors = @config[:retry_request_errors].presence
|
142
|
+
@browser.config.retry_request_errors = retry_errors
|
137
143
|
logger.debug "BrowserBuilder (selenium_firefox): enabled retry_request_errors"
|
138
144
|
end
|
139
145
|
|
@@ -1,5 +1,6 @@
|
|
1
1
|
require 'capybara'
|
2
2
|
require 'nokogiri'
|
3
|
+
require 'json'
|
3
4
|
require_relative 'session/config'
|
4
5
|
|
5
6
|
module Capybara
|
@@ -18,21 +19,30 @@ module Capybara
|
|
18
19
|
spider.class.update(:visits, :requests) if spider.with_info
|
19
20
|
|
20
21
|
original_visit(visit_uri)
|
21
|
-
rescue
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
22
|
+
rescue => e
|
23
|
+
if match_error?(e, type: :to_skip)
|
24
|
+
logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
|
25
|
+
spider.add_event(:requests_errors, e.inspect) if spider.with_info
|
26
|
+
false
|
27
|
+
elsif match_error?(e, type: :to_retry)
|
28
|
+
logger.error "Browser: retry request error: #{e.inspect}, url: #{visit_uri}"
|
29
|
+
spider.add_event(:requests_errors, e.inspect) if spider.with_info
|
30
|
+
|
31
|
+
if (retries += 1) <= max_retries
|
32
|
+
logger.info "Browser: sleep #{(sleep_interval += 15)} seconds and process retry № #{retries} to the url: #{visit_uri}"
|
33
|
+
sleep sleep_interval and retry
|
34
|
+
else
|
35
|
+
logger.error "Browser: all retries (#{retries - 1}) to the url #{visit_uri} are gone"
|
36
|
+
raise e
|
37
|
+
end
|
28
38
|
else
|
29
|
-
logger.error "Browser: all retries (#{retries - 1}) to the url `#{visit_uri}` are gone"
|
30
39
|
raise e
|
31
40
|
end
|
32
41
|
else
|
33
42
|
driver.responses += 1 and logger.info "Browser: finished get request to: #{visit_uri}"
|
34
43
|
spider.class.update(:visits, :responses) if spider.with_info
|
35
44
|
driver.visited = true unless driver.visited
|
45
|
+
true
|
36
46
|
ensure
|
37
47
|
if spider.with_info
|
38
48
|
logger.info "Info: visits: requests: #{spider.class.visits[:requests]}, responses: #{spider.class.visits[:responses]}"
|
@@ -75,8 +85,13 @@ module Capybara
|
|
75
85
|
logger.info "Browser: driver has been restarted: name: #{mode}, pid: #{driver.pid}, port: #{driver.port}"
|
76
86
|
end
|
77
87
|
|
78
|
-
def current_response
|
79
|
-
|
88
|
+
def current_response(response_type = :html)
|
89
|
+
case response_type
|
90
|
+
when :html
|
91
|
+
Nokogiri::HTML(body)
|
92
|
+
when :json
|
93
|
+
JSON.parse(body)
|
94
|
+
end
|
80
95
|
end
|
81
96
|
|
82
97
|
###
|
@@ -114,6 +129,27 @@ module Capybara
|
|
114
129
|
|
115
130
|
private
|
116
131
|
|
132
|
+
def match_error?(e, type:)
|
133
|
+
errors = (type == :to_retry ? config.retry_request_errors : config.skip_request_errors)
|
134
|
+
if errors.present?
|
135
|
+
errors.any? do |error|
|
136
|
+
if error.class == Hash
|
137
|
+
match = if error[:message].class == Regexp
|
138
|
+
e.message&.match?(error[:message])
|
139
|
+
else
|
140
|
+
e.message&.include?(error[:message])
|
141
|
+
end
|
142
|
+
|
143
|
+
e.class == error[:error] && match
|
144
|
+
else
|
145
|
+
e.class == error
|
146
|
+
end
|
147
|
+
end
|
148
|
+
else
|
149
|
+
false
|
150
|
+
end
|
151
|
+
end
|
152
|
+
|
117
153
|
def process_delay(delay)
|
118
154
|
interval = (delay.class == Range ? rand(delay) : delay)
|
119
155
|
logger.debug "Browser: sleep #{interval.round(2)} #{'second'.pluralize(interval)} before request..."
|
@@ -1,12 +1,16 @@
|
|
1
1
|
module Capybara
|
2
2
|
class SessionConfig
|
3
3
|
attr_accessor :cookies, :proxy, :user_agent
|
4
|
-
attr_writer :retry_request_errors
|
4
|
+
attr_writer :retry_request_errors, :skip_request_errors
|
5
5
|
|
6
6
|
def retry_request_errors
|
7
7
|
@retry_request_errors ||= []
|
8
8
|
end
|
9
9
|
|
10
|
+
def skip_request_errors
|
11
|
+
@skip_request_errors ||= []
|
12
|
+
end
|
13
|
+
|
10
14
|
def restart_if
|
11
15
|
@restart_if ||= {}
|
12
16
|
end
|
data/lib/kimurai/cli.rb
CHANGED
@@ -146,10 +146,23 @@ module Kimurai
|
|
146
146
|
puts VERSION
|
147
147
|
end
|
148
148
|
|
149
|
+
desc "dashboard", "Run dashboard"
|
150
|
+
def dashboard
|
151
|
+
raise "Can't find Kimurai project" unless inside_project?
|
152
|
+
|
153
|
+
require './config/boot'
|
154
|
+
if Object.const_defined?("Kimurai::Dashboard")
|
155
|
+
require 'kimurai/dashboard/app'
|
156
|
+
Kimurai::Dashboard::App.run!
|
157
|
+
else
|
158
|
+
raise "Kimurai::Dashboard is not defined"
|
159
|
+
end
|
160
|
+
end
|
161
|
+
|
149
162
|
private
|
150
163
|
|
151
164
|
def inside_project?
|
152
|
-
Dir.exists?
|
165
|
+
Dir.exists?("spiders") && File.exists?("./config/boot.rb")
|
153
166
|
end
|
154
167
|
end
|
155
168
|
end
|
data/lib/kimurai/runner.rb
CHANGED
@@ -2,61 +2,43 @@ require 'pmap'
|
|
2
2
|
|
3
3
|
module Kimurai
|
4
4
|
class Runner
|
5
|
-
attr_reader :jobs, :spiders
|
5
|
+
attr_reader :jobs, :spiders, :session_info
|
6
6
|
|
7
7
|
def initialize(parallel_jobs:)
|
8
8
|
@jobs = parallel_jobs
|
9
9
|
@spiders = Kimurai.list
|
10
|
+
@start_time = Time.now
|
10
11
|
|
11
|
-
|
12
|
-
|
13
|
-
end
|
14
|
-
end
|
15
|
-
|
16
|
-
def run!
|
17
|
-
start_time = Time.now
|
18
|
-
run_id = start_time.to_i
|
19
|
-
running_pids = []
|
20
|
-
|
21
|
-
ENV.store("RBCAT_COLORIZER", "false")
|
22
|
-
|
23
|
-
run_info = {
|
24
|
-
id: run_id,
|
12
|
+
@session_info = {
|
13
|
+
id: @start_time.to_i,
|
25
14
|
status: :processing,
|
26
|
-
start_time: start_time,
|
15
|
+
start_time: @start_time,
|
27
16
|
stop_time: nil,
|
28
17
|
environment: Kimurai.env,
|
29
|
-
concurrent_jobs: jobs,
|
30
|
-
spiders: spiders.keys
|
18
|
+
concurrent_jobs: @jobs,
|
19
|
+
spiders: @spiders.keys
|
31
20
|
}
|
32
21
|
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
# Kill currently running spiders
|
37
|
-
running_pids.each { |pid| Process.kill("INT", pid) }
|
38
|
-
|
39
|
-
error = $!
|
40
|
-
stop_time = Time.now
|
22
|
+
if time_zone = Kimurai.configuration.time_zone
|
23
|
+
Kimurai.time_zone = time_zone
|
24
|
+
end
|
41
25
|
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
run_info.merge!(status: :failed, error: error.inspect, stop_time: stop_time)
|
46
|
-
end
|
26
|
+
ENV.store("SESSION_ID", @start_time.to_i.to_s)
|
27
|
+
ENV.store("RBCAT_COLORIZER", "false")
|
28
|
+
end
|
47
29
|
|
48
|
-
|
49
|
-
|
50
|
-
end
|
51
|
-
puts "<<< Runner: stopped: #{run_info}"
|
52
|
-
end
|
30
|
+
def run!(exception_on_fail: true)
|
31
|
+
running_pids = []
|
53
32
|
|
54
|
-
puts ">>> Runner: started: #{
|
33
|
+
puts ">>> Runner: started: #{session_info}"
|
55
34
|
if at_start_callback = Kimurai.configuration.runner_at_start_callback
|
56
|
-
at_start_callback.call(
|
35
|
+
at_start_callback.call(session_info)
|
57
36
|
end
|
58
37
|
|
38
|
+
running = true
|
59
39
|
spiders.peach_with_index(jobs) do |spider, i|
|
40
|
+
next unless running
|
41
|
+
|
60
42
|
spider_name = spider[0]
|
61
43
|
puts "> Runner: started spider: #{spider_name}, index: #{i}"
|
62
44
|
|
@@ -67,6 +49,23 @@ module Kimurai
|
|
67
49
|
running_pids.delete(pid)
|
68
50
|
puts "< Runner: stopped spider: #{spider_name}, index: #{i}"
|
69
51
|
end
|
52
|
+
rescue StandardError, SignalException, SystemExit => e
|
53
|
+
running = false
|
54
|
+
session_info.merge!(status: :failed, error: e.inspect, stop_time: Time.now)
|
55
|
+
exception_on_fail ? raise(e) : [session_info, e]
|
56
|
+
else
|
57
|
+
session_info.merge!(status: :completed, stop_time: Time.now)
|
58
|
+
ensure
|
59
|
+
running = false
|
60
|
+
Thread.list.each { |t| t.kill if t != Thread.main }
|
61
|
+
|
62
|
+
# Kill currently running spiders (if any, in case of fail)
|
63
|
+
running_pids.each { |pid| Process.kill("INT", pid) }
|
64
|
+
|
65
|
+
if at_stop_callback = Kimurai.configuration.runner_at_stop_callback
|
66
|
+
at_stop_callback.call(session_info)
|
67
|
+
end
|
68
|
+
puts "<<< Runner: stopped: #{session_info}"
|
70
69
|
end
|
71
70
|
end
|
72
71
|
end
|
@@ -29,4 +29,9 @@ Kimurai.configure do |config|
|
|
29
29
|
# json = JSON.pretty_generate(info)
|
30
30
|
# Sender.send_notification("Stopped session: #{json}")
|
31
31
|
# end
|
32
|
+
|
33
|
+
# Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
|
34
|
+
# config.selenium_chrome_path = "/usr/bin/chromium-browser"
|
35
|
+
# Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
|
36
|
+
# config.chromedriver_path = "/usr/local/bin/chromedriver"
|
32
37
|
end
|
@@ -81,7 +81,23 @@ class ApplicationSpider < Kimurai::Base
|
|
81
81
|
# works for all drivers
|
82
82
|
# skip_duplicate_requests: true,
|
83
83
|
|
84
|
-
#
|
84
|
+
# Automatically skip provided errors while requesting a page.
|
85
|
+
# If raised error matches one of the errors in the list, then this error will be caught,
|
86
|
+
# and request will be skipped.
|
87
|
+
# It is a good idea to skip errors like NotFound(404), etc.
|
88
|
+
# Format: array where elements are error classes or/and hashes. You can use hash format
|
89
|
+
# for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
|
90
|
+
# Provided `message:` will be compared with a full error message using `String#include?`. Also
|
91
|
+
# you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
|
92
|
+
# skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
|
93
|
+
|
94
|
+
# Automatically retry provided errors with a few attempts while requesting a page.
|
95
|
+
# If raised error matches one of the errors in the list, then this error will be caught
|
96
|
+
# and the request will be processed again within a delay. There are 3 attempts:
|
97
|
+
# first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
|
98
|
+
# If after 3 attempts there is still an exception, then the exception will be raised.
|
99
|
+
# It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
|
100
|
+
# Format: same like for `skip_request_errors` option.
|
85
101
|
# retry_request_errors: [Net::ReadTimeout],
|
86
102
|
|
87
103
|
# Restart browser if one of the options is true:
|
@@ -92,6 +108,8 @@ class ApplicationSpider < Kimurai::Base
|
|
92
108
|
# Restart browser if provided requests limit is exceeded (works for all engines)
|
93
109
|
# requests_limit: 100
|
94
110
|
},
|
111
|
+
|
112
|
+
# Perform several actions before each request:
|
95
113
|
before_request: {
|
96
114
|
# Change proxy before each request. The `proxy:` option above should be presented
|
97
115
|
# and has lambda format. Works only for poltergeist and mechanize engines
|
data/lib/kimurai/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: kimurai
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Victor Afanasev
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-09-
|
11
|
+
date: 2018-09-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: thor
|
@@ -265,7 +265,6 @@ files:
|
|
265
265
|
- ".gitignore"
|
266
266
|
- ".travis.yml"
|
267
267
|
- CHANGELOG.md
|
268
|
-
- CODE_OF_CONDUCT.md
|
269
268
|
- Gemfile
|
270
269
|
- LICENSE.txt
|
271
270
|
- README.md
|
@@ -301,6 +300,7 @@ files:
|
|
301
300
|
- lib/kimurai/cli/ansible_command_builder.rb
|
302
301
|
- lib/kimurai/cli/generator.rb
|
303
302
|
- lib/kimurai/core_ext/array.rb
|
303
|
+
- lib/kimurai/core_ext/hash.rb
|
304
304
|
- lib/kimurai/core_ext/numeric.rb
|
305
305
|
- lib/kimurai/core_ext/string.rb
|
306
306
|
- lib/kimurai/pipeline.rb
|
data/CODE_OF_CONDUCT.md
DELETED
@@ -1,74 +0,0 @@
|
|
1
|
-
# Contributor Covenant Code of Conduct
|
2
|
-
|
3
|
-
## Our Pledge
|
4
|
-
|
5
|
-
In the interest of fostering an open and welcoming environment, we as
|
6
|
-
contributors and maintainers pledge to making participation in our project and
|
7
|
-
our community a harassment-free experience for everyone, regardless of age, body
|
8
|
-
size, disability, ethnicity, gender identity and expression, level of experience,
|
9
|
-
nationality, personal appearance, race, religion, or sexual identity and
|
10
|
-
orientation.
|
11
|
-
|
12
|
-
## Our Standards
|
13
|
-
|
14
|
-
Examples of behavior that contributes to creating a positive environment
|
15
|
-
include:
|
16
|
-
|
17
|
-
* Using welcoming and inclusive language
|
18
|
-
* Being respectful of differing viewpoints and experiences
|
19
|
-
* Gracefully accepting constructive criticism
|
20
|
-
* Focusing on what is best for the community
|
21
|
-
* Showing empathy towards other community members
|
22
|
-
|
23
|
-
Examples of unacceptable behavior by participants include:
|
24
|
-
|
25
|
-
* The use of sexualized language or imagery and unwelcome sexual attention or
|
26
|
-
advances
|
27
|
-
* Trolling, insulting/derogatory comments, and personal or political attacks
|
28
|
-
* Public or private harassment
|
29
|
-
* Publishing others' private information, such as a physical or electronic
|
30
|
-
address, without explicit permission
|
31
|
-
* Other conduct which could reasonably be considered inappropriate in a
|
32
|
-
professional setting
|
33
|
-
|
34
|
-
## Our Responsibilities
|
35
|
-
|
36
|
-
Project maintainers are responsible for clarifying the standards of acceptable
|
37
|
-
behavior and are expected to take appropriate and fair corrective action in
|
38
|
-
response to any instances of unacceptable behavior.
|
39
|
-
|
40
|
-
Project maintainers have the right and responsibility to remove, edit, or
|
41
|
-
reject comments, commits, code, wiki edits, issues, and other contributions
|
42
|
-
that are not aligned to this Code of Conduct, or to ban temporarily or
|
43
|
-
permanently any contributor for other behaviors that they deem inappropriate,
|
44
|
-
threatening, offensive, or harmful.
|
45
|
-
|
46
|
-
## Scope
|
47
|
-
|
48
|
-
This Code of Conduct applies both within project spaces and in public spaces
|
49
|
-
when an individual is representing the project or its community. Examples of
|
50
|
-
representing a project or community include using an official project e-mail
|
51
|
-
address, posting via an official social media account, or acting as an appointed
|
52
|
-
representative at an online or offline event. Representation of a project may be
|
53
|
-
further defined and clarified by project maintainers.
|
54
|
-
|
55
|
-
## Enforcement
|
56
|
-
|
57
|
-
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
58
|
-
reported by contacting the project team at vicfreefly@gmail.com. All
|
59
|
-
complaints will be reviewed and investigated and will result in a response that
|
60
|
-
is deemed necessary and appropriate to the circumstances. The project team is
|
61
|
-
obligated to maintain confidentiality with regard to the reporter of an incident.
|
62
|
-
Further details of specific enforcement policies may be posted separately.
|
63
|
-
|
64
|
-
Project maintainers who do not follow or enforce the Code of Conduct in good
|
65
|
-
faith may face temporary or permanent repercussions as determined by other
|
66
|
-
members of the project's leadership.
|
67
|
-
|
68
|
-
## Attribution
|
69
|
-
|
70
|
-
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
|
71
|
-
available at [http://contributor-covenant.org/version/1/4][version]
|
72
|
-
|
73
|
-
[homepage]: http://contributor-covenant.org
|
74
|
-
[version]: http://contributor-covenant.org/version/1/4/
|