kimurai 1.1.0 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -0
- data/README.md +54 -3
- data/lib/kimurai.rb +1 -0
- data/lib/kimurai/base.rb +30 -14
- data/lib/kimurai/base/saver.rb +4 -4
- data/lib/kimurai/base/storage.rb +12 -2
- data/lib/kimurai/browser_builder/mechanize_builder.rb +8 -2
- data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +8 -2
- data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +19 -5
- data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +8 -2
- data/lib/kimurai/capybara_ext/session.rb +46 -10
- data/lib/kimurai/capybara_ext/session/config.rb +5 -1
- data/lib/kimurai/cli.rb +14 -1
- data/lib/kimurai/core_ext/hash.rb +5 -0
- data/lib/kimurai/runner.rb +37 -38
- data/lib/kimurai/template/config/application.rb +5 -0
- data/lib/kimurai/template/spiders/application_spider.rb +19 -1
- data/lib/kimurai/version.rb +1 -1
- metadata +3 -3
- data/CODE_OF_CONDUCT.md +0 -74
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 2ebb62d0916ee55eb8ec05f34b5f5909b99aae8fa2236dd93169f6e5e1221805
|
4
|
+
data.tar.gz: da86497a6c4f61f2ff1ffe3e8eeee92acb203305e712b9bd4baaa5794fdcf5f5
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9d54d6074928a5bc0aa0a9d7e64308942e21ad5ae17788b2fa36adfffe88f67df0c533558ced412f2e22481a1664661c547419d09451d1260dd6ebd14ca4d915
|
7
|
+
data.tar.gz: d21f499a2a292dd672480d15da71742cfa82dec054b6a3c3a3c756e6cd2c98e58ac7d3b7fa3fe0ce6ae2e956d46417269242e742f8d83b644176ccef0822de75
|
data/CHANGELOG.md
CHANGED
@@ -1,4 +1,20 @@
|
|
1
1
|
# CHANGELOG
|
2
|
+
## 1.2.0
|
3
|
+
### New
|
4
|
+
* Add possibility to add array of values to the storage (`Base::Storage#add`)
|
5
|
+
* Add `exception_on_fail` option to `Base.crawl!`
|
6
|
+
* Add possibility to pass request hash to the `start_urls` (You can use array of hashes as well, like: `@start_urls = [{ url: "https://example.com/cat?id=1", data: { category: "First Category" } }]`)
|
7
|
+
* Implement `skip_request_errors` config feature. Added [Handle request errors](https://github.com/vifreefly/kimuraframework#handle-request-errors) chapter to the README.
|
8
|
+
* Add option to choose response type for `Session#current_response` (`:html` default, or `:json`)
|
9
|
+
* Add option to provide custom chrome and chromedriver paths
|
10
|
+
|
11
|
+
### Improvements
|
12
|
+
* Refactor `Runner`
|
13
|
+
|
14
|
+
### Fixes
|
15
|
+
* Fix `Base#Saver` (automatically create file if it doesn't exists in case of persistence database)
|
16
|
+
* Do not deep merge config's `headers:` option
|
17
|
+
|
2
18
|
## 1.1.0
|
3
19
|
### Breaking changes 1.1.0
|
4
20
|
`browser` config option depricated. Now all sub-options inside `browser` should be placed right into `@config` hash, without `browser` parent key. Example:
|
data/README.md
CHANGED
@@ -18,7 +18,7 @@
|
|
18
18
|
|
19
19
|
<br>
|
20
20
|
|
21
|
-
> Note: this readme is for `1.
|
21
|
+
> Note: this readme is for `1.2.0` gem version. CHANGELOG [here](CHANGELOG.md).
|
22
22
|
|
23
23
|
Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
|
24
24
|
|
@@ -216,7 +216,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
216
216
|
* All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
|
217
217
|
* Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
|
218
218
|
* Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
|
219
|
-
* Automatically [
|
219
|
+
* Automatically [handle requests errors](#handle-request-errors)
|
220
220
|
* Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
|
221
221
|
* Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
|
222
222
|
* [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
|
@@ -242,6 +242,9 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
|
|
242
242
|
* [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
|
243
243
|
* [Storage object](#storage-object)
|
244
244
|
* [Persistence database for the storage](#persistence-database-for-the-storage)
|
245
|
+
* [Handle request errors](#handle-request-errors)
|
246
|
+
* [skip_request_errors](#skip_request_errors)
|
247
|
+
* [retry_request_errors](#retry_request_errors)
|
245
248
|
* [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
|
246
249
|
* [KIMURAI_ENV](#kimurai_env)
|
247
250
|
* [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
|
@@ -860,6 +863,9 @@ ProductsSpider.crawl!(continue: true)
|
|
860
863
|
|
861
864
|
Second approach is to automatically skip already processed items urls using `@config` `skip_duplicate_requests:` option:
|
862
865
|
|
866
|
+
<details/>
|
867
|
+
<summary>Check the code</summary>
|
868
|
+
|
863
869
|
```ruby
|
864
870
|
class ProductsSpider < Kimurai::Base
|
865
871
|
@start_urls = ["https://example-shop.com/"]
|
@@ -893,7 +899,29 @@ end
|
|
893
899
|
# Run the spider with persistence database option:
|
894
900
|
ProductsSpider.crawl!(continue: true)
|
895
901
|
```
|
902
|
+
</details>
|
903
|
+
|
904
|
+
### Handle request errors
|
905
|
+
It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
|
906
|
+
|
907
|
+
#### skip_request_errors
|
908
|
+
You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
|
909
|
+
|
910
|
+
Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
|
911
|
+
|
912
|
+
```ruby
|
913
|
+
@config = {
|
914
|
+
skip_request_errors: [{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }]
|
915
|
+
}
|
916
|
+
```
|
917
|
+
In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
|
896
918
|
|
919
|
+
#### retry_request_errors
|
920
|
+
You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
|
921
|
+
|
922
|
+
There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
|
923
|
+
|
924
|
+
Format for the option: same like for `skip_request_errors` option.
|
897
925
|
|
898
926
|
### `open_spider` and `close_spider` callbacks
|
899
927
|
|
@@ -1340,6 +1368,11 @@ Kimurai.configure do |config|
|
|
1340
1368
|
# Custom time zone (for logs):
|
1341
1369
|
# config.time_zone = "UTC"
|
1342
1370
|
# config.time_zone = "Europe/Moscow"
|
1371
|
+
|
1372
|
+
# Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
|
1373
|
+
# config.selenium_chrome_path = "/usr/bin/chromium-browser"
|
1374
|
+
# Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
|
1375
|
+
# config.chromedriver_path = "~/.local/bin/chromedriver"
|
1343
1376
|
end
|
1344
1377
|
```
|
1345
1378
|
|
@@ -1591,7 +1624,23 @@ end
|
|
1591
1624
|
# works for all drivers
|
1592
1625
|
skip_duplicate_requests: true,
|
1593
1626
|
|
1594
|
-
#
|
1627
|
+
# Automatically skip provided errors while requesting a page.
|
1628
|
+
# If raised error matches one of the errors in the list, then this error will be caught,
|
1629
|
+
# and request will be skipped.
|
1630
|
+
# It is a good idea to skip errors like NotFound(404), etc.
|
1631
|
+
# Format: array where elements are error classes or/and hashes. You can use hash format
|
1632
|
+
# for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
|
1633
|
+
# Provided `message:` will be compared with a full error message using `String#include?`. Also
|
1634
|
+
# you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
|
1635
|
+
skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
|
1636
|
+
|
1637
|
+
# Automatically retry provided errors with a few attempts while requesting a page.
|
1638
|
+
# If raised error matches one of the errors in the list, then this error will be caught
|
1639
|
+
# and the request will be processed again within a delay. There are 3 attempts:
|
1640
|
+
# first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
|
1641
|
+
# If after 3 attempts there is still an exception, then the exception will be raised.
|
1642
|
+
# It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
|
1643
|
+
# Format: same like for `skip_request_errors` option.
|
1595
1644
|
retry_request_errors: [Net::ReadTimeout],
|
1596
1645
|
|
1597
1646
|
# Restart browser if one of the options is true:
|
@@ -1602,6 +1651,8 @@ end
|
|
1602
1651
|
# Restart browser if provided requests limit is exceeded (works for all engines)
|
1603
1652
|
requests_limit: 100
|
1604
1653
|
},
|
1654
|
+
|
1655
|
+
# Perform several actions before each request:
|
1605
1656
|
before_request: {
|
1606
1657
|
# Change proxy before each request. The `proxy:` option above should be presented
|
1607
1658
|
# and has lambda format. Works only for poltergeist and mechanize engines
|
data/lib/kimurai.rb
CHANGED
@@ -10,6 +10,7 @@ require_relative 'kimurai/version'
|
|
10
10
|
require_relative 'kimurai/core_ext/numeric'
|
11
11
|
require_relative 'kimurai/core_ext/string'
|
12
12
|
require_relative 'kimurai/core_ext/array'
|
13
|
+
require_relative 'kimurai/core_ext/hash'
|
13
14
|
|
14
15
|
require_relative 'kimurai/browser_builder'
|
15
16
|
require_relative 'kimurai/base_helper'
|
data/lib/kimurai/base.rb
CHANGED
@@ -3,6 +3,9 @@ require_relative 'base/storage'
|
|
3
3
|
|
4
4
|
module Kimurai
|
5
5
|
class Base
|
6
|
+
# don't deep merge config's headers hash option
|
7
|
+
DMERGE_EXCLUDE = [:headers]
|
8
|
+
|
6
9
|
LoggerFormatter = proc do |severity, datetime, progname, msg|
|
7
10
|
current_thread_id = Thread.current.object_id
|
8
11
|
thread_type = Thread.main == Thread.current ? "M" : "C"
|
@@ -77,7 +80,11 @@ module Kimurai
|
|
77
80
|
end
|
78
81
|
|
79
82
|
def self.config
|
80
|
-
superclass.equal?(::Object)
|
83
|
+
if superclass.equal?(::Object)
|
84
|
+
@config
|
85
|
+
else
|
86
|
+
superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
|
87
|
+
end
|
81
88
|
end
|
82
89
|
|
83
90
|
###
|
@@ -90,7 +97,7 @@ module Kimurai
|
|
90
97
|
end
|
91
98
|
end
|
92
99
|
|
93
|
-
def self.crawl!(continue: false)
|
100
|
+
def self.crawl!(continue: false, exception_on_fail: true)
|
94
101
|
logger.error "Spider: already running: #{name}" and return false if running?
|
95
102
|
|
96
103
|
storage_path =
|
@@ -118,19 +125,23 @@ module Kimurai
|
|
118
125
|
spider.with_info = true
|
119
126
|
if start_urls
|
120
127
|
start_urls.each do |start_url|
|
121
|
-
|
128
|
+
if start_url.class == Hash
|
129
|
+
spider.request_to(:parse, start_url)
|
130
|
+
else
|
131
|
+
spider.request_to(:parse, url: start_url)
|
132
|
+
end
|
122
133
|
end
|
123
134
|
else
|
124
135
|
spider.parse
|
125
136
|
end
|
126
137
|
rescue StandardError, SignalException, SystemExit => e
|
127
138
|
@run_info.merge!(status: :failed, error: e.inspect)
|
128
|
-
raise e
|
139
|
+
exception_on_fail ? raise(e) : [@run_info, e]
|
129
140
|
else
|
130
141
|
@run_info.merge!(status: :completed)
|
131
142
|
ensure
|
132
143
|
if spider
|
133
|
-
spider.browser.destroy_driver!
|
144
|
+
spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
|
134
145
|
|
135
146
|
stop_time = Time.now
|
136
147
|
total_time = (stop_time - @run_info[:start_time]).round(3)
|
@@ -168,7 +179,7 @@ module Kimurai
|
|
168
179
|
|
169
180
|
def initialize(engine = self.class.engine, config: {})
|
170
181
|
@engine = engine
|
171
|
-
@config = self.class.config.
|
182
|
+
@config = self.class.config.deep_merge_excl(config, DMERGE_EXCLUDE)
|
172
183
|
@pipelines = self.class.pipelines.map do |pipeline_name|
|
173
184
|
klass = Pipeline.descendants.find { |kl| kl.name == pipeline_name }
|
174
185
|
instance = klass.new
|
@@ -184,15 +195,16 @@ module Kimurai
|
|
184
195
|
@browser ||= BrowserBuilder.build(@engine, @config, spider: self)
|
185
196
|
end
|
186
197
|
|
187
|
-
def request_to(handler, delay = nil, url:, data: {})
|
198
|
+
def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
|
188
199
|
if @config[:skip_duplicate_requests] && !unique_request?(url)
|
189
200
|
add_event(:duplicate_requests) if self.with_info
|
190
|
-
logger.warn "Spider: request_to:
|
201
|
+
logger.warn "Spider: request_to: not unique url: #{url}, skipped" and return
|
191
202
|
end
|
192
203
|
|
193
|
-
|
194
|
-
|
195
|
-
|
204
|
+
visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
|
205
|
+
return unless visited
|
206
|
+
|
207
|
+
public_send(handler, browser.current_response(response_type), { url: url, data: data })
|
196
208
|
end
|
197
209
|
|
198
210
|
def console(response = nil, url: nil, data: {})
|
@@ -285,18 +297,22 @@ module Kimurai
|
|
285
297
|
all << Thread.new(part) do |part|
|
286
298
|
Thread.current.abort_on_exception = true
|
287
299
|
|
288
|
-
spider = self.class.new(engine, config: config)
|
300
|
+
spider = self.class.new(engine, config: @config.deep_merge_excl(config, DMERGE_EXCLUDE))
|
289
301
|
spider.with_info = true if self.with_info
|
290
302
|
|
291
303
|
part.each do |url_data|
|
292
304
|
if url_data.class == Hash
|
293
|
-
|
305
|
+
if url_data[:url].present? && url_data[:data].present?
|
306
|
+
spider.request_to(handler, delay, url_data)
|
307
|
+
else
|
308
|
+
spider.public_send(handler, url_data)
|
309
|
+
end
|
294
310
|
else
|
295
311
|
spider.request_to(handler, delay, url: url_data, data: data)
|
296
312
|
end
|
297
313
|
end
|
298
314
|
ensure
|
299
|
-
spider.browser.destroy_driver!
|
315
|
+
spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
|
300
316
|
end
|
301
317
|
|
302
318
|
sleep 0.5
|
data/lib/kimurai/base/saver.rb
CHANGED
@@ -42,7 +42,7 @@ module Kimurai
|
|
42
42
|
def save_to_json(item)
|
43
43
|
data = JSON.generate([item])
|
44
44
|
|
45
|
-
if
|
45
|
+
if @index > 1 || append && File.exists?(path)
|
46
46
|
file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
|
47
47
|
File.open(path, "w") do |f|
|
48
48
|
f.write(file_content + data.sub(/\A\[/, ""))
|
@@ -55,7 +55,7 @@ module Kimurai
|
|
55
55
|
def save_to_pretty_json(item)
|
56
56
|
data = JSON.pretty_generate([item])
|
57
57
|
|
58
|
-
if
|
58
|
+
if @index > 1 || append && File.exists?(path)
|
59
59
|
file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
|
60
60
|
File.open(path, "w") do |f|
|
61
61
|
f.write(file_content + data.sub(/\A\[\n/, ""))
|
@@ -68,7 +68,7 @@ module Kimurai
|
|
68
68
|
def save_to_jsonlines(item)
|
69
69
|
data = JSON.generate(item)
|
70
70
|
|
71
|
-
if
|
71
|
+
if @index > 1 || append && File.exists?(path)
|
72
72
|
File.open(path, "a") { |file| file.write("\n" + data) }
|
73
73
|
else
|
74
74
|
File.open(path, "w") { |file| file.write(data) }
|
@@ -78,7 +78,7 @@ module Kimurai
|
|
78
78
|
def save_to_csv(item)
|
79
79
|
data = flatten_hash(item)
|
80
80
|
|
81
|
-
if
|
81
|
+
if @index > 1 || append && File.exists?(path)
|
82
82
|
CSV.open(path, "a+", force_quotes: true) do |csv|
|
83
83
|
csv << data.values
|
84
84
|
end
|
data/lib/kimurai/base/storage.rb
CHANGED
@@ -40,11 +40,21 @@ module Kimurai
|
|
40
40
|
if path
|
41
41
|
database.transaction do
|
42
42
|
database[scope] ||= []
|
43
|
-
|
43
|
+
if value.class == Array
|
44
|
+
database[scope] += value
|
45
|
+
database[scope].uniq!
|
46
|
+
else
|
47
|
+
database[scope].push(value) unless database[scope].include?(value)
|
48
|
+
end
|
44
49
|
end
|
45
50
|
else
|
46
51
|
database[scope] ||= []
|
47
|
-
|
52
|
+
if value.class == Array
|
53
|
+
database[scope] += value
|
54
|
+
database[scope].uniq!
|
55
|
+
else
|
56
|
+
database[scope].push(value) unless database[scope].include?(value)
|
57
|
+
end
|
48
58
|
end
|
49
59
|
end
|
50
60
|
end
|
@@ -80,9 +80,15 @@ module Kimurai
|
|
80
80
|
end
|
81
81
|
|
82
82
|
# Browser instance options
|
83
|
+
# skip_request_errors
|
84
|
+
if skip_errors = @config[:skip_request_errors].presence
|
85
|
+
@browser.config.skip_request_errors = skip_errors
|
86
|
+
logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
|
87
|
+
end
|
88
|
+
|
83
89
|
# retry_request_errors
|
84
|
-
if
|
85
|
-
@browser.config.retry_request_errors =
|
90
|
+
if retry_errors = @config[:retry_request_errors].presence
|
91
|
+
@browser.config.retry_request_errors = retry_errors
|
86
92
|
logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
|
87
93
|
end
|
88
94
|
|
@@ -91,9 +91,15 @@ module Kimurai
|
|
91
91
|
end
|
92
92
|
|
93
93
|
# Browser instance options
|
94
|
+
# skip_request_errors
|
95
|
+
if skip_errors = @config[:skip_request_errors].presence
|
96
|
+
@browser.config.skip_request_errors = skip_errors
|
97
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
|
98
|
+
end
|
99
|
+
|
94
100
|
# retry_request_errors
|
95
|
-
if
|
96
|
-
@browser.config.retry_request_errors =
|
101
|
+
if retry_errors = @config[:retry_request_errors].presence
|
102
|
+
@browser.config.retry_request_errors = retry_errors
|
97
103
|
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
|
98
104
|
end
|
99
105
|
|
@@ -23,8 +23,15 @@ module Kimurai
|
|
23
23
|
# Register driver
|
24
24
|
Capybara.register_driver :selenium_chrome do |app|
|
25
25
|
# Create driver options
|
26
|
-
|
27
|
-
|
26
|
+
opts = { args: %w[--disable-gpu --no-sandbox --disable-translate] }
|
27
|
+
|
28
|
+
# Provide custom chrome browser path:
|
29
|
+
if chrome_path = Kimurai.configuration.selenium_chrome_path
|
30
|
+
opts.merge!(binary: chrome_path)
|
31
|
+
end
|
32
|
+
|
33
|
+
# See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
|
34
|
+
driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
|
28
35
|
|
29
36
|
# Window size
|
30
37
|
if size = @config[:window_size].presence
|
@@ -99,7 +106,8 @@ module Kimurai
|
|
99
106
|
end
|
100
107
|
end
|
101
108
|
|
102
|
-
|
109
|
+
chromedriver_path = Kimurai.configuration.chromedriver_path || "/usr/local/bin/chromedriver"
|
110
|
+
Capybara::Selenium::Driver.new(app, browser: :chrome, options: driver_options, driver_path: chromedriver_path)
|
103
111
|
end
|
104
112
|
|
105
113
|
# Create browser instance (Capybara session)
|
@@ -118,9 +126,15 @@ module Kimurai
|
|
118
126
|
end
|
119
127
|
|
120
128
|
# Browser instance options
|
129
|
+
# skip_request_errors
|
130
|
+
if skip_errors = @config[:skip_request_errors].presence
|
131
|
+
@browser.config.skip_request_errors = skip_errors
|
132
|
+
logger.debug "BrowserBuilder (selenium_chrome): enabled skip_request_errors"
|
133
|
+
end
|
134
|
+
|
121
135
|
# retry_request_errors
|
122
|
-
if
|
123
|
-
@browser.config.retry_request_errors =
|
136
|
+
if retry_errors = @config[:retry_request_errors].presence
|
137
|
+
@browser.config.retry_request_errors = retry_errors
|
124
138
|
logger.debug "BrowserBuilder (selenium_chrome): enabled retry_request_errors"
|
125
139
|
end
|
126
140
|
|
@@ -131,9 +131,15 @@ module Kimurai
|
|
131
131
|
end
|
132
132
|
|
133
133
|
# Browser instance options
|
134
|
+
# skip_request_errors
|
135
|
+
if skip_errors = @config[:skip_request_errors].presence
|
136
|
+
@browser.config.skip_request_errors = skip_errors
|
137
|
+
logger.debug "BrowserBuilder (selenium_firefox): enabled skip_request_errors"
|
138
|
+
end
|
139
|
+
|
134
140
|
# retry_request_errors
|
135
|
-
if
|
136
|
-
@browser.config.retry_request_errors =
|
141
|
+
if retry_errors = @config[:retry_request_errors].presence
|
142
|
+
@browser.config.retry_request_errors = retry_errors
|
137
143
|
logger.debug "BrowserBuilder (selenium_firefox): enabled retry_request_errors"
|
138
144
|
end
|
139
145
|
|
@@ -1,5 +1,6 @@
|
|
1
1
|
require 'capybara'
|
2
2
|
require 'nokogiri'
|
3
|
+
require 'json'
|
3
4
|
require_relative 'session/config'
|
4
5
|
|
5
6
|
module Capybara
|
@@ -18,21 +19,30 @@ module Capybara
|
|
18
19
|
spider.class.update(:visits, :requests) if spider.with_info
|
19
20
|
|
20
21
|
original_visit(visit_uri)
|
21
|
-
rescue
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
22
|
+
rescue => e
|
23
|
+
if match_error?(e, type: :to_skip)
|
24
|
+
logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
|
25
|
+
spider.add_event(:requests_errors, e.inspect) if spider.with_info
|
26
|
+
false
|
27
|
+
elsif match_error?(e, type: :to_retry)
|
28
|
+
logger.error "Browser: retry request error: #{e.inspect}, url: #{visit_uri}"
|
29
|
+
spider.add_event(:requests_errors, e.inspect) if spider.with_info
|
30
|
+
|
31
|
+
if (retries += 1) <= max_retries
|
32
|
+
logger.info "Browser: sleep #{(sleep_interval += 15)} seconds and process retry № #{retries} to the url: #{visit_uri}"
|
33
|
+
sleep sleep_interval and retry
|
34
|
+
else
|
35
|
+
logger.error "Browser: all retries (#{retries - 1}) to the url #{visit_uri} are gone"
|
36
|
+
raise e
|
37
|
+
end
|
28
38
|
else
|
29
|
-
logger.error "Browser: all retries (#{retries - 1}) to the url `#{visit_uri}` are gone"
|
30
39
|
raise e
|
31
40
|
end
|
32
41
|
else
|
33
42
|
driver.responses += 1 and logger.info "Browser: finished get request to: #{visit_uri}"
|
34
43
|
spider.class.update(:visits, :responses) if spider.with_info
|
35
44
|
driver.visited = true unless driver.visited
|
45
|
+
true
|
36
46
|
ensure
|
37
47
|
if spider.with_info
|
38
48
|
logger.info "Info: visits: requests: #{spider.class.visits[:requests]}, responses: #{spider.class.visits[:responses]}"
|
@@ -75,8 +85,13 @@ module Capybara
|
|
75
85
|
logger.info "Browser: driver has been restarted: name: #{mode}, pid: #{driver.pid}, port: #{driver.port}"
|
76
86
|
end
|
77
87
|
|
78
|
-
def current_response
|
79
|
-
|
88
|
+
def current_response(response_type = :html)
|
89
|
+
case response_type
|
90
|
+
when :html
|
91
|
+
Nokogiri::HTML(body)
|
92
|
+
when :json
|
93
|
+
JSON.parse(body)
|
94
|
+
end
|
80
95
|
end
|
81
96
|
|
82
97
|
###
|
@@ -114,6 +129,27 @@ module Capybara
|
|
114
129
|
|
115
130
|
private
|
116
131
|
|
132
|
+
def match_error?(e, type:)
|
133
|
+
errors = (type == :to_retry ? config.retry_request_errors : config.skip_request_errors)
|
134
|
+
if errors.present?
|
135
|
+
errors.any? do |error|
|
136
|
+
if error.class == Hash
|
137
|
+
match = if error[:message].class == Regexp
|
138
|
+
e.message&.match?(error[:message])
|
139
|
+
else
|
140
|
+
e.message&.include?(error[:message])
|
141
|
+
end
|
142
|
+
|
143
|
+
e.class == error[:error] && match
|
144
|
+
else
|
145
|
+
e.class == error
|
146
|
+
end
|
147
|
+
end
|
148
|
+
else
|
149
|
+
false
|
150
|
+
end
|
151
|
+
end
|
152
|
+
|
117
153
|
def process_delay(delay)
|
118
154
|
interval = (delay.class == Range ? rand(delay) : delay)
|
119
155
|
logger.debug "Browser: sleep #{interval.round(2)} #{'second'.pluralize(interval)} before request..."
|
@@ -1,12 +1,16 @@
|
|
1
1
|
module Capybara
|
2
2
|
class SessionConfig
|
3
3
|
attr_accessor :cookies, :proxy, :user_agent
|
4
|
-
attr_writer :retry_request_errors
|
4
|
+
attr_writer :retry_request_errors, :skip_request_errors
|
5
5
|
|
6
6
|
def retry_request_errors
|
7
7
|
@retry_request_errors ||= []
|
8
8
|
end
|
9
9
|
|
10
|
+
def skip_request_errors
|
11
|
+
@skip_request_errors ||= []
|
12
|
+
end
|
13
|
+
|
10
14
|
def restart_if
|
11
15
|
@restart_if ||= {}
|
12
16
|
end
|
data/lib/kimurai/cli.rb
CHANGED
@@ -146,10 +146,23 @@ module Kimurai
|
|
146
146
|
puts VERSION
|
147
147
|
end
|
148
148
|
|
149
|
+
desc "dashboard", "Run dashboard"
|
150
|
+
def dashboard
|
151
|
+
raise "Can't find Kimurai project" unless inside_project?
|
152
|
+
|
153
|
+
require './config/boot'
|
154
|
+
if Object.const_defined?("Kimurai::Dashboard")
|
155
|
+
require 'kimurai/dashboard/app'
|
156
|
+
Kimurai::Dashboard::App.run!
|
157
|
+
else
|
158
|
+
raise "Kimurai::Dashboard is not defined"
|
159
|
+
end
|
160
|
+
end
|
161
|
+
|
149
162
|
private
|
150
163
|
|
151
164
|
def inside_project?
|
152
|
-
Dir.exists?
|
165
|
+
Dir.exists?("spiders") && File.exists?("./config/boot.rb")
|
153
166
|
end
|
154
167
|
end
|
155
168
|
end
|
data/lib/kimurai/runner.rb
CHANGED
@@ -2,61 +2,43 @@ require 'pmap'
|
|
2
2
|
|
3
3
|
module Kimurai
|
4
4
|
class Runner
|
5
|
-
attr_reader :jobs, :spiders
|
5
|
+
attr_reader :jobs, :spiders, :session_info
|
6
6
|
|
7
7
|
def initialize(parallel_jobs:)
|
8
8
|
@jobs = parallel_jobs
|
9
9
|
@spiders = Kimurai.list
|
10
|
+
@start_time = Time.now
|
10
11
|
|
11
|
-
|
12
|
-
|
13
|
-
end
|
14
|
-
end
|
15
|
-
|
16
|
-
def run!
|
17
|
-
start_time = Time.now
|
18
|
-
run_id = start_time.to_i
|
19
|
-
running_pids = []
|
20
|
-
|
21
|
-
ENV.store("RBCAT_COLORIZER", "false")
|
22
|
-
|
23
|
-
run_info = {
|
24
|
-
id: run_id,
|
12
|
+
@session_info = {
|
13
|
+
id: @start_time.to_i,
|
25
14
|
status: :processing,
|
26
|
-
start_time: start_time,
|
15
|
+
start_time: @start_time,
|
27
16
|
stop_time: nil,
|
28
17
|
environment: Kimurai.env,
|
29
|
-
concurrent_jobs: jobs,
|
30
|
-
spiders: spiders.keys
|
18
|
+
concurrent_jobs: @jobs,
|
19
|
+
spiders: @spiders.keys
|
31
20
|
}
|
32
21
|
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
# Kill currently running spiders
|
37
|
-
running_pids.each { |pid| Process.kill("INT", pid) }
|
38
|
-
|
39
|
-
error = $!
|
40
|
-
stop_time = Time.now
|
22
|
+
if time_zone = Kimurai.configuration.time_zone
|
23
|
+
Kimurai.time_zone = time_zone
|
24
|
+
end
|
41
25
|
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
run_info.merge!(status: :failed, error: error.inspect, stop_time: stop_time)
|
46
|
-
end
|
26
|
+
ENV.store("SESSION_ID", @start_time.to_i.to_s)
|
27
|
+
ENV.store("RBCAT_COLORIZER", "false")
|
28
|
+
end
|
47
29
|
|
48
|
-
|
49
|
-
|
50
|
-
end
|
51
|
-
puts "<<< Runner: stopped: #{run_info}"
|
52
|
-
end
|
30
|
+
def run!(exception_on_fail: true)
|
31
|
+
running_pids = []
|
53
32
|
|
54
|
-
puts ">>> Runner: started: #{
|
33
|
+
puts ">>> Runner: started: #{session_info}"
|
55
34
|
if at_start_callback = Kimurai.configuration.runner_at_start_callback
|
56
|
-
at_start_callback.call(
|
35
|
+
at_start_callback.call(session_info)
|
57
36
|
end
|
58
37
|
|
38
|
+
running = true
|
59
39
|
spiders.peach_with_index(jobs) do |spider, i|
|
40
|
+
next unless running
|
41
|
+
|
60
42
|
spider_name = spider[0]
|
61
43
|
puts "> Runner: started spider: #{spider_name}, index: #{i}"
|
62
44
|
|
@@ -67,6 +49,23 @@ module Kimurai
|
|
67
49
|
running_pids.delete(pid)
|
68
50
|
puts "< Runner: stopped spider: #{spider_name}, index: #{i}"
|
69
51
|
end
|
52
|
+
rescue StandardError, SignalException, SystemExit => e
|
53
|
+
running = false
|
54
|
+
session_info.merge!(status: :failed, error: e.inspect, stop_time: Time.now)
|
55
|
+
exception_on_fail ? raise(e) : [session_info, e]
|
56
|
+
else
|
57
|
+
session_info.merge!(status: :completed, stop_time: Time.now)
|
58
|
+
ensure
|
59
|
+
running = false
|
60
|
+
Thread.list.each { |t| t.kill if t != Thread.main }
|
61
|
+
|
62
|
+
# Kill currently running spiders (if any, in case of fail)
|
63
|
+
running_pids.each { |pid| Process.kill("INT", pid) }
|
64
|
+
|
65
|
+
if at_stop_callback = Kimurai.configuration.runner_at_stop_callback
|
66
|
+
at_stop_callback.call(session_info)
|
67
|
+
end
|
68
|
+
puts "<<< Runner: stopped: #{session_info}"
|
70
69
|
end
|
71
70
|
end
|
72
71
|
end
|
@@ -29,4 +29,9 @@ Kimurai.configure do |config|
|
|
29
29
|
# json = JSON.pretty_generate(info)
|
30
30
|
# Sender.send_notification("Stopped session: #{json}")
|
31
31
|
# end
|
32
|
+
|
33
|
+
# Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
|
34
|
+
# config.selenium_chrome_path = "/usr/bin/chromium-browser"
|
35
|
+
# Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
|
36
|
+
# config.chromedriver_path = "/usr/local/bin/chromedriver"
|
32
37
|
end
|
@@ -81,7 +81,23 @@ class ApplicationSpider < Kimurai::Base
|
|
81
81
|
# works for all drivers
|
82
82
|
# skip_duplicate_requests: true,
|
83
83
|
|
84
|
-
#
|
84
|
+
# Automatically skip provided errors while requesting a page.
|
85
|
+
# If raised error matches one of the errors in the list, then this error will be caught,
|
86
|
+
# and request will be skipped.
|
87
|
+
# It is a good idea to skip errors like NotFound(404), etc.
|
88
|
+
# Format: array where elements are error classes or/and hashes. You can use hash format
|
89
|
+
# for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
|
90
|
+
# Provided `message:` will be compared with a full error message using `String#include?`. Also
|
91
|
+
# you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
|
92
|
+
# skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
|
93
|
+
|
94
|
+
# Automatically retry provided errors with a few attempts while requesting a page.
|
95
|
+
# If raised error matches one of the errors in the list, then this error will be caught
|
96
|
+
# and the request will be processed again within a delay. There are 3 attempts:
|
97
|
+
# first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
|
98
|
+
# If after 3 attempts there is still an exception, then the exception will be raised.
|
99
|
+
# It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
|
100
|
+
# Format: same like for `skip_request_errors` option.
|
85
101
|
# retry_request_errors: [Net::ReadTimeout],
|
86
102
|
|
87
103
|
# Restart browser if one of the options is true:
|
@@ -92,6 +108,8 @@ class ApplicationSpider < Kimurai::Base
|
|
92
108
|
# Restart browser if provided requests limit is exceeded (works for all engines)
|
93
109
|
# requests_limit: 100
|
94
110
|
},
|
111
|
+
|
112
|
+
# Perform several actions before each request:
|
95
113
|
before_request: {
|
96
114
|
# Change proxy before each request. The `proxy:` option above should be presented
|
97
115
|
# and has lambda format. Works only for poltergeist and mechanize engines
|
data/lib/kimurai/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: kimurai
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Victor Afanasev
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-09-
|
11
|
+
date: 2018-09-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: thor
|
@@ -265,7 +265,6 @@ files:
|
|
265
265
|
- ".gitignore"
|
266
266
|
- ".travis.yml"
|
267
267
|
- CHANGELOG.md
|
268
|
-
- CODE_OF_CONDUCT.md
|
269
268
|
- Gemfile
|
270
269
|
- LICENSE.txt
|
271
270
|
- README.md
|
@@ -301,6 +300,7 @@ files:
|
|
301
300
|
- lib/kimurai/cli/ansible_command_builder.rb
|
302
301
|
- lib/kimurai/cli/generator.rb
|
303
302
|
- lib/kimurai/core_ext/array.rb
|
303
|
+
- lib/kimurai/core_ext/hash.rb
|
304
304
|
- lib/kimurai/core_ext/numeric.rb
|
305
305
|
- lib/kimurai/core_ext/string.rb
|
306
306
|
- lib/kimurai/pipeline.rb
|
data/CODE_OF_CONDUCT.md
DELETED
@@ -1,74 +0,0 @@
|
|
1
|
-
# Contributor Covenant Code of Conduct
|
2
|
-
|
3
|
-
## Our Pledge
|
4
|
-
|
5
|
-
In the interest of fostering an open and welcoming environment, we as
|
6
|
-
contributors and maintainers pledge to making participation in our project and
|
7
|
-
our community a harassment-free experience for everyone, regardless of age, body
|
8
|
-
size, disability, ethnicity, gender identity and expression, level of experience,
|
9
|
-
nationality, personal appearance, race, religion, or sexual identity and
|
10
|
-
orientation.
|
11
|
-
|
12
|
-
## Our Standards
|
13
|
-
|
14
|
-
Examples of behavior that contributes to creating a positive environment
|
15
|
-
include:
|
16
|
-
|
17
|
-
* Using welcoming and inclusive language
|
18
|
-
* Being respectful of differing viewpoints and experiences
|
19
|
-
* Gracefully accepting constructive criticism
|
20
|
-
* Focusing on what is best for the community
|
21
|
-
* Showing empathy towards other community members
|
22
|
-
|
23
|
-
Examples of unacceptable behavior by participants include:
|
24
|
-
|
25
|
-
* The use of sexualized language or imagery and unwelcome sexual attention or
|
26
|
-
advances
|
27
|
-
* Trolling, insulting/derogatory comments, and personal or political attacks
|
28
|
-
* Public or private harassment
|
29
|
-
* Publishing others' private information, such as a physical or electronic
|
30
|
-
address, without explicit permission
|
31
|
-
* Other conduct which could reasonably be considered inappropriate in a
|
32
|
-
professional setting
|
33
|
-
|
34
|
-
## Our Responsibilities
|
35
|
-
|
36
|
-
Project maintainers are responsible for clarifying the standards of acceptable
|
37
|
-
behavior and are expected to take appropriate and fair corrective action in
|
38
|
-
response to any instances of unacceptable behavior.
|
39
|
-
|
40
|
-
Project maintainers have the right and responsibility to remove, edit, or
|
41
|
-
reject comments, commits, code, wiki edits, issues, and other contributions
|
42
|
-
that are not aligned to this Code of Conduct, or to ban temporarily or
|
43
|
-
permanently any contributor for other behaviors that they deem inappropriate,
|
44
|
-
threatening, offensive, or harmful.
|
45
|
-
|
46
|
-
## Scope
|
47
|
-
|
48
|
-
This Code of Conduct applies both within project spaces and in public spaces
|
49
|
-
when an individual is representing the project or its community. Examples of
|
50
|
-
representing a project or community include using an official project e-mail
|
51
|
-
address, posting via an official social media account, or acting as an appointed
|
52
|
-
representative at an online or offline event. Representation of a project may be
|
53
|
-
further defined and clarified by project maintainers.
|
54
|
-
|
55
|
-
## Enforcement
|
56
|
-
|
57
|
-
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
58
|
-
reported by contacting the project team at vicfreefly@gmail.com. All
|
59
|
-
complaints will be reviewed and investigated and will result in a response that
|
60
|
-
is deemed necessary and appropriate to the circumstances. The project team is
|
61
|
-
obligated to maintain confidentiality with regard to the reporter of an incident.
|
62
|
-
Further details of specific enforcement policies may be posted separately.
|
63
|
-
|
64
|
-
Project maintainers who do not follow or enforce the Code of Conduct in good
|
65
|
-
faith may face temporary or permanent repercussions as determined by other
|
66
|
-
members of the project's leadership.
|
67
|
-
|
68
|
-
## Attribution
|
69
|
-
|
70
|
-
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
|
71
|
-
available at [http://contributor-covenant.org/version/1/4][version]
|
72
|
-
|
73
|
-
[homepage]: http://contributor-covenant.org
|
74
|
-
[version]: http://contributor-covenant.org/version/1/4/
|