kimurai 1.1.0 → 1.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 67ee49692e64813bc980eb7562b711d5b5d2c47b50a995acb4759709703da0f9
4
- data.tar.gz: baba361bc5039d303ae4a6c9a1dd2109368f8e4c7a641d0a782cfc6a7776ade4
3
+ metadata.gz: 2ebb62d0916ee55eb8ec05f34b5f5909b99aae8fa2236dd93169f6e5e1221805
4
+ data.tar.gz: da86497a6c4f61f2ff1ffe3e8eeee92acb203305e712b9bd4baaa5794fdcf5f5
5
5
  SHA512:
6
- metadata.gz: 0173d3859b5f8776371fad454ff5575fdc453aa3c6038d8a8399651c46c5eaae789273772227ea014b6ce39b13586e6805bd7f69156eafeacf653804f954003c
7
- data.tar.gz: b05889c0cb030aed06fe1df5cc5411154d24019667e1f00f9f4248d598fc93990f86a4aae78430af3140f3dc3989e856cfc3e2316f455984c898442fccad15db
6
+ metadata.gz: 9d54d6074928a5bc0aa0a9d7e64308942e21ad5ae17788b2fa36adfffe88f67df0c533558ced412f2e22481a1664661c547419d09451d1260dd6ebd14ca4d915
7
+ data.tar.gz: d21f499a2a292dd672480d15da71742cfa82dec054b6a3c3a3c756e6cd2c98e58ac7d3b7fa3fe0ce6ae2e956d46417269242e742f8d83b644176ccef0822de75
@@ -1,4 +1,20 @@
1
1
  # CHANGELOG
2
+ ## 1.2.0
3
+ ### New
4
+ * Add possibility to add array of values to the storage (`Base::Storage#add`)
5
+ * Add `exception_on_fail` option to `Base.crawl!`
6
+ * Add possibility to pass request hash to the `start_urls` (You can use array of hashes as well, like: `@start_urls = [{ url: "https://example.com/cat?id=1", data: { category: "First Category" } }]`)
7
+ * Implement `skip_request_errors` config feature. Added [Handle request errors](https://github.com/vifreefly/kimuraframework#handle-request-errors) chapter to the README.
8
+ * Add option to choose response type for `Session#current_response` (`:html` default, or `:json`)
9
+ * Add option to provide custom chrome and chromedriver paths
10
+
11
+ ### Improvements
12
+ * Refactor `Runner`
13
+
14
+ ### Fixes
15
+ * Fix `Base#Saver` (automatically create file if it doesn't exists in case of persistence database)
16
+ * Do not deep merge config's `headers:` option
17
+
2
18
  ## 1.1.0
3
19
  ### Breaking changes 1.1.0
4
20
  `browser` config option depricated. Now all sub-options inside `browser` should be placed right into `@config` hash, without `browser` parent key. Example:
data/README.md CHANGED
@@ -18,7 +18,7 @@
18
18
 
19
19
  <br>
20
20
 
21
- > Note: this readme is for `1.1.0` gem version. CHANGELOG [here](CHANGELOG.md).
21
+ > Note: this readme is for `1.2.0` gem version. CHANGELOG [here](CHANGELOG.md).
22
22
 
23
23
  Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
24
24
 
@@ -216,7 +216,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
216
216
  * All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
217
217
  * Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
218
218
  * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
219
- * Automatically [retry failed requests](#configuration-options) with delay
219
+ * Automatically [handle requests errors](#handle-request-errors)
220
220
  * Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
221
221
  * Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
222
222
  * [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
@@ -242,6 +242,9 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
242
242
  * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
243
243
  * [Storage object](#storage-object)
244
244
  * [Persistence database for the storage](#persistence-database-for-the-storage)
245
+ * [Handle request errors](#handle-request-errors)
246
+ * [skip_request_errors](#skip_request_errors)
247
+ * [retry_request_errors](#retry_request_errors)
245
248
  * [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
246
249
  * [KIMURAI_ENV](#kimurai_env)
247
250
  * [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
@@ -860,6 +863,9 @@ ProductsSpider.crawl!(continue: true)
860
863
 
861
864
  Second approach is to automatically skip already processed items urls using `@config` `skip_duplicate_requests:` option:
862
865
 
866
+ <details/>
867
+ <summary>Check the code</summary>
868
+
863
869
  ```ruby
864
870
  class ProductsSpider < Kimurai::Base
865
871
  @start_urls = ["https://example-shop.com/"]
@@ -893,7 +899,29 @@ end
893
899
  # Run the spider with persistence database option:
894
900
  ProductsSpider.crawl!(continue: true)
895
901
  ```
902
+ </details>
903
+
904
+ ### Handle request errors
905
+ It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
906
+
907
+ #### skip_request_errors
908
+ You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
909
+
910
+ Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
911
+
912
+ ```ruby
913
+ @config = {
914
+ skip_request_errors: [{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }]
915
+ }
916
+ ```
917
+ In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
896
918
 
919
+ #### retry_request_errors
920
+ You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
921
+
922
+ There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
923
+
924
+ Format for the option: same like for `skip_request_errors` option.
897
925
 
898
926
  ### `open_spider` and `close_spider` callbacks
899
927
 
@@ -1340,6 +1368,11 @@ Kimurai.configure do |config|
1340
1368
  # Custom time zone (for logs):
1341
1369
  # config.time_zone = "UTC"
1342
1370
  # config.time_zone = "Europe/Moscow"
1371
+
1372
+ # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
1373
+ # config.selenium_chrome_path = "/usr/bin/chromium-browser"
1374
+ # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
1375
+ # config.chromedriver_path = "~/.local/bin/chromedriver"
1343
1376
  end
1344
1377
  ```
1345
1378
 
@@ -1591,7 +1624,23 @@ end
1591
1624
  # works for all drivers
1592
1625
  skip_duplicate_requests: true,
1593
1626
 
1594
- # Array of possible errors to retry while processing a request:
1627
+ # Automatically skip provided errors while requesting a page.
1628
+ # If raised error matches one of the errors in the list, then this error will be caught,
1629
+ # and request will be skipped.
1630
+ # It is a good idea to skip errors like NotFound(404), etc.
1631
+ # Format: array where elements are error classes or/and hashes. You can use hash format
1632
+ # for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
1633
+ # Provided `message:` will be compared with a full error message using `String#include?`. Also
1634
+ # you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
1635
+ skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
1636
+
1637
+ # Automatically retry provided errors with a few attempts while requesting a page.
1638
+ # If raised error matches one of the errors in the list, then this error will be caught
1639
+ # and the request will be processed again within a delay. There are 3 attempts:
1640
+ # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
1641
+ # If after 3 attempts there is still an exception, then the exception will be raised.
1642
+ # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
1643
+ # Format: same like for `skip_request_errors` option.
1595
1644
  retry_request_errors: [Net::ReadTimeout],
1596
1645
 
1597
1646
  # Restart browser if one of the options is true:
@@ -1602,6 +1651,8 @@ end
1602
1651
  # Restart browser if provided requests limit is exceeded (works for all engines)
1603
1652
  requests_limit: 100
1604
1653
  },
1654
+
1655
+ # Perform several actions before each request:
1605
1656
  before_request: {
1606
1657
  # Change proxy before each request. The `proxy:` option above should be presented
1607
1658
  # and has lambda format. Works only for poltergeist and mechanize engines
@@ -10,6 +10,7 @@ require_relative 'kimurai/version'
10
10
  require_relative 'kimurai/core_ext/numeric'
11
11
  require_relative 'kimurai/core_ext/string'
12
12
  require_relative 'kimurai/core_ext/array'
13
+ require_relative 'kimurai/core_ext/hash'
13
14
 
14
15
  require_relative 'kimurai/browser_builder'
15
16
  require_relative 'kimurai/base_helper'
@@ -3,6 +3,9 @@ require_relative 'base/storage'
3
3
 
4
4
  module Kimurai
5
5
  class Base
6
+ # don't deep merge config's headers hash option
7
+ DMERGE_EXCLUDE = [:headers]
8
+
6
9
  LoggerFormatter = proc do |severity, datetime, progname, msg|
7
10
  current_thread_id = Thread.current.object_id
8
11
  thread_type = Thread.main == Thread.current ? "M" : "C"
@@ -77,7 +80,11 @@ module Kimurai
77
80
  end
78
81
 
79
82
  def self.config
80
- superclass.equal?(::Object) ? @config : superclass.config.deep_merge(@config || {})
83
+ if superclass.equal?(::Object)
84
+ @config
85
+ else
86
+ superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
87
+ end
81
88
  end
82
89
 
83
90
  ###
@@ -90,7 +97,7 @@ module Kimurai
90
97
  end
91
98
  end
92
99
 
93
- def self.crawl!(continue: false)
100
+ def self.crawl!(continue: false, exception_on_fail: true)
94
101
  logger.error "Spider: already running: #{name}" and return false if running?
95
102
 
96
103
  storage_path =
@@ -118,19 +125,23 @@ module Kimurai
118
125
  spider.with_info = true
119
126
  if start_urls
120
127
  start_urls.each do |start_url|
121
- spider.request_to(:parse, url: start_url)
128
+ if start_url.class == Hash
129
+ spider.request_to(:parse, start_url)
130
+ else
131
+ spider.request_to(:parse, url: start_url)
132
+ end
122
133
  end
123
134
  else
124
135
  spider.parse
125
136
  end
126
137
  rescue StandardError, SignalException, SystemExit => e
127
138
  @run_info.merge!(status: :failed, error: e.inspect)
128
- raise e
139
+ exception_on_fail ? raise(e) : [@run_info, e]
129
140
  else
130
141
  @run_info.merge!(status: :completed)
131
142
  ensure
132
143
  if spider
133
- spider.browser.destroy_driver!
144
+ spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
134
145
 
135
146
  stop_time = Time.now
136
147
  total_time = (stop_time - @run_info[:start_time]).round(3)
@@ -168,7 +179,7 @@ module Kimurai
168
179
 
169
180
  def initialize(engine = self.class.engine, config: {})
170
181
  @engine = engine
171
- @config = self.class.config.deep_merge(config)
182
+ @config = self.class.config.deep_merge_excl(config, DMERGE_EXCLUDE)
172
183
  @pipelines = self.class.pipelines.map do |pipeline_name|
173
184
  klass = Pipeline.descendants.find { |kl| kl.name == pipeline_name }
174
185
  instance = klass.new
@@ -184,15 +195,16 @@ module Kimurai
184
195
  @browser ||= BrowserBuilder.build(@engine, @config, spider: self)
185
196
  end
186
197
 
187
- def request_to(handler, delay = nil, url:, data: {})
198
+ def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
188
199
  if @config[:skip_duplicate_requests] && !unique_request?(url)
189
200
  add_event(:duplicate_requests) if self.with_info
190
- logger.warn "Spider: request_to: url is not unique: #{url}, skipped" and return
201
+ logger.warn "Spider: request_to: not unique url: #{url}, skipped" and return
191
202
  end
192
203
 
193
- request_data = { url: url, data: data }
194
- delay ? browser.visit(url, delay: delay) : browser.visit(url)
195
- public_send(handler, browser.current_response, request_data)
204
+ visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
205
+ return unless visited
206
+
207
+ public_send(handler, browser.current_response(response_type), { url: url, data: data })
196
208
  end
197
209
 
198
210
  def console(response = nil, url: nil, data: {})
@@ -285,18 +297,22 @@ module Kimurai
285
297
  all << Thread.new(part) do |part|
286
298
  Thread.current.abort_on_exception = true
287
299
 
288
- spider = self.class.new(engine, config: config)
300
+ spider = self.class.new(engine, config: @config.deep_merge_excl(config, DMERGE_EXCLUDE))
289
301
  spider.with_info = true if self.with_info
290
302
 
291
303
  part.each do |url_data|
292
304
  if url_data.class == Hash
293
- spider.request_to(handler, delay, url_data)
305
+ if url_data[:url].present? && url_data[:data].present?
306
+ spider.request_to(handler, delay, url_data)
307
+ else
308
+ spider.public_send(handler, url_data)
309
+ end
294
310
  else
295
311
  spider.request_to(handler, delay, url: url_data, data: data)
296
312
  end
297
313
  end
298
314
  ensure
299
- spider.browser.destroy_driver!
315
+ spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
300
316
  end
301
317
 
302
318
  sleep 0.5
@@ -42,7 +42,7 @@ module Kimurai
42
42
  def save_to_json(item)
43
43
  data = JSON.generate([item])
44
44
 
45
- if append || @index > 1
45
+ if @index > 1 || append && File.exists?(path)
46
46
  file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
47
47
  File.open(path, "w") do |f|
48
48
  f.write(file_content + data.sub(/\A\[/, ""))
@@ -55,7 +55,7 @@ module Kimurai
55
55
  def save_to_pretty_json(item)
56
56
  data = JSON.pretty_generate([item])
57
57
 
58
- if append || @index > 1
58
+ if @index > 1 || append && File.exists?(path)
59
59
  file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
60
60
  File.open(path, "w") do |f|
61
61
  f.write(file_content + data.sub(/\A\[\n/, ""))
@@ -68,7 +68,7 @@ module Kimurai
68
68
  def save_to_jsonlines(item)
69
69
  data = JSON.generate(item)
70
70
 
71
- if append || @index > 1
71
+ if @index > 1 || append && File.exists?(path)
72
72
  File.open(path, "a") { |file| file.write("\n" + data) }
73
73
  else
74
74
  File.open(path, "w") { |file| file.write(data) }
@@ -78,7 +78,7 @@ module Kimurai
78
78
  def save_to_csv(item)
79
79
  data = flatten_hash(item)
80
80
 
81
- if append || @index > 1
81
+ if @index > 1 || append && File.exists?(path)
82
82
  CSV.open(path, "a+", force_quotes: true) do |csv|
83
83
  csv << data.values
84
84
  end
@@ -40,11 +40,21 @@ module Kimurai
40
40
  if path
41
41
  database.transaction do
42
42
  database[scope] ||= []
43
- database[scope].push(value) unless database[scope].include?(value)
43
+ if value.class == Array
44
+ database[scope] += value
45
+ database[scope].uniq!
46
+ else
47
+ database[scope].push(value) unless database[scope].include?(value)
48
+ end
44
49
  end
45
50
  else
46
51
  database[scope] ||= []
47
- database[scope].push(value) unless database[scope].include?(value)
52
+ if value.class == Array
53
+ database[scope] += value
54
+ database[scope].uniq!
55
+ else
56
+ database[scope].push(value) unless database[scope].include?(value)
57
+ end
48
58
  end
49
59
  end
50
60
  end
@@ -80,9 +80,15 @@ module Kimurai
80
80
  end
81
81
 
82
82
  # Browser instance options
83
+ # skip_request_errors
84
+ if skip_errors = @config[:skip_request_errors].presence
85
+ @browser.config.skip_request_errors = skip_errors
86
+ logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
87
+ end
88
+
83
89
  # retry_request_errors
84
- if errors = @config[:retry_request_errors].presence
85
- @browser.config.retry_request_errors = errors
90
+ if retry_errors = @config[:retry_request_errors].presence
91
+ @browser.config.retry_request_errors = retry_errors
86
92
  logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
87
93
  end
88
94
 
@@ -91,9 +91,15 @@ module Kimurai
91
91
  end
92
92
 
93
93
  # Browser instance options
94
+ # skip_request_errors
95
+ if skip_errors = @config[:skip_request_errors].presence
96
+ @browser.config.skip_request_errors = skip_errors
97
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
98
+ end
99
+
94
100
  # retry_request_errors
95
- if errors = @config[:retry_request_errors].presence
96
- @browser.config.retry_request_errors = errors
101
+ if retry_errors = @config[:retry_request_errors].presence
102
+ @browser.config.retry_request_errors = retry_errors
97
103
  logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
98
104
  end
99
105
 
@@ -23,8 +23,15 @@ module Kimurai
23
23
  # Register driver
24
24
  Capybara.register_driver :selenium_chrome do |app|
25
25
  # Create driver options
26
- default_args = %w[--disable-gpu --no-sandbox --disable-translate]
27
- driver_options = Selenium::WebDriver::Chrome::Options.new(args: default_args)
26
+ opts = { args: %w[--disable-gpu --no-sandbox --disable-translate] }
27
+
28
+ # Provide custom chrome browser path:
29
+ if chrome_path = Kimurai.configuration.selenium_chrome_path
30
+ opts.merge!(binary: chrome_path)
31
+ end
32
+
33
+ # See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
34
+ driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
28
35
 
29
36
  # Window size
30
37
  if size = @config[:window_size].presence
@@ -99,7 +106,8 @@ module Kimurai
99
106
  end
100
107
  end
101
108
 
102
- Capybara::Selenium::Driver.new(app, browser: :chrome, options: driver_options, driver_path: "/usr/local/bin/chromedriver")
109
+ chromedriver_path = Kimurai.configuration.chromedriver_path || "/usr/local/bin/chromedriver"
110
+ Capybara::Selenium::Driver.new(app, browser: :chrome, options: driver_options, driver_path: chromedriver_path)
103
111
  end
104
112
 
105
113
  # Create browser instance (Capybara session)
@@ -118,9 +126,15 @@ module Kimurai
118
126
  end
119
127
 
120
128
  # Browser instance options
129
+ # skip_request_errors
130
+ if skip_errors = @config[:skip_request_errors].presence
131
+ @browser.config.skip_request_errors = skip_errors
132
+ logger.debug "BrowserBuilder (selenium_chrome): enabled skip_request_errors"
133
+ end
134
+
121
135
  # retry_request_errors
122
- if errors = @config[:retry_request_errors].presence
123
- @browser.config.retry_request_errors = errors
136
+ if retry_errors = @config[:retry_request_errors].presence
137
+ @browser.config.retry_request_errors = retry_errors
124
138
  logger.debug "BrowserBuilder (selenium_chrome): enabled retry_request_errors"
125
139
  end
126
140
 
@@ -131,9 +131,15 @@ module Kimurai
131
131
  end
132
132
 
133
133
  # Browser instance options
134
+ # skip_request_errors
135
+ if skip_errors = @config[:skip_request_errors].presence
136
+ @browser.config.skip_request_errors = skip_errors
137
+ logger.debug "BrowserBuilder (selenium_firefox): enabled skip_request_errors"
138
+ end
139
+
134
140
  # retry_request_errors
135
- if errors = @config[:retry_request_errors].presence
136
- @browser.config.retry_request_errors = errors
141
+ if retry_errors = @config[:retry_request_errors].presence
142
+ @browser.config.retry_request_errors = retry_errors
137
143
  logger.debug "BrowserBuilder (selenium_firefox): enabled retry_request_errors"
138
144
  end
139
145
 
@@ -1,5 +1,6 @@
1
1
  require 'capybara'
2
2
  require 'nokogiri'
3
+ require 'json'
3
4
  require_relative 'session/config'
4
5
 
5
6
  module Capybara
@@ -18,21 +19,30 @@ module Capybara
18
19
  spider.class.update(:visits, :requests) if spider.with_info
19
20
 
20
21
  original_visit(visit_uri)
21
- rescue *config.retry_request_errors => e
22
- logger.error "Browser: request visit error: #{e.inspect}, url: #{visit_uri}"
23
- spider.add_event(:requests_errors, e.inspect) if spider.with_info
24
-
25
- if (retries += 1) <= max_retries
26
- logger.info "Browser: sleep #{(sleep_interval += 15)} seconds and process retry № #{retries} to the url: #{visit_uri}"
27
- sleep sleep_interval and retry
22
+ rescue => e
23
+ if match_error?(e, type: :to_skip)
24
+ logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
25
+ spider.add_event(:requests_errors, e.inspect) if spider.with_info
26
+ false
27
+ elsif match_error?(e, type: :to_retry)
28
+ logger.error "Browser: retry request error: #{e.inspect}, url: #{visit_uri}"
29
+ spider.add_event(:requests_errors, e.inspect) if spider.with_info
30
+
31
+ if (retries += 1) <= max_retries
32
+ logger.info "Browser: sleep #{(sleep_interval += 15)} seconds and process retry № #{retries} to the url: #{visit_uri}"
33
+ sleep sleep_interval and retry
34
+ else
35
+ logger.error "Browser: all retries (#{retries - 1}) to the url #{visit_uri} are gone"
36
+ raise e
37
+ end
28
38
  else
29
- logger.error "Browser: all retries (#{retries - 1}) to the url `#{visit_uri}` are gone"
30
39
  raise e
31
40
  end
32
41
  else
33
42
  driver.responses += 1 and logger.info "Browser: finished get request to: #{visit_uri}"
34
43
  spider.class.update(:visits, :responses) if spider.with_info
35
44
  driver.visited = true unless driver.visited
45
+ true
36
46
  ensure
37
47
  if spider.with_info
38
48
  logger.info "Info: visits: requests: #{spider.class.visits[:requests]}, responses: #{spider.class.visits[:responses]}"
@@ -75,8 +85,13 @@ module Capybara
75
85
  logger.info "Browser: driver has been restarted: name: #{mode}, pid: #{driver.pid}, port: #{driver.port}"
76
86
  end
77
87
 
78
- def current_response
79
- Nokogiri::HTML(body)
88
+ def current_response(response_type = :html)
89
+ case response_type
90
+ when :html
91
+ Nokogiri::HTML(body)
92
+ when :json
93
+ JSON.parse(body)
94
+ end
80
95
  end
81
96
 
82
97
  ###
@@ -114,6 +129,27 @@ module Capybara
114
129
 
115
130
  private
116
131
 
132
+ def match_error?(e, type:)
133
+ errors = (type == :to_retry ? config.retry_request_errors : config.skip_request_errors)
134
+ if errors.present?
135
+ errors.any? do |error|
136
+ if error.class == Hash
137
+ match = if error[:message].class == Regexp
138
+ e.message&.match?(error[:message])
139
+ else
140
+ e.message&.include?(error[:message])
141
+ end
142
+
143
+ e.class == error[:error] && match
144
+ else
145
+ e.class == error
146
+ end
147
+ end
148
+ else
149
+ false
150
+ end
151
+ end
152
+
117
153
  def process_delay(delay)
118
154
  interval = (delay.class == Range ? rand(delay) : delay)
119
155
  logger.debug "Browser: sleep #{interval.round(2)} #{'second'.pluralize(interval)} before request..."
@@ -1,12 +1,16 @@
1
1
  module Capybara
2
2
  class SessionConfig
3
3
  attr_accessor :cookies, :proxy, :user_agent
4
- attr_writer :retry_request_errors
4
+ attr_writer :retry_request_errors, :skip_request_errors
5
5
 
6
6
  def retry_request_errors
7
7
  @retry_request_errors ||= []
8
8
  end
9
9
 
10
+ def skip_request_errors
11
+ @skip_request_errors ||= []
12
+ end
13
+
10
14
  def restart_if
11
15
  @restart_if ||= {}
12
16
  end
@@ -146,10 +146,23 @@ module Kimurai
146
146
  puts VERSION
147
147
  end
148
148
 
149
+ desc "dashboard", "Run dashboard"
150
+ def dashboard
151
+ raise "Can't find Kimurai project" unless inside_project?
152
+
153
+ require './config/boot'
154
+ if Object.const_defined?("Kimurai::Dashboard")
155
+ require 'kimurai/dashboard/app'
156
+ Kimurai::Dashboard::App.run!
157
+ else
158
+ raise "Kimurai::Dashboard is not defined"
159
+ end
160
+ end
161
+
149
162
  private
150
163
 
151
164
  def inside_project?
152
- Dir.exists? "spiders"
165
+ Dir.exists?("spiders") && File.exists?("./config/boot.rb")
153
166
  end
154
167
  end
155
168
  end
@@ -0,0 +1,5 @@
1
+ class Hash
2
+ def deep_merge_excl(second, exclude)
3
+ self.merge(second.slice(*exclude)).deep_merge(second.except(*exclude))
4
+ end
5
+ end
@@ -2,61 +2,43 @@ require 'pmap'
2
2
 
3
3
  module Kimurai
4
4
  class Runner
5
- attr_reader :jobs, :spiders
5
+ attr_reader :jobs, :spiders, :session_info
6
6
 
7
7
  def initialize(parallel_jobs:)
8
8
  @jobs = parallel_jobs
9
9
  @spiders = Kimurai.list
10
+ @start_time = Time.now
10
11
 
11
- if time_zone = Kimurai.configuration.time_zone
12
- Kimurai.time_zone = time_zone
13
- end
14
- end
15
-
16
- def run!
17
- start_time = Time.now
18
- run_id = start_time.to_i
19
- running_pids = []
20
-
21
- ENV.store("RBCAT_COLORIZER", "false")
22
-
23
- run_info = {
24
- id: run_id,
12
+ @session_info = {
13
+ id: @start_time.to_i,
25
14
  status: :processing,
26
- start_time: start_time,
15
+ start_time: @start_time,
27
16
  stop_time: nil,
28
17
  environment: Kimurai.env,
29
- concurrent_jobs: jobs,
30
- spiders: spiders.keys
18
+ concurrent_jobs: @jobs,
19
+ spiders: @spiders.keys
31
20
  }
32
21
 
33
- at_exit do
34
- # Prevent queue to process new intems while executing at_exit body
35
- Thread.list.each { |t| t.kill if t != Thread.main }
36
- # Kill currently running spiders
37
- running_pids.each { |pid| Process.kill("INT", pid) }
38
-
39
- error = $!
40
- stop_time = Time.now
22
+ if time_zone = Kimurai.configuration.time_zone
23
+ Kimurai.time_zone = time_zone
24
+ end
41
25
 
42
- if error.nil?
43
- run_info.merge!(status: :completed, stop_time: stop_time)
44
- else
45
- run_info.merge!(status: :failed, error: error.inspect, stop_time: stop_time)
46
- end
26
+ ENV.store("SESSION_ID", @start_time.to_i.to_s)
27
+ ENV.store("RBCAT_COLORIZER", "false")
28
+ end
47
29
 
48
- if at_stop_callback = Kimurai.configuration.runner_at_stop_callback
49
- at_stop_callback.call(run_info)
50
- end
51
- puts "<<< Runner: stopped: #{run_info}"
52
- end
30
+ def run!(exception_on_fail: true)
31
+ running_pids = []
53
32
 
54
- puts ">>> Runner: started: #{run_info}"
33
+ puts ">>> Runner: started: #{session_info}"
55
34
  if at_start_callback = Kimurai.configuration.runner_at_start_callback
56
- at_start_callback.call(run_info)
35
+ at_start_callback.call(session_info)
57
36
  end
58
37
 
38
+ running = true
59
39
  spiders.peach_with_index(jobs) do |spider, i|
40
+ next unless running
41
+
60
42
  spider_name = spider[0]
61
43
  puts "> Runner: started spider: #{spider_name}, index: #{i}"
62
44
 
@@ -67,6 +49,23 @@ module Kimurai
67
49
  running_pids.delete(pid)
68
50
  puts "< Runner: stopped spider: #{spider_name}, index: #{i}"
69
51
  end
52
+ rescue StandardError, SignalException, SystemExit => e
53
+ running = false
54
+ session_info.merge!(status: :failed, error: e.inspect, stop_time: Time.now)
55
+ exception_on_fail ? raise(e) : [session_info, e]
56
+ else
57
+ session_info.merge!(status: :completed, stop_time: Time.now)
58
+ ensure
59
+ running = false
60
+ Thread.list.each { |t| t.kill if t != Thread.main }
61
+
62
+ # Kill currently running spiders (if any, in case of fail)
63
+ running_pids.each { |pid| Process.kill("INT", pid) }
64
+
65
+ if at_stop_callback = Kimurai.configuration.runner_at_stop_callback
66
+ at_stop_callback.call(session_info)
67
+ end
68
+ puts "<<< Runner: stopped: #{session_info}"
70
69
  end
71
70
  end
72
71
  end
@@ -29,4 +29,9 @@ Kimurai.configure do |config|
29
29
  # json = JSON.pretty_generate(info)
30
30
  # Sender.send_notification("Stopped session: #{json}")
31
31
  # end
32
+
33
+ # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
34
+ # config.selenium_chrome_path = "/usr/bin/chromium-browser"
35
+ # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
36
+ # config.chromedriver_path = "/usr/local/bin/chromedriver"
32
37
  end
@@ -81,7 +81,23 @@ class ApplicationSpider < Kimurai::Base
81
81
  # works for all drivers
82
82
  # skip_duplicate_requests: true,
83
83
 
84
- # Array of errors to retry while processing a request
84
+ # Automatically skip provided errors while requesting a page.
85
+ # If raised error matches one of the errors in the list, then this error will be caught,
86
+ # and request will be skipped.
87
+ # It is a good idea to skip errors like NotFound(404), etc.
88
+ # Format: array where elements are error classes or/and hashes. You can use hash format
89
+ # for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
90
+ # Provided `message:` will be compared with a full error message using `String#include?`. Also
91
+ # you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
92
+ # skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
93
+
94
+ # Automatically retry provided errors with a few attempts while requesting a page.
95
+ # If raised error matches one of the errors in the list, then this error will be caught
96
+ # and the request will be processed again within a delay. There are 3 attempts:
97
+ # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
98
+ # If after 3 attempts there is still an exception, then the exception will be raised.
99
+ # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
100
+ # Format: same like for `skip_request_errors` option.
85
101
  # retry_request_errors: [Net::ReadTimeout],
86
102
 
87
103
  # Restart browser if one of the options is true:
@@ -92,6 +108,8 @@ class ApplicationSpider < Kimurai::Base
92
108
  # Restart browser if provided requests limit is exceeded (works for all engines)
93
109
  # requests_limit: 100
94
110
  },
111
+
112
+ # Perform several actions before each request:
95
113
  before_request: {
96
114
  # Change proxy before each request. The `proxy:` option above should be presented
97
115
  # and has lambda format. Works only for poltergeist and mechanize engines
@@ -1,3 +1,3 @@
1
1
  module Kimurai
2
- VERSION = "1.1.0"
2
+ VERSION = "1.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kimurai
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Afanasev
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-09-12 00:00:00.000000000 Z
11
+ date: 2018-09-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: thor
@@ -265,7 +265,6 @@ files:
265
265
  - ".gitignore"
266
266
  - ".travis.yml"
267
267
  - CHANGELOG.md
268
- - CODE_OF_CONDUCT.md
269
268
  - Gemfile
270
269
  - LICENSE.txt
271
270
  - README.md
@@ -301,6 +300,7 @@ files:
301
300
  - lib/kimurai/cli/ansible_command_builder.rb
302
301
  - lib/kimurai/cli/generator.rb
303
302
  - lib/kimurai/core_ext/array.rb
303
+ - lib/kimurai/core_ext/hash.rb
304
304
  - lib/kimurai/core_ext/numeric.rb
305
305
  - lib/kimurai/core_ext/string.rb
306
306
  - lib/kimurai/pipeline.rb
@@ -1,74 +0,0 @@
1
- # Contributor Covenant Code of Conduct
2
-
3
- ## Our Pledge
4
-
5
- In the interest of fostering an open and welcoming environment, we as
6
- contributors and maintainers pledge to making participation in our project and
7
- our community a harassment-free experience for everyone, regardless of age, body
8
- size, disability, ethnicity, gender identity and expression, level of experience,
9
- nationality, personal appearance, race, religion, or sexual identity and
10
- orientation.
11
-
12
- ## Our Standards
13
-
14
- Examples of behavior that contributes to creating a positive environment
15
- include:
16
-
17
- * Using welcoming and inclusive language
18
- * Being respectful of differing viewpoints and experiences
19
- * Gracefully accepting constructive criticism
20
- * Focusing on what is best for the community
21
- * Showing empathy towards other community members
22
-
23
- Examples of unacceptable behavior by participants include:
24
-
25
- * The use of sexualized language or imagery and unwelcome sexual attention or
26
- advances
27
- * Trolling, insulting/derogatory comments, and personal or political attacks
28
- * Public or private harassment
29
- * Publishing others' private information, such as a physical or electronic
30
- address, without explicit permission
31
- * Other conduct which could reasonably be considered inappropriate in a
32
- professional setting
33
-
34
- ## Our Responsibilities
35
-
36
- Project maintainers are responsible for clarifying the standards of acceptable
37
- behavior and are expected to take appropriate and fair corrective action in
38
- response to any instances of unacceptable behavior.
39
-
40
- Project maintainers have the right and responsibility to remove, edit, or
41
- reject comments, commits, code, wiki edits, issues, and other contributions
42
- that are not aligned to this Code of Conduct, or to ban temporarily or
43
- permanently any contributor for other behaviors that they deem inappropriate,
44
- threatening, offensive, or harmful.
45
-
46
- ## Scope
47
-
48
- This Code of Conduct applies both within project spaces and in public spaces
49
- when an individual is representing the project or its community. Examples of
50
- representing a project or community include using an official project e-mail
51
- address, posting via an official social media account, or acting as an appointed
52
- representative at an online or offline event. Representation of a project may be
53
- further defined and clarified by project maintainers.
54
-
55
- ## Enforcement
56
-
57
- Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
- reported by contacting the project team at vicfreefly@gmail.com. All
59
- complaints will be reviewed and investigated and will result in a response that
60
- is deemed necessary and appropriate to the circumstances. The project team is
61
- obligated to maintain confidentiality with regard to the reporter of an incident.
62
- Further details of specific enforcement policies may be posted separately.
63
-
64
- Project maintainers who do not follow or enforce the Code of Conduct in good
65
- faith may face temporary or permanent repercussions as determined by other
66
- members of the project's leadership.
67
-
68
- ## Attribution
69
-
70
- This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
- available at [http://contributor-covenant.org/version/1/4][version]
72
-
73
- [homepage]: http://contributor-covenant.org
74
- [version]: http://contributor-covenant.org/version/1/4/