kimurai 1.1.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 67ee49692e64813bc980eb7562b711d5b5d2c47b50a995acb4759709703da0f9
4
- data.tar.gz: baba361bc5039d303ae4a6c9a1dd2109368f8e4c7a641d0a782cfc6a7776ade4
3
+ metadata.gz: 2ebb62d0916ee55eb8ec05f34b5f5909b99aae8fa2236dd93169f6e5e1221805
4
+ data.tar.gz: da86497a6c4f61f2ff1ffe3e8eeee92acb203305e712b9bd4baaa5794fdcf5f5
5
5
  SHA512:
6
- metadata.gz: 0173d3859b5f8776371fad454ff5575fdc453aa3c6038d8a8399651c46c5eaae789273772227ea014b6ce39b13586e6805bd7f69156eafeacf653804f954003c
7
- data.tar.gz: b05889c0cb030aed06fe1df5cc5411154d24019667e1f00f9f4248d598fc93990f86a4aae78430af3140f3dc3989e856cfc3e2316f455984c898442fccad15db
6
+ metadata.gz: 9d54d6074928a5bc0aa0a9d7e64308942e21ad5ae17788b2fa36adfffe88f67df0c533558ced412f2e22481a1664661c547419d09451d1260dd6ebd14ca4d915
7
+ data.tar.gz: d21f499a2a292dd672480d15da71742cfa82dec054b6a3c3a3c756e6cd2c98e58ac7d3b7fa3fe0ce6ae2e956d46417269242e742f8d83b644176ccef0822de75
@@ -1,4 +1,20 @@
1
1
  # CHANGELOG
2
+ ## 1.2.0
3
+ ### New
4
+ * Add possibility to add array of values to the storage (`Base::Storage#add`)
5
+ * Add `exception_on_fail` option to `Base.crawl!`
6
+ * Add possibility to pass request hash to the `start_urls` (You can use array of hashes as well, like: `@start_urls = [{ url: "https://example.com/cat?id=1", data: { category: "First Category" } }]`)
7
+ * Implement `skip_request_errors` config feature. Added [Handle request errors](https://github.com/vifreefly/kimuraframework#handle-request-errors) chapter to the README.
8
+ * Add option to choose response type for `Session#current_response` (`:html` default, or `:json`)
9
+ * Add option to provide custom chrome and chromedriver paths
10
+
11
+ ### Improvements
12
+ * Refactor `Runner`
13
+
14
+ ### Fixes
15
+ * Fix `Base#Saver` (automatically create file if it doesn't exists in case of persistence database)
16
+ * Do not deep merge config's `headers:` option
17
+
2
18
  ## 1.1.0
3
19
  ### Breaking changes 1.1.0
4
20
  `browser` config option depricated. Now all sub-options inside `browser` should be placed right into `@config` hash, without `browser` parent key. Example:
data/README.md CHANGED
@@ -18,7 +18,7 @@
18
18
 
19
19
  <br>
20
20
 
21
- > Note: this readme is for `1.1.0` gem version. CHANGELOG [here](CHANGELOG.md).
21
+ > Note: this readme is for `1.2.0` gem version. CHANGELOG [here](CHANGELOG.md).
22
22
 
23
23
  Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
24
24
 
@@ -216,7 +216,7 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
216
216
  * All the power of [Capybara](https://github.com/teamcapybara/capybara): use methods like `click_on`, `fill_in`, `select`, `choose`, `set`, `go_back`, etc. to interact with web pages
217
217
  * Rich [configuration](#spider-config): **set default headers, cookies, delay between requests, enable proxy/user-agents rotation**
218
218
  * Built-in helpers to make scraping easy, like [save_to](#save_to-helper) (save items to JSON, JSON lines, or CSV formats) or [unique?](#skip-duplicates-unique-helper) to skip duplicates
219
- * Automatically [retry failed requests](#configuration-options) with delay
219
+ * Automatically [handle requests errors](#handle-request-errors)
220
220
  * Automatically restart browsers when reaching memory limit [**(memory control)**](#spider-config) or requests limit
221
221
  * Easily [schedule spiders](#schedule-spiders-using-cron) within cron using [Whenever](https://github.com/javan/whenever) (no need to know cron syntax)
222
222
  * [Parallel scraping](#parallel-crawling-using-in_parallel) using simple method `in_parallel`
@@ -242,6 +242,9 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
242
242
  * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
243
243
  * [Storage object](#storage-object)
244
244
  * [Persistence database for the storage](#persistence-database-for-the-storage)
245
+ * [Handle request errors](#handle-request-errors)
246
+ * [skip_request_errors](#skip_request_errors)
247
+ * [retry_request_errors](#retry_request_errors)
245
248
  * [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
246
249
  * [KIMURAI_ENV](#kimurai_env)
247
250
  * [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
@@ -860,6 +863,9 @@ ProductsSpider.crawl!(continue: true)
860
863
 
861
864
  Second approach is to automatically skip already processed items urls using `@config` `skip_duplicate_requests:` option:
862
865
 
866
+ <details/>
867
+ <summary>Check the code</summary>
868
+
863
869
  ```ruby
864
870
  class ProductsSpider < Kimurai::Base
865
871
  @start_urls = ["https://example-shop.com/"]
@@ -893,7 +899,29 @@ end
893
899
  # Run the spider with persistence database option:
894
900
  ProductsSpider.crawl!(continue: true)
895
901
  ```
902
+ </details>
903
+
904
+ ### Handle request errors
905
+ It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
906
+
907
+ #### skip_request_errors
908
+ You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
909
+
910
+ Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
911
+
912
+ ```ruby
913
+ @config = {
914
+ skip_request_errors: [{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }]
915
+ }
916
+ ```
917
+ In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
896
918
 
919
+ #### retry_request_errors
920
+ You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
921
+
922
+ There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
923
+
924
+ Format for the option: same like for `skip_request_errors` option.
897
925
 
898
926
  ### `open_spider` and `close_spider` callbacks
899
927
 
@@ -1340,6 +1368,11 @@ Kimurai.configure do |config|
1340
1368
  # Custom time zone (for logs):
1341
1369
  # config.time_zone = "UTC"
1342
1370
  # config.time_zone = "Europe/Moscow"
1371
+
1372
+ # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
1373
+ # config.selenium_chrome_path = "/usr/bin/chromium-browser"
1374
+ # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
1375
+ # config.chromedriver_path = "~/.local/bin/chromedriver"
1343
1376
  end
1344
1377
  ```
1345
1378
 
@@ -1591,7 +1624,23 @@ end
1591
1624
  # works for all drivers
1592
1625
  skip_duplicate_requests: true,
1593
1626
 
1594
- # Array of possible errors to retry while processing a request:
1627
+ # Automatically skip provided errors while requesting a page.
1628
+ # If raised error matches one of the errors in the list, then this error will be caught,
1629
+ # and request will be skipped.
1630
+ # It is a good idea to skip errors like NotFound(404), etc.
1631
+ # Format: array where elements are error classes or/and hashes. You can use hash format
1632
+ # for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
1633
+ # Provided `message:` will be compared with a full error message using `String#include?`. Also
1634
+ # you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
1635
+ skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
1636
+
1637
+ # Automatically retry provided errors with a few attempts while requesting a page.
1638
+ # If raised error matches one of the errors in the list, then this error will be caught
1639
+ # and the request will be processed again within a delay. There are 3 attempts:
1640
+ # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
1641
+ # If after 3 attempts there is still an exception, then the exception will be raised.
1642
+ # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
1643
+ # Format: same like for `skip_request_errors` option.
1595
1644
  retry_request_errors: [Net::ReadTimeout],
1596
1645
 
1597
1646
  # Restart browser if one of the options is true:
@@ -1602,6 +1651,8 @@ end
1602
1651
  # Restart browser if provided requests limit is exceeded (works for all engines)
1603
1652
  requests_limit: 100
1604
1653
  },
1654
+
1655
+ # Perform several actions before each request:
1605
1656
  before_request: {
1606
1657
  # Change proxy before each request. The `proxy:` option above should be presented
1607
1658
  # and has lambda format. Works only for poltergeist and mechanize engines
@@ -10,6 +10,7 @@ require_relative 'kimurai/version'
10
10
  require_relative 'kimurai/core_ext/numeric'
11
11
  require_relative 'kimurai/core_ext/string'
12
12
  require_relative 'kimurai/core_ext/array'
13
+ require_relative 'kimurai/core_ext/hash'
13
14
 
14
15
  require_relative 'kimurai/browser_builder'
15
16
  require_relative 'kimurai/base_helper'
@@ -3,6 +3,9 @@ require_relative 'base/storage'
3
3
 
4
4
  module Kimurai
5
5
  class Base
6
+ # don't deep merge config's headers hash option
7
+ DMERGE_EXCLUDE = [:headers]
8
+
6
9
  LoggerFormatter = proc do |severity, datetime, progname, msg|
7
10
  current_thread_id = Thread.current.object_id
8
11
  thread_type = Thread.main == Thread.current ? "M" : "C"
@@ -77,7 +80,11 @@ module Kimurai
77
80
  end
78
81
 
79
82
  def self.config
80
- superclass.equal?(::Object) ? @config : superclass.config.deep_merge(@config || {})
83
+ if superclass.equal?(::Object)
84
+ @config
85
+ else
86
+ superclass.config.deep_merge_excl(@config || {}, DMERGE_EXCLUDE)
87
+ end
81
88
  end
82
89
 
83
90
  ###
@@ -90,7 +97,7 @@ module Kimurai
90
97
  end
91
98
  end
92
99
 
93
- def self.crawl!(continue: false)
100
+ def self.crawl!(continue: false, exception_on_fail: true)
94
101
  logger.error "Spider: already running: #{name}" and return false if running?
95
102
 
96
103
  storage_path =
@@ -118,19 +125,23 @@ module Kimurai
118
125
  spider.with_info = true
119
126
  if start_urls
120
127
  start_urls.each do |start_url|
121
- spider.request_to(:parse, url: start_url)
128
+ if start_url.class == Hash
129
+ spider.request_to(:parse, start_url)
130
+ else
131
+ spider.request_to(:parse, url: start_url)
132
+ end
122
133
  end
123
134
  else
124
135
  spider.parse
125
136
  end
126
137
  rescue StandardError, SignalException, SystemExit => e
127
138
  @run_info.merge!(status: :failed, error: e.inspect)
128
- raise e
139
+ exception_on_fail ? raise(e) : [@run_info, e]
129
140
  else
130
141
  @run_info.merge!(status: :completed)
131
142
  ensure
132
143
  if spider
133
- spider.browser.destroy_driver!
144
+ spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
134
145
 
135
146
  stop_time = Time.now
136
147
  total_time = (stop_time - @run_info[:start_time]).round(3)
@@ -168,7 +179,7 @@ module Kimurai
168
179
 
169
180
  def initialize(engine = self.class.engine, config: {})
170
181
  @engine = engine
171
- @config = self.class.config.deep_merge(config)
182
+ @config = self.class.config.deep_merge_excl(config, DMERGE_EXCLUDE)
172
183
  @pipelines = self.class.pipelines.map do |pipeline_name|
173
184
  klass = Pipeline.descendants.find { |kl| kl.name == pipeline_name }
174
185
  instance = klass.new
@@ -184,15 +195,16 @@ module Kimurai
184
195
  @browser ||= BrowserBuilder.build(@engine, @config, spider: self)
185
196
  end
186
197
 
187
- def request_to(handler, delay = nil, url:, data: {})
198
+ def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
188
199
  if @config[:skip_duplicate_requests] && !unique_request?(url)
189
200
  add_event(:duplicate_requests) if self.with_info
190
- logger.warn "Spider: request_to: url is not unique: #{url}, skipped" and return
201
+ logger.warn "Spider: request_to: not unique url: #{url}, skipped" and return
191
202
  end
192
203
 
193
- request_data = { url: url, data: data }
194
- delay ? browser.visit(url, delay: delay) : browser.visit(url)
195
- public_send(handler, browser.current_response, request_data)
204
+ visited = delay ? browser.visit(url, delay: delay) : browser.visit(url)
205
+ return unless visited
206
+
207
+ public_send(handler, browser.current_response(response_type), { url: url, data: data })
196
208
  end
197
209
 
198
210
  def console(response = nil, url: nil, data: {})
@@ -285,18 +297,22 @@ module Kimurai
285
297
  all << Thread.new(part) do |part|
286
298
  Thread.current.abort_on_exception = true
287
299
 
288
- spider = self.class.new(engine, config: config)
300
+ spider = self.class.new(engine, config: @config.deep_merge_excl(config, DMERGE_EXCLUDE))
289
301
  spider.with_info = true if self.with_info
290
302
 
291
303
  part.each do |url_data|
292
304
  if url_data.class == Hash
293
- spider.request_to(handler, delay, url_data)
305
+ if url_data[:url].present? && url_data[:data].present?
306
+ spider.request_to(handler, delay, url_data)
307
+ else
308
+ spider.public_send(handler, url_data)
309
+ end
294
310
  else
295
311
  spider.request_to(handler, delay, url: url_data, data: data)
296
312
  end
297
313
  end
298
314
  ensure
299
- spider.browser.destroy_driver!
315
+ spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
300
316
  end
301
317
 
302
318
  sleep 0.5
@@ -42,7 +42,7 @@ module Kimurai
42
42
  def save_to_json(item)
43
43
  data = JSON.generate([item])
44
44
 
45
- if append || @index > 1
45
+ if @index > 1 || append && File.exists?(path)
46
46
  file_content = File.read(path).sub(/\}\]\Z/, "\}\,")
47
47
  File.open(path, "w") do |f|
48
48
  f.write(file_content + data.sub(/\A\[/, ""))
@@ -55,7 +55,7 @@ module Kimurai
55
55
  def save_to_pretty_json(item)
56
56
  data = JSON.pretty_generate([item])
57
57
 
58
- if append || @index > 1
58
+ if @index > 1 || append && File.exists?(path)
59
59
  file_content = File.read(path).sub(/\}\n\]\Z/, "\}\,\n")
60
60
  File.open(path, "w") do |f|
61
61
  f.write(file_content + data.sub(/\A\[\n/, ""))
@@ -68,7 +68,7 @@ module Kimurai
68
68
  def save_to_jsonlines(item)
69
69
  data = JSON.generate(item)
70
70
 
71
- if append || @index > 1
71
+ if @index > 1 || append && File.exists?(path)
72
72
  File.open(path, "a") { |file| file.write("\n" + data) }
73
73
  else
74
74
  File.open(path, "w") { |file| file.write(data) }
@@ -78,7 +78,7 @@ module Kimurai
78
78
  def save_to_csv(item)
79
79
  data = flatten_hash(item)
80
80
 
81
- if append || @index > 1
81
+ if @index > 1 || append && File.exists?(path)
82
82
  CSV.open(path, "a+", force_quotes: true) do |csv|
83
83
  csv << data.values
84
84
  end
@@ -40,11 +40,21 @@ module Kimurai
40
40
  if path
41
41
  database.transaction do
42
42
  database[scope] ||= []
43
- database[scope].push(value) unless database[scope].include?(value)
43
+ if value.class == Array
44
+ database[scope] += value
45
+ database[scope].uniq!
46
+ else
47
+ database[scope].push(value) unless database[scope].include?(value)
48
+ end
44
49
  end
45
50
  else
46
51
  database[scope] ||= []
47
- database[scope].push(value) unless database[scope].include?(value)
52
+ if value.class == Array
53
+ database[scope] += value
54
+ database[scope].uniq!
55
+ else
56
+ database[scope].push(value) unless database[scope].include?(value)
57
+ end
48
58
  end
49
59
  end
50
60
  end
@@ -80,9 +80,15 @@ module Kimurai
80
80
  end
81
81
 
82
82
  # Browser instance options
83
+ # skip_request_errors
84
+ if skip_errors = @config[:skip_request_errors].presence
85
+ @browser.config.skip_request_errors = skip_errors
86
+ logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
87
+ end
88
+
83
89
  # retry_request_errors
84
- if errors = @config[:retry_request_errors].presence
85
- @browser.config.retry_request_errors = errors
90
+ if retry_errors = @config[:retry_request_errors].presence
91
+ @browser.config.retry_request_errors = retry_errors
86
92
  logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
87
93
  end
88
94
 
@@ -91,9 +91,15 @@ module Kimurai
91
91
  end
92
92
 
93
93
  # Browser instance options
94
+ # skip_request_errors
95
+ if skip_errors = @config[:skip_request_errors].presence
96
+ @browser.config.skip_request_errors = skip_errors
97
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
98
+ end
99
+
94
100
  # retry_request_errors
95
- if errors = @config[:retry_request_errors].presence
96
- @browser.config.retry_request_errors = errors
101
+ if retry_errors = @config[:retry_request_errors].presence
102
+ @browser.config.retry_request_errors = retry_errors
97
103
  logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
98
104
  end
99
105
 
@@ -23,8 +23,15 @@ module Kimurai
23
23
  # Register driver
24
24
  Capybara.register_driver :selenium_chrome do |app|
25
25
  # Create driver options
26
- default_args = %w[--disable-gpu --no-sandbox --disable-translate]
27
- driver_options = Selenium::WebDriver::Chrome::Options.new(args: default_args)
26
+ opts = { args: %w[--disable-gpu --no-sandbox --disable-translate] }
27
+
28
+ # Provide custom chrome browser path:
29
+ if chrome_path = Kimurai.configuration.selenium_chrome_path
30
+ opts.merge!(binary: chrome_path)
31
+ end
32
+
33
+ # See all options here: https://seleniumhq.github.io/selenium/docs/api/rb/Selenium/WebDriver/Chrome/Options.html
34
+ driver_options = Selenium::WebDriver::Chrome::Options.new(opts)
28
35
 
29
36
  # Window size
30
37
  if size = @config[:window_size].presence
@@ -99,7 +106,8 @@ module Kimurai
99
106
  end
100
107
  end
101
108
 
102
- Capybara::Selenium::Driver.new(app, browser: :chrome, options: driver_options, driver_path: "/usr/local/bin/chromedriver")
109
+ chromedriver_path = Kimurai.configuration.chromedriver_path || "/usr/local/bin/chromedriver"
110
+ Capybara::Selenium::Driver.new(app, browser: :chrome, options: driver_options, driver_path: chromedriver_path)
103
111
  end
104
112
 
105
113
  # Create browser instance (Capybara session)
@@ -118,9 +126,15 @@ module Kimurai
118
126
  end
119
127
 
120
128
  # Browser instance options
129
+ # skip_request_errors
130
+ if skip_errors = @config[:skip_request_errors].presence
131
+ @browser.config.skip_request_errors = skip_errors
132
+ logger.debug "BrowserBuilder (selenium_chrome): enabled skip_request_errors"
133
+ end
134
+
121
135
  # retry_request_errors
122
- if errors = @config[:retry_request_errors].presence
123
- @browser.config.retry_request_errors = errors
136
+ if retry_errors = @config[:retry_request_errors].presence
137
+ @browser.config.retry_request_errors = retry_errors
124
138
  logger.debug "BrowserBuilder (selenium_chrome): enabled retry_request_errors"
125
139
  end
126
140
 
@@ -131,9 +131,15 @@ module Kimurai
131
131
  end
132
132
 
133
133
  # Browser instance options
134
+ # skip_request_errors
135
+ if skip_errors = @config[:skip_request_errors].presence
136
+ @browser.config.skip_request_errors = skip_errors
137
+ logger.debug "BrowserBuilder (selenium_firefox): enabled skip_request_errors"
138
+ end
139
+
134
140
  # retry_request_errors
135
- if errors = @config[:retry_request_errors].presence
136
- @browser.config.retry_request_errors = errors
141
+ if retry_errors = @config[:retry_request_errors].presence
142
+ @browser.config.retry_request_errors = retry_errors
137
143
  logger.debug "BrowserBuilder (selenium_firefox): enabled retry_request_errors"
138
144
  end
139
145
 
@@ -1,5 +1,6 @@
1
1
  require 'capybara'
2
2
  require 'nokogiri'
3
+ require 'json'
3
4
  require_relative 'session/config'
4
5
 
5
6
  module Capybara
@@ -18,21 +19,30 @@ module Capybara
18
19
  spider.class.update(:visits, :requests) if spider.with_info
19
20
 
20
21
  original_visit(visit_uri)
21
- rescue *config.retry_request_errors => e
22
- logger.error "Browser: request visit error: #{e.inspect}, url: #{visit_uri}"
23
- spider.add_event(:requests_errors, e.inspect) if spider.with_info
24
-
25
- if (retries += 1) <= max_retries
26
- logger.info "Browser: sleep #{(sleep_interval += 15)} seconds and process retry № #{retries} to the url: #{visit_uri}"
27
- sleep sleep_interval and retry
22
+ rescue => e
23
+ if match_error?(e, type: :to_skip)
24
+ logger.error "Browser: skip request error: #{e.inspect}, url: #{visit_uri}"
25
+ spider.add_event(:requests_errors, e.inspect) if spider.with_info
26
+ false
27
+ elsif match_error?(e, type: :to_retry)
28
+ logger.error "Browser: retry request error: #{e.inspect}, url: #{visit_uri}"
29
+ spider.add_event(:requests_errors, e.inspect) if spider.with_info
30
+
31
+ if (retries += 1) <= max_retries
32
+ logger.info "Browser: sleep #{(sleep_interval += 15)} seconds and process retry № #{retries} to the url: #{visit_uri}"
33
+ sleep sleep_interval and retry
34
+ else
35
+ logger.error "Browser: all retries (#{retries - 1}) to the url #{visit_uri} are gone"
36
+ raise e
37
+ end
28
38
  else
29
- logger.error "Browser: all retries (#{retries - 1}) to the url `#{visit_uri}` are gone"
30
39
  raise e
31
40
  end
32
41
  else
33
42
  driver.responses += 1 and logger.info "Browser: finished get request to: #{visit_uri}"
34
43
  spider.class.update(:visits, :responses) if spider.with_info
35
44
  driver.visited = true unless driver.visited
45
+ true
36
46
  ensure
37
47
  if spider.with_info
38
48
  logger.info "Info: visits: requests: #{spider.class.visits[:requests]}, responses: #{spider.class.visits[:responses]}"
@@ -75,8 +85,13 @@ module Capybara
75
85
  logger.info "Browser: driver has been restarted: name: #{mode}, pid: #{driver.pid}, port: #{driver.port}"
76
86
  end
77
87
 
78
- def current_response
79
- Nokogiri::HTML(body)
88
+ def current_response(response_type = :html)
89
+ case response_type
90
+ when :html
91
+ Nokogiri::HTML(body)
92
+ when :json
93
+ JSON.parse(body)
94
+ end
80
95
  end
81
96
 
82
97
  ###
@@ -114,6 +129,27 @@ module Capybara
114
129
 
115
130
  private
116
131
 
132
+ def match_error?(e, type:)
133
+ errors = (type == :to_retry ? config.retry_request_errors : config.skip_request_errors)
134
+ if errors.present?
135
+ errors.any? do |error|
136
+ if error.class == Hash
137
+ match = if error[:message].class == Regexp
138
+ e.message&.match?(error[:message])
139
+ else
140
+ e.message&.include?(error[:message])
141
+ end
142
+
143
+ e.class == error[:error] && match
144
+ else
145
+ e.class == error
146
+ end
147
+ end
148
+ else
149
+ false
150
+ end
151
+ end
152
+
117
153
  def process_delay(delay)
118
154
  interval = (delay.class == Range ? rand(delay) : delay)
119
155
  logger.debug "Browser: sleep #{interval.round(2)} #{'second'.pluralize(interval)} before request..."
@@ -1,12 +1,16 @@
1
1
  module Capybara
2
2
  class SessionConfig
3
3
  attr_accessor :cookies, :proxy, :user_agent
4
- attr_writer :retry_request_errors
4
+ attr_writer :retry_request_errors, :skip_request_errors
5
5
 
6
6
  def retry_request_errors
7
7
  @retry_request_errors ||= []
8
8
  end
9
9
 
10
+ def skip_request_errors
11
+ @skip_request_errors ||= []
12
+ end
13
+
10
14
  def restart_if
11
15
  @restart_if ||= {}
12
16
  end
@@ -146,10 +146,23 @@ module Kimurai
146
146
  puts VERSION
147
147
  end
148
148
 
149
+ desc "dashboard", "Run dashboard"
150
+ def dashboard
151
+ raise "Can't find Kimurai project" unless inside_project?
152
+
153
+ require './config/boot'
154
+ if Object.const_defined?("Kimurai::Dashboard")
155
+ require 'kimurai/dashboard/app'
156
+ Kimurai::Dashboard::App.run!
157
+ else
158
+ raise "Kimurai::Dashboard is not defined"
159
+ end
160
+ end
161
+
149
162
  private
150
163
 
151
164
  def inside_project?
152
- Dir.exists? "spiders"
165
+ Dir.exists?("spiders") && File.exists?("./config/boot.rb")
153
166
  end
154
167
  end
155
168
  end
@@ -0,0 +1,5 @@
1
+ class Hash
2
+ def deep_merge_excl(second, exclude)
3
+ self.merge(second.slice(*exclude)).deep_merge(second.except(*exclude))
4
+ end
5
+ end
@@ -2,61 +2,43 @@ require 'pmap'
2
2
 
3
3
  module Kimurai
4
4
  class Runner
5
- attr_reader :jobs, :spiders
5
+ attr_reader :jobs, :spiders, :session_info
6
6
 
7
7
  def initialize(parallel_jobs:)
8
8
  @jobs = parallel_jobs
9
9
  @spiders = Kimurai.list
10
+ @start_time = Time.now
10
11
 
11
- if time_zone = Kimurai.configuration.time_zone
12
- Kimurai.time_zone = time_zone
13
- end
14
- end
15
-
16
- def run!
17
- start_time = Time.now
18
- run_id = start_time.to_i
19
- running_pids = []
20
-
21
- ENV.store("RBCAT_COLORIZER", "false")
22
-
23
- run_info = {
24
- id: run_id,
12
+ @session_info = {
13
+ id: @start_time.to_i,
25
14
  status: :processing,
26
- start_time: start_time,
15
+ start_time: @start_time,
27
16
  stop_time: nil,
28
17
  environment: Kimurai.env,
29
- concurrent_jobs: jobs,
30
- spiders: spiders.keys
18
+ concurrent_jobs: @jobs,
19
+ spiders: @spiders.keys
31
20
  }
32
21
 
33
- at_exit do
34
- # Prevent queue to process new intems while executing at_exit body
35
- Thread.list.each { |t| t.kill if t != Thread.main }
36
- # Kill currently running spiders
37
- running_pids.each { |pid| Process.kill("INT", pid) }
38
-
39
- error = $!
40
- stop_time = Time.now
22
+ if time_zone = Kimurai.configuration.time_zone
23
+ Kimurai.time_zone = time_zone
24
+ end
41
25
 
42
- if error.nil?
43
- run_info.merge!(status: :completed, stop_time: stop_time)
44
- else
45
- run_info.merge!(status: :failed, error: error.inspect, stop_time: stop_time)
46
- end
26
+ ENV.store("SESSION_ID", @start_time.to_i.to_s)
27
+ ENV.store("RBCAT_COLORIZER", "false")
28
+ end
47
29
 
48
- if at_stop_callback = Kimurai.configuration.runner_at_stop_callback
49
- at_stop_callback.call(run_info)
50
- end
51
- puts "<<< Runner: stopped: #{run_info}"
52
- end
30
+ def run!(exception_on_fail: true)
31
+ running_pids = []
53
32
 
54
- puts ">>> Runner: started: #{run_info}"
33
+ puts ">>> Runner: started: #{session_info}"
55
34
  if at_start_callback = Kimurai.configuration.runner_at_start_callback
56
- at_start_callback.call(run_info)
35
+ at_start_callback.call(session_info)
57
36
  end
58
37
 
38
+ running = true
59
39
  spiders.peach_with_index(jobs) do |spider, i|
40
+ next unless running
41
+
60
42
  spider_name = spider[0]
61
43
  puts "> Runner: started spider: #{spider_name}, index: #{i}"
62
44
 
@@ -67,6 +49,23 @@ module Kimurai
67
49
  running_pids.delete(pid)
68
50
  puts "< Runner: stopped spider: #{spider_name}, index: #{i}"
69
51
  end
52
+ rescue StandardError, SignalException, SystemExit => e
53
+ running = false
54
+ session_info.merge!(status: :failed, error: e.inspect, stop_time: Time.now)
55
+ exception_on_fail ? raise(e) : [session_info, e]
56
+ else
57
+ session_info.merge!(status: :completed, stop_time: Time.now)
58
+ ensure
59
+ running = false
60
+ Thread.list.each { |t| t.kill if t != Thread.main }
61
+
62
+ # Kill currently running spiders (if any, in case of fail)
63
+ running_pids.each { |pid| Process.kill("INT", pid) }
64
+
65
+ if at_stop_callback = Kimurai.configuration.runner_at_stop_callback
66
+ at_stop_callback.call(session_info)
67
+ end
68
+ puts "<<< Runner: stopped: #{session_info}"
70
69
  end
71
70
  end
72
71
  end
@@ -29,4 +29,9 @@ Kimurai.configure do |config|
29
29
  # json = JSON.pretty_generate(info)
30
30
  # Sender.send_notification("Stopped session: #{json}")
31
31
  # end
32
+
33
+ # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
34
+ # config.selenium_chrome_path = "/usr/bin/chromium-browser"
35
+ # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
36
+ # config.chromedriver_path = "/usr/local/bin/chromedriver"
32
37
  end
@@ -81,7 +81,23 @@ class ApplicationSpider < Kimurai::Base
81
81
  # works for all drivers
82
82
  # skip_duplicate_requests: true,
83
83
 
84
- # Array of errors to retry while processing a request
84
+ # Automatically skip provided errors while requesting a page.
85
+ # If raised error matches one of the errors in the list, then this error will be caught,
86
+ # and request will be skipped.
87
+ # It is a good idea to skip errors like NotFound(404), etc.
88
+ # Format: array where elements are error classes or/and hashes. You can use hash format
89
+ # for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
90
+ # Provided `message:` will be compared with a full error message using `String#include?`. Also
91
+ # you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
92
+ # skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],
93
+
94
+ # Automatically retry provided errors with a few attempts while requesting a page.
95
+ # If raised error matches one of the errors in the list, then this error will be caught
96
+ # and the request will be processed again within a delay. There are 3 attempts:
97
+ # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
98
+ # If after 3 attempts there is still an exception, then the exception will be raised.
99
+ # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
100
+ # Format: same like for `skip_request_errors` option.
85
101
  # retry_request_errors: [Net::ReadTimeout],
86
102
 
87
103
  # Restart browser if one of the options is true:
@@ -92,6 +108,8 @@ class ApplicationSpider < Kimurai::Base
92
108
  # Restart browser if provided requests limit is exceeded (works for all engines)
93
109
  # requests_limit: 100
94
110
  },
111
+
112
+ # Perform several actions before each request:
95
113
  before_request: {
96
114
  # Change proxy before each request. The `proxy:` option above should be presented
97
115
  # and has lambda format. Works only for poltergeist and mechanize engines
@@ -1,3 +1,3 @@
1
1
  module Kimurai
2
- VERSION = "1.1.0"
2
+ VERSION = "1.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kimurai
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Afanasev
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-09-12 00:00:00.000000000 Z
11
+ date: 2018-09-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: thor
@@ -265,7 +265,6 @@ files:
265
265
  - ".gitignore"
266
266
  - ".travis.yml"
267
267
  - CHANGELOG.md
268
- - CODE_OF_CONDUCT.md
269
268
  - Gemfile
270
269
  - LICENSE.txt
271
270
  - README.md
@@ -301,6 +300,7 @@ files:
301
300
  - lib/kimurai/cli/ansible_command_builder.rb
302
301
  - lib/kimurai/cli/generator.rb
303
302
  - lib/kimurai/core_ext/array.rb
303
+ - lib/kimurai/core_ext/hash.rb
304
304
  - lib/kimurai/core_ext/numeric.rb
305
305
  - lib/kimurai/core_ext/string.rb
306
306
  - lib/kimurai/pipeline.rb
@@ -1,74 +0,0 @@
1
- # Contributor Covenant Code of Conduct
2
-
3
- ## Our Pledge
4
-
5
- In the interest of fostering an open and welcoming environment, we as
6
- contributors and maintainers pledge to making participation in our project and
7
- our community a harassment-free experience for everyone, regardless of age, body
8
- size, disability, ethnicity, gender identity and expression, level of experience,
9
- nationality, personal appearance, race, religion, or sexual identity and
10
- orientation.
11
-
12
- ## Our Standards
13
-
14
- Examples of behavior that contributes to creating a positive environment
15
- include:
16
-
17
- * Using welcoming and inclusive language
18
- * Being respectful of differing viewpoints and experiences
19
- * Gracefully accepting constructive criticism
20
- * Focusing on what is best for the community
21
- * Showing empathy towards other community members
22
-
23
- Examples of unacceptable behavior by participants include:
24
-
25
- * The use of sexualized language or imagery and unwelcome sexual attention or
26
- advances
27
- * Trolling, insulting/derogatory comments, and personal or political attacks
28
- * Public or private harassment
29
- * Publishing others' private information, such as a physical or electronic
30
- address, without explicit permission
31
- * Other conduct which could reasonably be considered inappropriate in a
32
- professional setting
33
-
34
- ## Our Responsibilities
35
-
36
- Project maintainers are responsible for clarifying the standards of acceptable
37
- behavior and are expected to take appropriate and fair corrective action in
38
- response to any instances of unacceptable behavior.
39
-
40
- Project maintainers have the right and responsibility to remove, edit, or
41
- reject comments, commits, code, wiki edits, issues, and other contributions
42
- that are not aligned to this Code of Conduct, or to ban temporarily or
43
- permanently any contributor for other behaviors that they deem inappropriate,
44
- threatening, offensive, or harmful.
45
-
46
- ## Scope
47
-
48
- This Code of Conduct applies both within project spaces and in public spaces
49
- when an individual is representing the project or its community. Examples of
50
- representing a project or community include using an official project e-mail
51
- address, posting via an official social media account, or acting as an appointed
52
- representative at an online or offline event. Representation of a project may be
53
- further defined and clarified by project maintainers.
54
-
55
- ## Enforcement
56
-
57
- Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
- reported by contacting the project team at vicfreefly@gmail.com. All
59
- complaints will be reviewed and investigated and will result in a response that
60
- is deemed necessary and appropriate to the circumstances. The project team is
61
- obligated to maintain confidentiality with regard to the reporter of an incident.
62
- Further details of specific enforcement policies may be posted separately.
63
-
64
- Project maintainers who do not follow or enforce the Code of Conduct in good
65
- faith may face temporary or permanent repercussions as determined by other
66
- members of the project's leadership.
67
-
68
- ## Attribution
69
-
70
- This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
- available at [http://contributor-covenant.org/version/1/4][version]
72
-
73
- [homepage]: http://contributor-covenant.org
74
- [version]: http://contributor-covenant.org/version/1/4/