kimurai 1.2.0 → 1.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2ebb62d0916ee55eb8ec05f34b5f5909b99aae8fa2236dd93169f6e5e1221805
4
- data.tar.gz: da86497a6c4f61f2ff1ffe3e8eeee92acb203305e712b9bd4baaa5794fdcf5f5
3
+ metadata.gz: 9fb75ab8965adb212964eee4b319642c3b83bbb4e3b957b739b6eba28b64546d
4
+ data.tar.gz: b03efc58a57222fdbb51e5ab6aa389f352de1f94fbd02358ccb613d7a15a5774
5
5
  SHA512:
6
- metadata.gz: 9d54d6074928a5bc0aa0a9d7e64308942e21ad5ae17788b2fa36adfffe88f67df0c533558ced412f2e22481a1664661c547419d09451d1260dd6ebd14ca4d915
7
- data.tar.gz: d21f499a2a292dd672480d15da71742cfa82dec054b6a3c3a3c756e6cd2c98e58ac7d3b7fa3fe0ce6ae2e956d46417269242e742f8d83b644176ccef0822de75
6
+ metadata.gz: 737f467deef60eb2983d86286c040d1b899732ff3e52fdab50ce07e8893070c0cd9f3f3020fe57e32ab1d9140299a9112db698e8d72112934b4fd90ee10a7644
7
+ data.tar.gz: de89bc1ef85f7f29986b7033d883a49c804592a21f78b409bf325f2e0e578dc78bbf67cdca4ca719f34e43f305dd99f8f227cbd8b545e42f06ca7f49d24e1e35
@@ -1,4 +1,21 @@
1
1
  # CHANGELOG
2
+ ## 1.3.0
3
+ ### Breaking changes 1.3.0
4
+ * Remove persistence database feature (because it's slow and makes things complicated)
5
+
6
+ ### New
7
+ * Add `--include` and `--exclude` options to CLI#runner
8
+ * Add Base `#create_browser` method to easily create additional browser instances
9
+ * Add Capybara::Session `#scroll_to_bottom`
10
+ * Add skip_on_failure feature to `retry_request_errors` config option
11
+ * Add info about `add_event` method to the README
12
+
13
+ ### Fixes and improvements
14
+ * Improve Runner
15
+ * Fix time helper in schedule.rb
16
+ * Add proxy validation to browser builders
17
+ * Allow to pass different arguments to the `Base.parse` method
18
+
2
19
  ## 1.2.0
3
20
  ### New
4
21
  * Add possibility to add array of values to the storage (`Base::Storage#add`)
data/README.md CHANGED
@@ -18,7 +18,7 @@
18
18
 
19
19
  <br>
20
20
 
21
- > Note: this readme is for `1.2.0` gem version. CHANGELOG [here](CHANGELOG.md).
21
+ > Note: this readme is for `1.3.0` gem version. CHANGELOG [here](CHANGELOG.md).
22
22
 
23
23
  Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
24
24
 
@@ -241,10 +241,10 @@ I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320] INFO -- infinite_scrol
241
241
  * [Skip duplicates](#skip-duplicates)
242
242
  * [Automatically skip all duplicated requests urls](#automatically-skip-all-duplicated-requests-urls)
243
243
  * [Storage object](#storage-object)
244
- * [Persistence database for the storage](#persistence-database-for-the-storage)
245
244
  * [Handle request errors](#handle-request-errors)
246
245
  * [skip_request_errors](#skip_request_errors)
247
246
  * [retry_request_errors](#retry_request_errors)
247
+ * [Logging custom events](#logging-custom-events)
248
248
  * [open_spider and close_spider callbacks](#open_spider-and-close_spider-callbacks)
249
249
  * [KIMURAI_ENV](#kimurai_env)
250
250
  * [Parallel crawling using in_parallel](#parallel-crawling-using-in_parallel)
@@ -297,8 +297,8 @@ echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
297
297
  exec $SHELL
298
298
 
299
299
  # Install latest Ruby
300
- rbenv install 2.5.1
301
- rbenv global 2.5.1
300
+ rbenv install 2.5.3
301
+ rbenv global 2.5.3
302
302
 
303
303
  gem install bundler
304
304
  ```
@@ -317,8 +317,8 @@ echo 'if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi' >> ~/.bash_pr
317
317
  source ~/.bash_profile
318
318
 
319
319
  # Install latest Ruby
320
- rbenv install 2.5.1
321
- rbenv global 2.5.1
320
+ rbenv install 2.5.3
321
+ rbenv global 2.5.3
322
322
 
323
323
  gem install bundler
324
324
  ```
@@ -349,17 +349,17 @@ sudo apt install -q -y xvfb
349
349
  # Install chromium-browser and firefox
350
350
  sudo apt install -q -y chromium-browser firefox
351
351
 
352
- # Instal chromedriver (2.39 version)
352
+ # Instal chromedriver (2.44 version)
353
353
  # All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
354
- cd /tmp && wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip
354
+ cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
355
355
  sudo unzip chromedriver_linux64.zip -d /usr/local/bin
356
356
  rm -f chromedriver_linux64.zip
357
357
 
358
- # Install geckodriver (0.21.0 version)
358
+ # Install geckodriver (0.23.0 version)
359
359
  # All versions located here https://github.com/mozilla/geckodriver/releases/
360
- cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz
361
- sudo tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin
362
- rm -f geckodriver-v0.21.0-linux64.tar.gz
360
+ cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
361
+ sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
362
+ rm -f geckodriver-v0.23.0-linux64.tar.gz
363
363
 
364
364
  # Install PhantomJS (2.1.1)
365
365
  # All versions located here http://phantomjs.org/download.html
@@ -815,113 +815,60 @@ It is possible to automatically skip all already visited urls while calling `req
815
815
  * `#all` - display storage hash where keys are existing scopes.
816
816
  * `#include?(scope, value)` - return `true` if value in the scope exists, and `false` if not
817
817
  * `#add(scope, value)` - add value to the scope
818
- * `unique?(scope, value)` - method already described above, will return `false` if value in the scope exists, or return `true` + add value to the scope if value in the scope not exists.
819
- * `clear!` - reset the whole storage by deleting all values from all scopes.
818
+ * `#unique?(scope, value)` - method already described above, will return `false` if value in the scope exists, or return `true` + add value to the scope if value in the scope not exists.
819
+ * `#clear!` - reset the whole storage by deleting all values from all scopes.
820
820
 
821
- #### Persistence database for the storage
822
821
 
823
- It's pretty common that spider can fail (IP blocking, etc.) while crawling a huge website with +5k listings. In this case, it's not convenient to start everything over again.
824
-
825
- Kimurai can use persistence database for a `storage` using Ruby built-in [PStore](https://ruby-doc.org/stdlib-2.5.1/libdoc/pstore/rdoc/PStore.html) database. With this option, you can automatically skip already visited urls in the next run _if previous run was failed_, otherwise _(if run was successful)_ storage database will be removed before spider stops.
826
-
827
- Also, with persistence storage enabled, [save_to](#save_to-helper) method will keep adding items to an existing file (it will not be cleared before each run).
822
+ ### Handle request errors
823
+ It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
828
824
 
829
- To use persistence storage, provide `continue: true` option to the `.crawl!` method: `SomeSpider.crawl!(continue: true)`.
825
+ #### skip_request_errors
826
+ You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
830
827
 
831
- There are two approaches how to use persistence storage and skip already processed items pages. First, is to manually add required urls to the storage:
828
+ Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
832
829
 
833
- ```ruby
834
- class ProductsSpider < Kimurai::Base
835
- @start_urls = ["https://example-shop.com/"]
830
+ ```
831
+ @config = {
832
+ skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }]
833
+ }
834
+ ```
835
+ In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: RuntimeError, message: /404|403/ }`.
836
836
 
837
- def parse(response, url:, data: {})
838
- response.xpath("//categories/path").each do |category|
839
- request_to :parse_category, url: category[:href]
840
- end
841
- end
837
+ #### retry_request_errors
838
+ You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
842
839
 
843
- def parse_category(response, url:, data: {})
844
- response.xpath("//products/path").each do |product|
845
- # check if product url already contains in the scope `:product_urls`, if so, skip the request:
846
- next if storage.contains?(:product_urls, product[:href])
847
- # Otherwise process it:
848
- request_to :parse_product, url: product[:href]
849
- end
850
- end
840
+ There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
851
841
 
852
- def parse_product(response, url:, data: {})
853
- # Add visited item to the storage:
854
- storage.add(:product_urls, url)
842
+ Format for the option: same like for `skip_request_errors` option.
855
843
 
856
- # ...
857
- end
858
- end
844
+ If you would like to skip (not raise) error after all retries gone, you can specify `skip_on_failure: true` option:
859
845
 
860
- # Run the spider with persistence database option:
861
- ProductsSpider.crawl!(continue: true)
846
+ ```ruby
847
+ @config = {
848
+ retry_request_errors: [{ error: RuntimeError, skip_on_failure: true }]
849
+ }
862
850
  ```
863
851
 
864
- Second approach is to automatically skip already processed items urls using `@config` `skip_duplicate_requests:` option:
852
+ ### Logging custom events
865
853
 
866
- <details/>
867
- <summary>Check the code</summary>
854
+ It is possible to save custom messages to the [run_info](#open_spider-and-close_spider-callbacks) hash using `add_event('Some message')` method. This feature helps you to keep track on important things which happened during crawling without checking the whole spider log (in case if you're logging these messages using `logger`). Example:
868
855
 
869
856
  ```ruby
870
- class ProductsSpider < Kimurai::Base
871
- @start_urls = ["https://example-shop.com/"]
872
- @config = {
873
- # Configure skip_duplicate_requests option:
874
- skip_duplicate_requests: { scope: :product_urls, check_only: true }
875
- }
876
-
877
- def parse(response, url:, data: {})
878
- response.xpath("//categories/path").each do |category|
879
- request_to :parse_category, url: category[:href]
880
- end
857
+ def parse_product(response, url:, data: {})
858
+ unless response.at_xpath("//path/to/add_to_card_button")
859
+ add_event("Product is sold") and return
881
860
  end
882
861
 
883
- def parse_category(response, url:, data: {})
884
- response.xpath("//products/path").each do |product|
885
- # Before visiting the url, `request_to` will check if it already contains
886
- # in the storage scope `:product_urls`, if so, request will be skipped:
887
- request_to :parse_product, url: product[:href]
888
- end
889
- end
890
-
891
- def parse_product(response, url:, data: {})
892
- # Add visited item url to the storage scope `:product_urls`:
893
- storage.add(:product_urls, url)
894
-
895
- # ...
896
- end
862
+ # ...
897
863
  end
898
-
899
- # Run the spider with persistence database option:
900
- ProductsSpider.crawl!(continue: true)
901
864
  ```
902
- </details>
903
-
904
- ### Handle request errors
905
- It is quite common that some pages of crawling website can return different response code than `200 ok`. In such cases, method `request_to` (or `browser.visit`) can raise an exception. Kimurai provides `skip_request_errors` and `retry_request_errors` [config](#spider-config) options to handle such errors:
906
-
907
- #### skip_request_errors
908
- You can automatically skip some of errors while requesting a page using `skip_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.
909
865
 
910
- Format for the option: array where elements are error classes or/and hashes. You can use _hash_ format for more flexibility:
911
-
912
- ```ruby
913
- @config = {
914
- skip_request_errors: [{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }]
915
- }
916
866
  ```
917
- In this case, provided `message:` will be compared with a full error message using `String#include?`. Also you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
918
-
919
- #### retry_request_errors
920
- You can automatically retry some of errors with a few attempts while requesting a page using `retry_request_errors` [config](#spider-config) option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.
921
-
922
- There are 3 attempts: first: delay _15 sec_, second: delay _30 sec_, third: delay _45 sec_. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
923
-
924
- Format for the option: same like for `skip_request_errors` option.
867
+ ...
868
+ I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640] INFO -- example_spider: Spider: new event (scope: custom): Product is sold
869
+ ...
870
+ I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640] INFO -- example_spider: Spider: stopped: {:events=>{:custom=>{"Product is sold"=>1}}}
871
+ ```
925
872
 
926
873
  ### `open_spider` and `close_spider` callbacks
927
874
 
@@ -1296,7 +1243,7 @@ set :chronic_options, hours24: true
1296
1243
  # crawl "google_spider.com", output: "log/google_spider.com.log"
1297
1244
  # end
1298
1245
  def local_to_utc(time_string, zone:)
1299
- TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time))
1246
+ TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time_string))
1300
1247
  end
1301
1248
 
1302
1249
  # Note: by default Whenever exports cron commands with :environment == "production".
@@ -1483,6 +1430,8 @@ You can automatically setup [required environment](#installation) for Kimurai on
1483
1430
 
1484
1431
  > To perform remote server setup, [Ansible](https://github.com/ansible/ansible) is required **on the desktop** machine (to install: Ubuntu: `$ sudo apt install ansible`, Mac OS X: `$ brew install ansible`)
1485
1432
 
1433
+ > It's recommended to use regular user to setup the server, not `root`. To create a new user, login to the server `$ ssh root@your_server_ip`, type `$ adduser username` to create a user, and `$ gpasswd -a username sudo` to add new user to a sudo group.
1434
+
1486
1435
  Example:
1487
1436
 
1488
1437
  ```bash
@@ -2081,6 +2030,16 @@ $ bundle exec kimurai runner -j 3
2081
2030
 
2082
2031
  Each spider runs in a separate process. Spiders logs available at `log/` folder. Pass `-j` option to specify how many spiders should be processed at the same time (default is 1).
2083
2032
 
2033
+ You can provide additional arguments like `--include` or `--exclude` to specify which spiders to run:
2034
+
2035
+ ```bash
2036
+ # Run only custom_spider and example_spider:
2037
+ $ bundle exec kimurai runner --include custom_spider example_spider
2038
+
2039
+ # Run all except github_spider:
2040
+ $ bundle exec kimurai runner --exclude github_spider
2041
+ ```
2042
+
2084
2043
  #### Runner callbacks
2085
2044
 
2086
2045
  You can perform custom actions before runner starts and after runner stops using `config.runner_at_start_callback` and `config.runner_at_stop_callback`. Check [config/application.rb](lib/kimurai/template/config/application.rb) to see example.
@@ -19,7 +19,7 @@ Gem::Specification.new do |spec|
19
19
  `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
20
20
  end
21
21
  spec.bindir = "exe"
22
- spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
22
+ spec.executables = "kimurai"
23
23
  spec.require_paths = ["lib"]
24
24
  spec.required_ruby_version = ">= 2.5.0"
25
25
 
@@ -1,16 +1,16 @@
1
1
  ---
2
2
  - hosts: all
3
3
  vars:
4
- ruby: 2.5.1
4
+ ruby: 2.5.3
5
5
  rbenv_root_path: /home/{{ ansible_user_id }}/.rbenv
6
6
  rbenv_shims_path: "{{ rbenv_root_path }}/shims"
7
7
  ruby_versions_path: "{{ rbenv_root_path }}/versions"
8
8
  # check latest here http://phantomjs.org/download.html
9
9
  phantomjs: 2.1.1
10
10
  # check latest here https://github.com/mozilla/geckodriver/releases/
11
- geckodriver: 0.21.0
11
+ geckodriver: 0.23.0
12
12
  # check latest here https://sites.google.com/a/chromium.org/chromedriver/downloads
13
- chromedriver: 2.39
13
+ chromedriver: 2.44
14
14
 
15
15
  tasks:
16
16
  - name: Update apt cache
@@ -97,15 +97,10 @@ module Kimurai
97
97
  end
98
98
  end
99
99
 
100
- def self.crawl!(continue: false, exception_on_fail: true)
100
+ def self.crawl!(exception_on_fail: true)
101
101
  logger.error "Spider: already running: #{name}" and return false if running?
102
102
 
103
- storage_path =
104
- if continue
105
- Dir.exists?("tmp") ? "tmp/#{name}.pstore" : "#{name}.pstore"
106
- end
107
-
108
- @storage = Storage.new(storage_path)
103
+ @storage = Storage.new
109
104
  @savers = {}
110
105
  @update_mutex = Mutex.new
111
106
 
@@ -149,15 +144,6 @@ module Kimurai
149
144
 
150
145
  close_spider if self.respond_to? :close_spider
151
146
 
152
- if @storage.path
153
- if completed?
154
- @storage.delete!
155
- logger.info "Spider: storage: persistence database #{@storage.path} was removed (successful run)"
156
- else
157
- logger.info "Spider: storage: persistence database #{@storage.path} wasn't removed (failed run)"
158
- end
159
- end
160
-
161
147
  message = "Spider: stopped: #{@run_info.merge(running_time: @run_info[:running_time]&.duration)}"
162
148
  failed? ? logger.fatal(message) : logger.info(message)
163
149
 
@@ -165,9 +151,16 @@ module Kimurai
165
151
  end
166
152
  end
167
153
 
168
- def self.parse!(handler, engine = nil, url: nil, data: {})
169
- spider = engine ? self.new(engine) : self.new
170
- url.present? ? spider.request_to(handler, url: url, data: data) : spider.public_send(handler)
154
+ def self.parse!(handler, *args, **request)
155
+ spider = self.new
156
+
157
+ if args.present?
158
+ spider.public_send(handler, *args)
159
+ elsif request.present?
160
+ spider.request_to(handler, request)
161
+ else
162
+ spider.public_send(handler)
163
+ end
171
164
  ensure
172
165
  spider.browser.destroy_driver! if spider.instance_variable_get("@browser")
173
166
  end
@@ -244,12 +237,17 @@ module Kimurai
244
237
  end
245
238
 
246
239
  self.class.add_event(scope, event)
240
+ logger.info "Spider: new event (scope: #{scope}): #{event}" if scope == :custom
247
241
  end
248
242
 
249
243
  ###
250
244
 
251
245
  private
252
246
 
247
+ def create_browser(engine, config = {})
248
+ Kimurai::BrowserBuilder.build(engine, config, spider: self)
249
+ end
250
+
253
251
  def unique_request?(url)
254
252
  options = @config[:skip_duplicate_requests]
255
253
  if options.class == Hash
@@ -1,60 +1,34 @@
1
- require 'pstore'
2
-
3
1
  module Kimurai
4
2
  class Base
5
3
  class Storage
6
- attr_reader :database, :path
4
+ attr_reader :database
7
5
 
8
- def initialize(path = nil)
9
- @path = path
6
+ def initialize
10
7
  @mutex = Mutex.new
11
- @database = path ? PStore.new(path) : {}
8
+ @database = {}
12
9
  end
13
10
 
14
11
  def all(scope = nil)
15
12
  @mutex.synchronize do
16
- if path
17
- database.transaction { scope ? database.fetch(scope, []) : database }
18
- else
19
- scope ? database.fetch(scope, []) : database
20
- end
13
+ scope ? database.fetch(scope, []) : database
21
14
  end
22
15
  end
23
16
 
24
17
  def include?(scope, value)
25
18
  @mutex.synchronize do
26
- if path
27
- database.transaction do
28
- database[scope] ||= []
29
- database[scope].include?(value)
30
- end
31
- else
32
- database[scope] ||= []
33
- database[scope].include?(value)
34
- end
19
+ database[scope] ||= []
20
+ database[scope].include?(value)
35
21
  end
36
22
  end
37
23
 
38
24
  def add(scope, value)
39
25
  @mutex.synchronize do
40
- if path
41
- database.transaction do
42
- database[scope] ||= []
43
- if value.class == Array
44
- database[scope] += value
45
- database[scope].uniq!
46
- else
47
- database[scope].push(value) unless database[scope].include?(value)
48
- end
49
- end
26
+ database[scope] ||= []
27
+ if value.kind_of?(Array)
28
+ database[scope] += value
29
+ database[scope].uniq!
50
30
  else
51
- database[scope] ||= []
52
- if value.class == Array
53
- database[scope] += value
54
- database[scope].uniq!
55
- else
56
- database[scope].push(value) unless database[scope].include?(value)
57
- end
31
+ database[scope].push(value) unless database[scope].include?(value)
58
32
  end
59
33
  end
60
34
  end
@@ -63,15 +37,8 @@ module Kimurai
63
37
 
64
38
  def unique?(scope, value)
65
39
  @mutex.synchronize do
66
- if path
67
- database.transaction do
68
- database[scope] ||= []
69
- database[scope].include?(value) ? false : database[scope].push(value) and true
70
- end
71
- else
72
- database[scope] ||= []
73
- database[scope].include?(value) ? false : database[scope].push(value) and true
74
- end
40
+ database[scope] ||= []
41
+ database[scope].include?(value) ? false : database[scope].push(value) and true
75
42
  end
76
43
  end
77
44
 
@@ -79,21 +46,7 @@ module Kimurai
79
46
 
80
47
  def clear!
81
48
  @mutex.synchronize do
82
- if path
83
- database.transaction do
84
- database.roots.each { |key| database.delete key }
85
- end
86
- else
87
- database = {}
88
- end
89
- end
90
- end
91
-
92
- def delete!
93
- @mutex.synchronize do
94
- if path
95
- File.delete path if File.exists? path
96
- end
49
+ @database = {}
97
50
  end
98
51
  end
99
52
  end
@@ -38,11 +38,11 @@ module Kimurai
38
38
  proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
39
39
  ip, port, type = proxy_string.split(":")
40
40
 
41
- if type == "socks5"
42
- logger.error "BrowserBuilder (mechanize): can't set socks5 proxy (not supported), skipped"
43
- else
41
+ if type == "http"
44
42
  @browser.driver.set_proxy(*proxy_string.split(":"))
45
- logger.debug "BrowserBuilder (mechanize): enabled #{type} proxy, ip: #{ip}, port: #{port}"
43
+ logger.debug "BrowserBuilder (mechanize): enabled http proxy, ip: #{ip}, port: #{port}"
44
+ else
45
+ logger.error "BrowserBuilder (mechanize): can't set #{type} proxy (not supported), skipped"
46
46
  end
47
47
  end
48
48
 
@@ -64,8 +64,12 @@ module Kimurai
64
64
  proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
65
65
  ip, port, type = proxy_string.split(":")
66
66
 
67
- @browser.driver.set_proxy(*proxy_string.split(":"))
68
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled #{type} proxy, ip: #{ip}, port: #{port}"
67
+ if %w(http socks5).include?(type)
68
+ @browser.driver.set_proxy(*proxy_string.split(":"))
69
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled #{type} proxy, ip: #{ip}, port: #{port}"
70
+ else
71
+ logger.error "BrowserBuilder (poltergeist_phantomjs): wrong type of proxy: #{type}, skipped"
72
+ end
69
73
  end
70
74
 
71
75
  # Headers
@@ -44,11 +44,15 @@ module Kimurai
44
44
  proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
45
45
  ip, port, type, user, password = proxy_string.split(":")
46
46
 
47
- if user.nil? && password.nil?
48
- driver_options.args << "--proxy-server=#{type}://#{ip}:#{port}"
49
- logger.debug "BrowserBuilder (selenium_chrome): enabled #{type} proxy, ip: #{ip}, port: #{port}"
47
+ if %w(http socks5).include?(type)
48
+ if user.nil? && password.nil?
49
+ driver_options.args << "--proxy-server=#{type}://#{ip}:#{port}"
50
+ logger.debug "BrowserBuilder (selenium_chrome): enabled #{type} proxy, ip: #{ip}, port: #{port}"
51
+ else
52
+ logger.error "BrowserBuilder (selenium_chrome): proxy with authentication doesn't supported by selenium, skipped"
53
+ end
50
54
  else
51
- logger.error "BrowserBuilder (selenium_chrome): proxy with authentication doesn't supported by selenium, skipped"
55
+ logger.error "BrowserBuilder (selenium_chrome): wrong type of proxy: #{type}, skipped"
52
56
  end
53
57
  end
54
58
 
@@ -40,13 +40,18 @@ module Kimurai
40
40
  driver_options.profile["network.proxy.http_port"] = port.to_i
41
41
  driver_options.profile["network.proxy.ssl"] = ip
42
42
  driver_options.profile["network.proxy.ssl_port"] = port.to_i
43
+
44
+ logger.debug "BrowserBuilder (selenium_firefox): enabled http proxy, ip: #{ip}, port: #{port}"
43
45
  elsif type == "socks5"
44
46
  driver_options.profile["network.proxy.socks"] = ip
45
47
  driver_options.profile["network.proxy.socks_port"] = port.to_i
46
48
  driver_options.profile["network.proxy.socks_version"] = 5
47
49
  driver_options.profile["network.proxy.socks_remote_dns"] = true
50
+
51
+ logger.debug "BrowserBuilder (selenium_firefox): enabled socks5 proxy, ip: #{ip}, port: #{port}"
52
+ else
53
+ logger.error "BrowserBuilder (selenium_firefox): wrong type of proxy: #{type}, skipped"
48
54
  end
49
- logger.debug "BrowserBuilder (selenium_firefox): enabled #{type} proxy, ip: #{ip}, port: #{port}"
50
55
  else
51
56
  logger.error "BrowserBuilder (selenium_firefox): proxy with authentication doesn't supported by selenium, skipped"
52
57
  end
@@ -10,6 +10,8 @@ class Capybara::Mechanize::Driver
10
10
  browser.agent.set_proxy(ip, port, user, password)
11
11
  end
12
12
 
13
+ ###
14
+
13
15
  def headers
14
16
  browser.agent.request_headers
15
17
  end
@@ -22,6 +24,12 @@ class Capybara::Mechanize::Driver
22
24
  browser.agent.request_headers[name] = value
23
25
  end
24
26
 
27
+ ###
28
+
29
+ def get_cookies
30
+ browser.agent.cookies
31
+ end
32
+
25
33
  def set_cookie(name, value, options = {})
26
34
  options[:name] ||= name
27
35
  options[:value] ||= value
@@ -30,10 +38,18 @@ class Capybara::Mechanize::Driver
30
38
  browser.agent.cookie_jar << cookie
31
39
  end
32
40
 
41
+ def set_cookies(cookies)
42
+ cookies.each do |cookie|
43
+ set_cookie(cookie[:name], cookie[:value], cookie)
44
+ end
45
+ end
46
+
33
47
  def clear_cookies
34
48
  browser.agent.cookie_jar.clear!
35
49
  end
36
50
 
51
+ ###
52
+
37
53
  def quit
38
54
  browser.agent.shutdown
39
55
  end
@@ -1,6 +1,10 @@
1
1
  require_relative '../driver/base'
2
2
 
3
3
  class Capybara::Selenium::Driver
4
+ def get_cookies
5
+ browser.manage.all_cookies
6
+ end
7
+
4
8
  def set_cookie(name, value, options = {})
5
9
  options[:name] ||= name
6
10
  options[:value] ||= value
@@ -8,6 +12,12 @@ class Capybara::Selenium::Driver
8
12
  browser.manage.add_cookie(options)
9
13
  end
10
14
 
15
+ def set_cookies(cookies)
16
+ cookies.each do |cookie|
17
+ set_cookie(cookie[:name], cookie[:value], cookie)
18
+ end
19
+ end
20
+
11
21
  def clear_cookies
12
22
  browser.manage.delete_all_cookies
13
23
  end
@@ -33,7 +33,7 @@ module Capybara
33
33
  sleep sleep_interval and retry
34
34
  else
35
35
  logger.error "Browser: all retries (#{retries - 1}) to the url #{visit_uri} are gone"
36
- raise e
36
+ raise e unless skip_error_on_failure?(e)
37
37
  end
38
38
  else
39
39
  raise e
@@ -127,26 +127,40 @@ module Capybara
127
127
 
128
128
  ###
129
129
 
130
+ def scroll_to_bottom
131
+ execute_script("window.scrollBy(0,10000)")
132
+ end
133
+
130
134
  private
131
135
 
136
+ def skip_error_on_failure?(e)
137
+ config.retry_request_errors.any? do |error|
138
+ error[:skip_on_failure] && e.class.ancestors.include?(error[:error]) if error.kind_of?(Hash)
139
+ end
140
+ end
141
+
132
142
  def match_error?(e, type:)
133
- errors = (type == :to_retry ? config.retry_request_errors : config.skip_request_errors)
134
- if errors.present?
135
- errors.any? do |error|
136
- if error.class == Hash
137
- match = if error[:message].class == Regexp
143
+ errors =
144
+ case type
145
+ when :to_retry then config.retry_request_errors
146
+ when :to_skip then config.skip_request_errors
147
+ end
148
+
149
+ errors.any? do |error|
150
+ if error.kind_of?(Hash)
151
+ match_class = e.class.ancestors.include?(error[:error])
152
+ if error[:message].present?
153
+ if error[:message].kind_of?(Regexp)
138
154
  e.message&.match?(error[:message])
139
155
  else
140
156
  e.message&.include?(error[:message])
141
- end
142
-
143
- e.class == error[:error] && match
157
+ end && match_class
144
158
  else
145
- e.class == error
159
+ match_class
146
160
  end
161
+ else
162
+ e.class.ancestors.include?(error)
147
163
  end
148
- else
149
- false
150
164
  end
151
165
  end
152
166
 
@@ -66,7 +66,6 @@ module Kimurai
66
66
  ###
67
67
 
68
68
  desc "crawl", "Run a particular spider by it's name"
69
- option :continue, aliases: :c, type: :boolean, default: false, banner: "Continue previous crawling"
70
69
  def crawl(spider_name)
71
70
  raise "Can't find Kimurai project" unless inside_project?
72
71
  require './config/boot'
@@ -81,7 +80,7 @@ module Kimurai
81
80
  Kimurai.time_zone = time_zone
82
81
  end
83
82
 
84
- klass.crawl!(continue: options["continue"])
83
+ klass.crawl!
85
84
  end
86
85
 
87
86
  desc "parse", "Parse url in the particular spider method"
@@ -129,6 +128,8 @@ module Kimurai
129
128
  end
130
129
 
131
130
  desc "runner", "Run all spiders in the project in queue"
131
+ option :include, type: :array, default: [], banner: "List of spiders to run"
132
+ option :exclude, type: :array, default: [], banner: "List of spiders to exclude from run"
132
133
  option :jobs, aliases: :j, type: :numeric, default: 1, banner: "The number of concurrent jobs"
133
134
  def runner
134
135
  raise "Can't find Kimurai project" unless inside_project?
@@ -138,7 +139,11 @@ module Kimurai
138
139
 
139
140
  require './config/boot'
140
141
  require 'kimurai/runner'
141
- Runner.new(parallel_jobs: jobs).run!
142
+
143
+ spiders = options["include"].presence || Kimurai.list.keys
144
+ spiders -= options["exclude"]
145
+
146
+ Runner.new(spiders, jobs).run!
142
147
  end
143
148
 
144
149
  desc "--version, -v", "Print the version"
@@ -4,9 +4,9 @@ module Kimurai
4
4
  class Runner
5
5
  attr_reader :jobs, :spiders, :session_info
6
6
 
7
- def initialize(parallel_jobs:)
7
+ def initialize(spiders, parallel_jobs)
8
8
  @jobs = parallel_jobs
9
- @spiders = Kimurai.list
9
+ @spiders = spiders
10
10
  @start_time = Time.now
11
11
 
12
12
  @session_info = {
@@ -16,7 +16,7 @@ module Kimurai
16
16
  stop_time: nil,
17
17
  environment: Kimurai.env,
18
18
  concurrent_jobs: @jobs,
19
- spiders: @spiders.keys
19
+ spiders: @spiders
20
20
  }
21
21
 
22
22
  if time_zone = Kimurai.configuration.time_zone
@@ -28,8 +28,6 @@ module Kimurai
28
28
  end
29
29
 
30
30
  def run!(exception_on_fail: true)
31
- running_pids = []
32
-
33
31
  puts ">>> Runner: started: #{session_info}"
34
32
  if at_start_callback = Kimurai.configuration.runner_at_start_callback
35
33
  at_start_callback.call(session_info)
@@ -39,29 +37,20 @@ module Kimurai
39
37
  spiders.peach_with_index(jobs) do |spider, i|
40
38
  next unless running
41
39
 
42
- spider_name = spider[0]
43
- puts "> Runner: started spider: #{spider_name}, index: #{i}"
44
-
45
- pid = spawn("bundle", "exec", "kimurai", "crawl", spider_name, [:out, :err] => "log/#{spider_name}.log")
46
- running_pids << pid
40
+ puts "> Runner: started spider: #{spider}, index: #{i}"
41
+ pid = spawn("bundle", "exec", "kimurai", "crawl", spider, [:out, :err] => "log/#{spider}.log")
47
42
  Process.wait pid
48
43
 
49
- running_pids.delete(pid)
50
- puts "< Runner: stopped spider: #{spider_name}, index: #{i}"
44
+ puts "< Runner: stopped spider: #{spider}, index: #{i}"
51
45
  end
52
46
  rescue StandardError, SignalException, SystemExit => e
53
47
  running = false
48
+
54
49
  session_info.merge!(status: :failed, error: e.inspect, stop_time: Time.now)
55
50
  exception_on_fail ? raise(e) : [session_info, e]
56
51
  else
57
52
  session_info.merge!(status: :completed, stop_time: Time.now)
58
53
  ensure
59
- running = false
60
- Thread.list.each { |t| t.kill if t != Thread.main }
61
-
62
- # Kill currently running spiders (if any, in case of fail)
63
- running_pids.each { |pid| Process.kill("INT", pid) }
64
-
65
54
  if at_stop_callback = Kimurai.configuration.runner_at_stop_callback
66
55
  at_stop_callback.call(session_info)
67
56
  end
@@ -16,7 +16,7 @@ set :chronic_options, hours24: true
16
16
  # crawl "google_spider.com", output: "log/google_spider.com.log"
17
17
  # end
18
18
  def local_to_utc(time_string, zone:)
19
- TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time))
19
+ TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time_string))
20
20
  end
21
21
 
22
22
  # Note: by default Whenever exports cron commands with :environment == "production".
@@ -1,3 +1,3 @@
1
1
  module Kimurai
2
- VERSION = "1.2.0"
2
+ VERSION = "1.3.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kimurai
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.0
4
+ version: 1.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Afanasev
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-09-20 00:00:00.000000000 Z
11
+ date: 2018-11-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: thor