kimurai 1.3.2 → 1.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 71ac47a69324914eaa098231d72b4af453e9e222d4fc7b19f0b2ef88449c7c18
4
- data.tar.gz: e5ac63a13897f92118b3bbe33bc55f8f46e4970333ffeedf88c3f64e2df791e8
3
+ metadata.gz: 9efd6a3590e9df217a803d9e7c806b052d1c3c1cd9887294d7c736ee19fc8e7b
4
+ data.tar.gz: 14d14793e808f5122158ae315da53bbb7484772274cc332e96064300016a3974
5
5
  SHA512:
6
- metadata.gz: 11555b24a707d857c37de7952654cc2d5cdc5d5c188cd15aa364b02126e3abeff71b65a86ddfae276fc8d2136f7b2e0b3b25afe7b798e9af1674bfb0b6796c03
7
- data.tar.gz: '094fb4c7cac6a326c76758c07e3ca9e09ab36d68e8da3d80465493c93a6fa5a664e332ec55ea3d5b05f1d52dd3327bd732c43149db95b04e59bf2fe85f36ec15'
6
+ metadata.gz: 6224c7b4bccdbe92b610cd47fa04ed01dc9366fc7b12fb35edac218d71cdd24b7644fd7e5ee5b195d389395f6d147bc7ec8716c33e2d9227fde1cc1204b4a1e4
7
+ data.tar.gz: 2c6789e7c62dfe999b641cae172f28c0eb609860de4e123fd0305934ef251fedfa2e6a299d3705f395bdbd483d3cd1004ec1804a98a5517ad218a789d98702f5
@@ -1,4 +1,12 @@
1
1
  # CHANGELOG
2
+ ## 1.4.0
3
+ ### New
4
+ * Add `encoding` config option (see [All available config options](https://github.com/vifreefly/kimuraframework#all-available-config-options))
5
+ * Validate url before processing a request (Base#request_to)
6
+
7
+ ### Fixes
8
+ * Fix console command bug (see [issue 21](https://github.com/vifreefly/kimuraframework/issues/21))
9
+
2
10
  ## 1.3.2
3
11
  ### Fixes
4
12
  * In the project template, set Ruby version as >= 2.5 (before was hard-coded to 2.5.1)
data/README.md CHANGED
@@ -10,15 +10,12 @@
10
10
  > * The code was massively refactored for a [support](#using-kimurai-inside-existing-ruby-application) to run spiders multiple times from inside a single process. Now it's possible to run Kimurai spiders using background jobs like Sidekiq.
11
11
  > * `require 'kimurai'` doesn't require any gems except Active Support. Only when a particular spider [starts](#crawl-method), Capybara will be required with a specific driver.
12
12
  > * Although Kimurai [extends](lib/kimurai/capybara_ext) Capybara (all the magic happens inside [extended](lib/kimurai/capybara_ext/session.rb) `Capybara::Session#visit` method), session instances which were created manually will behave normally.
13
- > * No spaghetti code with `case/when/end` blocks anymore. All drivers [were extended](lib/kimurai/capybara_ext) to support unified methods for cookies, proxies, headers, etc.
14
- > * `selenium_url_to_set_cookies` @config option don't need anymore if you're use Selenium-like engine with custom cookies setting.
15
13
  > * Small changes in design (check the readme again to see what was changed)
16
14
  > * Stats database with a web dashboard were removed
17
- > * Again, massive refactor. Code now looks much better than it was before.
18
15
 
19
16
  <br>
20
17
 
21
- > Note: this readme is for `1.3.2` gem version. CHANGELOG [here](CHANGELOG.md).
18
+ > Note: this readme is for `1.4.0` gem version. CHANGELOG [here](CHANGELOG.md).
22
19
 
23
20
  Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
24
21
 
@@ -1592,6 +1589,12 @@ end
1592
1589
  # Format: same like for `skip_request_errors` option.
1593
1590
  retry_request_errors: [Net::ReadTimeout],
1594
1591
 
1592
+ # Handle page encoding while parsing html response using Nokogiri. There are two modes:
1593
+ # Auto (`:auto`) (try to fetch correct encoding from <meta http-equiv="Content-Type"> or <meta charset> tags)
1594
+ # Set required encoding manually, example: `encoding: "GB2312"` (Set required encoding manually)
1595
+ # Default this option is unset.
1596
+ encoding: nil,
1597
+
1595
1598
  # Restart browser if one of the options is true:
1596
1599
  restart_if: {
1597
1600
  # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
@@ -1740,8 +1743,6 @@ end
1740
1743
  ### Crawl
1741
1744
  To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
1742
1745
 
1743
- You can provide an additional option `--continue` to use [persistence storage database](#persistence-database-for-the-storage) feature.
1744
-
1745
1746
  ### List
1746
1747
  To list all project spiders, run: `$ bundle exec kimurai list`
1747
1748
 
@@ -3,6 +3,8 @@ require_relative 'base/storage'
3
3
 
4
4
  module Kimurai
5
5
  class Base
6
+ class InvalidUrlError < StandardError; end
7
+
6
8
  # don't deep merge config's headers hash option
7
9
  DMERGE_EXCLUDE = [:headers]
8
10
 
@@ -171,7 +173,7 @@ module Kimurai
171
173
  attr_accessor :with_info
172
174
 
173
175
  def initialize(engine = self.class.engine, config: {})
174
- @engine = engine
176
+ @engine = engine || self.class.engine
175
177
  @config = self.class.config.deep_merge_excl(config, DMERGE_EXCLUDE)
176
178
  @pipelines = self.class.pipelines.map do |pipeline_name|
177
179
  klass = Pipeline.descendants.find { |kl| kl.name == pipeline_name }
@@ -189,6 +191,8 @@ module Kimurai
189
191
  end
190
192
 
191
193
  def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
194
+ raise InvalidUrlError, "Requested url is invalid: #{url}" unless URI.parse(url).kind_of?(URI::HTTP)
195
+
192
196
  if @config[:skip_duplicate_requests] && !unique_request?(url)
193
197
  add_event(:duplicate_requests) if self.with_info
194
198
  logger.warn "Spider: request_to: not unique url: #{url}, skipped" and return
@@ -1,38 +1,20 @@
1
1
  module Kimurai
2
- class BrowserBuilder
3
- AVAILABLE_ENGINES = [
4
- :mechanize,
5
- :mechanize_standalone,
6
- :poltergeist_phantomjs,
7
- :selenium_firefox,
8
- :selenium_chrome
9
- ]
10
-
2
+ module BrowserBuilder
11
3
  def self.build(engine, config = {}, spider:)
12
- unless AVAILABLE_ENGINES.include? engine
13
- raise "BrowserBuilder: wrong name of engine, available engines: #{AVAILABLE_ENGINES.join(', ')}"
14
- end
15
-
16
4
  if config[:browser].present?
17
5
  raise "++++++ BrowserBuilder: browser option is depricated. Now all sub-options inside " \
18
6
  "`browser` should be placed right into `@config` hash, without `browser` parent key.\n" \
19
7
  "See more here: https://github.com/vifreefly/kimuraframework/blob/master/CHANGELOG.md#breaking-changes-110 ++++++"
20
8
  end
21
9
 
22
- case engine
23
- when :mechanize
24
- require_relative 'browser_builder/mechanize_builder'
25
- MechanizeBuilder.new(config, spider: spider).build
26
- when :selenium_chrome
27
- require_relative 'browser_builder/selenium_chrome_builder'
28
- SeleniumChromeBuilder.new(config, spider: spider).build
29
- when :poltergeist_phantomjs
30
- require_relative 'browser_builder/poltergeist_phantomjs_builder'
31
- PoltergeistPhantomJSBuilder.new(config, spider: spider).build
32
- when :selenium_firefox
33
- require_relative 'browser_builder/selenium_firefox_builder'
34
- SeleniumFirefoxBuilder.new(config, spider: spider).build
10
+ begin
11
+ require "kimurai/browser_builder/#{engine}_builder"
12
+ rescue LoadError => e
35
13
  end
14
+
15
+ builder_class_name = "#{engine}_builder".classify
16
+ builder = "Kimurai::BrowserBuilder::#{builder_class_name}".constantize
17
+ builder.new(config, spider: spider).build
36
18
  end
37
19
  end
38
20
  end
@@ -4,147 +4,151 @@ require_relative '../capybara_configuration'
4
4
  require_relative '../capybara_ext/mechanize/driver'
5
5
  require_relative '../capybara_ext/session'
6
6
 
7
- module Kimurai
8
- class BrowserBuilder
9
- class MechanizeBuilder
10
- attr_reader :logger, :spider
11
-
12
- def initialize(config, spider:)
13
- @config = config
14
- @spider = spider
15
- @logger = spider.logger
16
- end
17
-
18
- def build
19
- # Register driver
20
- Capybara.register_driver :mechanize do |app|
21
- driver = Capybara::Mechanize::Driver.new("app")
22
- # keep the history as small as possible (by default it's unlimited)
23
- driver.configure { |a| a.history.max_size = 2 }
24
- driver
25
- end
7
+ module Kimurai::BrowserBuilder
8
+ class MechanizeBuilder
9
+ attr_reader :logger, :spider
10
+
11
+ def initialize(config, spider:)
12
+ @config = config
13
+ @spider = spider
14
+ @logger = spider.logger
15
+ end
26
16
 
27
- # Create browser instance (Capybara session)
28
- @browser = Capybara::Session.new(:mechanize)
29
- @browser.spider = spider
30
- logger.debug "BrowserBuilder (mechanize): created browser instance"
17
+ def build
18
+ # Register driver
19
+ Capybara.register_driver :mechanize do |app|
20
+ driver = Capybara::Mechanize::Driver.new("app")
21
+ # keep the history as small as possible (by default it's unlimited)
22
+ driver.configure { |a| a.history.max_size = 2 }
23
+ driver
24
+ end
31
25
 
32
- if @config[:extensions].present?
33
- logger.error "BrowserBuilder (mechanize): `extensions` option not supported, skipped"
34
- end
26
+ # Create browser instance (Capybara session)
27
+ @browser = Capybara::Session.new(:mechanize)
28
+ @browser.spider = spider
29
+ logger.debug "BrowserBuilder (mechanize): created browser instance"
35
30
 
36
- # Proxy
37
- if proxy = @config[:proxy].presence
38
- proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
39
- ip, port, type = proxy_string.split(":")
40
-
41
- if type == "http"
42
- @browser.driver.set_proxy(*proxy_string.split(":"))
43
- logger.debug "BrowserBuilder (mechanize): enabled http proxy, ip: #{ip}, port: #{port}"
44
- else
45
- logger.error "BrowserBuilder (mechanize): can't set #{type} proxy (not supported), skipped"
46
- end
47
- end
31
+ if @config[:extensions].present?
32
+ logger.error "BrowserBuilder (mechanize): `extensions` option not supported, skipped"
33
+ end
48
34
 
49
- # SSL
50
- if ssl_cert_path = @config[:ssl_cert_path].presence
51
- @browser.driver.browser.agent.http.ca_file = ssl_cert_path
52
- logger.debug "BrowserBuilder (mechanize): enabled custom ssl_cert"
53
- end
35
+ # Proxy
36
+ if proxy = @config[:proxy].presence
37
+ proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
38
+ ip, port, type = proxy_string.split(":")
54
39
 
55
- if @config[:ignore_ssl_errors].present?
56
- @browser.driver.browser.agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
57
- logger.debug "BrowserBuilder (mechanize): enabled ignore_ssl_errors"
40
+ if type == "http"
41
+ @browser.driver.set_proxy(*proxy_string.split(":"))
42
+ logger.debug "BrowserBuilder (mechanize): enabled http proxy, ip: #{ip}, port: #{port}"
43
+ else
44
+ logger.error "BrowserBuilder (mechanize): can't set #{type} proxy (not supported), skipped"
58
45
  end
46
+ end
59
47
 
60
- # Headers
61
- if headers = @config[:headers].presence
62
- @browser.driver.headers = headers
63
- logger.debug "BrowserBuilder (mechanize): enabled custom headers"
64
- end
48
+ # SSL
49
+ if ssl_cert_path = @config[:ssl_cert_path].presence
50
+ @browser.driver.browser.agent.http.ca_file = ssl_cert_path
51
+ logger.debug "BrowserBuilder (mechanize): enabled custom ssl_cert"
52
+ end
65
53
 
66
- if user_agent = @config[:user_agent].presence
67
- user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
54
+ if @config[:ignore_ssl_errors].present?
55
+ @browser.driver.browser.agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
56
+ logger.debug "BrowserBuilder (mechanize): enabled ignore_ssl_errors"
57
+ end
68
58
 
69
- @browser.driver.add_header("User-Agent", user_agent_string)
70
- logger.debug "BrowserBuilder (mechanize): enabled custom user_agent"
71
- end
59
+ # Headers
60
+ if headers = @config[:headers].presence
61
+ @browser.driver.headers = headers
62
+ logger.debug "BrowserBuilder (mechanize): enabled custom headers"
63
+ end
72
64
 
73
- # Cookies
74
- if cookies = @config[:cookies].presence
75
- cookies.each do |cookie|
76
- @browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
77
- end
65
+ if user_agent = @config[:user_agent].presence
66
+ user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
78
67
 
79
- logger.debug "BrowserBuilder (mechanize): enabled custom cookies"
80
- end
68
+ @browser.driver.add_header("User-Agent", user_agent_string)
69
+ logger.debug "BrowserBuilder (mechanize): enabled custom user_agent"
70
+ end
81
71
 
82
- # Browser instance options
83
- # skip_request_errors
84
- if skip_errors = @config[:skip_request_errors].presence
85
- @browser.config.skip_request_errors = skip_errors
86
- logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
72
+ # Cookies
73
+ if cookies = @config[:cookies].presence
74
+ cookies.each do |cookie|
75
+ @browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
87
76
  end
88
77
 
89
- # retry_request_errors
90
- if retry_errors = @config[:retry_request_errors].presence
91
- @browser.config.retry_request_errors = retry_errors
92
- logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
93
- end
78
+ logger.debug "BrowserBuilder (mechanize): enabled custom cookies"
79
+ end
94
80
 
95
- # restart_if
96
- if @config[:restart_if].present?
97
- logger.warn "BrowserBuilder (mechanize): restart_if options not supported by Mechanize, skipped"
98
- end
81
+ # Browser instance options
82
+ # skip_request_errors
83
+ if skip_errors = @config[:skip_request_errors].presence
84
+ @browser.config.skip_request_errors = skip_errors
85
+ logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
86
+ end
99
87
 
100
- # before_request clear_cookies
101
- if @config.dig(:before_request, :clear_cookies)
102
- @browser.config.before_request[:clear_cookies] = true
103
- logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_cookies"
104
- end
88
+ # retry_request_errors
89
+ if retry_errors = @config[:retry_request_errors].presence
90
+ @browser.config.retry_request_errors = retry_errors
91
+ logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
92
+ end
105
93
 
106
- # before_request clear_and_set_cookies
107
- if @config.dig(:before_request, :clear_and_set_cookies)
108
- if cookies = @config[:cookies].presence
109
- @browser.config.cookies = cookies
110
- @browser.config.before_request[:clear_and_set_cookies] = true
111
- logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_and_set_cookies"
112
- else
113
- logger.error "BrowserBuilder (mechanize): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
114
- end
115
- end
94
+ # restart_if
95
+ if @config[:restart_if].present?
96
+ logger.warn "BrowserBuilder (mechanize): restart_if options not supported by Mechanize, skipped"
97
+ end
98
+
99
+ # before_request clear_cookies
100
+ if @config.dig(:before_request, :clear_cookies)
101
+ @browser.config.before_request[:clear_cookies] = true
102
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_cookies"
103
+ end
116
104
 
117
- # before_request change_user_agent
118
- if @config.dig(:before_request, :change_user_agent)
119
- if @config[:user_agent].present? && @config[:user_agent].class == Proc
120
- @browser.config.user_agent = @config[:user_agent]
121
- @browser.config.before_request[:change_user_agent] = true
122
- logger.debug "BrowserBuilder (mechanize): enabled before_request.change_user_agent"
123
- else
124
- logger.error "BrowserBuilder (mechanize): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
125
- end
105
+ # before_request clear_and_set_cookies
106
+ if @config.dig(:before_request, :clear_and_set_cookies)
107
+ if cookies = @config[:cookies].presence
108
+ @browser.config.cookies = cookies
109
+ @browser.config.before_request[:clear_and_set_cookies] = true
110
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_and_set_cookies"
111
+ else
112
+ logger.error "BrowserBuilder (mechanize): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
126
113
  end
114
+ end
127
115
 
128
- # before_request change_proxy
129
- if @config.dig(:before_request, :change_proxy)
130
- if @config[:proxy].present? && @config[:proxy].class == Proc
131
- @browser.config.proxy = @config[:proxy]
132
- @browser.config.before_request[:change_proxy] = true
133
- logger.debug "BrowserBuilder (mechanize): enabled before_request.change_proxy"
134
- else
135
- logger.error "BrowserBuilder (mechanize): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
136
- end
116
+ # before_request change_user_agent
117
+ if @config.dig(:before_request, :change_user_agent)
118
+ if @config[:user_agent].present? && @config[:user_agent].class == Proc
119
+ @browser.config.user_agent = @config[:user_agent]
120
+ @browser.config.before_request[:change_user_agent] = true
121
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.change_user_agent"
122
+ else
123
+ logger.error "BrowserBuilder (mechanize): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
137
124
  end
125
+ end
138
126
 
139
- # before_request delay
140
- if delay = @config.dig(:before_request, :delay).presence
141
- @browser.config.before_request[:delay] = delay
142
- logger.debug "BrowserBuilder (mechanize): enabled before_request.delay"
127
+ # before_request change_proxy
128
+ if @config.dig(:before_request, :change_proxy)
129
+ if @config[:proxy].present? && @config[:proxy].class == Proc
130
+ @browser.config.proxy = @config[:proxy]
131
+ @browser.config.before_request[:change_proxy] = true
132
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.change_proxy"
133
+ else
134
+ logger.error "BrowserBuilder (mechanize): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
143
135
  end
136
+ end
137
+
138
+ # before_request delay
139
+ if delay = @config.dig(:before_request, :delay).presence
140
+ @browser.config.before_request[:delay] = delay
141
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.delay"
142
+ end
144
143
 
145
- # return Capybara session instance
146
- @browser
144
+ # encoding
145
+ if encoding = @config[:encoding]
146
+ @browser.config.encoding = encoding
147
+ logger.debug "BrowserBuilder (mechanize): enabled encoding: #{encoding}"
147
148
  end
149
+
150
+ # return Capybara session instance
151
+ @browser
148
152
  end
149
153
  end
150
154
  end
@@ -4,168 +4,172 @@ require_relative '../capybara_configuration'
4
4
  require_relative '../capybara_ext/poltergeist/driver'
5
5
  require_relative '../capybara_ext/session'
6
6
 
7
- module Kimurai
8
- class BrowserBuilder
9
- class PoltergeistPhantomJSBuilder
10
- attr_reader :logger, :spider
11
-
12
- def initialize(config, spider:)
13
- @config = config
14
- @spider = spider
15
- @logger = spider.logger
16
- end
7
+ module Kimurai::BrowserBuilder
8
+ class PoltergeistPhantomjsBuilder
9
+ attr_reader :logger, :spider
10
+
11
+ def initialize(config, spider:)
12
+ @config = config
13
+ @spider = spider
14
+ @logger = spider.logger
15
+ end
17
16
 
18
- def build
19
- # Register driver
20
- Capybara.register_driver :poltergeist_phantomjs do |app|
21
- # Create driver options
22
- driver_options = {
23
- js_errors: false, debug: false, inspector: false, phantomjs_options: []
24
- }
25
-
26
- if extensions = @config[:extensions].presence
27
- driver_options[:extensions] = extensions
28
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled extensions"
29
- end
30
-
31
- # Window size
32
- if size = @config[:window_size].presence
33
- driver_options[:window_size] = size
34
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled window_size"
35
- end
36
-
37
- # SSL
38
- if ssl_cert_path = @config[:ssl_cert_path].presence
39
- driver_options[:phantomjs_options] << "--ssl-certificates-path=#{ssl_cert_path}"
40
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom ssl_cert"
41
- end
42
-
43
- if @config[:ignore_ssl_errors].present?
44
- driver_options[:phantomjs_options].push("--ignore-ssl-errors=yes", "--ssl-protocol=any")
45
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled ignore_ssl_errors"
46
- end
47
-
48
- # Disable images
49
- if @config[:disable_images].present?
50
- driver_options[:phantomjs_options] << "--load-images=no"
51
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled disable_images"
52
- end
53
-
54
- Capybara::Poltergeist::Driver.new(app, driver_options)
17
+ def build
18
+ # Register driver
19
+ Capybara.register_driver :poltergeist_phantomjs do |app|
20
+ # Create driver options
21
+ driver_options = {
22
+ js_errors: false, debug: false, inspector: false, phantomjs_options: []
23
+ }
24
+
25
+ if extensions = @config[:extensions].presence
26
+ driver_options[:extensions] = extensions
27
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled extensions"
55
28
  end
56
29
 
57
- # Create browser instance (Capybara session)
58
- @browser = Capybara::Session.new(:poltergeist_phantomjs)
59
- @browser.spider = spider
60
- logger.debug "BrowserBuilder (poltergeist_phantomjs): created browser instance"
61
-
62
- # Proxy
63
- if proxy = @config[:proxy].presence
64
- proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
65
- ip, port, type = proxy_string.split(":")
66
-
67
- if %w(http socks5).include?(type)
68
- @browser.driver.set_proxy(*proxy_string.split(":"))
69
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled #{type} proxy, ip: #{ip}, port: #{port}"
70
- else
71
- logger.error "BrowserBuilder (poltergeist_phantomjs): wrong type of proxy: #{type}, skipped"
72
- end
30
+ # Window size
31
+ if size = @config[:window_size].presence
32
+ driver_options[:window_size] = size
33
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled window_size"
73
34
  end
74
35
 
75
- # Headers
76
- if headers = @config[:headers].presence
77
- @browser.driver.headers = headers
78
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom headers"
36
+ # SSL
37
+ if ssl_cert_path = @config[:ssl_cert_path].presence
38
+ driver_options[:phantomjs_options] << "--ssl-certificates-path=#{ssl_cert_path}"
39
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom ssl_cert"
79
40
  end
80
41
 
81
- if user_agent = @config[:user_agent].presence
82
- user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
42
+ if @config[:ignore_ssl_errors].present?
43
+ driver_options[:phantomjs_options].push("--ignore-ssl-errors=yes", "--ssl-protocol=any")
44
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled ignore_ssl_errors"
45
+ end
83
46
 
84
- @browser.driver.add_header("User-Agent", user_agent_string)
85
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom user_agent"
47
+ # Disable images
48
+ if @config[:disable_images].present?
49
+ driver_options[:phantomjs_options] << "--load-images=no"
50
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled disable_images"
86
51
  end
87
52
 
88
- # Cookies
89
- if cookies = @config[:cookies].presence
90
- cookies.each do |cookie|
91
- @browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
92
- end
53
+ Capybara::Poltergeist::Driver.new(app, driver_options)
54
+ end
93
55
 
94
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom cookies"
56
+ # Create browser instance (Capybara session)
57
+ @browser = Capybara::Session.new(:poltergeist_phantomjs)
58
+ @browser.spider = spider
59
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): created browser instance"
60
+
61
+ # Proxy
62
+ if proxy = @config[:proxy].presence
63
+ proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
64
+ ip, port, type = proxy_string.split(":")
65
+
66
+ if %w(http socks5).include?(type)
67
+ @browser.driver.set_proxy(*proxy_string.split(":"))
68
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled #{type} proxy, ip: #{ip}, port: #{port}"
69
+ else
70
+ logger.error "BrowserBuilder (poltergeist_phantomjs): wrong type of proxy: #{type}, skipped"
95
71
  end
72
+ end
96
73
 
97
- # Browser instance options
98
- # skip_request_errors
99
- if skip_errors = @config[:skip_request_errors].presence
100
- @browser.config.skip_request_errors = skip_errors
101
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
102
- end
74
+ # Headers
75
+ if headers = @config[:headers].presence
76
+ @browser.driver.headers = headers
77
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom headers"
78
+ end
103
79
 
104
- # retry_request_errors
105
- if retry_errors = @config[:retry_request_errors].presence
106
- @browser.config.retry_request_errors = retry_errors
107
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
108
- end
80
+ if user_agent = @config[:user_agent].presence
81
+ user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
109
82
 
110
- # restart_if
111
- if requests_limit = @config.dig(:restart_if, :requests_limit).presence
112
- @browser.config.restart_if[:requests_limit] = requests_limit
113
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.requests_limit >= #{requests_limit}"
114
- end
83
+ @browser.driver.add_header("User-Agent", user_agent_string)
84
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom user_agent"
85
+ end
115
86
 
116
- if memory_limit = @config.dig(:restart_if, :memory_limit).presence
117
- @browser.config.restart_if[:memory_limit] = memory_limit
118
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.memory_limit >= #{memory_limit}"
87
+ # Cookies
88
+ if cookies = @config[:cookies].presence
89
+ cookies.each do |cookie|
90
+ @browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
119
91
  end
120
92
 
121
- # before_request clear_cookies
122
- if @config.dig(:before_request, :clear_cookies)
123
- @browser.config.before_request[:clear_cookies] = true
124
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_cookies"
125
- end
93
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom cookies"
94
+ end
126
95
 
127
- # before_request clear_and_set_cookies
128
- if @config.dig(:before_request, :clear_and_set_cookies)
129
- if cookies = @config[:cookies].presence
130
- @browser.config.cookies = cookies
131
- @browser.config.before_request[:clear_and_set_cookies] = true
132
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_and_set_cookies"
133
- else
134
- logger.error "BrowserBuilder (poltergeist_phantomjs): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
135
- end
136
- end
96
+ # Browser instance options
97
+ # skip_request_errors
98
+ if skip_errors = @config[:skip_request_errors].presence
99
+ @browser.config.skip_request_errors = skip_errors
100
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
101
+ end
137
102
 
138
- # before_request change_user_agent
139
- if @config.dig(:before_request, :change_user_agent)
140
- if @config[:user_agent].present? && @config[:user_agent].class == Proc
141
- @browser.config.user_agent = @config[:user_agent]
142
- @browser.config.before_request[:change_user_agent] = true
143
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_user_agent"
144
- else
145
- logger.error "BrowserBuilder (poltergeist_phantomjs): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
146
- end
103
+ # retry_request_errors
104
+ if retry_errors = @config[:retry_request_errors].presence
105
+ @browser.config.retry_request_errors = retry_errors
106
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
107
+ end
108
+
109
+ # restart_if
110
+ if requests_limit = @config.dig(:restart_if, :requests_limit).presence
111
+ @browser.config.restart_if[:requests_limit] = requests_limit
112
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.requests_limit >= #{requests_limit}"
113
+ end
114
+
115
+ if memory_limit = @config.dig(:restart_if, :memory_limit).presence
116
+ @browser.config.restart_if[:memory_limit] = memory_limit
117
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.memory_limit >= #{memory_limit}"
118
+ end
119
+
120
+ # before_request clear_cookies
121
+ if @config.dig(:before_request, :clear_cookies)
122
+ @browser.config.before_request[:clear_cookies] = true
123
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_cookies"
124
+ end
125
+
126
+ # before_request clear_and_set_cookies
127
+ if @config.dig(:before_request, :clear_and_set_cookies)
128
+ if cookies = @config[:cookies].presence
129
+ @browser.config.cookies = cookies
130
+ @browser.config.before_request[:clear_and_set_cookies] = true
131
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_and_set_cookies"
132
+ else
133
+ logger.error "BrowserBuilder (poltergeist_phantomjs): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
147
134
  end
135
+ end
148
136
 
149
- # before_request change_proxy
150
- if @config.dig(:before_request, :change_proxy)
151
- if @config[:proxy].present? && @config[:proxy].class == Proc
152
- @browser.config.proxy = @config[:proxy]
153
- @browser.config.before_request[:change_proxy] = true
154
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_proxy"
155
- else
156
- logger.error "BrowserBuilder (poltergeist_phantomjs): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
157
- end
137
+ # before_request change_user_agent
138
+ if @config.dig(:before_request, :change_user_agent)
139
+ if @config[:user_agent].present? && @config[:user_agent].class == Proc
140
+ @browser.config.user_agent = @config[:user_agent]
141
+ @browser.config.before_request[:change_user_agent] = true
142
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_user_agent"
143
+ else
144
+ logger.error "BrowserBuilder (poltergeist_phantomjs): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
158
145
  end
146
+ end
159
147
 
160
- # before_request delay
161
- if delay = @config.dig(:before_request, :delay).presence
162
- @browser.config.before_request[:delay] = delay
163
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.delay"
148
+ # before_request change_proxy
149
+ if @config.dig(:before_request, :change_proxy)
150
+ if @config[:proxy].present? && @config[:proxy].class == Proc
151
+ @browser.config.proxy = @config[:proxy]
152
+ @browser.config.before_request[:change_proxy] = true
153
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_proxy"
154
+ else
155
+ logger.error "BrowserBuilder (poltergeist_phantomjs): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
164
156
  end
157
+ end
165
158
 
166
- # return Capybara session instance
167
- @browser
159
+ # before_request delay
160
+ if delay = @config.dig(:before_request, :delay).presence
161
+ @browser.config.before_request[:delay] = delay
162
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.delay"
168
163
  end
164
+
165
+ # encoding
166
+ if encoding = @config[:encoding]
167
+ @browser.config.encoding = encoding
168
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled encoding: #{encoding}"
169
+ end
170
+
171
+ # return Capybara session instance
172
+ @browser
169
173
  end
170
174
  end
171
175
  end