kimurai 1.3.2 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 71ac47a69324914eaa098231d72b4af453e9e222d4fc7b19f0b2ef88449c7c18
4
- data.tar.gz: e5ac63a13897f92118b3bbe33bc55f8f46e4970333ffeedf88c3f64e2df791e8
3
+ metadata.gz: 9efd6a3590e9df217a803d9e7c806b052d1c3c1cd9887294d7c736ee19fc8e7b
4
+ data.tar.gz: 14d14793e808f5122158ae315da53bbb7484772274cc332e96064300016a3974
5
5
  SHA512:
6
- metadata.gz: 11555b24a707d857c37de7952654cc2d5cdc5d5c188cd15aa364b02126e3abeff71b65a86ddfae276fc8d2136f7b2e0b3b25afe7b798e9af1674bfb0b6796c03
7
- data.tar.gz: '094fb4c7cac6a326c76758c07e3ca9e09ab36d68e8da3d80465493c93a6fa5a664e332ec55ea3d5b05f1d52dd3327bd732c43149db95b04e59bf2fe85f36ec15'
6
+ metadata.gz: 6224c7b4bccdbe92b610cd47fa04ed01dc9366fc7b12fb35edac218d71cdd24b7644fd7e5ee5b195d389395f6d147bc7ec8716c33e2d9227fde1cc1204b4a1e4
7
+ data.tar.gz: 2c6789e7c62dfe999b641cae172f28c0eb609860de4e123fd0305934ef251fedfa2e6a299d3705f395bdbd483d3cd1004ec1804a98a5517ad218a789d98702f5
@@ -1,4 +1,12 @@
1
1
  # CHANGELOG
2
+ ## 1.4.0
3
+ ### New
4
+ * Add `encoding` config option (see [All available config options](https://github.com/vifreefly/kimuraframework#all-available-config-options))
5
+ * Validate url before processing a request (Base#request_to)
6
+
7
+ ### Fixes
8
+ * Fix console command bug (see [issue 21](https://github.com/vifreefly/kimuraframework/issues/21))
9
+
2
10
  ## 1.3.2
3
11
  ### Fixes
4
12
  * In the project template, set Ruby version as >= 2.5 (before was hard-coded to 2.5.1)
data/README.md CHANGED
@@ -10,15 +10,12 @@
10
10
  > * The code was massively refactored for a [support](#using-kimurai-inside-existing-ruby-application) to run spiders multiple times from inside a single process. Now it's possible to run Kimurai spiders using background jobs like Sidekiq.
11
11
  > * `require 'kimurai'` doesn't require any gems except Active Support. Only when a particular spider [starts](#crawl-method), Capybara will be required with a specific driver.
12
12
  > * Although Kimurai [extends](lib/kimurai/capybara_ext) Capybara (all the magic happens inside [extended](lib/kimurai/capybara_ext/session.rb) `Capybara::Session#visit` method), session instances which were created manually will behave normally.
13
- > * No spaghetti code with `case/when/end` blocks anymore. All drivers [were extended](lib/kimurai/capybara_ext) to support unified methods for cookies, proxies, headers, etc.
14
- > * `selenium_url_to_set_cookies` @config option don't need anymore if you're use Selenium-like engine with custom cookies setting.
15
13
  > * Small changes in design (check the readme again to see what was changed)
16
14
  > * Stats database with a web dashboard were removed
17
- > * Again, massive refactor. Code now looks much better than it was before.
18
15
 
19
16
  <br>
20
17
 
21
- > Note: this readme is for `1.3.2` gem version. CHANGELOG [here](CHANGELOG.md).
18
+ > Note: this readme is for `1.4.0` gem version. CHANGELOG [here](CHANGELOG.md).
22
19
 
23
20
  Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
24
21
 
@@ -1592,6 +1589,12 @@ end
1592
1589
  # Format: same like for `skip_request_errors` option.
1593
1590
  retry_request_errors: [Net::ReadTimeout],
1594
1591
 
1592
+ # Handle page encoding while parsing html response using Nokogiri. There are two modes:
1593
+ # Auto (`:auto`) (try to fetch correct encoding from <meta http-equiv="Content-Type"> or <meta charset> tags)
1594
+ # Set required encoding manually, example: `encoding: "GB2312"` (Set required encoding manually)
1595
+ # Default this option is unset.
1596
+ encoding: nil,
1597
+
1595
1598
  # Restart browser if one of the options is true:
1596
1599
  restart_if: {
1597
1600
  # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
@@ -1740,8 +1743,6 @@ end
1740
1743
  ### Crawl
1741
1744
  To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
1742
1745
 
1743
- You can provide an additional option `--continue` to use [persistence storage database](#persistence-database-for-the-storage) feature.
1744
-
1745
1746
  ### List
1746
1747
  To list all project spiders, run: `$ bundle exec kimurai list`
1747
1748
 
@@ -3,6 +3,8 @@ require_relative 'base/storage'
3
3
 
4
4
  module Kimurai
5
5
  class Base
6
+ class InvalidUrlError < StandardError; end
7
+
6
8
  # don't deep merge config's headers hash option
7
9
  DMERGE_EXCLUDE = [:headers]
8
10
 
@@ -171,7 +173,7 @@ module Kimurai
171
173
  attr_accessor :with_info
172
174
 
173
175
  def initialize(engine = self.class.engine, config: {})
174
- @engine = engine
176
+ @engine = engine || self.class.engine
175
177
  @config = self.class.config.deep_merge_excl(config, DMERGE_EXCLUDE)
176
178
  @pipelines = self.class.pipelines.map do |pipeline_name|
177
179
  klass = Pipeline.descendants.find { |kl| kl.name == pipeline_name }
@@ -189,6 +191,8 @@ module Kimurai
189
191
  end
190
192
 
191
193
  def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
194
+ raise InvalidUrlError, "Requested url is invalid: #{url}" unless URI.parse(url).kind_of?(URI::HTTP)
195
+
192
196
  if @config[:skip_duplicate_requests] && !unique_request?(url)
193
197
  add_event(:duplicate_requests) if self.with_info
194
198
  logger.warn "Spider: request_to: not unique url: #{url}, skipped" and return
@@ -1,38 +1,20 @@
1
1
  module Kimurai
2
- class BrowserBuilder
3
- AVAILABLE_ENGINES = [
4
- :mechanize,
5
- :mechanize_standalone,
6
- :poltergeist_phantomjs,
7
- :selenium_firefox,
8
- :selenium_chrome
9
- ]
10
-
2
+ module BrowserBuilder
11
3
  def self.build(engine, config = {}, spider:)
12
- unless AVAILABLE_ENGINES.include? engine
13
- raise "BrowserBuilder: wrong name of engine, available engines: #{AVAILABLE_ENGINES.join(', ')}"
14
- end
15
-
16
4
  if config[:browser].present?
17
5
  raise "++++++ BrowserBuilder: browser option is depricated. Now all sub-options inside " \
18
6
  "`browser` should be placed right into `@config` hash, without `browser` parent key.\n" \
19
7
  "See more here: https://github.com/vifreefly/kimuraframework/blob/master/CHANGELOG.md#breaking-changes-110 ++++++"
20
8
  end
21
9
 
22
- case engine
23
- when :mechanize
24
- require_relative 'browser_builder/mechanize_builder'
25
- MechanizeBuilder.new(config, spider: spider).build
26
- when :selenium_chrome
27
- require_relative 'browser_builder/selenium_chrome_builder'
28
- SeleniumChromeBuilder.new(config, spider: spider).build
29
- when :poltergeist_phantomjs
30
- require_relative 'browser_builder/poltergeist_phantomjs_builder'
31
- PoltergeistPhantomJSBuilder.new(config, spider: spider).build
32
- when :selenium_firefox
33
- require_relative 'browser_builder/selenium_firefox_builder'
34
- SeleniumFirefoxBuilder.new(config, spider: spider).build
10
+ begin
11
+ require "kimurai/browser_builder/#{engine}_builder"
12
+ rescue LoadError => e
35
13
  end
14
+
15
+ builder_class_name = "#{engine}_builder".classify
16
+ builder = "Kimurai::BrowserBuilder::#{builder_class_name}".constantize
17
+ builder.new(config, spider: spider).build
36
18
  end
37
19
  end
38
20
  end
@@ -4,147 +4,151 @@ require_relative '../capybara_configuration'
4
4
  require_relative '../capybara_ext/mechanize/driver'
5
5
  require_relative '../capybara_ext/session'
6
6
 
7
- module Kimurai
8
- class BrowserBuilder
9
- class MechanizeBuilder
10
- attr_reader :logger, :spider
11
-
12
- def initialize(config, spider:)
13
- @config = config
14
- @spider = spider
15
- @logger = spider.logger
16
- end
17
-
18
- def build
19
- # Register driver
20
- Capybara.register_driver :mechanize do |app|
21
- driver = Capybara::Mechanize::Driver.new("app")
22
- # keep the history as small as possible (by default it's unlimited)
23
- driver.configure { |a| a.history.max_size = 2 }
24
- driver
25
- end
7
+ module Kimurai::BrowserBuilder
8
+ class MechanizeBuilder
9
+ attr_reader :logger, :spider
10
+
11
+ def initialize(config, spider:)
12
+ @config = config
13
+ @spider = spider
14
+ @logger = spider.logger
15
+ end
26
16
 
27
- # Create browser instance (Capybara session)
28
- @browser = Capybara::Session.new(:mechanize)
29
- @browser.spider = spider
30
- logger.debug "BrowserBuilder (mechanize): created browser instance"
17
+ def build
18
+ # Register driver
19
+ Capybara.register_driver :mechanize do |app|
20
+ driver = Capybara::Mechanize::Driver.new("app")
21
+ # keep the history as small as possible (by default it's unlimited)
22
+ driver.configure { |a| a.history.max_size = 2 }
23
+ driver
24
+ end
31
25
 
32
- if @config[:extensions].present?
33
- logger.error "BrowserBuilder (mechanize): `extensions` option not supported, skipped"
34
- end
26
+ # Create browser instance (Capybara session)
27
+ @browser = Capybara::Session.new(:mechanize)
28
+ @browser.spider = spider
29
+ logger.debug "BrowserBuilder (mechanize): created browser instance"
35
30
 
36
- # Proxy
37
- if proxy = @config[:proxy].presence
38
- proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
39
- ip, port, type = proxy_string.split(":")
40
-
41
- if type == "http"
42
- @browser.driver.set_proxy(*proxy_string.split(":"))
43
- logger.debug "BrowserBuilder (mechanize): enabled http proxy, ip: #{ip}, port: #{port}"
44
- else
45
- logger.error "BrowserBuilder (mechanize): can't set #{type} proxy (not supported), skipped"
46
- end
47
- end
31
+ if @config[:extensions].present?
32
+ logger.error "BrowserBuilder (mechanize): `extensions` option not supported, skipped"
33
+ end
48
34
 
49
- # SSL
50
- if ssl_cert_path = @config[:ssl_cert_path].presence
51
- @browser.driver.browser.agent.http.ca_file = ssl_cert_path
52
- logger.debug "BrowserBuilder (mechanize): enabled custom ssl_cert"
53
- end
35
+ # Proxy
36
+ if proxy = @config[:proxy].presence
37
+ proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
38
+ ip, port, type = proxy_string.split(":")
54
39
 
55
- if @config[:ignore_ssl_errors].present?
56
- @browser.driver.browser.agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
57
- logger.debug "BrowserBuilder (mechanize): enabled ignore_ssl_errors"
40
+ if type == "http"
41
+ @browser.driver.set_proxy(*proxy_string.split(":"))
42
+ logger.debug "BrowserBuilder (mechanize): enabled http proxy, ip: #{ip}, port: #{port}"
43
+ else
44
+ logger.error "BrowserBuilder (mechanize): can't set #{type} proxy (not supported), skipped"
58
45
  end
46
+ end
59
47
 
60
- # Headers
61
- if headers = @config[:headers].presence
62
- @browser.driver.headers = headers
63
- logger.debug "BrowserBuilder (mechanize): enabled custom headers"
64
- end
48
+ # SSL
49
+ if ssl_cert_path = @config[:ssl_cert_path].presence
50
+ @browser.driver.browser.agent.http.ca_file = ssl_cert_path
51
+ logger.debug "BrowserBuilder (mechanize): enabled custom ssl_cert"
52
+ end
65
53
 
66
- if user_agent = @config[:user_agent].presence
67
- user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
54
+ if @config[:ignore_ssl_errors].present?
55
+ @browser.driver.browser.agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
56
+ logger.debug "BrowserBuilder (mechanize): enabled ignore_ssl_errors"
57
+ end
68
58
 
69
- @browser.driver.add_header("User-Agent", user_agent_string)
70
- logger.debug "BrowserBuilder (mechanize): enabled custom user_agent"
71
- end
59
+ # Headers
60
+ if headers = @config[:headers].presence
61
+ @browser.driver.headers = headers
62
+ logger.debug "BrowserBuilder (mechanize): enabled custom headers"
63
+ end
72
64
 
73
- # Cookies
74
- if cookies = @config[:cookies].presence
75
- cookies.each do |cookie|
76
- @browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
77
- end
65
+ if user_agent = @config[:user_agent].presence
66
+ user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
78
67
 
79
- logger.debug "BrowserBuilder (mechanize): enabled custom cookies"
80
- end
68
+ @browser.driver.add_header("User-Agent", user_agent_string)
69
+ logger.debug "BrowserBuilder (mechanize): enabled custom user_agent"
70
+ end
81
71
 
82
- # Browser instance options
83
- # skip_request_errors
84
- if skip_errors = @config[:skip_request_errors].presence
85
- @browser.config.skip_request_errors = skip_errors
86
- logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
72
+ # Cookies
73
+ if cookies = @config[:cookies].presence
74
+ cookies.each do |cookie|
75
+ @browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
87
76
  end
88
77
 
89
- # retry_request_errors
90
- if retry_errors = @config[:retry_request_errors].presence
91
- @browser.config.retry_request_errors = retry_errors
92
- logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
93
- end
78
+ logger.debug "BrowserBuilder (mechanize): enabled custom cookies"
79
+ end
94
80
 
95
- # restart_if
96
- if @config[:restart_if].present?
97
- logger.warn "BrowserBuilder (mechanize): restart_if options not supported by Mechanize, skipped"
98
- end
81
+ # Browser instance options
82
+ # skip_request_errors
83
+ if skip_errors = @config[:skip_request_errors].presence
84
+ @browser.config.skip_request_errors = skip_errors
85
+ logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
86
+ end
99
87
 
100
- # before_request clear_cookies
101
- if @config.dig(:before_request, :clear_cookies)
102
- @browser.config.before_request[:clear_cookies] = true
103
- logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_cookies"
104
- end
88
+ # retry_request_errors
89
+ if retry_errors = @config[:retry_request_errors].presence
90
+ @browser.config.retry_request_errors = retry_errors
91
+ logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
92
+ end
105
93
 
106
- # before_request clear_and_set_cookies
107
- if @config.dig(:before_request, :clear_and_set_cookies)
108
- if cookies = @config[:cookies].presence
109
- @browser.config.cookies = cookies
110
- @browser.config.before_request[:clear_and_set_cookies] = true
111
- logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_and_set_cookies"
112
- else
113
- logger.error "BrowserBuilder (mechanize): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
114
- end
115
- end
94
+ # restart_if
95
+ if @config[:restart_if].present?
96
+ logger.warn "BrowserBuilder (mechanize): restart_if options not supported by Mechanize, skipped"
97
+ end
98
+
99
+ # before_request clear_cookies
100
+ if @config.dig(:before_request, :clear_cookies)
101
+ @browser.config.before_request[:clear_cookies] = true
102
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_cookies"
103
+ end
116
104
 
117
- # before_request change_user_agent
118
- if @config.dig(:before_request, :change_user_agent)
119
- if @config[:user_agent].present? && @config[:user_agent].class == Proc
120
- @browser.config.user_agent = @config[:user_agent]
121
- @browser.config.before_request[:change_user_agent] = true
122
- logger.debug "BrowserBuilder (mechanize): enabled before_request.change_user_agent"
123
- else
124
- logger.error "BrowserBuilder (mechanize): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
125
- end
105
+ # before_request clear_and_set_cookies
106
+ if @config.dig(:before_request, :clear_and_set_cookies)
107
+ if cookies = @config[:cookies].presence
108
+ @browser.config.cookies = cookies
109
+ @browser.config.before_request[:clear_and_set_cookies] = true
110
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_and_set_cookies"
111
+ else
112
+ logger.error "BrowserBuilder (mechanize): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
126
113
  end
114
+ end
127
115
 
128
- # before_request change_proxy
129
- if @config.dig(:before_request, :change_proxy)
130
- if @config[:proxy].present? && @config[:proxy].class == Proc
131
- @browser.config.proxy = @config[:proxy]
132
- @browser.config.before_request[:change_proxy] = true
133
- logger.debug "BrowserBuilder (mechanize): enabled before_request.change_proxy"
134
- else
135
- logger.error "BrowserBuilder (mechanize): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
136
- end
116
+ # before_request change_user_agent
117
+ if @config.dig(:before_request, :change_user_agent)
118
+ if @config[:user_agent].present? && @config[:user_agent].class == Proc
119
+ @browser.config.user_agent = @config[:user_agent]
120
+ @browser.config.before_request[:change_user_agent] = true
121
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.change_user_agent"
122
+ else
123
+ logger.error "BrowserBuilder (mechanize): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
137
124
  end
125
+ end
138
126
 
139
- # before_request delay
140
- if delay = @config.dig(:before_request, :delay).presence
141
- @browser.config.before_request[:delay] = delay
142
- logger.debug "BrowserBuilder (mechanize): enabled before_request.delay"
127
+ # before_request change_proxy
128
+ if @config.dig(:before_request, :change_proxy)
129
+ if @config[:proxy].present? && @config[:proxy].class == Proc
130
+ @browser.config.proxy = @config[:proxy]
131
+ @browser.config.before_request[:change_proxy] = true
132
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.change_proxy"
133
+ else
134
+ logger.error "BrowserBuilder (mechanize): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
143
135
  end
136
+ end
137
+
138
+ # before_request delay
139
+ if delay = @config.dig(:before_request, :delay).presence
140
+ @browser.config.before_request[:delay] = delay
141
+ logger.debug "BrowserBuilder (mechanize): enabled before_request.delay"
142
+ end
144
143
 
145
- # return Capybara session instance
146
- @browser
144
+ # encoding
145
+ if encoding = @config[:encoding]
146
+ @browser.config.encoding = encoding
147
+ logger.debug "BrowserBuilder (mechanize): enabled encoding: #{encoding}"
147
148
  end
149
+
150
+ # return Capybara session instance
151
+ @browser
148
152
  end
149
153
  end
150
154
  end
@@ -4,168 +4,172 @@ require_relative '../capybara_configuration'
4
4
  require_relative '../capybara_ext/poltergeist/driver'
5
5
  require_relative '../capybara_ext/session'
6
6
 
7
- module Kimurai
8
- class BrowserBuilder
9
- class PoltergeistPhantomJSBuilder
10
- attr_reader :logger, :spider
11
-
12
- def initialize(config, spider:)
13
- @config = config
14
- @spider = spider
15
- @logger = spider.logger
16
- end
7
+ module Kimurai::BrowserBuilder
8
+ class PoltergeistPhantomjsBuilder
9
+ attr_reader :logger, :spider
10
+
11
+ def initialize(config, spider:)
12
+ @config = config
13
+ @spider = spider
14
+ @logger = spider.logger
15
+ end
17
16
 
18
- def build
19
- # Register driver
20
- Capybara.register_driver :poltergeist_phantomjs do |app|
21
- # Create driver options
22
- driver_options = {
23
- js_errors: false, debug: false, inspector: false, phantomjs_options: []
24
- }
25
-
26
- if extensions = @config[:extensions].presence
27
- driver_options[:extensions] = extensions
28
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled extensions"
29
- end
30
-
31
- # Window size
32
- if size = @config[:window_size].presence
33
- driver_options[:window_size] = size
34
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled window_size"
35
- end
36
-
37
- # SSL
38
- if ssl_cert_path = @config[:ssl_cert_path].presence
39
- driver_options[:phantomjs_options] << "--ssl-certificates-path=#{ssl_cert_path}"
40
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom ssl_cert"
41
- end
42
-
43
- if @config[:ignore_ssl_errors].present?
44
- driver_options[:phantomjs_options].push("--ignore-ssl-errors=yes", "--ssl-protocol=any")
45
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled ignore_ssl_errors"
46
- end
47
-
48
- # Disable images
49
- if @config[:disable_images].present?
50
- driver_options[:phantomjs_options] << "--load-images=no"
51
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled disable_images"
52
- end
53
-
54
- Capybara::Poltergeist::Driver.new(app, driver_options)
17
+ def build
18
+ # Register driver
19
+ Capybara.register_driver :poltergeist_phantomjs do |app|
20
+ # Create driver options
21
+ driver_options = {
22
+ js_errors: false, debug: false, inspector: false, phantomjs_options: []
23
+ }
24
+
25
+ if extensions = @config[:extensions].presence
26
+ driver_options[:extensions] = extensions
27
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled extensions"
55
28
  end
56
29
 
57
- # Create browser instance (Capybara session)
58
- @browser = Capybara::Session.new(:poltergeist_phantomjs)
59
- @browser.spider = spider
60
- logger.debug "BrowserBuilder (poltergeist_phantomjs): created browser instance"
61
-
62
- # Proxy
63
- if proxy = @config[:proxy].presence
64
- proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
65
- ip, port, type = proxy_string.split(":")
66
-
67
- if %w(http socks5).include?(type)
68
- @browser.driver.set_proxy(*proxy_string.split(":"))
69
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled #{type} proxy, ip: #{ip}, port: #{port}"
70
- else
71
- logger.error "BrowserBuilder (poltergeist_phantomjs): wrong type of proxy: #{type}, skipped"
72
- end
30
+ # Window size
31
+ if size = @config[:window_size].presence
32
+ driver_options[:window_size] = size
33
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled window_size"
73
34
  end
74
35
 
75
- # Headers
76
- if headers = @config[:headers].presence
77
- @browser.driver.headers = headers
78
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom headers"
36
+ # SSL
37
+ if ssl_cert_path = @config[:ssl_cert_path].presence
38
+ driver_options[:phantomjs_options] << "--ssl-certificates-path=#{ssl_cert_path}"
39
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom ssl_cert"
79
40
  end
80
41
 
81
- if user_agent = @config[:user_agent].presence
82
- user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
42
+ if @config[:ignore_ssl_errors].present?
43
+ driver_options[:phantomjs_options].push("--ignore-ssl-errors=yes", "--ssl-protocol=any")
44
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled ignore_ssl_errors"
45
+ end
83
46
 
84
- @browser.driver.add_header("User-Agent", user_agent_string)
85
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom user_agent"
47
+ # Disable images
48
+ if @config[:disable_images].present?
49
+ driver_options[:phantomjs_options] << "--load-images=no"
50
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled disable_images"
86
51
  end
87
52
 
88
- # Cookies
89
- if cookies = @config[:cookies].presence
90
- cookies.each do |cookie|
91
- @browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
92
- end
53
+ Capybara::Poltergeist::Driver.new(app, driver_options)
54
+ end
93
55
 
94
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom cookies"
56
+ # Create browser instance (Capybara session)
57
+ @browser = Capybara::Session.new(:poltergeist_phantomjs)
58
+ @browser.spider = spider
59
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): created browser instance"
60
+
61
+ # Proxy
62
+ if proxy = @config[:proxy].presence
63
+ proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
64
+ ip, port, type = proxy_string.split(":")
65
+
66
+ if %w(http socks5).include?(type)
67
+ @browser.driver.set_proxy(*proxy_string.split(":"))
68
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled #{type} proxy, ip: #{ip}, port: #{port}"
69
+ else
70
+ logger.error "BrowserBuilder (poltergeist_phantomjs): wrong type of proxy: #{type}, skipped"
95
71
  end
72
+ end
96
73
 
97
- # Browser instance options
98
- # skip_request_errors
99
- if skip_errors = @config[:skip_request_errors].presence
100
- @browser.config.skip_request_errors = skip_errors
101
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
102
- end
74
+ # Headers
75
+ if headers = @config[:headers].presence
76
+ @browser.driver.headers = headers
77
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom headers"
78
+ end
103
79
 
104
- # retry_request_errors
105
- if retry_errors = @config[:retry_request_errors].presence
106
- @browser.config.retry_request_errors = retry_errors
107
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
108
- end
80
+ if user_agent = @config[:user_agent].presence
81
+ user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
109
82
 
110
- # restart_if
111
- if requests_limit = @config.dig(:restart_if, :requests_limit).presence
112
- @browser.config.restart_if[:requests_limit] = requests_limit
113
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.requests_limit >= #{requests_limit}"
114
- end
83
+ @browser.driver.add_header("User-Agent", user_agent_string)
84
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom user_agent"
85
+ end
115
86
 
116
- if memory_limit = @config.dig(:restart_if, :memory_limit).presence
117
- @browser.config.restart_if[:memory_limit] = memory_limit
118
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.memory_limit >= #{memory_limit}"
87
+ # Cookies
88
+ if cookies = @config[:cookies].presence
89
+ cookies.each do |cookie|
90
+ @browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
119
91
  end
120
92
 
121
- # before_request clear_cookies
122
- if @config.dig(:before_request, :clear_cookies)
123
- @browser.config.before_request[:clear_cookies] = true
124
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_cookies"
125
- end
93
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom cookies"
94
+ end
126
95
 
127
- # before_request clear_and_set_cookies
128
- if @config.dig(:before_request, :clear_and_set_cookies)
129
- if cookies = @config[:cookies].presence
130
- @browser.config.cookies = cookies
131
- @browser.config.before_request[:clear_and_set_cookies] = true
132
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_and_set_cookies"
133
- else
134
- logger.error "BrowserBuilder (poltergeist_phantomjs): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
135
- end
136
- end
96
+ # Browser instance options
97
+ # skip_request_errors
98
+ if skip_errors = @config[:skip_request_errors].presence
99
+ @browser.config.skip_request_errors = skip_errors
100
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
101
+ end
137
102
 
138
- # before_request change_user_agent
139
- if @config.dig(:before_request, :change_user_agent)
140
- if @config[:user_agent].present? && @config[:user_agent].class == Proc
141
- @browser.config.user_agent = @config[:user_agent]
142
- @browser.config.before_request[:change_user_agent] = true
143
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_user_agent"
144
- else
145
- logger.error "BrowserBuilder (poltergeist_phantomjs): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
146
- end
103
+ # retry_request_errors
104
+ if retry_errors = @config[:retry_request_errors].presence
105
+ @browser.config.retry_request_errors = retry_errors
106
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
107
+ end
108
+
109
+ # restart_if
110
+ if requests_limit = @config.dig(:restart_if, :requests_limit).presence
111
+ @browser.config.restart_if[:requests_limit] = requests_limit
112
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.requests_limit >= #{requests_limit}"
113
+ end
114
+
115
+ if memory_limit = @config.dig(:restart_if, :memory_limit).presence
116
+ @browser.config.restart_if[:memory_limit] = memory_limit
117
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.memory_limit >= #{memory_limit}"
118
+ end
119
+
120
+ # before_request clear_cookies
121
+ if @config.dig(:before_request, :clear_cookies)
122
+ @browser.config.before_request[:clear_cookies] = true
123
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_cookies"
124
+ end
125
+
126
+ # before_request clear_and_set_cookies
127
+ if @config.dig(:before_request, :clear_and_set_cookies)
128
+ if cookies = @config[:cookies].presence
129
+ @browser.config.cookies = cookies
130
+ @browser.config.before_request[:clear_and_set_cookies] = true
131
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_and_set_cookies"
132
+ else
133
+ logger.error "BrowserBuilder (poltergeist_phantomjs): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
147
134
  end
135
+ end
148
136
 
149
- # before_request change_proxy
150
- if @config.dig(:before_request, :change_proxy)
151
- if @config[:proxy].present? && @config[:proxy].class == Proc
152
- @browser.config.proxy = @config[:proxy]
153
- @browser.config.before_request[:change_proxy] = true
154
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_proxy"
155
- else
156
- logger.error "BrowserBuilder (poltergeist_phantomjs): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
157
- end
137
+ # before_request change_user_agent
138
+ if @config.dig(:before_request, :change_user_agent)
139
+ if @config[:user_agent].present? && @config[:user_agent].class == Proc
140
+ @browser.config.user_agent = @config[:user_agent]
141
+ @browser.config.before_request[:change_user_agent] = true
142
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_user_agent"
143
+ else
144
+ logger.error "BrowserBuilder (poltergeist_phantomjs): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
158
145
  end
146
+ end
159
147
 
160
- # before_request delay
161
- if delay = @config.dig(:before_request, :delay).presence
162
- @browser.config.before_request[:delay] = delay
163
- logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.delay"
148
+ # before_request change_proxy
149
+ if @config.dig(:before_request, :change_proxy)
150
+ if @config[:proxy].present? && @config[:proxy].class == Proc
151
+ @browser.config.proxy = @config[:proxy]
152
+ @browser.config.before_request[:change_proxy] = true
153
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_proxy"
154
+ else
155
+ logger.error "BrowserBuilder (poltergeist_phantomjs): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
164
156
  end
157
+ end
165
158
 
166
- # return Capybara session instance
167
- @browser
159
+ # before_request delay
160
+ if delay = @config.dig(:before_request, :delay).presence
161
+ @browser.config.before_request[:delay] = delay
162
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.delay"
168
163
  end
164
+
165
+ # encoding
166
+ if encoding = @config[:encoding]
167
+ @browser.config.encoding = encoding
168
+ logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled encoding: #{encoding}"
169
+ end
170
+
171
+ # return Capybara session instance
172
+ @browser
169
173
  end
170
174
  end
171
175
  end