kimurai 1.3.2 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +7 -6
- data/lib/kimurai/base.rb +5 -1
- data/lib/kimurai/browser_builder.rb +8 -26
- data/lib/kimurai/browser_builder/mechanize_builder.rb +120 -116
- data/lib/kimurai/browser_builder/poltergeist_phantomjs_builder.rb +139 -135
- data/lib/kimurai/browser_builder/selenium_chrome_builder.rb +152 -148
- data/lib/kimurai/browser_builder/selenium_firefox_builder.rb +161 -157
- data/lib/kimurai/capybara_ext/session.rb +10 -1
- data/lib/kimurai/capybara_ext/session/config.rb +1 -1
- data/lib/kimurai/cli.rb +5 -1
- data/lib/kimurai/template/Gemfile +1 -1
- data/lib/kimurai/template/spiders/application_spider.rb +6 -0
- data/lib/kimurai/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9efd6a3590e9df217a803d9e7c806b052d1c3c1cd9887294d7c736ee19fc8e7b
|
4
|
+
data.tar.gz: 14d14793e808f5122158ae315da53bbb7484772274cc332e96064300016a3974
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6224c7b4bccdbe92b610cd47fa04ed01dc9366fc7b12fb35edac218d71cdd24b7644fd7e5ee5b195d389395f6d147bc7ec8716c33e2d9227fde1cc1204b4a1e4
|
7
|
+
data.tar.gz: 2c6789e7c62dfe999b641cae172f28c0eb609860de4e123fd0305934ef251fedfa2e6a299d3705f395bdbd483d3cd1004ec1804a98a5517ad218a789d98702f5
|
data/CHANGELOG.md
CHANGED
@@ -1,4 +1,12 @@
|
|
1
1
|
# CHANGELOG
|
2
|
+
## 1.4.0
|
3
|
+
### New
|
4
|
+
* Add `encoding` config option (see [All available config options](https://github.com/vifreefly/kimuraframework#all-available-config-options))
|
5
|
+
* Validate url before processing a request (Base#request_to)
|
6
|
+
|
7
|
+
### Fixes
|
8
|
+
* Fix console command bug (see [issue 21](https://github.com/vifreefly/kimuraframework/issues/21))
|
9
|
+
|
2
10
|
## 1.3.2
|
3
11
|
### Fixes
|
4
12
|
* In the project template, set Ruby version as >= 2.5 (before was hard-coded to 2.5.1)
|
data/README.md
CHANGED
@@ -10,15 +10,12 @@
|
|
10
10
|
> * The code was massively refactored for a [support](#using-kimurai-inside-existing-ruby-application) to run spiders multiple times from inside a single process. Now it's possible to run Kimurai spiders using background jobs like Sidekiq.
|
11
11
|
> * `require 'kimurai'` doesn't require any gems except Active Support. Only when a particular spider [starts](#crawl-method), Capybara will be required with a specific driver.
|
12
12
|
> * Although Kimurai [extends](lib/kimurai/capybara_ext) Capybara (all the magic happens inside [extended](lib/kimurai/capybara_ext/session.rb) `Capybara::Session#visit` method), session instances which were created manually will behave normally.
|
13
|
-
> * No spaghetti code with `case/when/end` blocks anymore. All drivers [were extended](lib/kimurai/capybara_ext) to support unified methods for cookies, proxies, headers, etc.
|
14
|
-
> * `selenium_url_to_set_cookies` @config option don't need anymore if you're use Selenium-like engine with custom cookies setting.
|
15
13
|
> * Small changes in design (check the readme again to see what was changed)
|
16
14
|
> * Stats database with a web dashboard were removed
|
17
|
-
> * Again, massive refactor. Code now looks much better than it was before.
|
18
15
|
|
19
16
|
<br>
|
20
17
|
|
21
|
-
> Note: this readme is for `1.
|
18
|
+
> Note: this readme is for `1.4.0` gem version. CHANGELOG [here](CHANGELOG.md).
|
22
19
|
|
23
20
|
Kimurai is a modern web scraping framework written in Ruby which **works out of box with Headless Chromium/Firefox, PhantomJS**, or simple HTTP requests and **allows to scrape and interact with JavaScript rendered websites.**
|
24
21
|
|
@@ -1592,6 +1589,12 @@ end
|
|
1592
1589
|
# Format: same like for `skip_request_errors` option.
|
1593
1590
|
retry_request_errors: [Net::ReadTimeout],
|
1594
1591
|
|
1592
|
+
# Handle page encoding while parsing html response using Nokogiri. There are two modes:
|
1593
|
+
# Auto (`:auto`) (try to fetch correct encoding from <meta http-equiv="Content-Type"> or <meta charset> tags)
|
1594
|
+
# Set required encoding manually, example: `encoding: "GB2312"` (Set required encoding manually)
|
1595
|
+
# Default this option is unset.
|
1596
|
+
encoding: nil,
|
1597
|
+
|
1595
1598
|
# Restart browser if one of the options is true:
|
1596
1599
|
restart_if: {
|
1597
1600
|
# Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
|
@@ -1740,8 +1743,6 @@ end
|
|
1740
1743
|
### Crawl
|
1741
1744
|
To run a particular spider in the project, run: `$ bundle exec kimurai crawl example_spider`. Don't forget to add `bundle exec` before command to load required environment.
|
1742
1745
|
|
1743
|
-
You can provide an additional option `--continue` to use [persistence storage database](#persistence-database-for-the-storage) feature.
|
1744
|
-
|
1745
1746
|
### List
|
1746
1747
|
To list all project spiders, run: `$ bundle exec kimurai list`
|
1747
1748
|
|
data/lib/kimurai/base.rb
CHANGED
@@ -3,6 +3,8 @@ require_relative 'base/storage'
|
|
3
3
|
|
4
4
|
module Kimurai
|
5
5
|
class Base
|
6
|
+
class InvalidUrlError < StandardError; end
|
7
|
+
|
6
8
|
# don't deep merge config's headers hash option
|
7
9
|
DMERGE_EXCLUDE = [:headers]
|
8
10
|
|
@@ -171,7 +173,7 @@ module Kimurai
|
|
171
173
|
attr_accessor :with_info
|
172
174
|
|
173
175
|
def initialize(engine = self.class.engine, config: {})
|
174
|
-
@engine = engine
|
176
|
+
@engine = engine || self.class.engine
|
175
177
|
@config = self.class.config.deep_merge_excl(config, DMERGE_EXCLUDE)
|
176
178
|
@pipelines = self.class.pipelines.map do |pipeline_name|
|
177
179
|
klass = Pipeline.descendants.find { |kl| kl.name == pipeline_name }
|
@@ -189,6 +191,8 @@ module Kimurai
|
|
189
191
|
end
|
190
192
|
|
191
193
|
def request_to(handler, delay = nil, url:, data: {}, response_type: :html)
|
194
|
+
raise InvalidUrlError, "Requested url is invalid: #{url}" unless URI.parse(url).kind_of?(URI::HTTP)
|
195
|
+
|
192
196
|
if @config[:skip_duplicate_requests] && !unique_request?(url)
|
193
197
|
add_event(:duplicate_requests) if self.with_info
|
194
198
|
logger.warn "Spider: request_to: not unique url: #{url}, skipped" and return
|
@@ -1,38 +1,20 @@
|
|
1
1
|
module Kimurai
|
2
|
-
|
3
|
-
AVAILABLE_ENGINES = [
|
4
|
-
:mechanize,
|
5
|
-
:mechanize_standalone,
|
6
|
-
:poltergeist_phantomjs,
|
7
|
-
:selenium_firefox,
|
8
|
-
:selenium_chrome
|
9
|
-
]
|
10
|
-
|
2
|
+
module BrowserBuilder
|
11
3
|
def self.build(engine, config = {}, spider:)
|
12
|
-
unless AVAILABLE_ENGINES.include? engine
|
13
|
-
raise "BrowserBuilder: wrong name of engine, available engines: #{AVAILABLE_ENGINES.join(', ')}"
|
14
|
-
end
|
15
|
-
|
16
4
|
if config[:browser].present?
|
17
5
|
raise "++++++ BrowserBuilder: browser option is depricated. Now all sub-options inside " \
|
18
6
|
"`browser` should be placed right into `@config` hash, without `browser` parent key.\n" \
|
19
7
|
"See more here: https://github.com/vifreefly/kimuraframework/blob/master/CHANGELOG.md#breaking-changes-110 ++++++"
|
20
8
|
end
|
21
9
|
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
MechanizeBuilder.new(config, spider: spider).build
|
26
|
-
when :selenium_chrome
|
27
|
-
require_relative 'browser_builder/selenium_chrome_builder'
|
28
|
-
SeleniumChromeBuilder.new(config, spider: spider).build
|
29
|
-
when :poltergeist_phantomjs
|
30
|
-
require_relative 'browser_builder/poltergeist_phantomjs_builder'
|
31
|
-
PoltergeistPhantomJSBuilder.new(config, spider: spider).build
|
32
|
-
when :selenium_firefox
|
33
|
-
require_relative 'browser_builder/selenium_firefox_builder'
|
34
|
-
SeleniumFirefoxBuilder.new(config, spider: spider).build
|
10
|
+
begin
|
11
|
+
require "kimurai/browser_builder/#{engine}_builder"
|
12
|
+
rescue LoadError => e
|
35
13
|
end
|
14
|
+
|
15
|
+
builder_class_name = "#{engine}_builder".classify
|
16
|
+
builder = "Kimurai::BrowserBuilder::#{builder_class_name}".constantize
|
17
|
+
builder.new(config, spider: spider).build
|
36
18
|
end
|
37
19
|
end
|
38
20
|
end
|
@@ -4,147 +4,151 @@ require_relative '../capybara_configuration'
|
|
4
4
|
require_relative '../capybara_ext/mechanize/driver'
|
5
5
|
require_relative '../capybara_ext/session'
|
6
6
|
|
7
|
-
module Kimurai
|
8
|
-
class
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
end
|
17
|
-
|
18
|
-
def build
|
19
|
-
# Register driver
|
20
|
-
Capybara.register_driver :mechanize do |app|
|
21
|
-
driver = Capybara::Mechanize::Driver.new("app")
|
22
|
-
# keep the history as small as possible (by default it's unlimited)
|
23
|
-
driver.configure { |a| a.history.max_size = 2 }
|
24
|
-
driver
|
25
|
-
end
|
7
|
+
module Kimurai::BrowserBuilder
|
8
|
+
class MechanizeBuilder
|
9
|
+
attr_reader :logger, :spider
|
10
|
+
|
11
|
+
def initialize(config, spider:)
|
12
|
+
@config = config
|
13
|
+
@spider = spider
|
14
|
+
@logger = spider.logger
|
15
|
+
end
|
26
16
|
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
17
|
+
def build
|
18
|
+
# Register driver
|
19
|
+
Capybara.register_driver :mechanize do |app|
|
20
|
+
driver = Capybara::Mechanize::Driver.new("app")
|
21
|
+
# keep the history as small as possible (by default it's unlimited)
|
22
|
+
driver.configure { |a| a.history.max_size = 2 }
|
23
|
+
driver
|
24
|
+
end
|
31
25
|
|
32
|
-
|
33
|
-
|
34
|
-
|
26
|
+
# Create browser instance (Capybara session)
|
27
|
+
@browser = Capybara::Session.new(:mechanize)
|
28
|
+
@browser.spider = spider
|
29
|
+
logger.debug "BrowserBuilder (mechanize): created browser instance"
|
35
30
|
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
ip, port, type = proxy_string.split(":")
|
40
|
-
|
41
|
-
if type == "http"
|
42
|
-
@browser.driver.set_proxy(*proxy_string.split(":"))
|
43
|
-
logger.debug "BrowserBuilder (mechanize): enabled http proxy, ip: #{ip}, port: #{port}"
|
44
|
-
else
|
45
|
-
logger.error "BrowserBuilder (mechanize): can't set #{type} proxy (not supported), skipped"
|
46
|
-
end
|
47
|
-
end
|
31
|
+
if @config[:extensions].present?
|
32
|
+
logger.error "BrowserBuilder (mechanize): `extensions` option not supported, skipped"
|
33
|
+
end
|
48
34
|
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
end
|
35
|
+
# Proxy
|
36
|
+
if proxy = @config[:proxy].presence
|
37
|
+
proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
|
38
|
+
ip, port, type = proxy_string.split(":")
|
54
39
|
|
55
|
-
if
|
56
|
-
@browser.driver.
|
57
|
-
logger.debug "BrowserBuilder (mechanize): enabled
|
40
|
+
if type == "http"
|
41
|
+
@browser.driver.set_proxy(*proxy_string.split(":"))
|
42
|
+
logger.debug "BrowserBuilder (mechanize): enabled http proxy, ip: #{ip}, port: #{port}"
|
43
|
+
else
|
44
|
+
logger.error "BrowserBuilder (mechanize): can't set #{type} proxy (not supported), skipped"
|
58
45
|
end
|
46
|
+
end
|
59
47
|
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
48
|
+
# SSL
|
49
|
+
if ssl_cert_path = @config[:ssl_cert_path].presence
|
50
|
+
@browser.driver.browser.agent.http.ca_file = ssl_cert_path
|
51
|
+
logger.debug "BrowserBuilder (mechanize): enabled custom ssl_cert"
|
52
|
+
end
|
65
53
|
|
66
|
-
|
67
|
-
|
54
|
+
if @config[:ignore_ssl_errors].present?
|
55
|
+
@browser.driver.browser.agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
|
56
|
+
logger.debug "BrowserBuilder (mechanize): enabled ignore_ssl_errors"
|
57
|
+
end
|
68
58
|
|
69
|
-
|
70
|
-
|
71
|
-
|
59
|
+
# Headers
|
60
|
+
if headers = @config[:headers].presence
|
61
|
+
@browser.driver.headers = headers
|
62
|
+
logger.debug "BrowserBuilder (mechanize): enabled custom headers"
|
63
|
+
end
|
72
64
|
|
73
|
-
|
74
|
-
|
75
|
-
cookies.each do |cookie|
|
76
|
-
@browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
|
77
|
-
end
|
65
|
+
if user_agent = @config[:user_agent].presence
|
66
|
+
user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
|
78
67
|
|
79
|
-
|
80
|
-
|
68
|
+
@browser.driver.add_header("User-Agent", user_agent_string)
|
69
|
+
logger.debug "BrowserBuilder (mechanize): enabled custom user_agent"
|
70
|
+
end
|
81
71
|
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
@browser.
|
86
|
-
logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
|
72
|
+
# Cookies
|
73
|
+
if cookies = @config[:cookies].presence
|
74
|
+
cookies.each do |cookie|
|
75
|
+
@browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
|
87
76
|
end
|
88
77
|
|
89
|
-
|
90
|
-
|
91
|
-
@browser.config.retry_request_errors = retry_errors
|
92
|
-
logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
|
93
|
-
end
|
78
|
+
logger.debug "BrowserBuilder (mechanize): enabled custom cookies"
|
79
|
+
end
|
94
80
|
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
81
|
+
# Browser instance options
|
82
|
+
# skip_request_errors
|
83
|
+
if skip_errors = @config[:skip_request_errors].presence
|
84
|
+
@browser.config.skip_request_errors = skip_errors
|
85
|
+
logger.debug "BrowserBuilder (mechanize): enabled skip_request_errors"
|
86
|
+
end
|
99
87
|
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
88
|
+
# retry_request_errors
|
89
|
+
if retry_errors = @config[:retry_request_errors].presence
|
90
|
+
@browser.config.retry_request_errors = retry_errors
|
91
|
+
logger.debug "BrowserBuilder (mechanize): enabled retry_request_errors"
|
92
|
+
end
|
105
93
|
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
94
|
+
# restart_if
|
95
|
+
if @config[:restart_if].present?
|
96
|
+
logger.warn "BrowserBuilder (mechanize): restart_if options not supported by Mechanize, skipped"
|
97
|
+
end
|
98
|
+
|
99
|
+
# before_request clear_cookies
|
100
|
+
if @config.dig(:before_request, :clear_cookies)
|
101
|
+
@browser.config.before_request[:clear_cookies] = true
|
102
|
+
logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_cookies"
|
103
|
+
end
|
116
104
|
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
end
|
105
|
+
# before_request clear_and_set_cookies
|
106
|
+
if @config.dig(:before_request, :clear_and_set_cookies)
|
107
|
+
if cookies = @config[:cookies].presence
|
108
|
+
@browser.config.cookies = cookies
|
109
|
+
@browser.config.before_request[:clear_and_set_cookies] = true
|
110
|
+
logger.debug "BrowserBuilder (mechanize): enabled before_request.clear_and_set_cookies"
|
111
|
+
else
|
112
|
+
logger.error "BrowserBuilder (mechanize): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
|
126
113
|
end
|
114
|
+
end
|
127
115
|
|
128
|
-
|
129
|
-
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
end
|
116
|
+
# before_request change_user_agent
|
117
|
+
if @config.dig(:before_request, :change_user_agent)
|
118
|
+
if @config[:user_agent].present? && @config[:user_agent].class == Proc
|
119
|
+
@browser.config.user_agent = @config[:user_agent]
|
120
|
+
@browser.config.before_request[:change_user_agent] = true
|
121
|
+
logger.debug "BrowserBuilder (mechanize): enabled before_request.change_user_agent"
|
122
|
+
else
|
123
|
+
logger.error "BrowserBuilder (mechanize): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
|
137
124
|
end
|
125
|
+
end
|
138
126
|
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
127
|
+
# before_request change_proxy
|
128
|
+
if @config.dig(:before_request, :change_proxy)
|
129
|
+
if @config[:proxy].present? && @config[:proxy].class == Proc
|
130
|
+
@browser.config.proxy = @config[:proxy]
|
131
|
+
@browser.config.before_request[:change_proxy] = true
|
132
|
+
logger.debug "BrowserBuilder (mechanize): enabled before_request.change_proxy"
|
133
|
+
else
|
134
|
+
logger.error "BrowserBuilder (mechanize): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
|
143
135
|
end
|
136
|
+
end
|
137
|
+
|
138
|
+
# before_request delay
|
139
|
+
if delay = @config.dig(:before_request, :delay).presence
|
140
|
+
@browser.config.before_request[:delay] = delay
|
141
|
+
logger.debug "BrowserBuilder (mechanize): enabled before_request.delay"
|
142
|
+
end
|
144
143
|
|
145
|
-
|
146
|
-
|
144
|
+
# encoding
|
145
|
+
if encoding = @config[:encoding]
|
146
|
+
@browser.config.encoding = encoding
|
147
|
+
logger.debug "BrowserBuilder (mechanize): enabled encoding: #{encoding}"
|
147
148
|
end
|
149
|
+
|
150
|
+
# return Capybara session instance
|
151
|
+
@browser
|
148
152
|
end
|
149
153
|
end
|
150
154
|
end
|
@@ -4,168 +4,172 @@ require_relative '../capybara_configuration'
|
|
4
4
|
require_relative '../capybara_ext/poltergeist/driver'
|
5
5
|
require_relative '../capybara_ext/session'
|
6
6
|
|
7
|
-
module Kimurai
|
8
|
-
class
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
end
|
7
|
+
module Kimurai::BrowserBuilder
|
8
|
+
class PoltergeistPhantomjsBuilder
|
9
|
+
attr_reader :logger, :spider
|
10
|
+
|
11
|
+
def initialize(config, spider:)
|
12
|
+
@config = config
|
13
|
+
@spider = spider
|
14
|
+
@logger = spider.logger
|
15
|
+
end
|
17
16
|
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
end
|
30
|
-
|
31
|
-
# Window size
|
32
|
-
if size = @config[:window_size].presence
|
33
|
-
driver_options[:window_size] = size
|
34
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled window_size"
|
35
|
-
end
|
36
|
-
|
37
|
-
# SSL
|
38
|
-
if ssl_cert_path = @config[:ssl_cert_path].presence
|
39
|
-
driver_options[:phantomjs_options] << "--ssl-certificates-path=#{ssl_cert_path}"
|
40
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom ssl_cert"
|
41
|
-
end
|
42
|
-
|
43
|
-
if @config[:ignore_ssl_errors].present?
|
44
|
-
driver_options[:phantomjs_options].push("--ignore-ssl-errors=yes", "--ssl-protocol=any")
|
45
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled ignore_ssl_errors"
|
46
|
-
end
|
47
|
-
|
48
|
-
# Disable images
|
49
|
-
if @config[:disable_images].present?
|
50
|
-
driver_options[:phantomjs_options] << "--load-images=no"
|
51
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled disable_images"
|
52
|
-
end
|
53
|
-
|
54
|
-
Capybara::Poltergeist::Driver.new(app, driver_options)
|
17
|
+
def build
|
18
|
+
# Register driver
|
19
|
+
Capybara.register_driver :poltergeist_phantomjs do |app|
|
20
|
+
# Create driver options
|
21
|
+
driver_options = {
|
22
|
+
js_errors: false, debug: false, inspector: false, phantomjs_options: []
|
23
|
+
}
|
24
|
+
|
25
|
+
if extensions = @config[:extensions].presence
|
26
|
+
driver_options[:extensions] = extensions
|
27
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled extensions"
|
55
28
|
end
|
56
29
|
|
57
|
-
#
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
# Proxy
|
63
|
-
if proxy = @config[:proxy].presence
|
64
|
-
proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
|
65
|
-
ip, port, type = proxy_string.split(":")
|
66
|
-
|
67
|
-
if %w(http socks5).include?(type)
|
68
|
-
@browser.driver.set_proxy(*proxy_string.split(":"))
|
69
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled #{type} proxy, ip: #{ip}, port: #{port}"
|
70
|
-
else
|
71
|
-
logger.error "BrowserBuilder (poltergeist_phantomjs): wrong type of proxy: #{type}, skipped"
|
72
|
-
end
|
30
|
+
# Window size
|
31
|
+
if size = @config[:window_size].presence
|
32
|
+
driver_options[:window_size] = size
|
33
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled window_size"
|
73
34
|
end
|
74
35
|
|
75
|
-
#
|
76
|
-
if
|
77
|
-
|
78
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom
|
36
|
+
# SSL
|
37
|
+
if ssl_cert_path = @config[:ssl_cert_path].presence
|
38
|
+
driver_options[:phantomjs_options] << "--ssl-certificates-path=#{ssl_cert_path}"
|
39
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom ssl_cert"
|
79
40
|
end
|
80
41
|
|
81
|
-
if
|
82
|
-
|
42
|
+
if @config[:ignore_ssl_errors].present?
|
43
|
+
driver_options[:phantomjs_options].push("--ignore-ssl-errors=yes", "--ssl-protocol=any")
|
44
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled ignore_ssl_errors"
|
45
|
+
end
|
83
46
|
|
84
|
-
|
85
|
-
|
47
|
+
# Disable images
|
48
|
+
if @config[:disable_images].present?
|
49
|
+
driver_options[:phantomjs_options] << "--load-images=no"
|
50
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled disable_images"
|
86
51
|
end
|
87
52
|
|
88
|
-
|
89
|
-
|
90
|
-
cookies.each do |cookie|
|
91
|
-
@browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
|
92
|
-
end
|
53
|
+
Capybara::Poltergeist::Driver.new(app, driver_options)
|
54
|
+
end
|
93
55
|
|
94
|
-
|
56
|
+
# Create browser instance (Capybara session)
|
57
|
+
@browser = Capybara::Session.new(:poltergeist_phantomjs)
|
58
|
+
@browser.spider = spider
|
59
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): created browser instance"
|
60
|
+
|
61
|
+
# Proxy
|
62
|
+
if proxy = @config[:proxy].presence
|
63
|
+
proxy_string = (proxy.class == Proc ? proxy.call : proxy).strip
|
64
|
+
ip, port, type = proxy_string.split(":")
|
65
|
+
|
66
|
+
if %w(http socks5).include?(type)
|
67
|
+
@browser.driver.set_proxy(*proxy_string.split(":"))
|
68
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled #{type} proxy, ip: #{ip}, port: #{port}"
|
69
|
+
else
|
70
|
+
logger.error "BrowserBuilder (poltergeist_phantomjs): wrong type of proxy: #{type}, skipped"
|
95
71
|
end
|
72
|
+
end
|
96
73
|
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
end
|
74
|
+
# Headers
|
75
|
+
if headers = @config[:headers].presence
|
76
|
+
@browser.driver.headers = headers
|
77
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom headers"
|
78
|
+
end
|
103
79
|
|
104
|
-
|
105
|
-
|
106
|
-
@browser.config.retry_request_errors = retry_errors
|
107
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
|
108
|
-
end
|
80
|
+
if user_agent = @config[:user_agent].presence
|
81
|
+
user_agent_string = (user_agent.class == Proc ? user_agent.call : user_agent).strip
|
109
82
|
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.requests_limit >= #{requests_limit}"
|
114
|
-
end
|
83
|
+
@browser.driver.add_header("User-Agent", user_agent_string)
|
84
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom user_agent"
|
85
|
+
end
|
115
86
|
|
116
|
-
|
117
|
-
|
118
|
-
|
87
|
+
# Cookies
|
88
|
+
if cookies = @config[:cookies].presence
|
89
|
+
cookies.each do |cookie|
|
90
|
+
@browser.driver.set_cookie(cookie[:name], cookie[:value], cookie)
|
119
91
|
end
|
120
92
|
|
121
|
-
|
122
|
-
|
123
|
-
@browser.config.before_request[:clear_cookies] = true
|
124
|
-
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_cookies"
|
125
|
-
end
|
93
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled custom cookies"
|
94
|
+
end
|
126
95
|
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
else
|
134
|
-
logger.error "BrowserBuilder (poltergeist_phantomjs): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
|
135
|
-
end
|
136
|
-
end
|
96
|
+
# Browser instance options
|
97
|
+
# skip_request_errors
|
98
|
+
if skip_errors = @config[:skip_request_errors].presence
|
99
|
+
@browser.config.skip_request_errors = skip_errors
|
100
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled skip_request_errors"
|
101
|
+
end
|
137
102
|
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
103
|
+
# retry_request_errors
|
104
|
+
if retry_errors = @config[:retry_request_errors].presence
|
105
|
+
@browser.config.retry_request_errors = retry_errors
|
106
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled retry_request_errors"
|
107
|
+
end
|
108
|
+
|
109
|
+
# restart_if
|
110
|
+
if requests_limit = @config.dig(:restart_if, :requests_limit).presence
|
111
|
+
@browser.config.restart_if[:requests_limit] = requests_limit
|
112
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.requests_limit >= #{requests_limit}"
|
113
|
+
end
|
114
|
+
|
115
|
+
if memory_limit = @config.dig(:restart_if, :memory_limit).presence
|
116
|
+
@browser.config.restart_if[:memory_limit] = memory_limit
|
117
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled restart_if.memory_limit >= #{memory_limit}"
|
118
|
+
end
|
119
|
+
|
120
|
+
# before_request clear_cookies
|
121
|
+
if @config.dig(:before_request, :clear_cookies)
|
122
|
+
@browser.config.before_request[:clear_cookies] = true
|
123
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_cookies"
|
124
|
+
end
|
125
|
+
|
126
|
+
# before_request clear_and_set_cookies
|
127
|
+
if @config.dig(:before_request, :clear_and_set_cookies)
|
128
|
+
if cookies = @config[:cookies].presence
|
129
|
+
@browser.config.cookies = cookies
|
130
|
+
@browser.config.before_request[:clear_and_set_cookies] = true
|
131
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.clear_and_set_cookies"
|
132
|
+
else
|
133
|
+
logger.error "BrowserBuilder (poltergeist_phantomjs): cookies should be present to enable before_request.clear_and_set_cookies, skipped"
|
147
134
|
end
|
135
|
+
end
|
148
136
|
|
149
|
-
|
150
|
-
|
151
|
-
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
|
156
|
-
|
157
|
-
end
|
137
|
+
# before_request change_user_agent
|
138
|
+
if @config.dig(:before_request, :change_user_agent)
|
139
|
+
if @config[:user_agent].present? && @config[:user_agent].class == Proc
|
140
|
+
@browser.config.user_agent = @config[:user_agent]
|
141
|
+
@browser.config.before_request[:change_user_agent] = true
|
142
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_user_agent"
|
143
|
+
else
|
144
|
+
logger.error "BrowserBuilder (poltergeist_phantomjs): user_agent should be present and has lambda format to enable before_request.change_user_agent, skipped"
|
158
145
|
end
|
146
|
+
end
|
159
147
|
|
160
|
-
|
161
|
-
|
162
|
-
|
163
|
-
|
148
|
+
# before_request change_proxy
|
149
|
+
if @config.dig(:before_request, :change_proxy)
|
150
|
+
if @config[:proxy].present? && @config[:proxy].class == Proc
|
151
|
+
@browser.config.proxy = @config[:proxy]
|
152
|
+
@browser.config.before_request[:change_proxy] = true
|
153
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.change_proxy"
|
154
|
+
else
|
155
|
+
logger.error "BrowserBuilder (poltergeist_phantomjs): proxy should be present and has lambda format to enable before_request.change_proxy, skipped"
|
164
156
|
end
|
157
|
+
end
|
165
158
|
|
166
|
-
|
167
|
-
|
159
|
+
# before_request delay
|
160
|
+
if delay = @config.dig(:before_request, :delay).presence
|
161
|
+
@browser.config.before_request[:delay] = delay
|
162
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled before_request.delay"
|
168
163
|
end
|
164
|
+
|
165
|
+
# encoding
|
166
|
+
if encoding = @config[:encoding]
|
167
|
+
@browser.config.encoding = encoding
|
168
|
+
logger.debug "BrowserBuilder (poltergeist_phantomjs): enabled encoding: #{encoding}"
|
169
|
+
end
|
170
|
+
|
171
|
+
# return Capybara session instance
|
172
|
+
@browser
|
169
173
|
end
|
170
174
|
end
|
171
175
|
end
|