rubium 0.1.0 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +41 -8
- data/lib/rubium/browser.rb +72 -13
- data/lib/rubium/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 42ae899603df19fcd01dc7c565aabcc162be29b7e9a12e508ba1bf3b2c26002a
|
4
|
+
data.tar.gz: 3f5ee794cdb712200821de3b9ecd17cbe6edc085b70e1af393f88fab98ef3de1
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8d36c239dbe64b1693756615fe7aadf111aa71646fa226320bb85c69945f79ae921f29533ed7f6f6d376a78c82ddd22d4e517aaef536732f8bc72b40ba7010fb
|
7
|
+
data.tar.gz: 80a67649a61f32bf80f4d28195f9da07ec3528c801b00a79908a8baa9103577a70ce7a29e374d66fbf7f9ecbfee8b103a2f7c03c88ad4ffd22f827c8162351e2
|
data/README.md
CHANGED
@@ -1,6 +1,8 @@
|
|
1
1
|
# Rubium
|
2
2
|
|
3
|
-
Rubium
|
3
|
+
> **Rubium updated to 0.2.0 version!** Added new options like `set_cookies`, `restart_after`, `urls_blacklist`, `disable_images` and others. Check the readme below:
|
4
|
+
|
5
|
+
Rubium is a handy wrapper around [chrome_remote](https://github.com/cavalle/chrome_remote) gem. It adds browsers instances handling, and some Capybara-like methods. It is very lightweight (250 lines of code in the main `Rubium::Browser` class for now) and doens't use Selenium or Capybara. Consider Rubium as a _very simple_ and _basic_ implementation of [Puppeteer](https://github.com/GoogleChrome/puppeteer) in Ruby language.
|
4
6
|
|
5
7
|
You can use Rubium as a lightweight alternative to Selenium/Capybara/Watir if you need to perform some operations (like web scraping) using Headless Chromium and Ruby. Of course, the API currently doesn't has a lot of methods to automate browser, but it has the most frequently used and basic ones.
|
6
8
|
|
@@ -22,6 +24,12 @@ browser.click("some selector")
|
|
22
24
|
# Get current cookies:
|
23
25
|
browser.cookies
|
24
26
|
|
27
|
+
# Set cookies (Array of hashes):
|
28
|
+
browser.set_cookies([
|
29
|
+
{ name: "some_cookie_name", value: "some_cookie_value", domain: ".some-cookie-domain.com" },
|
30
|
+
{ name: "another_cookie_name", value: "another_cookie_value", domain: ".another-cookie-domain.com" }
|
31
|
+
])
|
32
|
+
|
25
33
|
# Fill in some field:
|
26
34
|
browser.fill_in("some field selector", "Some text")
|
27
35
|
|
@@ -39,7 +47,8 @@ browser.evaluate_on_new_document(File.read "browser_inject.js")
|
|
39
47
|
# Evaluate JS code expression:
|
40
48
|
browser.execute_script("JS code string")
|
41
49
|
|
42
|
-
# Access chrome_remote client directly:
|
50
|
+
# Access chrome_remote client (instance of ChromeRemote class) directly:
|
51
|
+
# See more here: https://github.com/cavalle/chrome_remote#using-the-chromeremote-api
|
43
52
|
browser.client
|
44
53
|
|
45
54
|
# Close browser:
|
@@ -49,18 +58,42 @@ browser.close
|
|
49
58
|
browser.restart!
|
50
59
|
```
|
51
60
|
|
52
|
-
There are some options which you can provide while creating browser instance:
|
61
|
+
**There are some options** which you can provide while creating browser instance:
|
53
62
|
|
54
63
|
```ruby
|
55
64
|
browser = Rubium::Browser.new(
|
56
|
-
debugging_port: 9222,
|
57
|
-
headless: false,
|
58
|
-
|
59
|
-
|
65
|
+
debugging_port: 9222, # custom debugging port. Default is any available port.
|
66
|
+
headless: false, # Run browser in normal (not headless) mode. Default is headless.
|
67
|
+
window_size: [1600, 900], # Custom window size. Default is unset.
|
68
|
+
user_agent: "Some user agent", # Custom user-agent.
|
69
|
+
proxy_server: "http://1.1.1.1:8080", # Set proxy.
|
70
|
+
extension_code: "Some JS code string", # Inject custom JS code on each page. See above `evaluate_on_new_document`
|
71
|
+
cookies: [], # Set custom cookies, see above `set_cookies`
|
72
|
+
restart_after: 25, # Automatically restart browser after N processed requests
|
73
|
+
enable_logger: true, # Enable logger to log info about processing requests
|
74
|
+
max_timeout: 30, # How long to wait (in seconds) until page will be fully loaded. Default 60 sec.
|
75
|
+
urls_blacklist: ["*some-domain.com*"], # Skip all requests which match provided patterns (wildcard allowed).
|
76
|
+
disable_images: true # Do not download images.
|
60
77
|
)
|
61
78
|
```
|
62
79
|
|
63
|
-
|
80
|
+
Note that for options `user_agent` and `proxy_server` you can provide `lambda` object instead of string:
|
81
|
+
|
82
|
+
```ruby
|
83
|
+
USER_AGENTS = ["Safari", "Mozilla", "IE", "Chrome"]
|
84
|
+
PROXIES = ["http://1.1.1.1:8080", "http://2.2.2.2:8080", "http://3.3.3.3:8080"]
|
85
|
+
|
86
|
+
browser = Rubium::Browser.new(
|
87
|
+
user_agent: -> { USER_AGENTS.sample },
|
88
|
+
proxy_server: -> { PROXIES.sample },
|
89
|
+
restart_after: 25
|
90
|
+
)
|
91
|
+
```
|
92
|
+
|
93
|
+
> What for: Chrome doesn't provide an API to change proxies on the fly (after browser has been started). It is possible to set proxy while starting Chrome instance by providing CLI argument only. On the other hand, Rubium allows you to automatically restart browser (`restart_after` option) after N processed requests. On each restart, if options `user_agent` and/or `proxy_server` has lambda format, then lambda will be called to fetch fresh value. Thus it's possible to rotate proxies/user-agents without any much effort.
|
94
|
+
|
95
|
+
|
96
|
+
**You can provide custom Chrome binary** path this way:
|
64
97
|
|
65
98
|
```ruby
|
66
99
|
Rubium.configure do |config|
|
data/lib/rubium/browser.rb
CHANGED
@@ -4,6 +4,7 @@ require 'random-port'
|
|
4
4
|
require 'cliver'
|
5
5
|
require 'timeout'
|
6
6
|
require 'securerandom'
|
7
|
+
require 'logger'
|
7
8
|
|
8
9
|
at_exit do
|
9
10
|
Rubium::Browser.running_pids.each { |pid| Process.kill("HUP", pid) }
|
@@ -14,6 +15,7 @@ module Rubium
|
|
14
15
|
class ConfigurationError < StandardError; end
|
15
16
|
|
16
17
|
MAX_CONNECT_WAIT_TIME = 2
|
18
|
+
MAX_DEFAULT_TIMEOUT = 60
|
17
19
|
|
18
20
|
class << self
|
19
21
|
def ports_pool
|
@@ -25,26 +27,38 @@ module Rubium
|
|
25
27
|
end
|
26
28
|
end
|
27
29
|
|
28
|
-
attr_reader :client, :devtools_url, :pid, :port, :options
|
30
|
+
attr_reader :client, :devtools_url, :pid, :port, :options, :processed_requests_count, :logger
|
29
31
|
|
30
32
|
def initialize(options = {})
|
31
33
|
@options = options
|
34
|
+
|
35
|
+
if @options[:enable_logger]
|
36
|
+
@logger = Logger.new(STDOUT)
|
37
|
+
@logger.progname = self.class.to_s
|
38
|
+
end
|
39
|
+
|
32
40
|
create_browser
|
33
41
|
end
|
34
42
|
|
35
43
|
def restart!
|
44
|
+
logger.info "Restarting..." if options[:enable_logger]
|
45
|
+
|
36
46
|
close
|
37
47
|
create_browser
|
38
48
|
end
|
39
49
|
|
40
50
|
def close
|
41
|
-
|
51
|
+
if closed?
|
52
|
+
logger.info "Browser already has been closed" if options[:enable_logger]
|
53
|
+
else
|
42
54
|
Process.kill("HUP", @pid)
|
43
55
|
self.class.running_pids.delete(@pid)
|
44
56
|
self.class.ports_pool.release(@port)
|
45
57
|
|
46
58
|
FileUtils.rm_rf(@data_dir) if Dir.exist?(@data_dir)
|
47
59
|
@closed = true
|
60
|
+
|
61
|
+
logger.info "Closed browser" if options[:enable_logger]
|
48
62
|
end
|
49
63
|
end
|
50
64
|
|
@@ -54,14 +68,28 @@ module Rubium
|
|
54
68
|
@closed
|
55
69
|
end
|
56
70
|
|
57
|
-
def goto(url, wait:
|
71
|
+
def goto(url, wait: options[:max_timeout] || MAX_DEFAULT_TIMEOUT)
|
72
|
+
logger.info "Started request: #{url}" if options[:enable_logger]
|
73
|
+
if options[:restart_after] && processed_requests_count >= options[:restart_after]
|
74
|
+
restart!
|
75
|
+
end
|
76
|
+
|
58
77
|
response = @client.send_cmd "Page.navigate", url: url
|
59
78
|
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
79
|
+
# By default, after Page.navigate we should wait till page will load completely
|
80
|
+
# using Page.loadEventFired. But on some websites with Ajax navigation, Page.loadEventFired
|
81
|
+
# will stuck forever. In this case you can provide `wait: false` option to skip waiting.
|
82
|
+
if wait != false
|
83
|
+
# https://chromedevtools.github.io/devtools-protocol/tot/Page#event-frameStoppedLoading
|
84
|
+
Timeout.timeout(wait) do
|
85
|
+
@client.wait_for do |event_name, event_params|
|
86
|
+
event_name == "Page.frameStoppedLoading" && event_params["frameId"] == response["frameId"]
|
87
|
+
end
|
88
|
+
end
|
64
89
|
end
|
90
|
+
|
91
|
+
@processed_requests_count += 1
|
92
|
+
logger.info "Finished request: #{url}" if options[:enable_logger]
|
65
93
|
end
|
66
94
|
|
67
95
|
alias_method :visit, :goto
|
@@ -106,21 +134,21 @@ module Rubium
|
|
106
134
|
end
|
107
135
|
|
108
136
|
def click(selector)
|
109
|
-
@client.send_cmd "Runtime.evaluate", expression: <<~
|
137
|
+
@client.send_cmd "Runtime.evaluate", expression: <<~js
|
110
138
|
document.querySelector("#{selector}").click();
|
111
|
-
|
139
|
+
js
|
112
140
|
end
|
113
141
|
|
114
142
|
# https://github.com/cyrus-and/chrome-remote-interface/issues/226#issuecomment-320247756
|
115
143
|
# https://stackoverflow.com/a/18937620
|
116
144
|
def send_key_on(selector, key)
|
117
|
-
@client.send_cmd "Runtime.evaluate", expression: <<~
|
145
|
+
@client.send_cmd "Runtime.evaluate", expression: <<~js
|
118
146
|
document.querySelector("#{selector}").dispatchEvent(
|
119
147
|
new KeyboardEvent("keydown", {
|
120
148
|
bubbles: true, cancelable: true, keyCode: #{key}
|
121
149
|
})
|
122
150
|
);
|
123
|
-
|
151
|
+
js
|
124
152
|
end
|
125
153
|
|
126
154
|
# https://github.com/GoogleChrome/puppeteer/blob/master/lib/Page.js#L784
|
@@ -130,15 +158,24 @@ module Rubium
|
|
130
158
|
@client.send_cmd "Page.addScriptToEvaluateOnNewDocument", source: script
|
131
159
|
end
|
132
160
|
|
161
|
+
###
|
162
|
+
|
133
163
|
def cookies
|
134
164
|
response = @client.send_cmd "Network.getCookies"
|
135
165
|
response["cookies"]
|
136
166
|
end
|
137
167
|
|
168
|
+
# https://chromedevtools.github.io/devtools-protocol/tot/Network#method-setCookies
|
169
|
+
def set_cookies(cookies)
|
170
|
+
@client.send_cmd "Network.setCookies", cookies: cookies
|
171
|
+
end
|
172
|
+
|
173
|
+
###
|
174
|
+
|
138
175
|
def fill_in(selector, text)
|
139
|
-
execute_script <<~
|
176
|
+
execute_script <<~js
|
140
177
|
document.querySelector("#{selector}").value = "#{text}"
|
141
|
-
|
178
|
+
js
|
142
179
|
end
|
143
180
|
|
144
181
|
def execute_script(script)
|
@@ -148,6 +185,8 @@ module Rubium
|
|
148
185
|
private
|
149
186
|
|
150
187
|
def create_browser
|
188
|
+
@processed_requests_count = 0
|
189
|
+
|
151
190
|
@port = options[:debugging_port] || self.class.ports_pool.acquire
|
152
191
|
@data_dir = "/tmp/rubium_profile_#{SecureRandom.hex}"
|
153
192
|
|
@@ -196,6 +235,26 @@ module Rubium
|
|
196
235
|
@client.send_cmd "Page.enable"
|
197
236
|
|
198
237
|
evaluate_on_new_document(options[:extension_code]) if options[:extension_code]
|
238
|
+
|
239
|
+
set_cookies(options[:cookies]) if options[:cookies]
|
240
|
+
|
241
|
+
if options[:urls_blacklist] || options[:disable_images]
|
242
|
+
urls = []
|
243
|
+
|
244
|
+
if options[:urls_blacklist]
|
245
|
+
urls += options[:urls_blacklist]
|
246
|
+
end
|
247
|
+
|
248
|
+
if options[:disable_images]
|
249
|
+
urls += %w(jpg jpeg png gif swf svg tif).map { |ext| ["*.#{ext}", "*.#{ext}?*"] }.flatten
|
250
|
+
urls << "data:image*"
|
251
|
+
end
|
252
|
+
|
253
|
+
@client.send_cmd "Network.setBlockedURLs", urls: urls
|
254
|
+
end
|
255
|
+
|
256
|
+
|
257
|
+
logger.info "Opened browser" if options[:enable_logger]
|
199
258
|
end
|
200
259
|
|
201
260
|
def convert_proxy(proxy_string)
|
data/lib/rubium/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rubium
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Victor Afanasev
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2019-01-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: chrome_remote
|