rubium 0.1.0 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 124da1f21fad244bbdeb8adee803043896502e4c58964753bbba6ea8083df750
4
- data.tar.gz: 13cda3fb2bd4f121700d08f03a8fbc3b15e38b785dca30074afa13121f4592b0
3
+ metadata.gz: 42ae899603df19fcd01dc7c565aabcc162be29b7e9a12e508ba1bf3b2c26002a
4
+ data.tar.gz: 3f5ee794cdb712200821de3b9ecd17cbe6edc085b70e1af393f88fab98ef3de1
5
5
  SHA512:
6
- metadata.gz: e724053ddf9d97bbaf77db2c3647e1988c5bbace1a56b2f55f72afde1758ace0aaabd1f9d1771647361f8a863a6c33c35a8b616a7b8c4bd4f61278a750b33af0
7
- data.tar.gz: 8decfacf86c9751c6ae39e61cae9b8fa1b29a3cb769f8ac3e11e15122364173bf0d20e649e058fe150023e1a7ea38ac13fd919d14abddef6fbfa961d64ca6987
6
+ metadata.gz: 8d36c239dbe64b1693756615fe7aadf111aa71646fa226320bb85c69945f79ae921f29533ed7f6f6d376a78c82ddd22d4e517aaef536732f8bc72b40ba7010fb
7
+ data.tar.gz: 80a67649a61f32bf80f4d28195f9da07ec3528c801b00a79908a8baa9103577a70ce7a29e374d66fbf7f9ecbfee8b103a2f7c03c88ad4ffd22f827c8162351e2
data/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  # Rubium
2
2
 
3
- Rubium is a handy wrapper around [chrome_remote](https://github.com/cavalle/chrome_remote) gem. It adds browsers instances handling, and some Capybara-like methods. It is very lightweight (200 lines of code in the main `Rubium::Browser` class for now) and doens't use Selenium or Capybara.
3
+ > **Rubium updated to 0.2.0 version!** Added new options like `set_cookies`, `restart_after`, `urls_blacklist`, `disable_images` and others. Check the readme below:
4
+
5
+ Rubium is a handy wrapper around [chrome_remote](https://github.com/cavalle/chrome_remote) gem. It adds browsers instances handling, and some Capybara-like methods. It is very lightweight (250 lines of code in the main `Rubium::Browser` class for now) and doens't use Selenium or Capybara. Consider Rubium as a _very simple_ and _basic_ implementation of [Puppeteer](https://github.com/GoogleChrome/puppeteer) in Ruby language.
4
6
 
5
7
  You can use Rubium as a lightweight alternative to Selenium/Capybara/Watir if you need to perform some operations (like web scraping) using Headless Chromium and Ruby. Of course, the API currently doesn't has a lot of methods to automate browser, but it has the most frequently used and basic ones.
6
8
 
@@ -22,6 +24,12 @@ browser.click("some selector")
22
24
  # Get current cookies:
23
25
  browser.cookies
24
26
 
27
+ # Set cookies (Array of hashes):
28
+ browser.set_cookies([
29
+ { name: "some_cookie_name", value: "some_cookie_value", domain: ".some-cookie-domain.com" },
30
+ { name: "another_cookie_name", value: "another_cookie_value", domain: ".another-cookie-domain.com" }
31
+ ])
32
+
25
33
  # Fill in some field:
26
34
  browser.fill_in("some field selector", "Some text")
27
35
 
@@ -39,7 +47,8 @@ browser.evaluate_on_new_document(File.read "browser_inject.js")
39
47
  # Evaluate JS code expression:
40
48
  browser.execute_script("JS code string")
41
49
 
42
- # Access chrome_remote client directly:
50
+ # Access chrome_remote client (instance of ChromeRemote class) directly:
51
+ # See more here: https://github.com/cavalle/chrome_remote#using-the-chromeremote-api
43
52
  browser.client
44
53
 
45
54
  # Close browser:
@@ -49,18 +58,42 @@ browser.close
49
58
  browser.restart!
50
59
  ```
51
60
 
52
- There are some options which you can provide while creating browser instance:
61
+ **There are some options** which you can provide while creating browser instance:
53
62
 
54
63
  ```ruby
55
64
  browser = Rubium::Browser.new(
56
- debugging_port: 9222, # custom debugging port
57
- headless: false, # Run browser in normal (not headless) mode
58
- user_agent: "Some user agent", # Custom user-agent
59
- proxy_server: "http://1.1.1.1:8080", # Set proxy
65
+ debugging_port: 9222, # custom debugging port. Default is any available port.
66
+ headless: false, # Run browser in normal (not headless) mode. Default is headless.
67
+ window_size: [1600, 900], # Custom window size. Default is unset.
68
+ user_agent: "Some user agent", # Custom user-agent.
69
+ proxy_server: "http://1.1.1.1:8080", # Set proxy.
70
+ extension_code: "Some JS code string", # Inject custom JS code on each page. See above `evaluate_on_new_document`
71
+ cookies: [], # Set custom cookies, see above `set_cookies`
72
+ restart_after: 25, # Automatically restart browser after N processed requests
73
+ enable_logger: true, # Enable logger to log info about processing requests
74
+ max_timeout: 30, # How long to wait (in seconds) until page will be fully loaded. Default 60 sec.
75
+ urls_blacklist: ["*some-domain.com*"], # Skip all requests which match provided patterns (wildcard allowed).
76
+ disable_images: true # Do not download images.
60
77
  )
61
78
  ```
62
79
 
63
- You can provide custom Chrome binary path this way:
80
+ Note that for options `user_agent` and `proxy_server` you can provide `lambda` object instead of string:
81
+
82
+ ```ruby
83
+ USER_AGENTS = ["Safari", "Mozilla", "IE", "Chrome"]
84
+ PROXIES = ["http://1.1.1.1:8080", "http://2.2.2.2:8080", "http://3.3.3.3:8080"]
85
+
86
+ browser = Rubium::Browser.new(
87
+ user_agent: -> { USER_AGENTS.sample },
88
+ proxy_server: -> { PROXIES.sample },
89
+ restart_after: 25
90
+ )
91
+ ```
92
+
93
+ > What for: Chrome doesn't provide an API to change proxies on the fly (after browser has been started). It is possible to set proxy while starting Chrome instance by providing CLI argument only. On the other hand, Rubium allows you to automatically restart browser (`restart_after` option) after N processed requests. On each restart, if options `user_agent` and/or `proxy_server` has lambda format, then lambda will be called to fetch fresh value. Thus it's possible to rotate proxies/user-agents without any much effort.
94
+
95
+
96
+ **You can provide custom Chrome binary** path this way:
64
97
 
65
98
  ```ruby
66
99
  Rubium.configure do |config|
@@ -4,6 +4,7 @@ require 'random-port'
4
4
  require 'cliver'
5
5
  require 'timeout'
6
6
  require 'securerandom'
7
+ require 'logger'
7
8
 
8
9
  at_exit do
9
10
  Rubium::Browser.running_pids.each { |pid| Process.kill("HUP", pid) }
@@ -14,6 +15,7 @@ module Rubium
14
15
  class ConfigurationError < StandardError; end
15
16
 
16
17
  MAX_CONNECT_WAIT_TIME = 2
18
+ MAX_DEFAULT_TIMEOUT = 60
17
19
 
18
20
  class << self
19
21
  def ports_pool
@@ -25,26 +27,38 @@ module Rubium
25
27
  end
26
28
  end
27
29
 
28
- attr_reader :client, :devtools_url, :pid, :port, :options
30
+ attr_reader :client, :devtools_url, :pid, :port, :options, :processed_requests_count, :logger
29
31
 
30
32
  def initialize(options = {})
31
33
  @options = options
34
+
35
+ if @options[:enable_logger]
36
+ @logger = Logger.new(STDOUT)
37
+ @logger.progname = self.class.to_s
38
+ end
39
+
32
40
  create_browser
33
41
  end
34
42
 
35
43
  def restart!
44
+ logger.info "Restarting..." if options[:enable_logger]
45
+
36
46
  close
37
47
  create_browser
38
48
  end
39
49
 
40
50
  def close
41
- unless closed?
51
+ if closed?
52
+ logger.info "Browser already has been closed" if options[:enable_logger]
53
+ else
42
54
  Process.kill("HUP", @pid)
43
55
  self.class.running_pids.delete(@pid)
44
56
  self.class.ports_pool.release(@port)
45
57
 
46
58
  FileUtils.rm_rf(@data_dir) if Dir.exist?(@data_dir)
47
59
  @closed = true
60
+
61
+ logger.info "Closed browser" if options[:enable_logger]
48
62
  end
49
63
  end
50
64
 
@@ -54,14 +68,28 @@ module Rubium
54
68
  @closed
55
69
  end
56
70
 
57
- def goto(url, wait: 30)
71
+ def goto(url, wait: options[:max_timeout] || MAX_DEFAULT_TIMEOUT)
72
+ logger.info "Started request: #{url}" if options[:enable_logger]
73
+ if options[:restart_after] && processed_requests_count >= options[:restart_after]
74
+ restart!
75
+ end
76
+
58
77
  response = @client.send_cmd "Page.navigate", url: url
59
78
 
60
- if wait
61
- Timeout.timeout(wait) { @client.wait_for "Page.loadEventFired" }
62
- else
63
- response
79
+ # By default, after Page.navigate we should wait till page will load completely
80
+ # using Page.loadEventFired. But on some websites with Ajax navigation, Page.loadEventFired
81
+ # will stuck forever. In this case you can provide `wait: false` option to skip waiting.
82
+ if wait != false
83
+ # https://chromedevtools.github.io/devtools-protocol/tot/Page#event-frameStoppedLoading
84
+ Timeout.timeout(wait) do
85
+ @client.wait_for do |event_name, event_params|
86
+ event_name == "Page.frameStoppedLoading" && event_params["frameId"] == response["frameId"]
87
+ end
88
+ end
64
89
  end
90
+
91
+ @processed_requests_count += 1
92
+ logger.info "Finished request: #{url}" if options[:enable_logger]
65
93
  end
66
94
 
67
95
  alias_method :visit, :goto
@@ -106,21 +134,21 @@ module Rubium
106
134
  end
107
135
 
108
136
  def click(selector)
109
- @client.send_cmd "Runtime.evaluate", expression: <<~JS
137
+ @client.send_cmd "Runtime.evaluate", expression: <<~js
110
138
  document.querySelector("#{selector}").click();
111
- JS
139
+ js
112
140
  end
113
141
 
114
142
  # https://github.com/cyrus-and/chrome-remote-interface/issues/226#issuecomment-320247756
115
143
  # https://stackoverflow.com/a/18937620
116
144
  def send_key_on(selector, key)
117
- @client.send_cmd "Runtime.evaluate", expression: <<~JS
145
+ @client.send_cmd "Runtime.evaluate", expression: <<~js
118
146
  document.querySelector("#{selector}").dispatchEvent(
119
147
  new KeyboardEvent("keydown", {
120
148
  bubbles: true, cancelable: true, keyCode: #{key}
121
149
  })
122
150
  );
123
- JS
151
+ js
124
152
  end
125
153
 
126
154
  # https://github.com/GoogleChrome/puppeteer/blob/master/lib/Page.js#L784
@@ -130,15 +158,24 @@ module Rubium
130
158
  @client.send_cmd "Page.addScriptToEvaluateOnNewDocument", source: script
131
159
  end
132
160
 
161
+ ###
162
+
133
163
  def cookies
134
164
  response = @client.send_cmd "Network.getCookies"
135
165
  response["cookies"]
136
166
  end
137
167
 
168
+ # https://chromedevtools.github.io/devtools-protocol/tot/Network#method-setCookies
169
+ def set_cookies(cookies)
170
+ @client.send_cmd "Network.setCookies", cookies: cookies
171
+ end
172
+
173
+ ###
174
+
138
175
  def fill_in(selector, text)
139
- execute_script <<~HEREDOC
176
+ execute_script <<~js
140
177
  document.querySelector("#{selector}").value = "#{text}"
141
- HEREDOC
178
+ js
142
179
  end
143
180
 
144
181
  def execute_script(script)
@@ -148,6 +185,8 @@ module Rubium
148
185
  private
149
186
 
150
187
  def create_browser
188
+ @processed_requests_count = 0
189
+
151
190
  @port = options[:debugging_port] || self.class.ports_pool.acquire
152
191
  @data_dir = "/tmp/rubium_profile_#{SecureRandom.hex}"
153
192
 
@@ -196,6 +235,26 @@ module Rubium
196
235
  @client.send_cmd "Page.enable"
197
236
 
198
237
  evaluate_on_new_document(options[:extension_code]) if options[:extension_code]
238
+
239
+ set_cookies(options[:cookies]) if options[:cookies]
240
+
241
+ if options[:urls_blacklist] || options[:disable_images]
242
+ urls = []
243
+
244
+ if options[:urls_blacklist]
245
+ urls += options[:urls_blacklist]
246
+ end
247
+
248
+ if options[:disable_images]
249
+ urls += %w(jpg jpeg png gif swf svg tif).map { |ext| ["*.#{ext}", "*.#{ext}?*"] }.flatten
250
+ urls << "data:image*"
251
+ end
252
+
253
+ @client.send_cmd "Network.setBlockedURLs", urls: urls
254
+ end
255
+
256
+
257
+ logger.info "Opened browser" if options[:enable_logger]
199
258
  end
200
259
 
201
260
  def convert_proxy(proxy_string)
@@ -1,3 +1,3 @@
1
1
  module Rubium
2
- VERSION = "0.1.0"
2
+ VERSION = "0.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rubium
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Victor Afanasev
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-12-19 00:00:00.000000000 Z
11
+ date: 2019-01-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: chrome_remote