sunbro 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: b1c6c6f4683dda9c20055c2f766462986b23bc86
4
+ data.tar.gz: 27a8682eae376d1cb0675e99e56a04ebe89ee832
5
+ SHA512:
6
+ metadata.gz: cc97aa66162c983c490bd713c372624bdf65e66145ddb6c51e2de4984002ed10c984d97069345c0a6bfd54c7c1a12fd4edc2ae4efdc1aaffef7d4a59da7249f9
7
+ data.tar.gz: 587c366d59f326b0141e231b19b42aa1b5b148acb717fac8b7d01c48e8cf4476d83ea499a208283a785f43dd1f556ac569eac4b7307d95406cfa5bd937b2df6b
data/.gitignore ADDED
@@ -0,0 +1,19 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ .rspec
7
+ Gemfile.lock
8
+ InstalledFiles
9
+ _yardoc
10
+ coverage
11
+ doc/
12
+ lib/bundler/man
13
+ pkg
14
+ rdoc
15
+ spec/reports
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
19
+ .idea/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in sunbro.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 Jon Stokes
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,120 @@
1
+ # Sunbro
2
+
3
+ Some code that I use to crawl the web at scale with Poltergeist and
4
+ PhantomJS (cf. [stretched.io](https://github.com/jonstokes/stretched.io)). Uses a bunch of code from the venerable [anemone gem](https://github.com/chriskite/anemone). Released in the spirit of jolly cooperation.
5
+
6
+ ## Installation
7
+
8
+ Add this line to your application's Gemfile:
9
+
10
+ gem 'sunbro'
11
+
12
+ And then execute:
13
+
14
+ $ bundle
15
+
16
+ Or install it yourself as:
17
+
18
+ $ gem install sunbro
19
+
20
+ ## Usage
21
+
22
+ I use sunbro to crawl the web at scale via Sidekiq on EC2. I've found
23
+ that web scraping with capybara/poltergeist + phantomjs is a giant pain
24
+ on JRuby (for various reasons that you'll encounter once you try it),
25
+ and this gem is basically my collection of fixes that makes it actually
26
+ work. And it works pretty well; I use in production to crawl
27
+ 230 sites and counting.
28
+
29
+ Here's an example of a worker that looks something like what you might find in my code:
30
+
31
+ ```ruby
32
+ class CrawlerWorker
33
+
34
+ def perform(opts)
35
+ @connection = Sunbro::Connection.new
36
+ return unless @links = opts[:links]
37
+
38
+ links.each do |link|
39
+ next unless page = @connection.get_page(link)
40
+ puts "Page #{page.url} returned code #{page.code} with body size #{page.body.size}"
41
+ end
42
+
43
+ ensure
44
+ @connection.close
45
+ end
46
+
47
+ end
48
+ ```
49
+
50
+ The above uses `net-http` to fetch connections, and it pools
51
+ them. This is all you need most of the time. However, if you're scraping
52
+ a page that is AJAX-heavy, that's where you'll get the most out of sunbro.
53
+ To use phantomjs to scrape a page, you'll want to call `connection.render_page(link)`.
54
+ This renders the JS on the page, but doesn't download any images.
55
+
56
+ The one option to either `get_page` or `render_page` is
57
+ `:force_format`, can be one of `:html`, `:xml`, or `:auto`. If the
58
+ option is set to `:html`, then `Nokogiri::HTML` will be used to parse
59
+ `page.body`; if it's set to `:xml`, then `Nokogiri::XML` is used. If
60
+ it's set to `:auto` or `nil`, `Nokogiri.parse` is called.
61
+
62
+ ## Configuration
63
+
64
+ You can configure a few options in a `config/initializers/sunbro.rb`
65
+ file, as follows:
66
+
67
+ ```ruby
68
+ Sunbro::Settings.configure do |config|
69
+ config.user_agent = ENV['USER_AGENT_STRING1']
70
+ config.phantomjs_user_agent = ENV['USER_AGENT_STRING2']
71
+ config.page_format = :auto
72
+ end
73
+ ```
74
+
75
+ ## PhantomJS zombie process monkey patch
76
+
77
+ I use the following monkey patch for PhantomJS, because it has zombie
78
+ process issues when it comes to JRuby. This monkey patch kills some minor
79
+ PhantomJS functionality that I don't use, and you can read more about
80
+ what it does and why, in [this blog post](http://jonstokes.com/2014/07/07/monkey-patching-poltergeist-for-web-scraping-with-jruby/).
81
+
82
+ I put this in `config/initializers/phantomjs.rb`
83
+
84
+ ```ruby
85
+ require "capybara"
86
+ require "capybara/poltergeist"
87
+ require "capybara/poltergeist/utility"
88
+
89
+ module Capybara::Poltergeist
90
+ Client.class_eval do
91
+ def start
92
+ @pid = Process.spawn(*command.map(&:to_s), pgroup: true)
93
+ ObjectSpace.define_finalizer(self, self.class.process_killer(@pid))
94
+ end
95
+
96
+ def stop
97
+ if pid
98
+ kill_phantomjs
99
+ ObjectSpace.undefine_finalizer(self)
100
+ end
101
+ end
102
+ end
103
+ end
104
+ ```
105
+
106
+ ## Next steps
107
+
108
+ Right now, this is more of a bag of code than a bona fide user-friendly
109
+ gem. One next step would be to add some configuration options for PhantomJS
110
+ that get passed via `render_page` to poltergeist and then on to the
111
+ command line. Another would be to use `net-http-persistent`, which is
112
+ actually included here as a dependency but isn't yet used.
113
+
114
+ ## Contributing
115
+
116
+ 1. Fork it ( http://github.com/<my-github-username>/sunbro/fork )
117
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
118
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
119
+ 4. Push to the branch (`git push origin my-new-feature`)
120
+ 5. Create new Pull Request
data/Rakefile ADDED
@@ -0,0 +1 @@
1
+ require "bundler/gem_tasks"
data/lib/sunbro.rb ADDED
@@ -0,0 +1,44 @@
1
+ require 'nokogiri'
2
+ require 'capybara/poltergeist'
3
+ require 'net/http/persistent'
4
+ require 'webrick/cookie'
5
+
6
+ %w(
7
+ sunbro/version
8
+ sunbro/settings
9
+ sunbro/dynamic_http
10
+ sunbro/http
11
+ sunbro/page
12
+ ).each do |f|
13
+ require f
14
+ end
15
+
16
+ module Sunbro
17
+ MAX_RETRIES = 5
18
+
19
+ def get_page(link, opts={})
20
+ @http ||= HTTP.new
21
+ fetch_with_connection(@http, link, opts)
22
+ end
23
+
24
+ def render_page(link, opts={})
25
+ @dhttp ||= DynamicHTTP.new
26
+ fetch_with_connection(@dhttp, link, opts)
27
+ end
28
+
29
+ def fetch_with_connection(conn, link, opts)
30
+ page, tries = nil, MAX_RETRIES
31
+ begin
32
+ page = conn.fetch_page(link, opts)
33
+ sleep 1
34
+ end until page.try(:present?) || (tries -= 1).zero?
35
+ page.discard_doc! unless page.is_valid?
36
+ page
37
+ end
38
+
39
+ def close_http_connections
40
+ @http.close if @http
41
+ @dhttp.close if @dhttp
42
+ rescue IOError
43
+ end
44
+ end
@@ -0,0 +1,9 @@
1
+ class Connection
2
+ extend Sunbro
3
+ attr_reader :http, :dhttp
4
+
5
+ def close
6
+ close_http_connections
7
+ end
8
+ end
9
+
@@ -0,0 +1,99 @@
1
+ module Sunbro
2
+ class DynamicHTTP
3
+ attr_reader :session
4
+
5
+ def initialize(opts = {})
6
+ @opts = opts
7
+ new_session
8
+ end
9
+
10
+ def close
11
+ @session.driver.quit
12
+ end
13
+
14
+ def new_session
15
+ Capybara.register_driver :poltergeist do |app|
16
+ Capybara::Poltergeist::Driver.new(
17
+ app,
18
+ js_errors: false,
19
+ phantomjs_options: ['--load-images=no', '--ignore-ssl-errors=yes']
20
+ )
21
+ end
22
+ Capybara.default_driver = :poltergeist
23
+ Capybara.javascript_driver = :poltergeist
24
+ Capybara.run_server = false
25
+ @session = Capybara::Session.new(:poltergeist)
26
+ @session.driver.headers = {
27
+ 'User-Agent' => user_agent
28
+ }
29
+ @session
30
+ end
31
+
32
+ def user_agent
33
+ @opts[:agent] || Settings.phantomjs_user_agent
34
+ end
35
+
36
+ def restart_session
37
+ close
38
+ new_session
39
+ end
40
+
41
+ #
42
+ # Create new Pages from the response of an HTTP request to *url*,
43
+ # including redirects
44
+ #
45
+ def fetch_page(url, opts={})
46
+ begin
47
+ tries ||= 5
48
+ get_page(url, opts)
49
+ rescue Capybara::Poltergeist::DeadClient, Errno::EPIPE, NoMethodError, Capybara::Poltergeist::BrowserError => e
50
+ restart_session
51
+ retry unless (tries -= 1).zero?
52
+ close
53
+ raise e
54
+ end
55
+ end
56
+
57
+ def get_page(url, opts)
58
+ reset = opts.fetch(:reset) rescue true
59
+ session.visit(url.to_s)
60
+ page = create_page_from_session(url, session, opts)
61
+ session.reset! if reset
62
+ page
63
+ rescue Capybara::Poltergeist::TimeoutError => e
64
+ restart_session
65
+ return Page.new(url, :error => e)
66
+ end
67
+
68
+ private
69
+
70
+ def create_page_from_session(url, session, opts)
71
+ url = url.to_s
72
+ if url == session.current_url
73
+ Page.new(
74
+ session.current_url,
75
+ :body => session.html.dup,
76
+ :code => session.status_code,
77
+ :headers => session.response_headers,
78
+ :force_format => (opts[:force_format] || default_page_format)
79
+ )
80
+ else
81
+ Page.new(
82
+ session.current_url,
83
+ :body => session.html.dup,
84
+ :code => 301,
85
+ :redirect_from => url,
86
+ :headers => session.response_headers,
87
+ :force_format => (opts[:force_format] || default_page_format)
88
+ )
89
+ end
90
+ end
91
+
92
+ def default_page_format
93
+ # Don't force the page format if the default format is set to :any
94
+ return unless [:xml, :html].include? Settings.page_format
95
+ Settings.page_format
96
+ end
97
+
98
+ end
99
+ end
@@ -0,0 +1,217 @@
1
+ module Sunbro
2
+ class HTTP
3
+ # Maximum number of redirects to follow on each get_response
4
+ REDIRECT_LIMIT = 5
5
+
6
+ def initialize(opts = {})
7
+ @connections = {}
8
+ @opts = opts
9
+ end
10
+
11
+ def close
12
+ @connections.each do |host, ports|
13
+ ports.each do |port, connection|
14
+ connection.finish
15
+ end
16
+ end
17
+ end
18
+
19
+ #
20
+ # Fetch a single Page from the response of an HTTP request to *url*.
21
+ # Just gets the final destination page.
22
+ #
23
+ def fetch_page(url, opts={})
24
+ original_url = url.dup
25
+ pages = fetch_pages(url, opts)
26
+ if pages.count == 1
27
+ page = pages.first
28
+ page.url = original_url
29
+ page
30
+ else
31
+ page = pages.last
32
+ page.redirect_from = original_url
33
+ page
34
+ end
35
+ end
36
+
37
+ #
38
+ # Create new Pages from the response of an HTTP request to *url*,
39
+ # including redirects
40
+ #
41
+ def fetch_pages(url, opts={})
42
+ referer, depth = opts[:referer], opts[:depth]
43
+ force_format = opts[:force_format] || default_page_format
44
+ begin
45
+ url = convert_to_uri(url) unless url.is_a?(URI)
46
+ pages = []
47
+ get(url, referer) do |response, code, location, redirect_to, response_time|
48
+ pages << Page.new(location, :body => response.body.dup,
49
+ :code => code,
50
+ :headers => response.to_hash,
51
+ :referer => referer,
52
+ :depth => depth,
53
+ :redirect_to => redirect_to,
54
+ :response_time => response_time,
55
+ :force_format => force_format)
56
+ end
57
+
58
+ return pages
59
+ rescue Exception => e
60
+ if verbose?
61
+ puts e.inspect
62
+ puts e.backtrace
63
+ end
64
+ return [Page.new(url, :error => e)]
65
+ end
66
+ end
67
+
68
+ #
69
+ # Convert the link to a valid URI if possible
70
+ #
71
+ def convert_to_uri(url)
72
+ URI(url)
73
+ rescue URI::InvalidURIError
74
+ URI(URI.escape(url))
75
+ end
76
+
77
+ #
78
+ # The maximum number of redirects to follow
79
+ #
80
+ def redirect_limit
81
+ @opts[:redirect_limit] || REDIRECT_LIMIT
82
+ end
83
+
84
+ #
85
+ # The user-agent string which will be sent with each request,
86
+ # or nil if no such option is set
87
+ #
88
+ def user_agent
89
+ @opts[:agent] || Settings.user_agent
90
+ end
91
+
92
+ #
93
+ # Does this HTTP client accept cookies from the server?
94
+ #
95
+ def accept_cookies?
96
+ @opts[:accept_cookies]
97
+ end
98
+
99
+ #
100
+ # The proxy address string
101
+ #
102
+ def proxy_host
103
+ @opts[:proxy_host]
104
+ end
105
+
106
+ #
107
+ # The proxy port
108
+ #
109
+ def proxy_port
110
+ @opts[:proxy_port]
111
+ end
112
+
113
+ #
114
+ # HTTP read timeout in seconds
115
+ #
116
+ def read_timeout
117
+ @opts[:read_timeout]
118
+ end
119
+
120
+ private
121
+
122
+ #
123
+ # Retrieve HTTP responses for *url*, including redirects.
124
+ # Yields the response object, response code, and URI location
125
+ # for each response.
126
+ #
127
+ def get(url, referer = nil)
128
+ limit = redirect_limit
129
+ loc = url
130
+ begin
131
+ # if redirected to a relative url, merge it with the host of the original
132
+ # request url
133
+ loc = url.merge(loc) if loc.relative?
134
+
135
+ response, response_time = get_response(loc, referer)
136
+ code = Integer(response.code)
137
+ redirect_to = response.is_a?(Net::HTTPRedirection) ? URI(response['location']).normalize : nil
138
+ yield response, code, loc, redirect_to, response_time
139
+ limit -= 1
140
+ end while (loc = redirect_to) && allowed?(redirect_to, url) && limit > 0
141
+ end
142
+
143
+ #
144
+ # Get an HTTPResponse for *url*, sending the appropriate User-Agent string
145
+ #
146
+ def get_response(url, referer = nil)
147
+ full_path = url.query.nil? ? url.path : "#{url.path}?#{url.query}"
148
+
149
+ opts = {}
150
+ opts['User-Agent'] = user_agent if user_agent
151
+ opts['Referer'] = referer.to_s if referer
152
+
153
+ retries = 0
154
+ begin
155
+ start = Time.now()
156
+ # format request
157
+ req = Net::HTTP::Get.new(full_path, opts)
158
+ # HTTP Basic authentication
159
+ req.basic_auth url.user, url.password if url.user
160
+ response = connection(url).request(req)
161
+ finish = Time.now()
162
+ response_time = ((finish - start) * 1000).round
163
+ return response, response_time
164
+ rescue Timeout::Error, Net::HTTPBadResponse, EOFError => e
165
+ puts e.inspect if verbose?
166
+ refresh_connection(url)
167
+ retries += 1
168
+ retry unless retries > 3
169
+ end
170
+ end
171
+
172
+ def connection(url)
173
+ @connections[url.host] ||= {}
174
+
175
+ if conn = @connections[url.host][url.port]
176
+ return conn
177
+ end
178
+
179
+ refresh_connection url
180
+ end
181
+
182
+ def refresh_connection(url)
183
+ http = Net::HTTP.new(url.host, url.port, proxy_host, proxy_port)
184
+
185
+ http.read_timeout = read_timeout if !!read_timeout
186
+
187
+ if url.scheme == 'https'
188
+ http.use_ssl = true
189
+ http.verify_mode = OpenSSL::SSL::VERIFY_NONE
190
+ end
191
+
192
+ @connections[url.host][url.port] = http.start
193
+ end
194
+
195
+ def verbose?
196
+ @opts[:verbose]
197
+ end
198
+
199
+ #
200
+ # Allowed to connect to the requested url?
201
+ #
202
+ def allowed?(to_url, from_url)
203
+ to_url.host.nil? || (to_url.host.sub("www.","") == from_url.host.sub("www.",""))
204
+ rescue
205
+ true
206
+ end
207
+
208
+ private
209
+
210
+ def default_page_format
211
+ # Don't force the page format if the default format is set to :any
212
+ return unless [:xml, :html].include? Settings.page_format
213
+ Settings.page_format
214
+ end
215
+
216
+ end
217
+ end
@@ -0,0 +1,242 @@
1
+ module Sunbro
2
+ class Page
3
+
4
+ # The URL of the page
5
+ attr_accessor :url
6
+ # The raw HTTP response body of the page
7
+ attr_reader :body
8
+ # Headers of the HTTP response
9
+ attr_reader :headers
10
+ # URL of the page this one redirected to, if any
11
+ attr_reader :redirect_to
12
+ # Exception object, if one was raised during HTTP#fetch_page
13
+ attr_reader :error
14
+
15
+ # Integer response code of the page
16
+ attr_accessor :code
17
+ # Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!
18
+ attr_accessor :visited
19
+ # Depth of this page from the root of the crawl. This is not necessarily the
20
+ # shortest path; use PageStore#shortest_paths! to find that value.
21
+ attr_accessor :depth
22
+ # URL of the page that brought us to this page
23
+ attr_accessor :referer
24
+ # Response time of the request for this page in milliseconds
25
+ attr_accessor :response_time
26
+
27
+ attr_accessor :redirect_from
28
+
29
+ #
30
+ # Create a new page
31
+ #
32
+ def initialize(url, params = {})
33
+ @url = url
34
+
35
+ @code = params[:code]
36
+ @headers = params[:headers] || {}
37
+ @headers['content-type'] ||= ['']
38
+ @aliases = Array(params[:aka]).compact
39
+ @referer = params[:referer]
40
+ @depth = params[:depth] || 0
41
+ @redirect_to = to_absolute(params[:redirect_to])
42
+ @response_time = params[:response_time]
43
+ @error = params[:error]
44
+ @fetched = !params[:code].nil?
45
+ @force_format = params[:force_format]
46
+ @body = params[:body]
47
+ @redirect_from = params[:redirect_from]
48
+ end
49
+
50
+ #
51
+ # Nokogiri document for the HTML body
52
+ #
53
+ def doc
54
+ @doc ||= begin
55
+ if image?
56
+ nil
57
+ elsif should_parse_as?(:xml)
58
+ Nokogiri::XML(@body, @url.to_s)
59
+ elsif should_parse_as?(:html)
60
+ Nokogiri::HTML(@body, @url.to_s)
61
+ elsif @body
62
+ Nokogiri.parse(@body, @url.to_s)
63
+ end
64
+ end
65
+ end
66
+
67
+ def is_valid?
68
+ (url != "about:blank") && !not_found? && present?
69
+ end
70
+
71
+ def present?
72
+ !error && code && body.present? && doc
73
+ end
74
+
75
+ #
76
+ # Delete the Nokogiri document and response body to conserve memory
77
+ #
78
+ def discard_doc!
79
+ @doc = @body = nil
80
+ end
81
+
82
+ #
83
+ # Was the page successfully fetched?
84
+ # +true+ if the page was fetched with no error, +false+ otherwise.
85
+ #
86
+ def fetched?
87
+ @fetched
88
+ end
89
+
90
+ #
91
+ # Array of cookies received with this page as WEBrick::Cookie objects.
92
+ #
93
+ def cookies
94
+ WEBrick::Cookie.parse_set_cookies(@headers['Set-Cookie']) rescue []
95
+ end
96
+
97
+ #
98
+ # The content-type returned by the HTTP request for this page
99
+ #
100
+ def content_type
101
+ headers['content-type'].first
102
+ end
103
+
104
+ #
105
+ # Returns +true+ if the page is an image, returns +false+
106
+ # otherwise.
107
+ #
108
+ def image?
109
+ !!(content_type =~ %r{^(image/)\b})
110
+ end
111
+
112
+ #
113
+ # Returns +true+ if the page is a HTML document, returns +false+
114
+ # otherwise.
115
+ #
116
+ def html?
117
+ !!(content_type =~ %r{^(text/html|application/xhtml+xml)\b})
118
+ end
119
+
120
+ #
121
+ # Returns +true+ if the page is a XML document, returns +false+
122
+ # otherwise.
123
+ #
124
+ def xml?
125
+ !!(content_type =~ %r{^(text/xml|application/xml)\b})
126
+ end
127
+
128
+ #
129
+ # Returns +true+ if the page is a HTTP redirect, returns +false+
130
+ # otherwise.
131
+ #
132
+ def redirect?
133
+ (300..307).include?(@code)
134
+ end
135
+
136
+ #
137
+ # Returns +true+ if the page was not found (returned 404 code),
138
+ # returns +false+ otherwise.
139
+ #
140
+ def not_found?
141
+ 404 == @code
142
+ end
143
+
144
+ #
145
+ # Base URI from the HTML doc head element
146
+ # http://www.w3.org/TR/html4/struct/links.html#edef-BASE
147
+ #
148
+ def base
149
+ @base = if doc
150
+ href = doc.search('//head/base/@href')
151
+ URI(href.to_s) unless href.nil? rescue nil
152
+ end unless @base
153
+
154
+ return nil if @base && @base.to_s().empty?
155
+ @base
156
+ end
157
+
158
+
159
+ #
160
+ # Converts relative URL *link* into an absolute URL based on the
161
+ # location of the page
162
+ #
163
+ def to_absolute(link)
164
+ return nil if link.nil?
165
+
166
+ # remove anchor
167
+ link = URI.encode(URI.decode(link.to_s.gsub(/#[a-zA-Z0-9_-]*$/,'')))
168
+
169
+ relative = URI(link)
170
+ absolute = base ? base.merge(relative) : @url.merge(relative)
171
+
172
+ absolute.path = '/' if absolute.path.empty?
173
+
174
+ return absolute
175
+ end
176
+
177
+ #
178
+ # Returns +true+ if *uri* is in the same domain as the page, returns
179
+ # +false+ otherwise
180
+ #
181
+ def in_domain?(uri)
182
+ uri.host == @url.host
183
+ end
184
+
185
+ def marshal_dump
186
+ [@url, @headers, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched]
187
+ end
188
+
189
+ def marshal_load(ary)
190
+ @url, @headers, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched = ary
191
+ end
192
+
193
+ def to_hash
194
+ {
195
+ 'url' => @url.to_s,
196
+ 'headers' => headers.to_json,
197
+ 'body' => @body,
198
+ 'code' => @code,
199
+ 'error' => (@error ? @error.to_s : nil),
200
+ 'visited' => @visited,
201
+ 'referer' => (@referer ? @referer.to_s : nil),
202
+ 'redirect_to' => (@redirect_to ? @redirect_to.to_s : nil),
203
+ 'redirect_from' => (@redirect_from ? @redirect_from.to_s : nil),
204
+ 'response_time' => @response_time,
205
+ 'fetched' => @fetched
206
+ }.reject { |k, v| v.nil? }
207
+ end
208
+
209
+ def self.from_hash(hash)
210
+ page = self.new(URI(hash['url']))
211
+ {'@headers' => JSON.load(hash['headers']),
212
+ '@body' => hash['body'],
213
+ '@code' => hash['code'].to_i,
214
+ '@error' => hash['error'],
215
+ '@visited' => hash['visited'],
216
+ '@referer' => hash['referer'],
217
+ '@redirect_to' => (hash['redirect_to'].present?) ? URI(hash['redirect_to']) : nil,
218
+ '@redirect_from' => (hash['redirect_from'].present?) ? URI(hash['redirect_from']) : nil,
219
+ '@response_time' => hash['response_time'].to_i,
220
+ '@fetched' => hash['fetched']
221
+ }.each do |var, value|
222
+ page.instance_variable_set(var, value)
223
+ end
224
+ page
225
+ end
226
+
227
+ private
228
+
229
+ def cleanup_encoding(source)
230
+ return source unless source && (html? || xml? || @force_format)
231
+ text = source.dup
232
+ text.encode!('UTF-16', 'UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
233
+ text.encode('UTF-8', 'UTF-16')
234
+ end
235
+
236
+ def should_parse_as?(format)
237
+ return false unless @body
238
+ return @force_format == format if @force_format
239
+ send("#{format}?")
240
+ end
241
+ end
242
+ end
@@ -0,0 +1,36 @@
1
+ require 'hashie'
2
+
3
+ module Sunbro
4
+ module Settings
5
+
6
+ DEFAULTS = {
7
+ user_agent: "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36",
8
+ phantomjs_user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X)",
9
+ page_format: :auto
10
+ }
11
+
12
+ def self.configure
13
+ $sunbro_configuration ||= Hashie::Mash.new
14
+ yield $sunbro_configuration
15
+ end
16
+
17
+ def self.user_agent
18
+ return DEFAULTS[:user_agent] unless configured?
19
+ $sunbro_configuration.user_agent
20
+ end
21
+
22
+ def self.phantomjs_user_agent
23
+ return DEFAULTS[:phantomjs_user_agent] unless configured?
24
+ $sunbro_configuration.phantomjs_user_agent
25
+ end
26
+
27
+ def self.page_format
28
+ return DEFAULTS[:page_format] unless configured?
29
+ $sunbro_configuration.page_format
30
+ end
31
+
32
+ def self.configured?
33
+ !!$sunbro_configuration
34
+ end
35
+ end
36
+ end
@@ -0,0 +1,3 @@
1
+ module Sunbro
2
+ VERSION = "0.0.1"
3
+ end
data/spec/page_spec.rb ADDED
@@ -0,0 +1,67 @@
1
+ require 'spec_helper'
2
+ require 'mocktra'
3
+
4
+ describe Sunbro::Page do
5
+
6
+ before :each do
7
+ @http = Sunbro::HTTP.new(verbose: true)
8
+
9
+ Mocktra("www.retailer.com") do
10
+ get '/1.html' do
11
+ "<html><head><title>Title</title></head><body><p>Body text</p></body></html>"
12
+ end
13
+ end
14
+ end
15
+
16
+ describe "#initialize" do
17
+ it "it scrubs invalid UTF-8 from @body by converting to UTF-16, then back again" do
18
+ # See http://stackoverflow.com/a/8873922/1169868
19
+ pending "Example"
20
+ fail
21
+ end
22
+ end
23
+
24
+ describe "#fetch_page" do
25
+ it "fetches a single page" do
26
+ url = "http://www.retailer.com/1.html"
27
+
28
+ page = @http.fetch_page(url)
29
+ expect(page.url.to_s).to eq(url)
30
+ expect(page.redirect_to).to be_nil
31
+ expect(page.redirect_from).to be_nil
32
+
33
+ end
34
+
35
+ it "preserves the original url in redirect_from after a redirect" do
36
+ pending "Figure out how to make this work with Mocktra"
37
+ fail
38
+ end
39
+
40
+ end
41
+
42
+ describe "#doc" do
43
+ it "uses the correct Nokogiri parser to parse html or xml, or lets Nokogiri guess" do
44
+ pending "Example"
45
+ fail
46
+ end
47
+ end
48
+
49
+
50
+ describe "#should_parse_as?", no_es: true do
51
+ it "returns true if Nokogiri should try to parse the page with the supplied format, false otherwise" do
52
+ url = "http://www.retailer.com/1.html"
53
+
54
+ page = @http.fetch_page(url)
55
+ expect(page.send(:should_parse_as?, :xml)).to eq(false)
56
+ expect(page.send(:should_parse_as?, :html)).to eq(true)
57
+
58
+ page = @http.fetch_page(url, force_format: :html)
59
+ expect(page.send(:should_parse_as?, :xml)).to eq(false)
60
+ expect(page.send(:should_parse_as?, :html)).to eq(true)
61
+
62
+ page = @http.fetch_page(url, force_format: :xml)
63
+ expect(page.send(:should_parse_as?, :xml)).to eq(true)
64
+ expect(page.send(:should_parse_as?, :html)).to eq(false)
65
+ end
66
+ end
67
+ end
@@ -0,0 +1,13 @@
1
+ require 'rubygems'
2
+ require 'bundler/setup'
3
+ require 'active_support'
4
+ require 'sunbro'
5
+
6
+ RSpec.configure do |config|
7
+ # Run specs in random order to surface order dependencies. If you find an
8
+ # order dependency and want to debug it, you can fix the order by providing
9
+ # the seed, which is printed after each run.
10
+ # --seed 1234
11
+ config.order = "random"
12
+ end
13
+
data/sunbro.gemspec ADDED
@@ -0,0 +1,32 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'sunbro/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "sunbro"
8
+ spec.version = Sunbro::VERSION
9
+ spec.authors = ["Jon Stokes"]
10
+ spec.email = ["jon@jonstokes.com"]
11
+ spec.summary = %q{Some code that I use to crawl the web at scale. Shared in the spirit of jolly cooperation.}
12
+ spec.description = %q{Requires phantomjs.}
13
+ spec.homepage = ""
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_dependency "nokogiri"
22
+ spec.add_dependency "capybara"
23
+ spec.add_dependency "poltergeist"
24
+ spec.add_dependency "net-http-persistent"
25
+ spec.add_dependency "activesupport"
26
+ spec.add_dependency "hashie"
27
+
28
+ spec.add_development_dependency "bundler", "~> 1.5"
29
+ spec.add_development_dependency "rake"
30
+ spec.add_development_dependency "rspec"
31
+ spec.add_development_dependency "mocktra"
32
+ end
data/tasks/rspec.rake ADDED
@@ -0,0 +1,4 @@
1
+ require 'rspec/core/rake_task'
2
+
3
+ RSpec::Core::RakeTask.new(:spec)
4
+
metadata ADDED
@@ -0,0 +1,202 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: sunbro
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Jon Stokes
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2015-02-06 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - '>='
17
+ - !ruby/object:Gem::Version
18
+ version: '0'
19
+ name: nokogiri
20
+ prerelease: false
21
+ type: :runtime
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - '>='
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - '>='
31
+ - !ruby/object:Gem::Version
32
+ version: '0'
33
+ name: capybara
34
+ prerelease: false
35
+ type: :runtime
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - '>='
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - '>='
45
+ - !ruby/object:Gem::Version
46
+ version: '0'
47
+ name: poltergeist
48
+ prerelease: false
49
+ type: :runtime
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - '>='
59
+ - !ruby/object:Gem::Version
60
+ version: '0'
61
+ name: net-http-persistent
62
+ prerelease: false
63
+ type: :runtime
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - '>='
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ - !ruby/object:Gem::Dependency
70
+ requirement: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - '>='
73
+ - !ruby/object:Gem::Version
74
+ version: '0'
75
+ name: activesupport
76
+ prerelease: false
77
+ type: :runtime
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - '>='
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ - !ruby/object:Gem::Dependency
84
+ requirement: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - '>='
87
+ - !ruby/object:Gem::Version
88
+ version: '0'
89
+ name: hashie
90
+ prerelease: false
91
+ type: :runtime
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - '>='
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ requirement: !ruby/object:Gem::Requirement
99
+ requirements:
100
+ - - ~>
101
+ - !ruby/object:Gem::Version
102
+ version: '1.5'
103
+ name: bundler
104
+ prerelease: false
105
+ type: :development
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ~>
109
+ - !ruby/object:Gem::Version
110
+ version: '1.5'
111
+ - !ruby/object:Gem::Dependency
112
+ requirement: !ruby/object:Gem::Requirement
113
+ requirements:
114
+ - - '>='
115
+ - !ruby/object:Gem::Version
116
+ version: '0'
117
+ name: rake
118
+ prerelease: false
119
+ type: :development
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - '>='
123
+ - !ruby/object:Gem::Version
124
+ version: '0'
125
+ - !ruby/object:Gem::Dependency
126
+ requirement: !ruby/object:Gem::Requirement
127
+ requirements:
128
+ - - '>='
129
+ - !ruby/object:Gem::Version
130
+ version: '0'
131
+ name: rspec
132
+ prerelease: false
133
+ type: :development
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - '>='
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ - !ruby/object:Gem::Dependency
140
+ requirement: !ruby/object:Gem::Requirement
141
+ requirements:
142
+ - - '>='
143
+ - !ruby/object:Gem::Version
144
+ version: '0'
145
+ name: mocktra
146
+ prerelease: false
147
+ type: :development
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - '>='
151
+ - !ruby/object:Gem::Version
152
+ version: '0'
153
+ description: Requires phantomjs.
154
+ email:
155
+ - jon@jonstokes.com
156
+ executables: []
157
+ extensions: []
158
+ extra_rdoc_files: []
159
+ files:
160
+ - .gitignore
161
+ - Gemfile
162
+ - LICENSE.txt
163
+ - README.md
164
+ - Rakefile
165
+ - lib/sunbro.rb
166
+ - lib/sunbro/connection.rb
167
+ - lib/sunbro/dynamic_http.rb
168
+ - lib/sunbro/http.rb
169
+ - lib/sunbro/page.rb
170
+ - lib/sunbro/settings.rb
171
+ - lib/sunbro/version.rb
172
+ - spec/page_spec.rb
173
+ - spec/spec_helper.rb
174
+ - sunbro.gemspec
175
+ - tasks/rspec.rake
176
+ homepage: ''
177
+ licenses:
178
+ - MIT
179
+ metadata: {}
180
+ post_install_message:
181
+ rdoc_options: []
182
+ require_paths:
183
+ - lib
184
+ required_ruby_version: !ruby/object:Gem::Requirement
185
+ requirements:
186
+ - - '>='
187
+ - !ruby/object:Gem::Version
188
+ version: '0'
189
+ required_rubygems_version: !ruby/object:Gem::Requirement
190
+ requirements:
191
+ - - '>='
192
+ - !ruby/object:Gem::Version
193
+ version: '0'
194
+ requirements: []
195
+ rubyforge_project:
196
+ rubygems_version: 2.4.5
197
+ signing_key:
198
+ specification_version: 4
199
+ summary: Some code that I use to crawl the web at scale. Shared in the spirit of jolly cooperation.
200
+ test_files:
201
+ - spec/page_spec.rb
202
+ - spec/spec_helper.rb