sunbro 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: b1c6c6f4683dda9c20055c2f766462986b23bc86
4
+ data.tar.gz: 27a8682eae376d1cb0675e99e56a04ebe89ee832
5
+ SHA512:
6
+ metadata.gz: cc97aa66162c983c490bd713c372624bdf65e66145ddb6c51e2de4984002ed10c984d97069345c0a6bfd54c7c1a12fd4edc2ae4efdc1aaffef7d4a59da7249f9
7
+ data.tar.gz: 587c366d59f326b0141e231b19b42aa1b5b148acb717fac8b7d01c48e8cf4476d83ea499a208283a785f43dd1f556ac569eac4b7307d95406cfa5bd937b2df6b
data/.gitignore ADDED
@@ -0,0 +1,19 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ .rspec
7
+ Gemfile.lock
8
+ InstalledFiles
9
+ _yardoc
10
+ coverage
11
+ doc/
12
+ lib/bundler/man
13
+ pkg
14
+ rdoc
15
+ spec/reports
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
19
+ .idea/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in sunbro.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 Jon Stokes
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,120 @@
1
+ # Sunbro
2
+
3
+ Some code that I use to crawl the web at scale with Poltergeist and
4
+ PhantomJS (cf. [stretched.io](https://github.com/jonstokes/stretched.io)). Uses a bunch of code from the venerable [anemone gem](https://github.com/chriskite/anemone). Released in the spirit of jolly cooperation.
5
+
6
+ ## Installation
7
+
8
+ Add this line to your application's Gemfile:
9
+
10
+ gem 'sunbro'
11
+
12
+ And then execute:
13
+
14
+ $ bundle
15
+
16
+ Or install it yourself as:
17
+
18
+ $ gem install sunbro
19
+
20
+ ## Usage
21
+
22
+ I use sunbro to crawl the web at scale via Sidekiq on EC2. I've found
23
+ that web scraping with capybara/poltergeist + phantomjs is a giant pain
24
+ on JRuby (for various reasons that you'll encounter once you try it),
25
+ and this gem is basically my collection of fixes that makes it actually
26
+ work. And it works pretty well; I use in production to crawl
27
+ 230 sites and counting.
28
+
29
+ Here's an example of a worker that looks something like what you might find in my code:
30
+
31
+ ```ruby
32
+ class CrawlerWorker
33
+
34
+ def perform(opts)
35
+ @connection = Sunbro::Connection.new
36
+ return unless @links = opts[:links]
37
+
38
+ links.each do |link|
39
+ next unless page = @connection.get_page(link)
40
+ puts "Page #{page.url} returned code #{page.code} with body size #{page.body.size}"
41
+ end
42
+
43
+ ensure
44
+ @connection.close
45
+ end
46
+
47
+ end
48
+ ```
49
+
50
+ The above uses `net-http` to fetch connections, and it pools
51
+ them. This is all you need most of the time. However, if you're scraping
52
+ a page that is AJAX-heavy, that's where you'll get the most out of sunbro.
53
+ To use phantomjs to scrape a page, you'll want to call `connection.render_page(link)`.
54
+ This renders the JS on the page, but doesn't download any images.
55
+
56
+ The one option to either `get_page` or `render_page` is
57
+ `:force_format`, can be one of `:html`, `:xml`, or `:auto`. If the
58
+ option is set to `:html`, then `Nokogiri::HTML` will be used to parse
59
+ `page.body`; if it's set to `:xml`, then `Nokogiri::XML` is used. If
60
+ it's set to `:auto` or `nil`, `Nokogiri.parse` is called.
61
+
62
+ ## Configuration
63
+
64
+ You can configure a few options in a `config/initializers/sunbro.rb`
65
+ file, as follows:
66
+
67
+ ```ruby
68
+ Sunbro::Settings.configure do |config|
69
+ config.user_agent = ENV['USER_AGENT_STRING1']
70
+ config.phantomjs_user_agent = ENV['USER_AGENT_STRING2']
71
+ config.page_format = :auto
72
+ end
73
+ ```
74
+
75
+ ## PhantomJS zombie process monkey patch
76
+
77
+ I use the following monkey patch for PhantomJS, because it has zombie
78
+ process issues when it comes to JRuby. This monkey patch kills some minor
79
+ PhantomJS functionality that I don't use, and you can read more about
80
+ what it does and why, in [this blog post](http://jonstokes.com/2014/07/07/monkey-patching-poltergeist-for-web-scraping-with-jruby/).
81
+
82
+ I put this in `config/initializers/phantomjs.rb`
83
+
84
+ ```ruby
85
+ require "capybara"
86
+ require "capybara/poltergeist"
87
+ require "capybara/poltergeist/utility"
88
+
89
+ module Capybara::Poltergeist
90
+ Client.class_eval do
91
+ def start
92
+ @pid = Process.spawn(*command.map(&:to_s), pgroup: true)
93
+ ObjectSpace.define_finalizer(self, self.class.process_killer(@pid))
94
+ end
95
+
96
+ def stop
97
+ if pid
98
+ kill_phantomjs
99
+ ObjectSpace.undefine_finalizer(self)
100
+ end
101
+ end
102
+ end
103
+ end
104
+ ```
105
+
106
+ ## Next steps
107
+
108
+ Right now, this is more of a bag of code than a bona fide user-friendly
109
+ gem. One next step would be to add some configuration options for PhantomJS
110
+ that get passed via `render_page` to poltergeist and then on to the
111
+ command line. Another would be to use `net-http-persistent`, which is
112
+ actually included here as a dependency but isn't yet used.
113
+
114
+ ## Contributing
115
+
116
+ 1. Fork it ( http://github.com/<my-github-username>/sunbro/fork )
117
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
118
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
119
+ 4. Push to the branch (`git push origin my-new-feature`)
120
+ 5. Create new Pull Request
data/Rakefile ADDED
@@ -0,0 +1 @@
1
+ require "bundler/gem_tasks"
data/lib/sunbro.rb ADDED
@@ -0,0 +1,44 @@
1
+ require 'nokogiri'
2
+ require 'capybara/poltergeist'
3
+ require 'net/http/persistent'
4
+ require 'webrick/cookie'
5
+
6
+ %w(
7
+ sunbro/version
8
+ sunbro/settings
9
+ sunbro/dynamic_http
10
+ sunbro/http
11
+ sunbro/page
12
+ ).each do |f|
13
+ require f
14
+ end
15
+
16
+ module Sunbro
17
+ MAX_RETRIES = 5
18
+
19
+ def get_page(link, opts={})
20
+ @http ||= HTTP.new
21
+ fetch_with_connection(@http, link, opts)
22
+ end
23
+
24
+ def render_page(link, opts={})
25
+ @dhttp ||= DynamicHTTP.new
26
+ fetch_with_connection(@dhttp, link, opts)
27
+ end
28
+
29
+ def fetch_with_connection(conn, link, opts)
30
+ page, tries = nil, MAX_RETRIES
31
+ begin
32
+ page = conn.fetch_page(link, opts)
33
+ sleep 1
34
+ end until page.try(:present?) || (tries -= 1).zero?
35
+ page.discard_doc! unless page.is_valid?
36
+ page
37
+ end
38
+
39
+ def close_http_connections
40
+ @http.close if @http
41
+ @dhttp.close if @dhttp
42
+ rescue IOError
43
+ end
44
+ end
@@ -0,0 +1,9 @@
1
+ class Connection
2
+ extend Sunbro
3
+ attr_reader :http, :dhttp
4
+
5
+ def close
6
+ close_http_connections
7
+ end
8
+ end
9
+
@@ -0,0 +1,99 @@
1
+ module Sunbro
2
+ class DynamicHTTP
3
+ attr_reader :session
4
+
5
+ def initialize(opts = {})
6
+ @opts = opts
7
+ new_session
8
+ end
9
+
10
+ def close
11
+ @session.driver.quit
12
+ end
13
+
14
+ def new_session
15
+ Capybara.register_driver :poltergeist do |app|
16
+ Capybara::Poltergeist::Driver.new(
17
+ app,
18
+ js_errors: false,
19
+ phantomjs_options: ['--load-images=no', '--ignore-ssl-errors=yes']
20
+ )
21
+ end
22
+ Capybara.default_driver = :poltergeist
23
+ Capybara.javascript_driver = :poltergeist
24
+ Capybara.run_server = false
25
+ @session = Capybara::Session.new(:poltergeist)
26
+ @session.driver.headers = {
27
+ 'User-Agent' => user_agent
28
+ }
29
+ @session
30
+ end
31
+
32
+ def user_agent
33
+ @opts[:agent] || Settings.phantomjs_user_agent
34
+ end
35
+
36
+ def restart_session
37
+ close
38
+ new_session
39
+ end
40
+
41
+ #
42
+ # Create new Pages from the response of an HTTP request to *url*,
43
+ # including redirects
44
+ #
45
+ def fetch_page(url, opts={})
46
+ begin
47
+ tries ||= 5
48
+ get_page(url, opts)
49
+ rescue Capybara::Poltergeist::DeadClient, Errno::EPIPE, NoMethodError, Capybara::Poltergeist::BrowserError => e
50
+ restart_session
51
+ retry unless (tries -= 1).zero?
52
+ close
53
+ raise e
54
+ end
55
+ end
56
+
57
+ def get_page(url, opts)
58
+ reset = opts.fetch(:reset) rescue true
59
+ session.visit(url.to_s)
60
+ page = create_page_from_session(url, session, opts)
61
+ session.reset! if reset
62
+ page
63
+ rescue Capybara::Poltergeist::TimeoutError => e
64
+ restart_session
65
+ return Page.new(url, :error => e)
66
+ end
67
+
68
+ private
69
+
70
+ def create_page_from_session(url, session, opts)
71
+ url = url.to_s
72
+ if url == session.current_url
73
+ Page.new(
74
+ session.current_url,
75
+ :body => session.html.dup,
76
+ :code => session.status_code,
77
+ :headers => session.response_headers,
78
+ :force_format => (opts[:force_format] || default_page_format)
79
+ )
80
+ else
81
+ Page.new(
82
+ session.current_url,
83
+ :body => session.html.dup,
84
+ :code => 301,
85
+ :redirect_from => url,
86
+ :headers => session.response_headers,
87
+ :force_format => (opts[:force_format] || default_page_format)
88
+ )
89
+ end
90
+ end
91
+
92
+ def default_page_format
93
+ # Don't force the page format if the default format is set to :any
94
+ return unless [:xml, :html].include? Settings.page_format
95
+ Settings.page_format
96
+ end
97
+
98
+ end
99
+ end
@@ -0,0 +1,217 @@
1
+ module Sunbro
2
+ class HTTP
3
+ # Maximum number of redirects to follow on each get_response
4
+ REDIRECT_LIMIT = 5
5
+
6
+ def initialize(opts = {})
7
+ @connections = {}
8
+ @opts = opts
9
+ end
10
+
11
+ def close
12
+ @connections.each do |host, ports|
13
+ ports.each do |port, connection|
14
+ connection.finish
15
+ end
16
+ end
17
+ end
18
+
19
+ #
20
+ # Fetch a single Page from the response of an HTTP request to *url*.
21
+ # Just gets the final destination page.
22
+ #
23
+ def fetch_page(url, opts={})
24
+ original_url = url.dup
25
+ pages = fetch_pages(url, opts)
26
+ if pages.count == 1
27
+ page = pages.first
28
+ page.url = original_url
29
+ page
30
+ else
31
+ page = pages.last
32
+ page.redirect_from = original_url
33
+ page
34
+ end
35
+ end
36
+
37
+ #
38
+ # Create new Pages from the response of an HTTP request to *url*,
39
+ # including redirects
40
+ #
41
+ def fetch_pages(url, opts={})
42
+ referer, depth = opts[:referer], opts[:depth]
43
+ force_format = opts[:force_format] || default_page_format
44
+ begin
45
+ url = convert_to_uri(url) unless url.is_a?(URI)
46
+ pages = []
47
+ get(url, referer) do |response, code, location, redirect_to, response_time|
48
+ pages << Page.new(location, :body => response.body.dup,
49
+ :code => code,
50
+ :headers => response.to_hash,
51
+ :referer => referer,
52
+ :depth => depth,
53
+ :redirect_to => redirect_to,
54
+ :response_time => response_time,
55
+ :force_format => force_format)
56
+ end
57
+
58
+ return pages
59
+ rescue Exception => e
60
+ if verbose?
61
+ puts e.inspect
62
+ puts e.backtrace
63
+ end
64
+ return [Page.new(url, :error => e)]
65
+ end
66
+ end
67
+
68
+ #
69
+ # Convert the link to a valid URI if possible
70
+ #
71
+ def convert_to_uri(url)
72
+ URI(url)
73
+ rescue URI::InvalidURIError
74
+ URI(URI.escape(url))
75
+ end
76
+
77
+ #
78
+ # The maximum number of redirects to follow
79
+ #
80
+ def redirect_limit
81
+ @opts[:redirect_limit] || REDIRECT_LIMIT
82
+ end
83
+
84
+ #
85
+ # The user-agent string which will be sent with each request,
86
+ # or nil if no such option is set
87
+ #
88
+ def user_agent
89
+ @opts[:agent] || Settings.user_agent
90
+ end
91
+
92
+ #
93
+ # Does this HTTP client accept cookies from the server?
94
+ #
95
+ def accept_cookies?
96
+ @opts[:accept_cookies]
97
+ end
98
+
99
+ #
100
+ # The proxy address string
101
+ #
102
+ def proxy_host
103
+ @opts[:proxy_host]
104
+ end
105
+
106
+ #
107
+ # The proxy port
108
+ #
109
+ def proxy_port
110
+ @opts[:proxy_port]
111
+ end
112
+
113
+ #
114
+ # HTTP read timeout in seconds
115
+ #
116
+ def read_timeout
117
+ @opts[:read_timeout]
118
+ end
119
+
120
+ private
121
+
122
+ #
123
+ # Retrieve HTTP responses for *url*, including redirects.
124
+ # Yields the response object, response code, and URI location
125
+ # for each response.
126
+ #
127
+ def get(url, referer = nil)
128
+ limit = redirect_limit
129
+ loc = url
130
+ begin
131
+ # if redirected to a relative url, merge it with the host of the original
132
+ # request url
133
+ loc = url.merge(loc) if loc.relative?
134
+
135
+ response, response_time = get_response(loc, referer)
136
+ code = Integer(response.code)
137
+ redirect_to = response.is_a?(Net::HTTPRedirection) ? URI(response['location']).normalize : nil
138
+ yield response, code, loc, redirect_to, response_time
139
+ limit -= 1
140
+ end while (loc = redirect_to) && allowed?(redirect_to, url) && limit > 0
141
+ end
142
+
143
+ #
144
+ # Get an HTTPResponse for *url*, sending the appropriate User-Agent string
145
+ #
146
+ def get_response(url, referer = nil)
147
+ full_path = url.query.nil? ? url.path : "#{url.path}?#{url.query}"
148
+
149
+ opts = {}
150
+ opts['User-Agent'] = user_agent if user_agent
151
+ opts['Referer'] = referer.to_s if referer
152
+
153
+ retries = 0
154
+ begin
155
+ start = Time.now()
156
+ # format request
157
+ req = Net::HTTP::Get.new(full_path, opts)
158
+ # HTTP Basic authentication
159
+ req.basic_auth url.user, url.password if url.user
160
+ response = connection(url).request(req)
161
+ finish = Time.now()
162
+ response_time = ((finish - start) * 1000).round
163
+ return response, response_time
164
+ rescue Timeout::Error, Net::HTTPBadResponse, EOFError => e
165
+ puts e.inspect if verbose?
166
+ refresh_connection(url)
167
+ retries += 1
168
+ retry unless retries > 3
169
+ end
170
+ end
171
+
172
+ def connection(url)
173
+ @connections[url.host] ||= {}
174
+
175
+ if conn = @connections[url.host][url.port]
176
+ return conn
177
+ end
178
+
179
+ refresh_connection url
180
+ end
181
+
182
+ def refresh_connection(url)
183
+ http = Net::HTTP.new(url.host, url.port, proxy_host, proxy_port)
184
+
185
+ http.read_timeout = read_timeout if !!read_timeout
186
+
187
+ if url.scheme == 'https'
188
+ http.use_ssl = true
189
+ http.verify_mode = OpenSSL::SSL::VERIFY_NONE
190
+ end
191
+
192
+ @connections[url.host][url.port] = http.start
193
+ end
194
+
195
+ def verbose?
196
+ @opts[:verbose]
197
+ end
198
+
199
+ #
200
+ # Allowed to connect to the requested url?
201
+ #
202
+ def allowed?(to_url, from_url)
203
+ to_url.host.nil? || (to_url.host.sub("www.","") == from_url.host.sub("www.",""))
204
+ rescue
205
+ true
206
+ end
207
+
208
+ private
209
+
210
+ def default_page_format
211
+ # Don't force the page format if the default format is set to :any
212
+ return unless [:xml, :html].include? Settings.page_format
213
+ Settings.page_format
214
+ end
215
+
216
+ end
217
+ end
@@ -0,0 +1,242 @@
1
+ module Sunbro
2
+ class Page
3
+
4
+ # The URL of the page
5
+ attr_accessor :url
6
+ # The raw HTTP response body of the page
7
+ attr_reader :body
8
+ # Headers of the HTTP response
9
+ attr_reader :headers
10
+ # URL of the page this one redirected to, if any
11
+ attr_reader :redirect_to
12
+ # Exception object, if one was raised during HTTP#fetch_page
13
+ attr_reader :error
14
+
15
+ # Integer response code of the page
16
+ attr_accessor :code
17
+ # Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!
18
+ attr_accessor :visited
19
+ # Depth of this page from the root of the crawl. This is not necessarily the
20
+ # shortest path; use PageStore#shortest_paths! to find that value.
21
+ attr_accessor :depth
22
+ # URL of the page that brought us to this page
23
+ attr_accessor :referer
24
+ # Response time of the request for this page in milliseconds
25
+ attr_accessor :response_time
26
+
27
+ attr_accessor :redirect_from
28
+
29
+ #
30
+ # Create a new page
31
+ #
32
+ def initialize(url, params = {})
33
+ @url = url
34
+
35
+ @code = params[:code]
36
+ @headers = params[:headers] || {}
37
+ @headers['content-type'] ||= ['']
38
+ @aliases = Array(params[:aka]).compact
39
+ @referer = params[:referer]
40
+ @depth = params[:depth] || 0
41
+ @redirect_to = to_absolute(params[:redirect_to])
42
+ @response_time = params[:response_time]
43
+ @error = params[:error]
44
+ @fetched = !params[:code].nil?
45
+ @force_format = params[:force_format]
46
+ @body = params[:body]
47
+ @redirect_from = params[:redirect_from]
48
+ end
49
+
50
+ #
51
+ # Nokogiri document for the HTML body
52
+ #
53
+ def doc
54
+ @doc ||= begin
55
+ if image?
56
+ nil
57
+ elsif should_parse_as?(:xml)
58
+ Nokogiri::XML(@body, @url.to_s)
59
+ elsif should_parse_as?(:html)
60
+ Nokogiri::HTML(@body, @url.to_s)
61
+ elsif @body
62
+ Nokogiri.parse(@body, @url.to_s)
63
+ end
64
+ end
65
+ end
66
+
67
+ def is_valid?
68
+ (url != "about:blank") && !not_found? && present?
69
+ end
70
+
71
+ def present?
72
+ !error && code && body.present? && doc
73
+ end
74
+
75
+ #
76
+ # Delete the Nokogiri document and response body to conserve memory
77
+ #
78
+ def discard_doc!
79
+ @doc = @body = nil
80
+ end
81
+
82
+ #
83
+ # Was the page successfully fetched?
84
+ # +true+ if the page was fetched with no error, +false+ otherwise.
85
+ #
86
+ def fetched?
87
+ @fetched
88
+ end
89
+
90
+ #
91
+ # Array of cookies received with this page as WEBrick::Cookie objects.
92
+ #
93
+ def cookies
94
+ WEBrick::Cookie.parse_set_cookies(@headers['Set-Cookie']) rescue []
95
+ end
96
+
97
+ #
98
+ # The content-type returned by the HTTP request for this page
99
+ #
100
+ def content_type
101
+ headers['content-type'].first
102
+ end
103
+
104
+ #
105
+ # Returns +true+ if the page is an image, returns +false+
106
+ # otherwise.
107
+ #
108
+ def image?
109
+ !!(content_type =~ %r{^(image/)\b})
110
+ end
111
+
112
+ #
113
+ # Returns +true+ if the page is a HTML document, returns +false+
114
+ # otherwise.
115
+ #
116
+ def html?
117
+ !!(content_type =~ %r{^(text/html|application/xhtml+xml)\b})
118
+ end
119
+
120
+ #
121
+ # Returns +true+ if the page is a XML document, returns +false+
122
+ # otherwise.
123
+ #
124
+ def xml?
125
+ !!(content_type =~ %r{^(text/xml|application/xml)\b})
126
+ end
127
+
128
+ #
129
+ # Returns +true+ if the page is a HTTP redirect, returns +false+
130
+ # otherwise.
131
+ #
132
+ def redirect?
133
+ (300..307).include?(@code)
134
+ end
135
+
136
+ #
137
+ # Returns +true+ if the page was not found (returned 404 code),
138
+ # returns +false+ otherwise.
139
+ #
140
+ def not_found?
141
+ 404 == @code
142
+ end
143
+
144
+ #
145
+ # Base URI from the HTML doc head element
146
+ # http://www.w3.org/TR/html4/struct/links.html#edef-BASE
147
+ #
148
+ def base
149
+ @base = if doc
150
+ href = doc.search('//head/base/@href')
151
+ URI(href.to_s) unless href.nil? rescue nil
152
+ end unless @base
153
+
154
+ return nil if @base && @base.to_s().empty?
155
+ @base
156
+ end
157
+
158
+
159
+ #
160
+ # Converts relative URL *link* into an absolute URL based on the
161
+ # location of the page
162
+ #
163
+ def to_absolute(link)
164
+ return nil if link.nil?
165
+
166
+ # remove anchor
167
+ link = URI.encode(URI.decode(link.to_s.gsub(/#[a-zA-Z0-9_-]*$/,'')))
168
+
169
+ relative = URI(link)
170
+ absolute = base ? base.merge(relative) : @url.merge(relative)
171
+
172
+ absolute.path = '/' if absolute.path.empty?
173
+
174
+ return absolute
175
+ end
176
+
177
+ #
178
+ # Returns +true+ if *uri* is in the same domain as the page, returns
179
+ # +false+ otherwise
180
+ #
181
+ def in_domain?(uri)
182
+ uri.host == @url.host
183
+ end
184
+
185
+ def marshal_dump
186
+ [@url, @headers, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched]
187
+ end
188
+
189
+ def marshal_load(ary)
190
+ @url, @headers, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched = ary
191
+ end
192
+
193
+ def to_hash
194
+ {
195
+ 'url' => @url.to_s,
196
+ 'headers' => headers.to_json,
197
+ 'body' => @body,
198
+ 'code' => @code,
199
+ 'error' => (@error ? @error.to_s : nil),
200
+ 'visited' => @visited,
201
+ 'referer' => (@referer ? @referer.to_s : nil),
202
+ 'redirect_to' => (@redirect_to ? @redirect_to.to_s : nil),
203
+ 'redirect_from' => (@redirect_from ? @redirect_from.to_s : nil),
204
+ 'response_time' => @response_time,
205
+ 'fetched' => @fetched
206
+ }.reject { |k, v| v.nil? }
207
+ end
208
+
209
+ def self.from_hash(hash)
210
+ page = self.new(URI(hash['url']))
211
+ {'@headers' => JSON.load(hash['headers']),
212
+ '@body' => hash['body'],
213
+ '@code' => hash['code'].to_i,
214
+ '@error' => hash['error'],
215
+ '@visited' => hash['visited'],
216
+ '@referer' => hash['referer'],
217
+ '@redirect_to' => (hash['redirect_to'].present?) ? URI(hash['redirect_to']) : nil,
218
+ '@redirect_from' => (hash['redirect_from'].present?) ? URI(hash['redirect_from']) : nil,
219
+ '@response_time' => hash['response_time'].to_i,
220
+ '@fetched' => hash['fetched']
221
+ }.each do |var, value|
222
+ page.instance_variable_set(var, value)
223
+ end
224
+ page
225
+ end
226
+
227
+ private
228
+
229
+ def cleanup_encoding(source)
230
+ return source unless source && (html? || xml? || @force_format)
231
+ text = source.dup
232
+ text.encode!('UTF-16', 'UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
233
+ text.encode('UTF-8', 'UTF-16')
234
+ end
235
+
236
+ def should_parse_as?(format)
237
+ return false unless @body
238
+ return @force_format == format if @force_format
239
+ send("#{format}?")
240
+ end
241
+ end
242
+ end
@@ -0,0 +1,36 @@
1
+ require 'hashie'
2
+
3
+ module Sunbro
4
+ module Settings
5
+
6
+ DEFAULTS = {
7
+ user_agent: "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36",
8
+ phantomjs_user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X)",
9
+ page_format: :auto
10
+ }
11
+
12
+ def self.configure
13
+ $sunbro_configuration ||= Hashie::Mash.new
14
+ yield $sunbro_configuration
15
+ end
16
+
17
+ def self.user_agent
18
+ return DEFAULTS[:user_agent] unless configured?
19
+ $sunbro_configuration.user_agent
20
+ end
21
+
22
+ def self.phantomjs_user_agent
23
+ return DEFAULTS[:phantomjs_user_agent] unless configured?
24
+ $sunbro_configuration.phantomjs_user_agent
25
+ end
26
+
27
+ def self.page_format
28
+ return DEFAULTS[:page_format] unless configured?
29
+ $sunbro_configuration.page_format
30
+ end
31
+
32
+ def self.configured?
33
+ !!$sunbro_configuration
34
+ end
35
+ end
36
+ end
@@ -0,0 +1,3 @@
1
+ module Sunbro
2
+ VERSION = "0.0.1"
3
+ end
data/spec/page_spec.rb ADDED
@@ -0,0 +1,67 @@
1
+ require 'spec_helper'
2
+ require 'mocktra'
3
+
4
+ describe Sunbro::Page do
5
+
6
+ before :each do
7
+ @http = Sunbro::HTTP.new(verbose: true)
8
+
9
+ Mocktra("www.retailer.com") do
10
+ get '/1.html' do
11
+ "<html><head><title>Title</title></head><body><p>Body text</p></body></html>"
12
+ end
13
+ end
14
+ end
15
+
16
+ describe "#initialize" do
17
+ it "it scrubs invalid UTF-8 from @body by converting to UTF-16, then back again" do
18
+ # See http://stackoverflow.com/a/8873922/1169868
19
+ pending "Example"
20
+ fail
21
+ end
22
+ end
23
+
24
+ describe "#fetch_page" do
25
+ it "fetches a single page" do
26
+ url = "http://www.retailer.com/1.html"
27
+
28
+ page = @http.fetch_page(url)
29
+ expect(page.url.to_s).to eq(url)
30
+ expect(page.redirect_to).to be_nil
31
+ expect(page.redirect_from).to be_nil
32
+
33
+ end
34
+
35
+ it "preserves the original url in redirect_from after a redirect" do
36
+ pending "Figure out how to make this work with Mocktra"
37
+ fail
38
+ end
39
+
40
+ end
41
+
42
+ describe "#doc" do
43
+ it "uses the correct Nokogiri parser to parse html or xml, or lets Nokogiri guess" do
44
+ pending "Example"
45
+ fail
46
+ end
47
+ end
48
+
49
+
50
+ describe "#should_parse_as?", no_es: true do
51
+ it "returns true if Nokogiri should try to parse the page with the supplied format, false otherwise" do
52
+ url = "http://www.retailer.com/1.html"
53
+
54
+ page = @http.fetch_page(url)
55
+ expect(page.send(:should_parse_as?, :xml)).to eq(false)
56
+ expect(page.send(:should_parse_as?, :html)).to eq(true)
57
+
58
+ page = @http.fetch_page(url, force_format: :html)
59
+ expect(page.send(:should_parse_as?, :xml)).to eq(false)
60
+ expect(page.send(:should_parse_as?, :html)).to eq(true)
61
+
62
+ page = @http.fetch_page(url, force_format: :xml)
63
+ expect(page.send(:should_parse_as?, :xml)).to eq(true)
64
+ expect(page.send(:should_parse_as?, :html)).to eq(false)
65
+ end
66
+ end
67
+ end
@@ -0,0 +1,13 @@
1
+ require 'rubygems'
2
+ require 'bundler/setup'
3
+ require 'active_support'
4
+ require 'sunbro'
5
+
6
+ RSpec.configure do |config|
7
+ # Run specs in random order to surface order dependencies. If you find an
8
+ # order dependency and want to debug it, you can fix the order by providing
9
+ # the seed, which is printed after each run.
10
+ # --seed 1234
11
+ config.order = "random"
12
+ end
13
+
data/sunbro.gemspec ADDED
@@ -0,0 +1,32 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'sunbro/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "sunbro"
8
+ spec.version = Sunbro::VERSION
9
+ spec.authors = ["Jon Stokes"]
10
+ spec.email = ["jon@jonstokes.com"]
11
+ spec.summary = %q{Some code that I use to crawl the web at scale. Shared in the spirit of jolly cooperation.}
12
+ spec.description = %q{Requires phantomjs.}
13
+ spec.homepage = ""
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0")
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_dependency "nokogiri"
22
+ spec.add_dependency "capybara"
23
+ spec.add_dependency "poltergeist"
24
+ spec.add_dependency "net-http-persistent"
25
+ spec.add_dependency "activesupport"
26
+ spec.add_dependency "hashie"
27
+
28
+ spec.add_development_dependency "bundler", "~> 1.5"
29
+ spec.add_development_dependency "rake"
30
+ spec.add_development_dependency "rspec"
31
+ spec.add_development_dependency "mocktra"
32
+ end
data/tasks/rspec.rake ADDED
@@ -0,0 +1,4 @@
1
+ require 'rspec/core/rake_task'
2
+
3
+ RSpec::Core::RakeTask.new(:spec)
4
+
metadata ADDED
@@ -0,0 +1,202 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: sunbro
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Jon Stokes
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2015-02-06 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - '>='
17
+ - !ruby/object:Gem::Version
18
+ version: '0'
19
+ name: nokogiri
20
+ prerelease: false
21
+ type: :runtime
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - '>='
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - '>='
31
+ - !ruby/object:Gem::Version
32
+ version: '0'
33
+ name: capybara
34
+ prerelease: false
35
+ type: :runtime
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - '>='
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - '>='
45
+ - !ruby/object:Gem::Version
46
+ version: '0'
47
+ name: poltergeist
48
+ prerelease: false
49
+ type: :runtime
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - '>='
59
+ - !ruby/object:Gem::Version
60
+ version: '0'
61
+ name: net-http-persistent
62
+ prerelease: false
63
+ type: :runtime
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - '>='
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ - !ruby/object:Gem::Dependency
70
+ requirement: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - '>='
73
+ - !ruby/object:Gem::Version
74
+ version: '0'
75
+ name: activesupport
76
+ prerelease: false
77
+ type: :runtime
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - '>='
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ - !ruby/object:Gem::Dependency
84
+ requirement: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - '>='
87
+ - !ruby/object:Gem::Version
88
+ version: '0'
89
+ name: hashie
90
+ prerelease: false
91
+ type: :runtime
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - '>='
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ requirement: !ruby/object:Gem::Requirement
99
+ requirements:
100
+ - - ~>
101
+ - !ruby/object:Gem::Version
102
+ version: '1.5'
103
+ name: bundler
104
+ prerelease: false
105
+ type: :development
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ~>
109
+ - !ruby/object:Gem::Version
110
+ version: '1.5'
111
+ - !ruby/object:Gem::Dependency
112
+ requirement: !ruby/object:Gem::Requirement
113
+ requirements:
114
+ - - '>='
115
+ - !ruby/object:Gem::Version
116
+ version: '0'
117
+ name: rake
118
+ prerelease: false
119
+ type: :development
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - '>='
123
+ - !ruby/object:Gem::Version
124
+ version: '0'
125
+ - !ruby/object:Gem::Dependency
126
+ requirement: !ruby/object:Gem::Requirement
127
+ requirements:
128
+ - - '>='
129
+ - !ruby/object:Gem::Version
130
+ version: '0'
131
+ name: rspec
132
+ prerelease: false
133
+ type: :development
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - '>='
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ - !ruby/object:Gem::Dependency
140
+ requirement: !ruby/object:Gem::Requirement
141
+ requirements:
142
+ - - '>='
143
+ - !ruby/object:Gem::Version
144
+ version: '0'
145
+ name: mocktra
146
+ prerelease: false
147
+ type: :development
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - '>='
151
+ - !ruby/object:Gem::Version
152
+ version: '0'
153
+ description: Requires phantomjs.
154
+ email:
155
+ - jon@jonstokes.com
156
+ executables: []
157
+ extensions: []
158
+ extra_rdoc_files: []
159
+ files:
160
+ - .gitignore
161
+ - Gemfile
162
+ - LICENSE.txt
163
+ - README.md
164
+ - Rakefile
165
+ - lib/sunbro.rb
166
+ - lib/sunbro/connection.rb
167
+ - lib/sunbro/dynamic_http.rb
168
+ - lib/sunbro/http.rb
169
+ - lib/sunbro/page.rb
170
+ - lib/sunbro/settings.rb
171
+ - lib/sunbro/version.rb
172
+ - spec/page_spec.rb
173
+ - spec/spec_helper.rb
174
+ - sunbro.gemspec
175
+ - tasks/rspec.rake
176
+ homepage: ''
177
+ licenses:
178
+ - MIT
179
+ metadata: {}
180
+ post_install_message:
181
+ rdoc_options: []
182
+ require_paths:
183
+ - lib
184
+ required_ruby_version: !ruby/object:Gem::Requirement
185
+ requirements:
186
+ - - '>='
187
+ - !ruby/object:Gem::Version
188
+ version: '0'
189
+ required_rubygems_version: !ruby/object:Gem::Requirement
190
+ requirements:
191
+ - - '>='
192
+ - !ruby/object:Gem::Version
193
+ version: '0'
194
+ requirements: []
195
+ rubyforge_project:
196
+ rubygems_version: 2.4.5
197
+ signing_key:
198
+ specification_version: 4
199
+ summary: Some code that I use to crawl the web at scale. Shared in the spirit of jolly cooperation.
200
+ test_files:
201
+ - spec/page_spec.rb
202
+ - spec/spec_helper.rb