RubyGems - sunbro - Versions diffs - 0.0.1 - Mend

sunbro 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: b1c6c6f4683dda9c20055c2f766462986b23bc86
+  data.tar.gz: 27a8682eae376d1cb0675e99e56a04ebe89ee832
+SHA512:
+  metadata.gz: cc97aa66162c983c490bd713c372624bdf65e66145ddb6c51e2de4984002ed10c984d97069345c0a6bfd54c7c1a12fd4edc2ae4efdc1aaffef7d4a59da7249f9
+  data.tar.gz: 587c366d59f326b0141e231b19b42aa1b5b148acb717fac8b7d01c48e8cf4476d83ea499a208283a785f43dd1f556ac569eac4b7307d95406cfa5bd937b2df6b

data/.gitignore ADDED Viewed

@@ -0,0 +1,19 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+.rspec
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+.idea/

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in sunbro.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2014 Jon Stokes
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,120 @@
+# Sunbro
+Some code that I use to crawl the web at scale with Poltergeist and
+PhantomJS (cf. [stretched.io](https://github.com/jonstokes/stretched.io)). Uses a bunch of code from the venerable [anemone gem](https://github.com/chriskite/anemone). Released in the spirit of jolly cooperation.
+## Installation
+Add this line to your application's Gemfile:
+    gem 'sunbro'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install sunbro
+## Usage
+I use sunbro to crawl the web at scale via Sidekiq on EC2. I've found
+that web scraping with capybara/poltergeist + phantomjs is a giant pain
+on JRuby (for various reasons that you'll encounter once you try it),
+and this gem is basically my collection of fixes that makes it actually
+work. And it works pretty well; I use in production to crawl
+230 sites and counting.
+Here's an example of a worker that looks something like what you might find in my code:
+```ruby
+class CrawlerWorker
+  def perform(opts)
+    @connection = Sunbro::Connection.new
+    return unless @links = opts[:links]
+    links.each do |link|
+      next unless page = @connection.get_page(link)
+      puts "Page #{page.url} returned code #{page.code} with body size #{page.body.size}"
+    end
+  ensure
+    @connection.close
+  end
+end
+```
+The above uses `net-http` to fetch connections, and it pools
+them. This is all you need most of the time. However, if you're scraping
+a page that is AJAX-heavy, that's where you'll get the most out of sunbro.
+To use phantomjs to scrape a page, you'll want to call `connection.render_page(link)`.
+This renders the JS on the page, but doesn't download any images.
+The one option to either `get_page` or `render_page` is
+`:force_format`, can be one of `:html`, `:xml`, or `:auto`. If the
+option is set to `:html`, then `Nokogiri::HTML` will be used to parse
+`page.body`; if it's set to `:xml`, then `Nokogiri::XML` is used. If
+it's set to `:auto` or `nil`, `Nokogiri.parse` is called.
+## Configuration
+You can configure a few options in a `config/initializers/sunbro.rb`
+file, as follows:
+```ruby
+Sunbro::Settings.configure do |config|
+  config.user_agent = ENV['USER_AGENT_STRING1']
+  config.phantomjs_user_agent = ENV['USER_AGENT_STRING2']
+  config.page_format = :auto
+end
+```
+## PhantomJS zombie process monkey patch
+I use the following monkey patch for PhantomJS, because it has zombie
+process issues when it comes to JRuby. This monkey patch kills some minor
+PhantomJS functionality that I don't use, and you can read more about
+what it does and why, in [this blog post](http://jonstokes.com/2014/07/07/monkey-patching-poltergeist-for-web-scraping-with-jruby/).
+I put this in `config/initializers/phantomjs.rb`
+```ruby
+require "capybara"
+require "capybara/poltergeist"
+require "capybara/poltergeist/utility"
+module Capybara::Poltergeist
+  Client.class_eval do
+    def start
+      @pid = Process.spawn(*command.map(&:to_s), pgroup: true)
+      ObjectSpace.define_finalizer(self, self.class.process_killer(@pid))
+    end
+    def stop
+      if pid
+        kill_phantomjs
+        ObjectSpace.undefine_finalizer(self)
+      end
+    end
+  end
+end
+```
+## Next steps
+Right now, this is more of a bag of code than a bona fide user-friendly
+gem. One next step would be to add some configuration options for PhantomJS
+that get passed via `render_page` to poltergeist and then on to the
+command line. Another would be to use `net-http-persistent`, which is
+actually included here as a dependency but isn't yet used.
+## Contributing
+1. Fork it ( http://github.com/<my-github-username>/sunbro/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED Viewed

	@@ -0,0 +1 @@
1	+ require "bundler/gem_tasks"

data/lib/sunbro.rb ADDED Viewed

@@ -0,0 +1,44 @@
+require 'nokogiri'
+require 'capybara/poltergeist'
+require 'net/http/persistent'
+require 'webrick/cookie'
+%w(
+  sunbro/version
+  sunbro/settings
+  sunbro/dynamic_http
+  sunbro/http
+  sunbro/page
+).each do |f|
+  require f
+end
+module Sunbro
+  MAX_RETRIES = 5
+  def get_page(link, opts={})
+    @http ||= HTTP.new
+    fetch_with_connection(@http, link, opts)
+  end
+  def render_page(link, opts={})
+    @dhttp ||= DynamicHTTP.new
+    fetch_with_connection(@dhttp, link, opts)
+  end
+  def fetch_with_connection(conn, link, opts)
+    page, tries = nil, MAX_RETRIES
+    begin
+      page = conn.fetch_page(link, opts)
+      sleep 1
+    end until page.try(:present?) || (tries -= 1).zero?
+    page.discard_doc! unless page.is_valid?
+    page
+  end
+  def close_http_connections
+    @http.close if @http
+    @dhttp.close if @dhttp
+  rescue IOError
+  end
+end

data/lib/sunbro/connection.rb ADDED Viewed

@@ -0,0 +1,9 @@
+class Connection
+  extend Sunbro
+  attr_reader :http, :dhttp
+  def close
+    close_http_connections
+  end
+end

data/lib/sunbro/dynamic_http.rb ADDED Viewed

@@ -0,0 +1,99 @@
+module Sunbro
+  class DynamicHTTP
+    attr_reader :session
+    def initialize(opts = {})
+      @opts = opts
+      new_session
+    end
+    def close
+      @session.driver.quit
+    end
+    def new_session
+      Capybara.register_driver :poltergeist do |app|
+        Capybara::Poltergeist::Driver.new(
+          app,
+          js_errors: false,
+          phantomjs_options: ['--load-images=no', '--ignore-ssl-errors=yes']
+        )
+      end
+      Capybara.default_driver = :poltergeist
+      Capybara.javascript_driver = :poltergeist
+      Capybara.run_server = false
+      @session = Capybara::Session.new(:poltergeist)
+      @session.driver.headers = {
+        'User-Agent' => user_agent
+      }
+      @session
+    end
+    def user_agent
+      @opts[:agent] || Settings.phantomjs_user_agent
+    end
+    def restart_session
+      close
+      new_session
+    end
+    #
+    # Create new Pages from the response of an HTTP request to *url*,
+    # including redirects
+    #
+    def fetch_page(url, opts={})
+      begin
+        tries ||= 5
+        get_page(url, opts)
+      rescue Capybara::Poltergeist::DeadClient, Errno::EPIPE, NoMethodError, Capybara::Poltergeist::BrowserError => e
+        restart_session
+        retry unless (tries -= 1).zero?
+        close
+        raise e
+      end
+    end
+    def get_page(url, opts)
+      reset = opts.fetch(:reset) rescue true
+      session.visit(url.to_s)
+      page = create_page_from_session(url, session, opts)
+      session.reset! if reset
+      page
+    rescue Capybara::Poltergeist::TimeoutError => e
+      restart_session
+      return Page.new(url, :error => e)
+    end
+    private
+    def create_page_from_session(url, session, opts)
+      url = url.to_s
+      if url == session.current_url
+        Page.new(
+            session.current_url,
+            :body => session.html.dup,
+            :code => session.status_code,
+            :headers => session.response_headers,
+            :force_format => (opts[:force_format] || default_page_format)
+        )
+      else
+        Page.new(
+            session.current_url,
+            :body => session.html.dup,
+            :code => 301,
+            :redirect_from => url,
+            :headers => session.response_headers,
+            :force_format => (opts[:force_format] || default_page_format)
+        )
+      end
+    end
+    def default_page_format
+      # Don't force the page format if the default format is set to :any
+      return unless [:xml, :html].include? Settings.page_format
+      Settings.page_format
+    end
+  end
+end

data/lib/sunbro/http.rb ADDED Viewed

@@ -0,0 +1,217 @@
+module Sunbro
+  class HTTP
+    # Maximum number of redirects to follow on each get_response
+    REDIRECT_LIMIT = 5
+    def initialize(opts = {})
+      @connections = {}
+      @opts = opts
+    end
+    def close
+      @connections.each do |host, ports|
+        ports.each do |port, connection|
+          connection.finish
+        end
+      end
+    end
+    #
+    # Fetch a single Page from the response of an HTTP request to *url*.
+    # Just gets the final destination page.
+    #
+    def fetch_page(url, opts={})
+      original_url = url.dup
+      pages = fetch_pages(url, opts)
+      if pages.count == 1
+        page = pages.first
+        page.url = original_url
+        page
+      else
+        page = pages.last
+        page.redirect_from = original_url
+        page
+      end
+    end
+    #
+    # Create new Pages from the response of an HTTP request to *url*,
+    # including redirects
+    #
+    def fetch_pages(url, opts={})
+      referer, depth = opts[:referer], opts[:depth]
+      force_format = opts[:force_format] || default_page_format
+      begin
+        url = convert_to_uri(url) unless url.is_a?(URI)
+        pages = []
+        get(url, referer) do |response, code, location, redirect_to, response_time|
+          pages << Page.new(location, :body => response.body.dup,
+                                      :code => code,
+                                      :headers => response.to_hash,
+                                      :referer => referer,
+                                      :depth => depth,
+                                      :redirect_to => redirect_to,
+                                      :response_time => response_time,
+                                      :force_format => force_format)
+        end
+        return pages
+      rescue Exception => e
+        if verbose?
+          puts e.inspect
+          puts e.backtrace
+        end
+        return [Page.new(url, :error => e)]
+      end
+    end
+    #
+    # Convert the link to a valid URI if possible
+    #
+    def convert_to_uri(url)
+      URI(url)
+    rescue URI::InvalidURIError
+      URI(URI.escape(url))
+    end
+    #
+    # The maximum number of redirects to follow
+    #
+    def redirect_limit
+      @opts[:redirect_limit] || REDIRECT_LIMIT
+    end
+    #
+    # The user-agent string which will be sent with each request,
+    # or nil if no such option is set
+    #
+    def user_agent
+      @opts[:agent] || Settings.user_agent
+    end
+    #
+    # Does this HTTP client accept cookies from the server?
+    #
+    def accept_cookies?
+      @opts[:accept_cookies]
+    end
+    #
+    # The proxy address string
+    #
+    def proxy_host
+      @opts[:proxy_host]
+    end
+    #
+    # The proxy port
+    #
+    def proxy_port
+      @opts[:proxy_port]
+    end
+    #
+    # HTTP read timeout in seconds
+    #
+    def read_timeout
+      @opts[:read_timeout]
+    end
+    private
+    #
+    # Retrieve HTTP responses for *url*, including redirects.
+    # Yields the response object, response code, and URI location
+    # for each response.
+    #
+    def get(url, referer = nil)
+      limit = redirect_limit
+      loc = url
+      begin
+          # if redirected to a relative url, merge it with the host of the original
+          # request url
+          loc = url.merge(loc) if loc.relative?
+          response, response_time = get_response(loc, referer)
+          code = Integer(response.code)
+          redirect_to = response.is_a?(Net::HTTPRedirection) ? URI(response['location']).normalize : nil
+          yield response, code, loc, redirect_to, response_time
+          limit -= 1
+      end while (loc = redirect_to) && allowed?(redirect_to, url) && limit > 0
+    end
+    #
+    # Get an HTTPResponse for *url*, sending the appropriate User-Agent string
+    #
+    def get_response(url, referer = nil)
+      full_path = url.query.nil? ? url.path : "#{url.path}?#{url.query}"
+      opts = {}
+      opts['User-Agent'] = user_agent if user_agent
+      opts['Referer'] = referer.to_s if referer
+      retries = 0
+      begin
+        start = Time.now()
+        # format request
+        req = Net::HTTP::Get.new(full_path, opts)
+        # HTTP Basic authentication
+        req.basic_auth url.user, url.password if url.user
+        response = connection(url).request(req)
+        finish = Time.now()
+        response_time = ((finish - start) * 1000).round
+        return response, response_time
+      rescue Timeout::Error, Net::HTTPBadResponse, EOFError => e
+        puts e.inspect if verbose?
+        refresh_connection(url)
+        retries += 1
+        retry unless retries > 3
+      end
+    end
+    def connection(url)
+      @connections[url.host] ||= {}
+      if conn = @connections[url.host][url.port]
+        return conn
+      end
+      refresh_connection url
+    end
+    def refresh_connection(url)
+      http = Net::HTTP.new(url.host, url.port, proxy_host, proxy_port)
+      http.read_timeout = read_timeout if !!read_timeout
+      if url.scheme == 'https'
+        http.use_ssl = true
+        http.verify_mode = OpenSSL::SSL::VERIFY_NONE
+      end
+      @connections[url.host][url.port] = http.start
+    end
+    def verbose?
+      @opts[:verbose]
+    end
+    #
+    # Allowed to connect to the requested url?
+    #
+    def allowed?(to_url, from_url)
+      to_url.host.nil? || (to_url.host.sub("www.","") == from_url.host.sub("www.",""))
+    rescue
+      true
+    end
+    private
+    def default_page_format
+      # Don't force the page format if the default format is set to :any
+      return unless [:xml, :html].include? Settings.page_format
+      Settings.page_format
+    end
+  end
+end

data/lib/sunbro/page.rb ADDED Viewed

@@ -0,0 +1,242 @@
+module Sunbro
+  class Page
+    # The URL of the page
+    attr_accessor :url
+    # The raw HTTP response body of the page
+    attr_reader :body
+    # Headers of the HTTP response
+    attr_reader :headers
+    # URL of the page this one redirected to, if any
+    attr_reader :redirect_to
+    # Exception object, if one was raised during HTTP#fetch_page
+    attr_reader :error
+    # Integer response code of the page
+    attr_accessor :code
+    # Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!
+    attr_accessor :visited
+    # Depth of this page from the root of the crawl. This is not necessarily the
+    # shortest path; use PageStore#shortest_paths! to find that value.
+    attr_accessor :depth
+    # URL of the page that brought us to this page
+    attr_accessor :referer
+    # Response time of the request for this page in milliseconds
+    attr_accessor :response_time
+    attr_accessor :redirect_from
+    #
+    # Create a new page
+    #
+    def initialize(url, params = {})
+      @url = url
+      @code = params[:code]
+      @headers = params[:headers] || {}
+      @headers['content-type'] ||= ['']
+      @aliases = Array(params[:aka]).compact
+      @referer = params[:referer]
+      @depth = params[:depth] || 0
+      @redirect_to = to_absolute(params[:redirect_to])
+      @response_time = params[:response_time]
+      @error = params[:error]
+      @fetched = !params[:code].nil?
+      @force_format = params[:force_format]
+      @body = params[:body]
+      @redirect_from = params[:redirect_from]
+    end
+    #
+    # Nokogiri document for the HTML body
+    #
+    def doc
+      @doc ||= begin
+        if image?
+          nil
+        elsif should_parse_as?(:xml)
+          Nokogiri::XML(@body, @url.to_s)
+        elsif should_parse_as?(:html)
+          Nokogiri::HTML(@body, @url.to_s)
+        elsif @body
+          Nokogiri.parse(@body, @url.to_s)
+        end
+      end
+    end
+    def is_valid?
+      (url != "about:blank") && !not_found? && present?
+    end
+    def present?
+      !error && code && body.present? && doc
+    end
+    #
+    # Delete the Nokogiri document and response body to conserve memory
+    #
+    def discard_doc!
+      @doc = @body = nil
+    end
+    #
+    # Was the page successfully fetched?
+    # +true+ if the page was fetched with no error, +false+ otherwise.
+    #
+    def fetched?
+      @fetched
+    end
+    #
+    # Array of cookies received with this page as WEBrick::Cookie objects.
+    #
+    def cookies
+      WEBrick::Cookie.parse_set_cookies(@headers['Set-Cookie']) rescue []
+    end
+    #
+    # The content-type returned by the HTTP request for this page
+    #
+    def content_type
+      headers['content-type'].first
+    end
+    #
+    # Returns +true+ if the page is an image, returns +false+
+    # otherwise.
+    #
+    def image?
+      !!(content_type =~ %r{^(image/)\b})
+    end
+    #
+    # Returns +true+ if the page is a HTML document, returns +false+
+    # otherwise.
+    #
+    def html?
+      !!(content_type =~ %r{^(text/html|application/xhtml+xml)\b})
+    end
+    #
+    # Returns +true+ if the page is a XML document, returns +false+
+    # otherwise.
+    #
+    def xml?
+      !!(content_type =~ %r{^(text/xml|application/xml)\b})
+    end
+    #
+    # Returns +true+ if the page is a HTTP redirect, returns +false+
+    # otherwise.
+    #
+    def redirect?
+      (300..307).include?(@code)
+    end
+    #
+    # Returns +true+ if the page was not found (returned 404 code),
+    # returns +false+ otherwise.
+    #
+    def not_found?
+      404 == @code
+    end
+    #
+    # Base URI from the HTML doc head element
+    # http://www.w3.org/TR/html4/struct/links.html#edef-BASE
+    #
+    def base
+      @base = if doc
+        href = doc.search('//head/base/@href')
+        URI(href.to_s) unless href.nil? rescue nil
+      end unless @base
+      return nil if @base && @base.to_s().empty?
+      @base
+    end
+    #
+    # Converts relative URL *link* into an absolute URL based on the
+    # location of the page
+    #
+    def to_absolute(link)
+      return nil if link.nil?
+      # remove anchor
+      link = URI.encode(URI.decode(link.to_s.gsub(/#[a-zA-Z0-9_-]*$/,'')))
+      relative = URI(link)
+      absolute = base ? base.merge(relative) : @url.merge(relative)
+      absolute.path = '/' if absolute.path.empty?
+      return absolute
+    end
+    #
+    # Returns +true+ if *uri* is in the same domain as the page, returns
+    # +false+ otherwise
+    #
+    def in_domain?(uri)
+      uri.host == @url.host
+    end
+    def marshal_dump
+      [@url, @headers, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched]
+    end
+    def marshal_load(ary)
+      @url, @headers, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched = ary
+    end
+    def to_hash
+      {
+        'url'           => @url.to_s,
+        'headers'       => headers.to_json,
+        'body'          => @body,
+        'code'          => @code,
+        'error'         => (@error ? @error.to_s : nil),
+        'visited'       => @visited,
+        'referer'       => (@referer ? @referer.to_s : nil),
+        'redirect_to'   => (@redirect_to ? @redirect_to.to_s : nil),
+        'redirect_from' => (@redirect_from ? @redirect_from.to_s : nil),
+        'response_time' => @response_time,
+        'fetched'       => @fetched
+      }.reject { |k, v| v.nil? }
+    end
+    def self.from_hash(hash)
+      page = self.new(URI(hash['url']))
+      {'@headers'       => JSON.load(hash['headers']),
+       '@body'          => hash['body'],
+       '@code'          => hash['code'].to_i,
+       '@error'         => hash['error'],
+       '@visited'       => hash['visited'],
+       '@referer'       => hash['referer'],
+       '@redirect_to'   => (hash['redirect_to'].present?) ? URI(hash['redirect_to']) : nil,
+       '@redirect_from' => (hash['redirect_from'].present?) ? URI(hash['redirect_from']) : nil,
+       '@response_time' => hash['response_time'].to_i,
+       '@fetched'       => hash['fetched']
+      }.each do |var, value|
+        page.instance_variable_set(var, value)
+      end
+      page
+    end
+    private
+    def cleanup_encoding(source)
+      return source unless source && (html? || xml? || @force_format)
+      text = source.dup
+      text.encode!('UTF-16', 'UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
+      text.encode('UTF-8', 'UTF-16')
+    end
+    def should_parse_as?(format)
+      return false unless @body
+      return @force_format == format if @force_format
+      send("#{format}?")
+    end
+  end
+end

data/lib/sunbro/settings.rb ADDED Viewed

@@ -0,0 +1,36 @@
+require 'hashie'
+module Sunbro
+  module Settings
+    DEFAULTS = {
+      user_agent:           "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36",
+      phantomjs_user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X)",
+      page_format:          :auto
+    }
+    def self.configure
+      $sunbro_configuration ||= Hashie::Mash.new
+      yield $sunbro_configuration
+    end
+    def self.user_agent
+      return DEFAULTS[:user_agent] unless configured?
+      $sunbro_configuration.user_agent
+    end
+    def self.phantomjs_user_agent
+      return DEFAULTS[:phantomjs_user_agent] unless configured?
+      $sunbro_configuration.phantomjs_user_agent
+    end
+    def self.page_format
+      return DEFAULTS[:page_format] unless configured?
+      $sunbro_configuration.page_format
+    end
+    def self.configured?
+      !!$sunbro_configuration
+    end
+  end
+end

data/lib/sunbro/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module Sunbro
+  VERSION = "0.0.1"
+end

data/spec/page_spec.rb ADDED Viewed

@@ -0,0 +1,67 @@
+require 'spec_helper'
+require 'mocktra'
+describe Sunbro::Page do
+  before :each do
+    @http = Sunbro::HTTP.new(verbose: true)
+    Mocktra("www.retailer.com") do
+      get '/1.html' do
+        "<html><head><title>Title</title></head><body><p>Body text</p></body></html>"
+      end
+    end
+  end
+  describe "#initialize" do
+    it "it scrubs invalid UTF-8 from @body by converting to UTF-16, then back again" do
+      # See http://stackoverflow.com/a/8873922/1169868
+      pending "Example"
+      fail
+    end
+  end
+  describe "#fetch_page" do
+    it "fetches a single page" do
+      url = "http://www.retailer.com/1.html"
+      page = @http.fetch_page(url)
+      expect(page.url.to_s).to eq(url)
+      expect(page.redirect_to).to be_nil
+      expect(page.redirect_from).to be_nil
+    end
+    it "preserves the original url in redirect_from after a redirect" do
+      pending "Figure out how to make this work with Mocktra"
+      fail
+    end
+  end
+  describe "#doc" do
+    it "uses the correct Nokogiri parser to parse html or xml, or lets Nokogiri guess" do
+      pending "Example"
+      fail
+    end
+  end
+  describe "#should_parse_as?", no_es: true do
+    it "returns true if Nokogiri should try to parse the page with the supplied format, false otherwise" do
+      url = "http://www.retailer.com/1.html"
+      page = @http.fetch_page(url)
+      expect(page.send(:should_parse_as?, :xml)).to eq(false)
+      expect(page.send(:should_parse_as?, :html)).to eq(true)
+      page = @http.fetch_page(url, force_format: :html)
+      expect(page.send(:should_parse_as?, :xml)).to eq(false)
+      expect(page.send(:should_parse_as?, :html)).to eq(true)
+      page = @http.fetch_page(url, force_format: :xml)
+      expect(page.send(:should_parse_as?, :xml)).to eq(true)
+      expect(page.send(:should_parse_as?, :html)).to eq(false)
+    end
+  end
+end

data/spec/spec_helper.rb ADDED Viewed

@@ -0,0 +1,13 @@
+require 'rubygems'
+require 'bundler/setup'
+require 'active_support'
+require 'sunbro'
+RSpec.configure do |config|
+  # Run specs in random order to surface order dependencies. If you find an
+  # order dependency and want to debug it, you can fix the order by providing
+  # the seed, which is printed after each run.
+  #     --seed 1234
+  config.order = "random"
+end

data/sunbro.gemspec ADDED Viewed

@@ -0,0 +1,32 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'sunbro/version'
+Gem::Specification.new do |spec|
+  spec.name          = "sunbro"
+  spec.version       = Sunbro::VERSION
+  spec.authors       = ["Jon Stokes"]
+  spec.email         = ["jon@jonstokes.com"]
+  spec.summary       = %q{Some code that I use to crawl the web at scale. Shared in the spirit of jolly cooperation.}
+  spec.description   = %q{Requires phantomjs.}
+  spec.homepage      = ""
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_dependency "nokogiri"
+  spec.add_dependency "capybara"
+  spec.add_dependency "poltergeist"
+  spec.add_dependency "net-http-persistent"
+  spec.add_dependency "activesupport"
+  spec.add_dependency "hashie"
+  spec.add_development_dependency "bundler", "~> 1.5"
+  spec.add_development_dependency "rake"
+  spec.add_development_dependency "rspec"
+  spec.add_development_dependency "mocktra"
+end

data/tasks/rspec.rake ADDED Viewed

@@ -0,0 +1,4 @@
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec)

metadata ADDED Viewed

@@ -0,0 +1,202 @@
+--- !ruby/object:Gem::Specification
+name: sunbro
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- Jon Stokes
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2015-02-06 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: nokogiri
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: capybara
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: poltergeist
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: net-http-persistent
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: activesupport
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: hashie
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.5'
+  name: bundler
+  prerelease: false
+  type: :development
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.5'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: rake
+  prerelease: false
+  type: :development
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: rspec
+  prerelease: false
+  type: :development
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: mocktra
+  prerelease: false
+  type: :development
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: Requires phantomjs.
+email:
+- jon@jonstokes.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- lib/sunbro.rb
+- lib/sunbro/connection.rb
+- lib/sunbro/dynamic_http.rb
+- lib/sunbro/http.rb
+- lib/sunbro/page.rb
+- lib/sunbro/settings.rb
+- lib/sunbro/version.rb
+- spec/page_spec.rb
+- spec/spec_helper.rb
+- sunbro.gemspec
+- tasks/rspec.rake
+homepage: ''
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.4.5
+signing_key:
+specification_version: 4
+summary: Some code that I use to crawl the web at scale. Shared in the spirit of jolly cooperation.
+test_files:
+- spec/page_spec.rb
+- spec/spec_helper.rb