RubyGems - staticizer - Versions diffs - 0.0.8 → 0.0.9 - Mend

staticizer 0.0.8 → 0.0.9

Files changed (10) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ae3789f55851bb180c8e5e58b23629b3492ff21d3918f9851e16a061d5218d88
-  data.tar.gz: fbe8944e876c152b066ea76d7a1d9d7769bfe82d45a4fece95aa5a8edfa4f253
+  metadata.gz: 8b1b737ad4357e646eb47042fbede1e239ec6c5f963ce1c25b48a810d2efb31d
+  data.tar.gz: 186898d812bca0e4ad732bae97d91a25cb43d76643435d6f9d9a9a7ca78a76b4
 SHA512:
-  metadata.gz: fd2bde3224a14384d04a5e3883b4c296679eb9981a6cbebf2df12663cbd5108aa1e02e123f4d0b3c0da08bb49981c7652f8f7d036c7edaf9d80f63d4f46f5364
-  data.tar.gz: c124ef043c19155ad3b78b660f2022e2789d09b4207905e8b67b250148333f8568f108a2788c0e2d85e90c64eaadc916ff2b2dfb3a46b64f544d91e994c55595
+  metadata.gz: 96d6225ae4416784dd80f80d2cce73418494186b9135258e59880ae536d95c23841a0baca5433aa8baf32689b47e7bd575d80c0ab609f88b84a5463dddc4a27b
+  data.tar.gz: 8329950984e4556791cb87f17618905146add7133401288ca270af35dc37a9fde49c102201acd44529a1bd96e7b808d67cd40fe3ccb20e5dac90907642108068

data/README.md CHANGED Viewed

@@ -9,15 +9,30 @@ website. If the website goes down this backup would be available
 with reduced functionality.
 S3 and Route 53 provide an great way to host a static emergency backup for a website.
-See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html
-. In our experience it works very well and is incredibly cheap at less than US$1 a month (depending on the size of the website).
-We tried using exsisting tools httrack/wget to crawl and create a static version
-of the site to upload to S3, but we found that they did not work well with S3 hosting.
-We wanted the site uploaded to S3 to respond to the *exact* same URLs (where possible) as
-the existing site. This way when the  site goes down incoming links from Google search
+See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html
+. In our experience it works well and is incredibly cheap. Our average sized website
+with a few hundred pages and assets is less than US$1 a month.
+We tried using existing tools httrack/wget to crawl and create a static version
+of the site to upload to S3, but we found that they did not work well with S3 hosting.
+We wanted the site uploaded to S3 to respond to the *exact* same URLs (where possible) as
+the existing site. This way when the  site goes down incoming links from Google search
 results etc. will still work.
+## TODO
+* Abillity to specify AWS credentials via file or environment options
+* Tests!
+* Decide what to do with URLs with query strings. Currently they are crawled and uploaded to S3, but those keys cannot be accessed. ex http://squaremill.com/file?test=1 will be uploaded with the key file?test=1, but can only be accessed by encoding the ? like this %3Ftest=1
+* Create a 404 file on S3
+* Provide the option to rewrite absolute URLs to relative urls so that hosting can work on a different domain.
+* Multithread the crawler
+* Check for too many redirects
+* Provide regex options for what urls are scraped
+* Better handling of incorrect server mime types (ex. server returns text/plain for css instead of text/css)
+* Provide more options for uploading (upload via scp, ftp, custom etc.). Split out save/uploading into an interface.
+* Handle large files in a more memory efficient way by streaming uploads/downloads
 ## Installation
 Add this line to your application's Gemfile:
@@ -30,12 +45,11 @@ And then execute:
 Or install it yourself as:
-    $ gem install s3static
+    $ gem install staticizer
 ## Command line usage
-The tool can either be used via the 'staticizer' commandline tool or via requiring the library.
+Staticizer can be used through the commandline tool or by requiring the library.
 ### Crawl a website and write to disk
@@ -61,6 +75,8 @@ This will only crawl urls in the domain squaremill.com
     s = Staticizer::Crawler.new("http://squaremill.com",
       :aws => {
+        :region => "us-west-1",
+        :endpoint => "http://s3.amazonaws.com",
         :bucket_name => "www.squaremill.com",
         :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
         :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
@@ -73,10 +89,27 @@ This will only crawl urls in the domain squaremill.com
     s = Staticizer::Crawler.new("http://squaremill.com", :output_dir => "/tmp/crawl")
     s.crawl
+### Crawl a website and make all pages contain 'noindex' meta tag
+    s = Staticizer::Crawler.new("http://squaremill.com",
+      :output_dir => "/tmp/crawl",
+      :process_body => lambda {|body, uri, opts|
+        # not the best regex, but it will do for our use
+        body = body.gsub(/<meta\s+name=['"]robots[^>]+>/i,'')
+        body = body.gsub(/<head>/i,"<head>\n<meta name='robots' content='noindex'>")
+        body
+      }
+    )
+    s.crawl
 ### Crawl a website and rewrite all non www urls to www
     s = Staticizer::Crawler.new("http://squaremill.com",
       :aws => {
+        :region => "us-west-1",
+        :endpoint => "http://s3.amazonaws.com",
         :bucket_name => "www.squaremill.com",
         :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
         :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
@@ -92,14 +125,16 @@ This will only crawl urls in the domain squaremill.com
     )
     s.crawl
-## Cralwer Options
+## Crawler Options
 * :aws - Hash of connection options passed to aws/sdk gem
-* :filter_url - proc called to see if a discovered URL should be crawled, return nil to not crawl a url, return the url (can be modified) to crawl
+* :filter_url - lambda called to see if a discovered URL should be crawled, return the url (can be modified) to crawl, return nil otherwise
 * :output_dir - if writing a site to disk the directory to write to, will be created if it does not exist
 * :logger - A logger object responding to the usual Ruby Logger methods.
 * :log_level - Log level - defaults to INFO.
-# :valid_domains - Array of domains that should be crawled. Domains not in this list will be ignored.
+* :valid_domains - Array of domains that should be crawled. Domains not in this list will be ignored.
+* :process_body - lambda called to pre-process body of content before writing it out.
+* :skip_write - don't write retrieved files to disk or s3, just crawl the site (can be used to find 404s etc.)
 ## Contributing

data/Rakefile CHANGED Viewed

@@ -1 +1,7 @@
 require "bundler/gem_tasks"
+require 'rake/testtask'
+Rake::TestTask.new do |t|
+  t.libs << "tests"
+  t.test_files = FileList['tests/*_test.rb']
+end

data/lib/staticizer/command.rb CHANGED Viewed

@@ -18,6 +18,11 @@ module Staticizer
           options[:aws][:bucket_name] = v
         end
+        opts.on("--aws-region [STRING]", "AWS Region of S3 bucket") do |v|
+          options[:aws] ||= {}
+          options[:aws][:region] = v
+        end
         opts.on("--aws-access-key [STRING]", "AWS Access Key ID") do |v|
           options[:aws] ||= {}
           options[:aws][:access_key_id] = v
@@ -44,6 +49,10 @@ module Staticizer
           options[:logger] = Logger.new(v)
         end
+        opts.on("--skip-write [PATH]", "Don't write out files to disk or s3") do |v|
+          options[:skip_write] = true
+        end
         opts.on("--valid-domains x,y,z", Array, "Comma separated list of domains that should be crawled, other domains will be ignored") do |v|
           options[:valid_domains] = v
         end
@@ -55,7 +64,7 @@ module Staticizer
         end
       end
-      begin
+      begin
         parser.parse!(args)
         initial_page = ARGV.pop
         raise ArgumentError, "Need to specify an initial URL to start the crawl" unless initial_page

data/lib/staticizer/crawler.rb CHANGED Viewed

@@ -6,6 +6,9 @@ require 'logger'
 module Staticizer
   class Crawler
+    attr_reader :url_queue
+    attr_accessor :output_dir
     def initialize(initial_page, opts = {})
       if initial_page.nil?
         raise ArgumentError, "Initial page required"
@@ -14,24 +17,36 @@ module Staticizer
       @opts = opts.dup
       @url_queue = []
       @processed_urls = []
-      @opts[:output_dir] ||= File.expand_path("crawl/")
+      @output_dir = @opts[:output_dir] || File.expand_path("crawl/")
       @log = @opts[:logger] || Logger.new(STDOUT)
       @log.level = @opts[:log_level] || Logger::INFO
       if @opts[:aws]
         bucket_name = @opts[:aws].delete(:bucket_name)
-        AWS.config(opts[:aws])
-        @s3_bucket = AWS::S3.new.buckets[bucket_name]
-        @s3_bucket.acl = :public_read
+        Aws.config.update(opts[:aws])
+        @s3_bucket = Aws::S3::Resource.new.bucket(bucket_name)
       end
       if @opts[:valid_domains].nil?
         uri = URI.parse(initial_page)
         @opts[:valid_domains] ||= [uri.host]
       end
+      if @opts[:process_body]
+        @process_body = @opts[:process_body]
+      end
       add_url(initial_page)
     end
+    def log_level
+      @log.level
+    end
+    def log_level=(level)
+      @log.level = level
+    end
     def crawl
       @log.info("Starting crawl")
       while(@url_queue.length > 0)
@@ -42,15 +57,6 @@ module Staticizer
       @log.info("Finished crawl")
     end
-    def extract_videos(doc, base_uri)
-      doc.xpath("//video").map do |video|
-        sources = video.xpath("//source/@src").map {|src| make_absolute(base_uri, src)}
-        poster = video.attributes["poster"].to_s
-        make_absolute(base_uri, poster)
-        [poster, sources]
-      end.flatten.uniq.compact
-    end
     def extract_hrefs(doc, base_uri)
       doc.xpath("//a/@href").map {|href| make_absolute(base_uri, href) }
     end
@@ -63,17 +69,21 @@ module Staticizer
       doc.xpath("//link/@href").map {|href| make_absolute(base_uri, href) }
     end
+    def extract_videos(doc, base_uri)
+      doc.xpath("//video").map do |video|
+        sources = video.xpath("//source/@src").map {|src| make_absolute(base_uri, src)}
+        poster = video.attributes["poster"].to_s
+        make_absolute(base_uri, poster)
+        [poster, sources]
+      end.flatten.uniq.compact
+    end
     def extract_scripts(doc, base_uri)
       doc.xpath("//script/@src").map {|src| make_absolute(base_uri, src) }
     end
     def extract_css_urls(css, base_uri)
-      css.scan(/url\(([^)]+)\)/).map do |src|
-        path = src[0]
-        # URLS in css can be wrapped with " or 'ex: url("http:://something/"), strip these
-        path = path.strip.gsub(/^['"]/, "").gsub(/['"]$/,"")
-        make_absolute(base_uri, path)
-      end
+      css.scan(/url\(\s*['"]?(.+?)['"]?\s*\)/).map {|src| make_absolute(base_uri, src[0]) }
     end
     def add_urls(urls, info = {})
@@ -81,14 +91,11 @@ module Staticizer
     end
     def make_absolute(base_uri, href)
-      if href.to_s =~ /^https?/i
-        # If the uri is already absolute then don't do anything to it except make spaces to + (otherwise
-        # will not retrieve)
+      dup_uri = base_uri.dup
+      dup_uri.query = nil
+      if href.to_s =~ /https?/i
         href.to_s.gsub(" ", "+")
       else
-        dup_uri = base_uri.dup
-        # Remove the query params as otherwise will try use those when making absolute uri
-        dup_uri.query = nil
         URI::join(dup_uri.to_s, href).to_s
       end
     rescue StandardError => e
@@ -110,22 +117,23 @@ module Staticizer
       @url_queue << [url, info]
     end
-    def save_page(response, uri, opts = {})
+    def save_page(response, uri)
+      return if @opts[:skip_write]
       if @opts[:aws]
-        save_page_to_aws(response, uri, opts)
+        save_page_to_aws(response, uri)
       else
-        save_page_to_disk(response, uri, opts)
+        save_page_to_disk(response, uri)
       end
     end
-    def save_page_to_disk(response, uri, opts = {})
+    def save_page_to_disk(response, uri)
       path = uri.path
-      path += "?#{uri.query}" if uri.query && !opts[:no_query] && !@opts[:no_query]
+      path += "?#{uri.query}" if uri.query
       path_segments = path.scan(%r{[^/]*/})
       filename = path.include?("/") ? path[path.rindex("/")+1..-1] : path
-      current = @opts[:output_dir]
+      current = @output_dir
       FileUtils.mkdir_p(current) unless File.exist?(current)
       # Create all the directories necessary for this file
@@ -145,71 +153,77 @@ module Staticizer
       end
       body = response.respond_to?(:read_body) ? response.read_body : response
-      body = @opts[:process_body].call(body, uri, opts) if @opts[:process_body]
+      body = process_body(body, uri, {})
       outfile = File.join(current, "/#{filename}")
       if filename == ""
         indexfile = File.join(outfile, "/index.html")
-        return if opts[:no_overwrite] && File.exists?(indexfile)
         @log.info "Saving #{indexfile}"
         File.open(indexfile, "wb") {|f| f << body }
       elsif File.directory?(outfile)
         dirfile = outfile + ".d"
-        outfile = File.join(outfile, "/index.html")
-        return if opts[:no_overwrite] && File.exists?(outfile)
         @log.info "Saving #{dirfile}"
         File.open(dirfile, "wb") {|f| f << body }
-        FileUtils.cp(dirfile, outfile)
+        FileUtils.cp(dirfile, File.join(outfile, "/index.html"))
       else
-        return if opts[:no_overwrite] && File.exists?(outfile)
         @log.info "Saving #{outfile}"
         File.open(outfile, "wb") {|f| f << body }
       end
     end
-    def save_page_to_aws(response, uri, opts = {})
+    def save_page_to_aws(response, uri)
       key = uri.path
       key += "?#{uri.query}" if uri.query
+      key = key.gsub(%r{/$},"/index.html")
       key = key.gsub(%r{^/},"")
       key = "index.html" if key == ""
       # Upload this file directly to AWS::S3
-      opts = {:acl => :public_read}
+      opts = {:acl => "public-read"}
       opts[:content_type] = response['content-type'] rescue "text/html"
       @log.info "Uploading #{key} to s3 with content type #{opts[:content_type]}"
       if response.respond_to?(:read_body)
-        @s3_bucket.objects[key].write(response.read_body, opts)
+        body = process_body(response.read_body, uri, opts)
+        @s3_bucket.object(key).put(opts.merge(body: body))
       else
-        @s3_bucket.objects[key].write(response, opts)
-      end
+        body = process_body(response, uri, opts)
+        @s3_bucket.object(key).put(opts.merge(body: body))
+      end
     end
     def process_success(response, parsed_uri)
       url = parsed_uri.to_s
+      if @opts[:filter_process]
+        return if @opts[:filter_process].call(response, parsed_uri)
+      end
       case response['content-type']
       when /css/
-        save_page(response, parsed_uri, no_query: true)
-        add_urls(extract_css_urls(response.body, parsed_uri), {:type_hint => "css_url"})
+        save_page(response, parsed_uri)
+        add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"})
       when /html/
-        body = response.body
-        save_page(body.gsub("https://www.canaan.com", ""), parsed_uri)
-        doc = Nokogiri::HTML(body)
+        save_page(response, parsed_uri)
+        doc = Nokogiri::HTML(response.body)
+        add_urls(extract_links(doc, url), {:type_hint => "link"})
+        add_urls(extract_scripts(doc, url), {:type_hint => "script"})
+        add_urls(extract_images(doc, url), {:type_hint => "image"})
+        add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"})
         add_urls(extract_videos(doc, parsed_uri), {:type_hint => "video"})
-        add_urls(extract_links(doc, parsed_uri), {:type_hint => "link"})
-        add_urls(extract_scripts(doc, parsed_uri), {:type_hint => "script"})
-        add_urls(extract_images(doc, parsed_uri), {:type_hint => "image"})
-        add_urls(extract_hrefs(doc, parsed_uri), {:type_hint => "href"})
-        # extract inline style="background-image:url('https://')" type of urls
-        add_urls(extract_css_urls(body, parsed_uri), {:type_hint => "css_url"})
+        add_urls(extract_hrefs(doc, url), {:type_hint => "href"}) unless @opts[:single_page]
       else
-        save_page(response, parsed_uri, no_query: true)
+        save_page(response, parsed_uri)
       end
     end
     # If we hit a redirect we save the redirect as a meta refresh page
     # TODO: for AWS S3 hosting we could instead create a redirect?
-    def process_redirect(url, destination_url, opts = {})
+    def process_redirect(url, destination_url)
       body = "<html><head><META http-equiv='refresh' content='0;URL=\"#{destination_url}\"'></head><body>You are being redirected to <a href='#{destination_url}'>#{destination_url}</a>.</body></html>"
-      save_page(body, url, opts)
+      save_page(body, url)
+    end
+    def process_body(body, uri, opts)
+      if @process_body
+        body = @process_body.call(body, uri, opts)
+      end
+      body
     end
     # Fetch a URI and save it to disk
@@ -218,31 +232,37 @@ module Staticizer
       parsed_uri = URI(url)
       @log.debug "Fetching #{parsed_uri}"
       # Attempt to use an already open Net::HTTP connection
       key = parsed_uri.host + parsed_uri.port.to_s
       connection = @http_connections[key]
       if connection.nil?
         connection = Net::HTTP.new(parsed_uri.host, parsed_uri.port)
-        connection.use_ssl = true if parsed_uri.scheme == "https"
+        connection.use_ssl = true if parsed_uri.scheme.downcase == "https"
         @http_connections[key] = connection
       end
       request = Net::HTTP::Get.new(parsed_uri.request_uri)
-      connection.request(request) do |response|
-        case response
-        when Net::HTTPSuccess
-          process_success(response, parsed_uri)
-        when Net::HTTPRedirection
-          redirect_url = response['location']
-          @log.debug "Processing redirect to #{redirect_url}"
-          process_redirect(parsed_uri, redirect_url)
-          add_url(redirect_url)
-        else
-          @log.error "Error #{response.code}:#{response.message} fetching url #{url}"
+      begin
+        connection.request(request) do |response|
+          case response
+          when Net::HTTPSuccess
+            process_success(response, parsed_uri)
+          when Net::HTTPRedirection
+            redirect_url = response['location']
+            @log.debug "Processing redirect to #{redirect_url}"
+            process_redirect(parsed_uri, redirect_url)
+            add_url(redirect_url)
+          else
+            @log.error "Error #{response.code}:#{response.message} fetching url #{url}"
+          end
         end
+      rescue OpenSSL::SSL::SSLError => e
+        @log.error "SSL Error #{e.message} fetching url #{url}"
+      rescue Errno::ECONNRESET => e
+        @log.error "Error #{e.class}:#{e.message} fetching url #{url}"
       end
     end
   end
-end
+end

data/lib/staticizer/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Staticizer
-  VERSION = "0.0.8"
+  VERSION = "0.0.9"
 end

data/staticizer.gemspec CHANGED Viewed

@@ -20,6 +20,7 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency "bundler", "~> 1.3"
   spec.add_development_dependency "rake"
+  spec.add_development_dependency "webmock"
   spec.add_runtime_dependency 'nokogiri'
   spec.add_runtime_dependency 'aws-sdk'

data/tests/crawler_test.rb CHANGED Viewed

@@ -1,15 +1,80 @@
 require 'minitest/autorun'
+require 'ostruct'
-# TODO!
+lib = File.expand_path(File.dirname(__FILE__) + '/../lib')
+$LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib)
+require 'staticizer'
 class TestFilePaths < MiniTest::Unit::TestCase
-  tests = {
-    "" => "index.html"
-    "/" => "index.html"
-    "/asdfdf/dfdf" => "/asdfdf/dfdf"
-    "/asdfdf/dfdf/" => "/asdfdf/dfdf" and "/asdfdf/dfdf/index.html"
-    "/asdfad/asdffd.test" => "/asdfad/asdffd.test"
-    "/?asdfsd=12312" => "/?asdfsd=12312"
-    "/asdfad/asdffd.test?123=sdff" => "/asdfad/asdffd.test?123=sdff"
-  }
+  def setup
+    @crawler = Staticizer::Crawler.new("http://test.com")
+    @crawler.log_level = Logger::FATAL
+    @fake_page = File.read(File.expand_path(File.dirname(__FILE__) + "/fake_page.html"))
+  end
+  def test_save_page_to_disk
+    fake_response = OpenStruct.new(:read_body => "test", :body => "test")
+    file_paths = {
+      "http://test.com" => "index.html",
+      "http://test.com/" => "index.html",
+      "http://test.com/asdfdf/dfdf" => "/asdfdf/dfdf",
+      "http://test.com/asdfdf/dfdf/" => ["/asdfdf/dfdf","/asdfdf/dfdf/index.html"],
+      "http://test.com/asdfad/asdffd.test" => "/asdfad/asdffd.test",
+      "http://test.com/?asdfsd=12312" => "/?asdfsd=12312",
+      "http://test.com/asdfad/asdffd.test?123=sdff" => "/asdfad/asdffd.test?123=sdff",
+    }
+    # TODO: Stub out file system using https://github.com/defunkt/fakefs?
+    outputdir = "/tmp/staticizer_crawl_test"
+    FileUtils.rm_rf(outputdir)
+    @crawler.output_dir = outputdir
+    file_paths.each do |k,v|
+      @crawler.save_page_to_disk(fake_response, URI.parse(k))
+      [v].flatten.each do |file|
+        expected = File.expand_path(outputdir + "/#{file}")
+        assert File.exists?(expected), "File #{expected} not created for url #{k}"
+      end
+    end
+  end
+  def test_save_page_to_aws
+  end
+  def test_add_url_with_valid_domains
+    test_url = "http://test.com/test"
+    @crawler.add_url(test_url)
+    assert(@crawler.url_queue[-1] == [test_url, {}], "URL #{test_url} not added to queue")
+  end
+  def test_add_url_with_filter
+  end
+  def test_initialize_options
+  end
+  def test_process_url
+  end
+  def test_make_absolute
+  end
+  def test_link_extraction
+  end
+  def test_href_extraction
+  end
+  def test_css_extraction
+  end
+  def test_css_url_extraction
+  end
+  def test_image_extraction
+  end
+  def test_script_extraction
+  end
 end

data/tests/fake_page.html ADDED Viewed

@@ -0,0 +1,288 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <title>Web Application Design and Development &mdash; Square Mill Labs</title>
+  <meta content="authenticity_token" name="csrf-param" />
+<meta content="LshjtNLXmjVY9NINXYQds+2Ur+jxUtqKVjjbDbVl+9w=" name="csrf-token" />
+  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
+<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
+<meta property="og:type" content="website">
+<meta property="og:url" content="http://squaremill.com/">
+<meta property="og:image" content="">
+<meta name="viewport" content="width=device-width, maximum-scale=1.0, initial-scale=1.0">
+<meta name="description" content="Web Application Design and Development &mdash; Square Mill Labs">
+  <link rel="shortcut icon" type="image/png" href="http://squaremill.com/assets/icons/favicon-0fecbe6b20ff5bdf623357a3fac76b4b.png">
+  <link data-turbolinks-track="true" href="/assets/mn_application-5ddad96f16e03ad2137bf02270506e61.css" media="all" rel="stylesheet" />
+  <!--[if lt IE 9]>
+    <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
+  <![endif]-->
+  <script type="text/javascript" src="//use.typekit.net/cjr4fwy.js"></script>
+<script type="text/javascript">try{Typekit.load();}catch(e){}</script>
+  </head>
+<body id="public">
+  <script type="text/javascript">
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-30460332-1']);
+  _gaq.push(['_setDomainName', 'squaremill.com']);
+  _gaq.push(['_trackPageview']);
+  (function() {
+    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+  })();
+</script>
+  <header id="header">
+  <nav class="nav container">
+    <a class="branding" href="http://squaremill.com/" rel="home" title="Square Mill - Digital Products for Web and Mobile">
+      <img alt="Square Mill Logo" class="logo" height="16" src="/assets/m2-wordmark-black-97525464acd136ce26b77e39c7ed2ba3.png" width="128" />
+      <p class="description">
+        Digital Products for Web and Mobile
+      </p>
+</a>    <a class="menu-trigger" href="#">Menu</a>
+    <div class="main-nav">
+  <ul class="container">
+    <li><a href="/projects">Projects</a></li>
+    <li><a href="/about">About Us</a></li>
+    <li><a href="/blog">Blog</a></li>
+    <!-- <li class="link-biography"><a href="http://squaremill.com/#biography">People</a></li> -->
+  </ul>
+</div>
+  </nav>
+</header>
+  <div id="site-content">
+    <div class="container" id="home-projects">
+    <section class="big-promo">
+  <div class="project" id="project-7">
+    <a href="/projects/bon-voyaging">
+      <div class="devices">
+        <div class="laptop device">
+          <img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
+          <div class="screenshot">
+            <img alt="" src="/uploads/project/desktop_image/7/bonvoyaging-desktop.jpg" />
+          </div>
+        </div>
+          <div class="handheld device">
+            <img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
+            <div class="screenshot">
+              <img alt="" src="/uploads/project/iphone_image/7/bonvoyagin-handheld.jpg" />
+            </div>
+          </div>
+      </div>
+      <div class="project-description">
+        <div class="summary">
+          <h2>Bon Voyaging <i class="icon-play-sign"></i></h2>
+          <p>Bon Voyaging enables discerning travelers to expertly envision their next voyage from inspiration to exploration. Powerful search tools and a interactive javascript interface make planning trips fun.</p>
+        </div>
+      </div>
+</a>
+  </div>
+</section>
+    <section class="big-promo">
+  <div class="project" id="project-1">
+    <a href="/projects/kpcb-fellows">
+      <div class="devices">
+        <div class="laptop device">
+          <img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
+          <div class="screenshot">
+            <img alt="" src="/uploads/project/desktop_image/1/kpcb-fellows-screenshot.jpg" />
+          </div>
+        </div>
+          <div class="handheld device">
+            <img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
+            <div class="screenshot">
+              <img alt="" src="/uploads/project/iphone_image/1/kpcb-fellows-iphone-screenshot.jpg" />
+            </div>
+          </div>
+      </div>
+      <div class="project-description">
+        <div class="summary">
+          <h2>KPCB Fellows Website and Brand <i class="icon-play-sign"></i></h2>
+          <p>The Fellows Program is a three-month work-based program that pairs top U.S. Engineering, Design and Product Design students with leading technology companies</p>
+        </div>
+      </div>
+</a>
+  </div>
+</section>
+    <section class="big-promo">
+  <div class="project" id="project-2">
+    <a href="/projects/thomson-reuters-messenger">
+      <div class="devices">
+        <div class="laptop device no-handheld">
+          <img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
+          <div class="screenshot">
+            <img alt="" src="/uploads/project/desktop_image/2/thomson-reuters-messenger-desktop.png" />
+          </div>
+        </div>
+      </div>
+      <div class="project-description">
+        <div class="summary">
+          <h2>Thomson Reuters Messenger <i class="icon-play-sign"></i></h2>
+          <p>Messenger is an html5 / javascript instant messenger application for financial professionals</p>
+        </div>
+      </div>
+</a>
+  </div>
+</section>
+    <section class="big-promo">
+  <div class="project" id="project-3">
+    <a href="/projects/kleiner-perkins-caufield-byers-digital-presence">
+      <div class="devices">
+        <div class="laptop device">
+          <img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
+          <div class="screenshot">
+            <img alt="" src="/uploads/project/desktop_image/3/kpcb-screenshot.jpg" />
+          </div>
+        </div>
+          <div class="handheld device">
+            <img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
+            <div class="screenshot">
+              <img alt="" src="/uploads/project/iphone_image/3/kpcb-iphone-screenshot.jpg" />
+            </div>
+          </div>
+      </div>
+      <div class="project-description">
+        <div class="summary">
+          <h2>Kleiner Perkins Caufield &amp; Byers Digital Presence <i class="icon-play-sign"></i></h2>
+          <p>KPCB is a venture capital stalwart located in Silicon Valley with over 40 years of tech and science investment.</p>
+        </div>
+      </div>
+</a>
+  </div>
+</section>
+</div>
+<section class="clients full-width">
+  <div class="container">
+    <h2>Clients</h2>
+    <ul class="hlist">
+        <li><a href="http://kpcb.com" rel="friend" target="_blank" title="KPCB&#39;s Website"><img alt="KPCB" src="/uploads/client/image/1/home_logo_kpcb-logo.png" /></a></li>
+        <li><a href="http://thomsonreuters.com" rel="friend" target="_blank" title="Thomson Reuters&#39;s Website"><img alt="Thomson Reuters" src="/uploads/client/image/2/home_logo_thomsonreuters.png" /></a></li>
+        <li><a href="http://sumzero.com" rel="friend" target="_blank" title="SumZero&#39;s Website"><img alt="SumZero" src="/uploads/client/image/3/home_logo_sumzero.png" /></a></li>
+        <li><a href="http://marlboroughgallery.com" rel="friend" target="_blank" title="Marlborough Gallery&#39;s Website"><img alt="Marlborough Gallery" src="/uploads/client/image/4/home_logo_marlborough.png" /></a></li>
+        <li><a href="http://flurry.com" rel="friend" target="_blank" title="Flurry Analytics&#39;s Website"><img alt="Flurry Analytics" src="/uploads/client/image/8/home_logo_flurry.png" /></a></li>
+    </ul>
+  </div>
+</section>
+<section class="quote">
+  <blockquote>
+  <p>"Square Mill really took the time to understand our business and think strategically about how we want to engage and communicate with our entrepreneurs online. Together, their small team is responsive, nimble and efficient and has the deep design and technical chops to back it up."</p>
+  <small><a href="http://kpcb.com/partner/christina-lee" rel="friend" title="Christina Lee, Operating Partner at KPCB">Christina Lee</a>, <em>Operating Partner at KPCB</em></small>
+</blockquote>
+</section>
+  </div>
+  <footer id="footer">
+  <section class="container">
+    <a class="logo" href="http://squaremill.com/" rel="home">
+      <img alt="Square Mill Logo" height="64" src="/assets/md-logo-black-942423ecfd86c43ec6f13f163ea03f97.png" width="64" />
+</a>    <div class="main-nav">
+  <ul class="container">
+    <li><a href="/projects">Projects</a></li>
+    <li><a href="/about">About Us</a></li>
+    <li><a href="/blog">Blog</a></li>
+    <!-- <li class="link-biography"><a href="http://squaremill.com/#biography">People</a></li> -->
+  </ul>
+</div>
+  </section>
+  <p class="copyright">
+    &copy; 2014 Square Mill Labs, LLC. All rights reserved.
+  </p>
+</footer>
+  <script type="text/javascript" src="http://code.jquery.com/jquery-2.0.0.js"></script>
+  <script type="text/javascript" src="http://code.jquery.com/jquery-migrate-1.1.1.js"></script>
+  <script src="/assets/mn_application-82f6787dca307be34ec0c9fa6b7ba7d4.js"></script>
+  <script>
+  $(document).ready(function() {
+    var controller = $.superscrollorama({
+      triggerAtCenter: true,
+      playoutAnimations: true
+    });
+    if ( $(window).width() >= 767 ) {
+      controller.addTween('#project-7',
+        TweenMax.from($('#project-7'), .7, {
+          css:{"opacity":"0"},
+          onComplete: function(){
+            $('#project-7').toggleClass('active-in')
+          }
+        }),
+        300, // duration of scroll in pixel units
+        -100, // scroll offset (from center of viewport)
+        true
+        );
+      controller.addTween('#project-1',
+        TweenMax.from($('#project-1'), .7, {
+          css:{"opacity":"0"},
+          onComplete: function(){
+            $('#project-1').toggleClass('active-in')
+          }
+        }),
+        300, // duration of scroll in pixel units
+        -100, // scroll offset (from center of viewport)
+        true
+        );
+      controller.addTween('#project-2',
+        TweenMax.from($('#project-2'), .7, {
+          css:{"opacity":"0"},
+          onComplete: function(){
+            $('#project-2').toggleClass('active-in')
+          }
+        }),
+        300, // duration of scroll in pixel units
+        -100, // scroll offset (from center of viewport)
+        true
+        );
+      controller.addTween('#project-3',
+        TweenMax.from($('#project-3'), .7, {
+          css:{"opacity":"0"},
+          onComplete: function(){
+            $('#project-3').toggleClass('active-in')
+          }
+        }),
+        300, // duration of scroll in pixel units
+        -100, // scroll offset (from center of viewport)
+        true
+        );
+    }
+  });
+</script>
+</body>
+</html>

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: staticizer
 version: !ruby/object:Gem::Version
-  version: 0.0.8
+  version: 0.0.9
 platform: ruby
 authors:
 - Conor Hunt
@@ -38,6 +38,20 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: webmock
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: nokogiri
   requirement: !ruby/object:Gem::Requirement
@@ -87,6 +101,7 @@ files:
 - lib/staticizer/version.rb
 - staticizer.gemspec
 - tests/crawler_test.rb
+- tests/fake_page.html
 homepage: https://github.com/SquareMill/staticizer
 licenses:
 - MIT